# Exploration du fichier common place

Ce notebook permet d'explorer les données du fichier commonPlaces placés dans le dossier data/raw:

- CommonPlace_20251109.csv

Les traitements suivants sont appliqués:
- Extraction des coordonées (Lat/Long)
- Suppression des places non nécessaires au projet (Residential, Commercial...)
- Suppression des places n'ayant pas de Code de borough (parceque localiés en dehors)
- Détermination du type d'activité (recreational, cultural, transportation)

Les données sont ensuite exportées sous: data/processed/CommonPlaces.csv
Les colonnes conservées sont les suivantes:
 - 0   PLACEID             8120 non-null   int64 
 - 1   BOROUGH CODE        8120 non-null   float64
 - 2   FACILITY TYPE       8120 non-null   int64  
-  3   FEATURE NAME        8120 non-null   object 
 - 4   longitude           8120 non-null   float64
 - 5   latitude            8120 non-null   float64
 - 6   FACILITY TYPE NAME  8120 non-null   object 
 - 7   ACTIVITY_CATEGORY   8120 non-null   object 
 - 8   BOROUGH NAME        8120 non-null   object 



In [1]:
import pandas as pd

df = pd.read_csv('../data/raw/CommonPlace_20251109.csv')
df['PLACEID'] = df['PLACEID'].str.replace(',','').astype('int64')
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20586 entries, 0 to 20585
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   the_geom                  20586 non-null  object 
 1   SEGMENTID                 20584 non-null  object 
 2   COMPLEXID                 7606 non-null   object 
 3   SAFTYPE                   11037 non-null  object 
 4   PLACEID                   20586 non-null  int64  
 5   BIN                       14248 non-null  object 
 6   SOURCE                    20586 non-null  object 
 7   OBJECTID                  20586 non-null  object 
 8   SOS INDICATOR             20002 non-null  float64
 9   FACILITY DOMAINS          20586 non-null  int64  
 10  BOROUGH CODE              20375 non-null  float64
 11  SOURCE ID                 5895 non-null   object 
 12  CREATED_BY                983 non-null    object 
 13  CREATED_DATE              20583 non-null  object 
 14  MODIFI

In [2]:

# Supprimer le texte 'POINT (' et ')'
df['the_geom'] = df['the_geom'].str.replace('POINT ', '', regex=False)
df['the_geom'] = df['the_geom'].str.strip('()')

# Séparer longitude et latitude
df[['longitude', 'latitude']] = df['the_geom'].str.split(' ', expand=True).astype(float)

df.head()

Unnamed: 0,the_geom,SEGMENTID,COMPLEXID,SAFTYPE,PLACEID,BIN,SOURCE,OBJECTID,SOS INDICATOR,FACILITY DOMAINS,...,CREATED_DATE,MODIFIED_BY,MODIFIED_DATE,FACILITY TYPE,B7SC,PRIMARY ADDRESS POINT ID,FEATURE NAME,SECURITY LEVEL,longitude,latitude
0,-74.097961931446 40.634604200807,13847,,,11947,5002227.0,DOE,10555,2.0,2,...,2009 May 14 12:00:00 AM,,2020 Jun 18 09:27:09 AM,2,,5002977.0,IS 61 WILLIAM A MORRIS,1,-74.097962,40.634604
1,-73.981379489555 40.589105561411,19292,,,12280,3385667.0,DOE,5120,2.0,3,...,2009 May 14 12:00:00 AM,,2020 Apr 08 09:48:35 AM,2,,5132738.0,PS 721 BROOKLYN OCCUPATIONAL TRAINING CENTER,1,-73.981379,40.589106
2,-73.943478646583 40.724827480825,35645,4762.0,X,1036755,3338038.0,OTHER,20345,2.0,10,...,2020 Dec 10 01:57:39 PM,,2022 Oct 12 04:59:27 PM,4,31248503.0,5201395.0,MCGOLRICK PLAYGROUND COMFORT STATION,1,-73.943479,40.724827
3,-73.858490015009 40.708424926703,151696,,,3068,4095037.0,EMS,7507,1.0,2,...,2009 May 14 12:00:00 AM,,2017 Feb 08 03:08:17 PM,7,,105469.0,HOME DEPOT WOODHAVEN BLVD,1,-73.85849,40.708425
4,-74.024085858699 40.672444940613,17988,,,6914,,NOAA,7005,,11,...,2009 May 14 12:00:00 AM,,,6,,,BAY RIDGE CHANNEL LIGHTED GONG BUOY 11,1,-74.024086,40.672445


In [3]:
# Suppression des colonnes pas nécessaire

columns_to_remove = [
    'the_geom',
    'SEGMENTID',
    'COMPLEXID',
    'SAFTYPE',
    'BIN',
    'SOURCE',
    'OBJECTID',
    'SOS INDICATOR',
    'FACILITY DOMAINS',
    'SOURCE ID',
    'CREATED_BY',
    'CREATED_DATE',
    'MODIFIED_BY',
    'MODIFIED_DATE',
    'B7SC',
    'PRIMARY ADDRESS POINT ID',
    'SECURITY LEVEL',
]

df.drop(columns=columns_to_remove, axis=1, inplace=True)

## Facility Types
1 Residential

2 Education Facility

3 Cultural Facility

4 Recreational Facility

5 Social Services

6 Transportation Facility

7 Commercial

8 Government Facility (non public safety)

9 Religious Institution

10 Health Services

11 Public Safety

12 Water

13 Miscellaneous

In [4]:
import numpy as np

# Supprimer les types de facilities non pertinents
facility_types_to_remove = [1, 2, 5, 7, 8, 10, 11, 12, 13]
df = df[~df['FACILITY TYPE'].isin(facility_types_to_remove)].copy()

# Ajouter un libellé descriptif pour chaque type
facility_type_labels = {
    3: 'Cultural Facility',
    4: 'Recreational Facility',
    6: 'Transportation Facility',
    9: 'Religious Institution'
}
df['FACILITY TYPE NAME'] = df['FACILITY TYPE'].map(facility_type_labels)

# Créer une segmentation
conditions = [
    df['FACILITY TYPE'].isin([4]),     # loisirs
    df['FACILITY TYPE'].isin([3, 9])      # culture ou religion
]
choices = ['recreation', 'culture']

df['ACTIVITY_CATEGORY'] = np.select(conditions, choices, default='transportation')

# Vérifier s’il reste des entrées sans borough
missing_boroughs = df['BOROUGH CODE'].isnull().sum()
print(f"Missing boroughs: {missing_boroughs}")


Missing boroughs: 191


## NYC five boroughs.
1 Manhattan

2 Bronx

3 Brooklyn

4 Queens

5 Staten Island

6 Nassau County

7 Westchester

8 New Jersey

In [5]:
# Populating the boroughs name
df['BOROUGH NAME'] = df['BOROUGH CODE'].map({
    1: 'Manhattan',
    2: 'Bronx',
    3: 'Brooklyn',
    4: 'Queens',
    5: 'Staten Island',
    6: 'Nassau County',
    7: 'Westchester',
    8: 'New Jersey'
})

In [6]:
# ploting the entries on a map

import plotly.express as px

fig = px.scatter_map(df, lat='latitude', lon='longitude', color='BOROUGH CODE',
                     hover_name='FEATURE NAME',
                     title='Common places')
fig.show()

In [7]:
# removing entries without borough code (because they are outside of a n-y borough)
df = df[~df['BOROUGH CODE'].isnull()]

In [8]:
# ploting the entries on a map

import plotly.express as px

fig = px.scatter_map(df, lat='latitude', lon='longitude', color='ACTIVITY_CATEGORY',
                     hover_name='FEATURE NAME',
                     title='Common places')
fig.show()

In [9]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 8120 entries, 2 to 20580
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   PLACEID             8120 non-null   int64  
 1   BOROUGH CODE        8120 non-null   float64
 2   FACILITY TYPE       8120 non-null   int64  
 3   FEATURE NAME        8120 non-null   object 
 4   longitude           8120 non-null   float64
 5   latitude            8120 non-null   float64
 6   FACILITY TYPE NAME  8120 non-null   object 
 7   ACTIVITY_CATEGORY   8120 non-null   object 
 8   BOROUGH NAME        8120 non-null   object 
dtypes: float64(3), int64(2), object(4)
memory usage: 634.4+ KB


In [10]:
df.head(500)


Unnamed: 0,PLACEID,BOROUGH CODE,FACILITY TYPE,FEATURE NAME,longitude,latitude,FACILITY TYPE NAME,ACTIVITY_CATEGORY,BOROUGH NAME
2,1036755,3.0,4,MCGOLRICK PLAYGROUND COMFORT STATION,-73.943479,40.724827,Recreational Facility,recreation,Brooklyn
4,6914,3.0,6,BAY RIDGE CHANNEL LIGHTED GONG BUOY 11,-74.024086,40.672445,Transportation Facility,transportation,Brooklyn
5,15705,1.0,4,THOMAS JEFFERSON PARK,-73.936113,40.793070,Recreational Facility,recreation,Manhattan
7,1031750,3.0,6,NEWKIRK AV OVER NYCTA BRIGHTON,-73.962896,40.635726,Transportation Facility,transportation,Brooklyn
9,9539,4.0,4,ARTHUR ASHE STADIUM,-73.846946,40.749665,Recreational Facility,recreation,Queens
...,...,...,...,...,...,...,...,...,...
1279,6845,1.0,4,SOL BLOOM PLAYGROUND,-73.968822,40.789713,Recreational Facility,recreation,Manhattan
1280,15115,3.0,9,ST BRIGIDS RC CHURCH,-73.912111,40.701207,Religious Institution,culture,Brooklyn
1281,15852,4.0,4,TURTLE PLAYGROUND,-73.827530,40.742187,Recreational Facility,recreation,Queens
1282,3720,1.0,9,SHAARE ZEDEK CONGREGATION,-73.972613,40.792334,Religious Institution,culture,Manhattan


In [11]:
df.to_csv('../data/processed/CommonPlaces.csv')

## Mise en place de recherche de catégorisation pour le commmon place

In [45]:
import re
def search_for_word_in_df_column(df,word,col_in,col_out,category,case_sensitive=False):

    '''
    Le but est de chercher un mot(word) dans chaque ligne d'une colonne (col_in) d'un dataframe (df).
    Si le mot est trouvé, une colonne de sortie (col_out) reprend ce mot. si un autre mot est déjà existant pour cette sortie, il garde l'ancien
    Retourne le dataframe final
    
    '''

    # crée la colonne de sortie
    if col_out not in df.columns:
        df[col_out]=np.nan
    # préparer la recherche
    w=re.escape(word)
    pattern=rf'\b{w}\b'
    case=re.IGNORECASE if not case_sensitive else 0

    # chercher le mot

    found=df[col_in].astype(str).str.contains(pattern,flags=case,na=False,regex=True)

    # chercher les lignes vides ou null
    empty_lines=df[col_out].isna() | (df[col_out].astype(str).str.strip()=='')

    # ecrire ou les lignes sont vides et le mot est trouvé

    to_fill=found & empty_lines
    df.loc[to_fill,col_out]=category


In [None]:
search_word=[('playground','fun'),('church','religious building'),('park','park'),('garden','park'),('stadium','sport'),('theater','theater'),('museum','museum'),('fields','sport'),('chapel','religious building')]

for word in search_word:
    search_for_word_in_df_column(df=df,word=word[0],col_in='FEATURE NAME',col_out='PLACE CATEGORY',category=word[1])

df[df['PLACE CATEGORY'].isna()& (df['ACTIVITY_CATEGORY']!='transportation')]
# df

Unnamed: 0,PLACEID,BOROUGH CODE,FACILITY TYPE,FEATURE NAME,longitude,latitude,FACILITY TYPE NAME,ACTIVITY_CATEGORY,BOROUGH NAME,PLACE_CATEGORY,PLACE CATEGORY
11,1035190,1.0,4,HARLEM GROWN 134 ST GREEN HOUSE,-73.943313,40.814106,Recreational Facility,recreation,Manhattan,,
19,3633,3.0,9,BENSONHURST JEWISH COMMUNITY,-73.978016,40.613930,Religious Institution,culture,Brooklyn,,
80,1427,2.0,4,PADRE PLAZA COMMUNITY GARDEN,-73.917511,40.807942,Recreational Facility,recreation,Bronx,,
107,1031895,1.0,3,ARTS CLUB STUDIO BUILDING,-73.986696,40.737716,Cultural Facility,culture,Manhattan,,
119,1026988,4.0,4,KISSENA VELODROME,-73.809428,40.744777,Recreational Facility,recreation,Queens,,
...,...,...,...,...,...,...,...,...,...,...,...
20416,17202,3.0,4,FORT HAMILTON SENIOR CENTER,-74.031729,40.611847,Recreational Facility,recreation,Brooklyn,,
20438,1039384,1.0,4,JEFFERSON MARKET GARDEN,-73.999398,40.734517,Recreational Facility,recreation,Manhattan,,
20451,1036222,2.0,4,HENRY HUDSON PARKWAY MALLS,-73.908634,40.890159,Recreational Facility,recreation,Bronx,,
20458,1019884,3.0,4,ANCHORAGE PLAZA,-73.992168,40.702010,Recreational Facility,recreation,Brooklyn,,
