# Uber Pick-Ups Clustering in NYC

### But du projet : construire des modèles de ML non supervisés afin de déterminer les "zones chaudes" à fort potentiel pour que les chauffeurs UBER se trouvent au bon endroit au bon moment en fonction des moments de la journée pour maximiser le profit de leur tournée

In [2]:
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler, FunctionTransformer, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import  silhouette_score

### Import Data

In [3]:
df_apr14 = pd.read_csv('uber-raw-data-apr14.csv')
df_may14 = pd.read_csv('uber-raw-data-may14.csv')
df_jun14 = pd.read_csv('uber-raw-data-jun14.csv')
df_jul14 = pd.read_csv('uber-raw-data-jul14.csv')
df_aug14 = pd.read_csv('uber-raw-data-aug14.csv')
df_sep14 = pd.read_csv('uber-raw-data-sep14.csv')

frames = [df_apr14, df_may14, df_jun14, df_jul14, df_aug14, df_sep14]
df14 = pd.concat(frames, ignore_index=True)
df14.head()

Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512


In [4]:
df14.shape

(4534327, 4)

In [5]:
df14.describe(include='all')

Unnamed: 0,Date/Time,Lat,Lon,Base
count,4534327,4534327.0,4534327.0,4534327
unique,260093,,,5
top,4/7/2014 20:21:00,,,B02617
freq,97,,,1458853
mean,,40.73926,-73.97302,
std,,0.03994991,0.0572667,
min,,39.6569,-74.929,
25%,,40.7211,-73.9965,
50%,,40.7422,-73.9834,
75%,,40.761,-73.9653,


In [6]:
df14.isnull().sum()

Date/Time    0
Lat          0
Lon          0
Base         0
dtype: int64

In [7]:
df14.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4534327 entries, 0 to 4534326
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   Date/Time  object 
 1   Lat        float64
 2   Lon        float64
 3   Base       object 
dtypes: float64(2), object(2)
memory usage: 138.4+ MB


Aucune missing valuees, en revanche la colonne Date/Time est de type String, on va donc devoir extraires les différents éléments de cette colonne

# Data Cleaning

In [18]:
df14['Date/Time'] = pd.to_datetime(df14['Date/Time'])
df14['year'] = df14['Date/Time'].dt.year
df14['month'] = df14['Date/Time'].dt.month
df14['day'] = df14['Date/Time'].dt.day
df14['day_of_week'] = df14['Date/Time'].dt.day_name()
df14['hour'] = df14['Date/Time'].dt.hour
df14['minute'] = df14['Date/Time'].dt.minute


display(df14.head())
print(df14.dtypes)

Unnamed: 0,Date/Time,Lat,Lon,Base,year,month,day,day_of_week,hour,minute
0,2014-04-01 00:11:00,40.769,-73.9549,B02512,2014,4,1,Tuesday,0,11
1,2014-04-01 00:17:00,40.7267,-74.0345,B02512,2014,4,1,Tuesday,0,17
2,2014-04-01 00:21:00,40.7316,-73.9873,B02512,2014,4,1,Tuesday,0,21
3,2014-04-01 00:28:00,40.7588,-73.9776,B02512,2014,4,1,Tuesday,0,28
4,2014-04-01 00:33:00,40.7594,-73.9722,B02512,2014,4,1,Tuesday,0,33


Date/Time      datetime64[ns]
Lat                   float64
Lon                   float64
Base                   object
year                    int32
month                   int32
day                     int32
day_of_week            object
hour                    int32
minute                  int32
dtype: object


# EDA

Nous avons un dataframe qui contient plus de 4M de lignes
Regardons un échantillons des courses uber que l'on va mettre sur une map

In [19]:
data_sample = df14.sample(int(len(df14)*0.005)) # sample 0.5% of data
fig = px.scatter_mapbox(data_sample, lat='Lat', lon='Lon')
fig.update_layout(mapbox_style="open-street-map", width=600, height=300,
                  margin=dict(l=0, r=0, b=0, t=0))
fig.show()

In [None]:
data_day_hour = (df14.groupby(['day_of_week', 'hour'])
                    .size()
                    .reset_index(name='count'))

# Ordonner les jours
jour_ordre = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
data_day_hour['day_of_week'] = pd.Categorical(data_day_hour['day_of_week'], 
                                             categories=jour_ordre, 
                                             ordered=True)

fig = px.line(
    data_day_hour, 
    x='hour', 
    y='count', 
    color='day_of_week',
    facet_col='day_of_week',
    title="Évolution par jour de la semaine et heure"
)

fig.update_layout(
    showlegend=False,
    xaxis=dict(tickvals=list(range(0, 25, 6))),
    height=600
)




Cette visualisation nous permet de voir pour les mois de l'année 2014 l'enemble de l'activité cumulé des chauffeurs uber en fonction des jours de la semaine. Contrairement à ce que l'on pourrait penser, ce n'est pas le samedi où il y a le plus d'activité. Ce qui est finalement normal car aux USA, la semaine commence le dimanche

In [44]:
df14.head()

Unnamed: 0,Date/Time,Lat,Lon,Base,year,month,day,day_of_week,hour,minute
0,2014-04-01 00:11:00,40.769,-73.9549,B02512,2014,4,1,Tuesday,0,11
1,2014-04-01 00:17:00,40.7267,-74.0345,B02512,2014,4,1,Tuesday,0,17
2,2014-04-01 00:21:00,40.7316,-73.9873,B02512,2014,4,1,Tuesday,0,21
3,2014-04-01 00:28:00,40.7588,-73.9776,B02512,2014,4,1,Tuesday,0,28
4,2014-04-01 00:33:00,40.7594,-73.9722,B02512,2014,4,1,Tuesday,0,33


# Clustering sur 1 jours et une heure spécifique

On va regarder sur le vendredi à 17h, puisque d'après notre EDA c'est le jour et l'heure où il y a le plus gros pique d'activité

In [50]:
data_day_hour_selected = df14[(df14['day_of_week'] == 'Thursday') & (df14['hour'] == 17)].drop(columns='Date/Time', axis=1)
print(data_day_hour_selected.shape)
data_day_hour_selected

(56704, 9)


Unnamed: 0,Lat,Lon,Base,year,month,day,day_of_week,hour,minute
3119,40.7675,-73.9666,B02512,2014,4,3,Thursday,17,0
3120,40.7688,-73.8624,B02512,2014,4,3,Thursday,17,0
3121,40.7356,-74.0079,B02512,2014,4,3,Thursday,17,1
3122,40.6816,-73.9255,B02512,2014,4,3,Thursday,17,2
3123,40.7677,-73.9826,B02512,2014,4,3,Thursday,17,2
...,...,...,...,...,...,...,...,...,...
4492503,40.6774,-73.9764,B02764,2014,9,25,Thursday,17,59
4492504,40.7141,-74.0029,B02764,2014,9,25,Thursday,17,59
4492505,40.7511,-73.9824,B02764,2014,9,25,Thursday,17,59
4492506,40.7012,-73.9428,B02764,2014,9,25,Thursday,17,59


In [51]:
numeric_features_no_change = [0, 1]
numeric_transformer_no_change = Pipeline(steps=[
   ('passthrough', FunctionTransformer(lambda x: x))
])

#Here we only needed one preprocessing, that doesn't change the geo num values
#The values due to dates and time are not needed in the model
#The Base code creates to much categorical column in preprocessing to handle it with clustering model

preprocessor = ColumnTransformer(
    transformers=[
        ('num_no_change', numeric_transformer_no_change, numeric_features_no_change), # on fait passer que Lat et Lon dans le preprocessor, sans même y toucher 
    ])

# Preprocessings on dataset
print("Preprocessing sur le train set...")
print(data_day_hour_selected.head())
X = preprocessor.fit_transform(data_day_hour_selected) # fit_transform !!
print('...Terminé.')
print(X[0:5, :])
print()

Preprocessing sur le train set...
          Lat      Lon    Base  year  month  day day_of_week  hour  minute
3119  40.7675 -73.9666  B02512  2014      4    3    Thursday    17       0
3120  40.7688 -73.8624  B02512  2014      4    3    Thursday    17       0
3121  40.7356 -74.0079  B02512  2014      4    3    Thursday    17       1
3122  40.6816 -73.9255  B02512  2014      4    3    Thursday    17       2
3123  40.7677 -73.9826  B02512  2014      4    3    Thursday    17       2
...Terminé.
[[ 40.7675 -73.9666]
 [ 40.7688 -73.8624]
 [ 40.7356 -74.0079]
 [ 40.6816 -73.9255]
 [ 40.7677 -73.9826]]



### KMeans

In [52]:
wcss =  []
sil = []
for i in range (2,11):
    kmeans = KMeans(n_clusters= i)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    sil.append(silhouette_score(X, kmeans.predict(X)))

print(wcss)
print(sil)

[126.03980576875453, 104.24519791406748, 78.51684677063278, 74.88918971727772, 50.77273795925569, 44.61018897147604, 37.70029646945845, 31.126536335766993, 28.092738554482242]
[0.7455990500143701, 0.7553101600074494, 0.45954125688312575, 0.4403337964652114, 0.46548177153982767, 0.3657454963439317, 0.4457475616576948, 0.4518462782743762, 0.4182736713066453]


In [53]:
fig = px.line(x = range(2,11), y = wcss, height=600, width=800)
fig.show()

In [54]:
fig = px.bar(x = range(2,11), y = sil, height=600, width=800)
fig.show()

On pourrait d'après notre Elbow et Silhouette, choisir de premier abord un n_cluster = 5 car d'après le Elbow la cassure entre 5 et 6 est significative. Or, si on regarde Silhouette, on voit que à la valeur 5, on est pas très élevé. On va donc prendre n_cluster = 3 pour avoir un bon équilibre entre un score de silhouette à 0.75 ce qui est excellent, une réalité géographique et un bon compromis complexité/performance.

Pour rappel, la cassure sur le Elbow indique que : 
- avant le coude, chaque cluster apporte/veut dire beaucoup
- après le coude, chaque cluster ajouté apporte peu

Silhouette = (b-a)/max(a,b) avec:
- a = distance moyenne aux points du même cluster
- b= distance moyenne aux points du cluster le plus proche

Donc un silhouette à 0.75 est excellent et donne des clusters très disctincts et significatifs

In [61]:
kmeans = KMeans(n_clusters= 3)
kmeans.fit(X)

In [62]:
# on rajoute une colonne dans notre dataframe des clusters identifiés par KMeans
data_day_hour_selected.loc[:,'Cluster_KMeans'] = kmeans.predict(X)
data_day_hour_selected.head()

Unnamed: 0,Lat,Lon,Base,year,month,day,day_of_week,hour,minute,Cluster_KMeans
3119,40.7675,-73.9666,B02512,2014,4,3,Thursday,17,0,0
3120,40.7688,-73.8624,B02512,2014,4,3,Thursday,17,0,2
3121,40.7356,-74.0079,B02512,2014,4,3,Thursday,17,1,1
3122,40.6816,-73.9255,B02512,2014,4,3,Thursday,17,2,1
3123,40.7677,-73.9826,B02512,2014,4,3,Thursday,17,2,0


In [63]:
px.scatter_mapbox(
    data_day_hour_selected,
    lat="Lat",
    lon="Lon",
    color="Cluster_KMeans",
    mapbox_style="carto-positron",
    zoom=10,
    height=600,
    width=800
)

# DBScan