## Uber Project 🚧

Uber's data team would like to work on a project where **their app would recommend hot-zones in major cities to be in at any given time of day.**  

In this project, we will :
* Create an algorithm to find hot zones 
* Visualize results on a map

#### Goal : show hot-zones for Uber pick ups in New York city. 

Eventhough Uber wants to have hot-zones per hour and per day of week, for dataset size reasons, we will focus only on the month of September, which corresponds to the month of going back to school/work after summer holidays. 

In [104]:
import pandas as pd
import datetime
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans, DBSCAN
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings("ignore")

In [105]:
df = pd.read_csv("data/6_uber-raw-data-sep14.csv", parse_dates=['Date/Time']) 
print(df.shape)
df.head()

(1028136, 4)


Unnamed: 0,Date/Time,Lat,Lon,Base
0,2014-09-01 00:01:00,40.2201,-74.0021,B02512
1,2014-09-01 00:01:00,40.75,-74.0027,B02512
2,2014-09-01 00:03:00,40.7559,-73.9864,B02512
3,2014-09-01 00:06:00,40.745,-73.9889,B02512
4,2014-09-01 00:11:00,40.8145,-73.9444,B02512


In [106]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1028136 entries, 0 to 1028135
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   Date/Time  1028136 non-null  datetime64[ns]
 1   Lat        1028136 non-null  float64       
 2   Lon        1028136 non-null  float64       
 3   Base       1028136 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 31.4+ MB


Each row in the dataset corresponds to one pick-up request from a client. There were more than 1 million pick-up requests in September 2014.

## EDA : data preprocessing and visualisation

In [107]:
#As the dataset is quite big, we should only take a sample
sample = 10000
df = df.sample(sample, random_state=0)

In [108]:
#Base is an unuseful column
df = df.drop('Base', axis = 1)

**Datetime processing**

In [109]:
df['Hour'] = df['Date/Time'].dt.hour
df['Day'] = df['Date/Time'].dt.day
df['DayOfWeek'] = df['Date/Time'].dt.strftime('%A')
df['Weekend'] = df['DayOfWeek'].apply(lambda x : 'Weekday' if x in ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"] else 'Weekend')
df['Time_Slot'] = df['Hour'].apply(lambda x : 'Rush_Hour' if x in [7, 8, 9, 17, 18, 19] 
                                   else 'Day' if x in [10, 11, 12, 13, 14, 15, 16]
                                   else 'Evening' if x in [20, 21, 22, 23]
                                   else  'Night')
df['DayOfWeek_TimeSlot'] = df['DayOfWeek']+'_'+df['Time_Slot']
df = df.sort_values(['Day', 'Hour'], ascending = [True, True])
df.head()

Unnamed: 0,Date/Time,Lat,Lon,Hour,Day,DayOfWeek,Weekend,Time_Slot,DayOfWeek_TimeSlot
14,2014-09-01 00:48:00,40.7378,-74.0395,0,1,Monday,Weekday,Night,Monday_Night
34385,2014-09-01 00:05:00,40.7257,-73.99,0,1,Monday,Weekday,Night,Monday_Night
275048,2014-09-01 00:13:00,40.6886,-73.9559,0,1,Monday,Weekday,Night,Monday_Night
275254,2014-09-01 00:59:00,40.7173,-74.0018,0,1,Monday,Weekday,Night,Monday_Night
275002,2014-09-01 00:05:00,40.7034,-73.9908,0,1,Monday,Weekday,Night,Monday_Night


**Evolution of demand**

In [122]:
fig = px.histogram(df, x='Date/Time', nbins=500, title = 'Number of demands throughout the month of September')
fig.update_xaxes(showgrid=True,
                 rangeslider = go.layout.xaxis.Rangeslider(visible = True)) 

fig.show()

We know that September 1st is a Monday. 

According to this histogram, we can see that there is a daily seasonality, and a weekly seasonality.
There are multiple factors that can influence the number of pick up requests (weather, holidays...) that are not in this dataset. We choose to study the average day of week influence, along with the time it corresponds to. 

The timing is important, as we can differentiate rush hour during weekdays, and evening/night pick ups during weekends.

In [123]:
fig = px.histogram(df, x="DayOfWeek", pattern_shape = 'Weekend', color="Time_Slot", text_auto = True, title = "Number of demands according to day of week")
fig.show()

In [124]:
fig = px.histogram(df, x="DayOfWeek", pattern_shape = 'Weekend', color="Time_Slot", barnorm='percent', text_auto='.2f', title = "Percentage of demands in time slot according to day of week")
fig.show()

Observations :

- there are more demands on Tuesdays and Saturdays
- there is significantly more demand on Friday and Saturday evenings, as people tend to go out and drink, as well as night time on Sundays (corresponding to a Saturday night out)
- there is more demand (10%) during rush hour on weekdays than on weekends


In [114]:
# Splitting the data by timeframes : an independant clustering will be run on each timeframe

timeframes = df['DayOfWeek_TimeSlot'].unique()
timeframe_list = []
for timeframe in timeframes:
    timeframe_list.append(df[df['DayOfWeek_TimeSlot'] == timeframe])
print("Number of timeframes: ", len(timeframe_list))

Number of timeframes:  28


## KMEANS clustering algorithm

- The lower the WCSS inertia is, the closer the points of one given cluster are close from each other, we want to find an optimal 'k' with a low inertia (to allow drivers not to have to drive too far away to find clients)
- Also, the higher the silhouette score is, the better the partition between clusters is, we would also want to keep this score as high as possible. 

**=> To get the better compromise between the 2, we will scale these 2 characteristics and take for each timeframe the 'k' that maximizes the `score = scaled silhouette score - scaled WCSS inertia`, and keep the corresponding k hyperparameter.**

In [115]:
# Identifying for each timeframe the better number of clusters to use

print(f"Finding the optimum k hyperparameter for each of the {len(timeframe_list)} timeframes :")

optimum_k_list = []
sc = StandardScaler()

for i in range(len(timeframe_list)):
    wcss_list = []
    sil_list = []
    k_list = []
    for k in range(2, 20, 2):
        X = timeframe_list[i][['Lat', 'Lon']]
        kmeans = KMeans(n_clusters = k, random_state = 0)
        kmeans.fit(X)
        wcss_list.append(kmeans.inertia_)
        sil_list.append(silhouette_score(X, kmeans.predict(X)))
        k_list.append(k)
        
    k_choice = pd.DataFrame({'K':k_list, 'WCSS':wcss_list, 'Silhouette Score':sil_list}).set_index('K')
    k_choice_scaled = sc.fit_transform(k_choice)
    k_choice['score'] = k_choice_scaled[:,1] - k_choice_scaled[:,0]
    optimum_k = k_choice.index[k_choice['score'].argmax()]
    optimum_k_list.append(optimum_k)

print(optimum_k_list)
print("Done ! ")

Finding the optimum k hyperparameter for each of the 28 timeframes :
[18, 10, 16, 12, 18, 16, 12, 10, 12, 6, 12, 18, 18, 18, 2, 4, 18, 8, 8, 10, 18, 10, 10, 12, 18, 16, 12, 12]
Done ! 


In [116]:
# Run optimized KMeanss on each timeframe
 
print(f"Fitting one KMeans on each of the {len(timeframe_list)} timeframes :") 

for i in range(len(timeframe_list)):
    kmeans = KMeans(n_clusters = optimum_k_list[i], random_state = 0)
    kmeans.fit(timeframe_list[i][['Lat', 'Lon']])
    timeframe_list[i]['cluster'] = kmeans.labels_
    timeframe_list[i] = timeframe_list[i].sort_values('cluster')
    
print("Done !                 ")

Fitting one KMeans on each of the 28 timeframes :
Done !                 


In [117]:
df = pd.concat(timeframe_list)
df

Unnamed: 0,Date/Time,Lat,Lon,Hour,Day,DayOfWeek,Weekend,Time_Slot,DayOfWeek_TimeSlot,cluster
90687,2014-09-08 03:40:00,40.7436,-73.9808,3,8,Monday,Weekday,Night,Monday_Night,0
275786,2014-09-01 04:04:00,40.7556,-73.9726,4,1,Monday,Weekday,Night,Monday_Night,0
550028,2014-09-22 06:28:00,40.7489,-73.9747,6,22,Monday,Weekday,Night,Monday_Night,0
276010,2014-09-01 06:08:00,40.7677,-73.9842,6,1,Monday,Weekday,Night,Monday_Night,0
366563,2014-09-08 00:55:00,40.7523,-73.9744,0,8,Monday,Weekday,Night,Monday_Night,0
...,...,...,...,...,...,...,...,...,...,...
698924,2014-09-07 23:33:00,40.6816,-73.9742,23,7,Sunday,Weekend,Evening,Sunday_Evening,10
260782,2014-09-28 22:25:00,40.7322,-73.8244,22,28,Sunday,Weekend,Evening,Sunday_Evening,11
366083,2014-09-07 22:41:00,40.7025,-73.8176,22,7,Sunday,Weekend,Evening,Sunday_Evening,11
879507,2014-09-07 23:26:00,40.7399,-73.8277,23,7,Sunday,Weekend,Evening,Sunday_Evening,11


In [132]:
for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] :

    fig = px.scatter_mapbox(df[df['DayOfWeek']==day], lat="Lat", lon="Lon", color=df[df['DayOfWeek']==day]['cluster'], title = day + ' clusters with KMeans', animation_frame = "DayOfWeek_TimeSlot", animation_group = "DayOfWeek_TimeSlot", zoom=10,color_continuous_scale='turbo', mapbox_style="carto-positron", height = 800)
    fig.show()
    with open('KMeans_graph.html', 'a') as f:
        f.write(fig.to_html(full_html=False, include_plotlyjs='cdn'))

## DBSCAN clustering algorithm

In [119]:
# Set the DBSCAN constant hyperparameters

eps = 0.5/100 # corresponds to 0.5km radius
metric = 'manhattan' 
min_samples = int(0.01 * np.mean([len(tf) for tf in timeframe_list])) # defining min_samples as a 1% of the overall average demand by timeframe

# Run distinct DBSCANs on each timeframe
 
print(f"Fitting one DBSCAN on each of the {len(timeframe_list)} timeframes (min_samples = {min_samples}) :") 
timeframes_nb = len(timeframe_list)

for i in range(timeframes_nb):
    dbscan = DBSCAN(eps = eps, min_samples = min_samples, metric = metric, algorithm="brute")
    dbscan.fit(timeframe_list[i][['Lat', 'Lon']])
    timeframe_list[i]['cluster'] = dbscan.labels_
    timeframe_list[i] = timeframe_list[i].sort_values('cluster')
    
print("Done !          ")

Fitting one DBSCAN on each of the 28 timeframes (min_samples = 3) :
Done !          


In [120]:
df_dbscan = pd.concat(timeframe_list)
df_dbscan

Unnamed: 0,Date/Time,Lat,Lon,Hour,Day,DayOfWeek,Weekend,Time_Slot,DayOfWeek_TimeSlot,cluster
90687,2014-09-08 03:40:00,40.7436,-73.9808,3,8,Monday,Weekday,Night,Monday_Night,-1
261205,2014-09-29 03:37:00,40.7054,-73.7620,3,29,Monday,Weekday,Night,Monday_Night,-1
9174,2014-09-08 03:46:00,40.7748,-73.9843,3,8,Monday,Weekday,Night,Monday_Night,-1
261284,2014-09-29 04:23:00,40.7737,-73.9892,4,29,Monday,Weekday,Night,Monday_Night,-1
549704,2014-09-22 05:43:00,40.7863,-73.9775,5,22,Monday,Weekday,Night,Monday_Night,-1
...,...,...,...,...,...,...,...,...,...,...
794845,2014-09-21 20:54:00,40.7506,-74.0032,20,21,Sunday,Weekend,Evening,Sunday_Evening,12
89444,2014-09-07 20:12:00,40.7490,-74.0067,20,7,Sunday,Weekend,Evening,Sunday_Evening,12
260665,2014-09-28 21:54:00,40.6781,-73.9808,21,28,Sunday,Weekend,Evening,Sunday_Evening,13
879124,2014-09-07 21:09:00,40.6764,-73.9803,21,7,Sunday,Weekend,Evening,Sunday_Evening,13


In [133]:
for day in ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] :

    fig = px.scatter_mapbox(df_dbscan[df_dbscan['DayOfWeek']==day], lat="Lat", lon="Lon", color=df_dbscan[df_dbscan['DayOfWeek']==day]['cluster'], title = day + ' clusters with DBSCan', animation_frame = "DayOfWeek_TimeSlot", animation_group = "DayOfWeek_TimeSlot", zoom=10,color_continuous_scale='turbo', mapbox_style="carto-positron", height = 800)
    fig.show()
    with open('DBSCan_graph.html', 'a') as f:
        f.write(fig.to_html(full_html=False, include_plotlyjs='cdn'))