# Uber Pickups
#### <i>Author: Delphine César<i>

## Table of contents

<ul>
   <li><a href="#import">I - Import of librairies and dataset</a></li>
   <li><a href="#info">II - Dataset information</a></li>
   <li><a href="#engineering">III - Data engineering</a></li>
   <li><a href="#eda">IV - EDA</a></li>
   <li><a href="#ml">V - Machine Learning</a></li>
      <ul>
         <li><a href="#preprocessing">1 - Preprocessing</a></li>
         <li><a href="#kmeans">2 - Kmeans</a></li>
            <ul>
               <li><a href="#elbow">a - Elbow</a></li> 
               <li><a href="#silhouette">a - Silhouette</a></li> 
               <li><a href="#kmeanstrain">b - Model Training</a></li> 
               <li><a href="#kmeansvisualization">b - Clustering Visualization</a></li>
            </ul>
         <li><a href="#dbscan">2 - DBScan</a></li>
            <ul> 
               <li><a href="#dbscanvisualization">a - Clustering Visualization</a></li>
            </ul>
      </ul>
</ul>

<a id='import'></a>
### I - Import of librairies and dataset

In [1]:
import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score

In [2]:
april14 = pd.read_csv("uber-raw-data-apr14.csv")
sep14 = pd.read_csv("uber-raw-data-sep14.csv")

dataset = pd.concat([april14, sep14])

<a id='info'></a>
### II - Dataset information

In [3]:
print("Number of rows : {}".format(dataset.shape[0]))
print()

print("Number of columns : {}".format(dataset.shape[1]))
print()

print("Display of dataset: ")
display(dataset.head())
print()

print("Basics statistics: ")
data_desc = dataset.describe(include='all')
display(data_desc)
print()

print("Percentage of missing values: ")
display(100*dataset.isnull().sum()/dataset.shape[0])

print("Columns type")
display(dataset.info())

Number of rows : 1592652

Number of columns : 4

Display of dataset: 


Unnamed: 0,Date/Time,Lat,Lon,Base
0,4/1/2014 0:11:00,40.769,-73.9549,B02512
1,4/1/2014 0:17:00,40.7267,-74.0345,B02512
2,4/1/2014 0:21:00,40.7316,-73.9873,B02512
3,4/1/2014 0:28:00,40.7588,-73.9776,B02512
4,4/1/2014 0:33:00,40.7594,-73.9722,B02512



Basics statistics: 


Unnamed: 0,Date/Time,Lat,Lon,Base
count,1592652,1592652.0,1592652.0,1592652
unique,84906,,,5
top,4/7/2014 20:21:00,,,B02617
freq,97,,,485696
mean,,40.7395,-73.97359,
std,,0.03921414,0.05569756,
min,,39.9897,-74.7736,
25%,,40.7213,-73.9967,
50%,,40.742,-73.9837,
75%,,40.761,-73.9654,



Percentage of missing values: 


Date/Time    0.0
Lat          0.0
Lon          0.0
Base         0.0
dtype: float64

Columns type
<class 'pandas.core.frame.DataFrame'>
Index: 1592652 entries, 0 to 1028135
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   Date/Time  1592652 non-null  object 
 1   Lat        1592652 non-null  float64
 2   Lon        1592652 non-null  float64
 3   Base       1592652 non-null  object 
dtypes: float64(2), object(2)
memory usage: 60.8+ MB


None

<a id='engineering'></a>
### III - Data engineering

In [4]:
# Convertir la colonne date en objet datetime 
dataset["Date/Time"] = pd.to_datetime(dataset["Date/Time"])

dataset["dayofweek"] = dataset["Date/Time"].apply(lambda x : x.dayofweek)
dataset["hour"] = dataset["Date/Time"].apply(lambda x : x.hour)
dataset["month"] = dataset["Date/Time"].apply(lambda x : x.month)

dataset.head()

Unnamed: 0,Date/Time,Lat,Lon,Base,dayofweek,hour,month
0,2014-04-01 00:11:00,40.769,-73.9549,B02512,1,0,4
1,2014-04-01 00:17:00,40.7267,-74.0345,B02512,1,0,4
2,2014-04-01 00:21:00,40.7316,-73.9873,B02512,1,0,4
3,2014-04-01 00:28:00,40.7588,-73.9776,B02512,1,0,4
4,2014-04-01 00:33:00,40.7594,-73.9722,B02512,1,0,4


<a id='eda'></a>
### IV - EDA

In [5]:
dataset_sample = dataset.sample(20000)

In [24]:
fig = px.histogram(dataset_sample, x= "dayofweek", title ='Day of week repartition', height=400)
fig.update_layout(title_x=0.5)
fig.show()

fig = px.histogram(dataset_sample, x= "hour", title ='Hours repartition', height=400)
fig.update_layout(title_x=0.5)
fig.show()

- There are fewer pickups on Mondays and Sundays.
- The peak hour for pickups is 6pm, which is probably when people leave work.

<a id='ml'></a>
### V - Machine Learning

<a id='preprocessing'></a>
##### 1 - Preprocessing

In [8]:
# Drop useless columns
dataset_sample.drop("Date/Time",axis=1, inplace=True)
dataset_sample.drop("Base",axis=1, inplace=True)

In [9]:
# Creation of a column with the days of the week written out in full 
dataset_sample['dayofweekfull'] = dataset_sample['dayofweek'].apply(lambda x : "Monday" if x == 0
                                                                      else "Tuesday" if x == 1
                                                                      else "Wednesday" if x == 2
                                                                      else "Thursday" if x == 3
                                                                      else "Friday" if x == 4
                                                                      else "Saturday" if x == 5
                                                                      else "Sunday")
dataset_sample = dataset_sample.sort_values(by = "dayofweek")

In [10]:
# We chose the latitude and longitude features to train our model
X = dataset_sample[["Lat", "Lon"]]
preprocessor = StandardScaler()
X = preprocessor.fit_transform(X)
X[:5]

array([[-0.20861232, -0.03516526],
       [ 0.98654156, -0.01683121],
       [-0.65041066, -0.66035639],
       [ 0.03654745, -0.62368829],
       [ 0.39151837, -0.03516526]])

<a id='kmeans'></a>
##### 2 - Kmeans

<a id='elbow'></a>
a - Elbow

In [11]:
wcss =  []
k = []
for i in range (2,10): 
    kmeans = KMeans(n_clusters= i, random_state = 0, n_init='auto')
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
    k.append(i)
    print("WCSS for K={} --> {}".format(i, wcss[-1]))

WCSS for K=2 --> 28867.507993002924
WCSS for K=3 --> 19163.01746624043
WCSS for K=4 --> 15335.466110445055
WCSS for K=5 --> 12638.671121495701
WCSS for K=6 --> 10014.611950084913
WCSS for K=7 --> 9095.90749990509
WCSS for K=8 --> 7154.053964567218
WCSS for K=9 --> 6265.204161093208


In [12]:
# Let's visualize using plotly
# Create DataFrame
wcss_frame = pd.DataFrame(wcss)
k_frame = pd.Series(k)

# Create figure
fig= px.line(
    wcss_frame,
    x=k_frame,
    y=wcss_frame.iloc[:,-1]
)

# Create title and axis labels
fig.update_layout(
    yaxis_title="Inertia",
    xaxis_title="# Clusters",
    title="Inertia per cluster",
    title_x=0.5
)
# Render
fig.show() 

The elbow approach is used to determine the optimum number of clusters. The point at which the curve forms a sharp bens is considered the optimum number of clusters. In this case, is can be 3 or 6.

<a id='silhouette'></a>
b - Silouhette

In [14]:
# Computer mean silhouette score
sil = []
k = []

## Careful, you need to start at i=2 as silhouette score cannot accept less than 2 labels 
for i in range (2,10): 
    kmeans = KMeans(n_clusters= i, random_state = 0, n_init='auto')
    kmeans.fit(X)
    sil.append(silhouette_score(X, kmeans.predict(X)))
    k.append(i)
    print("Silhouette score for K={} is {}".format(i, sil[-1]))

Silhouette score for K=2 is 0.5737547812411594
Silhouette score for K=3 is 0.40907917804813976
Silhouette score for K=4 is 0.4215498068510321
Silhouette score for K=5 is 0.4334511125309361
Silhouette score for K=6 is 0.4523856203358424
Silhouette score for K=7 is 0.40939507206115183
Silhouette score for K=8 is 0.4234065617017932
Silhouette score for K=9 is 0.42469375194656284


In [15]:
# Create a data frame 
cluster_scores=pd.DataFrame(sil)
k_frame = pd.Series(k)

# Create figure
fig = px.bar(data_frame=cluster_scores,  
             x=k, 
             y=cluster_scores.iloc[:, -1]
            )

# Add title and axis labels
fig.update_layout(
    yaxis_title="Silhouette Score",
    xaxis_title="# Clusters",
    title="Silhouette Score per cluster",
    title_x=0.5
)

# Render
fig.show() 

The silhouette is a measure used to evaluate the quality of clustering. A high mean silhouette indicates good cluster separation. In this case, the two highest are 2 and 6. 

I chose to train the model with 6 clusters.

<a id='kmeanstrain'></a>
c - Model training

In [16]:
kmeans = KMeans(n_clusters=6, random_state=0, n_init='auto')
# Fit kmeans to our dataset
kmeans.fit(X)

<a id='kmeansvisualization'></a>
d - Clustering Visualization

In [17]:
dataset_sample['Cluster_KMeans'] = kmeans.predict(X)
dataset_sample.head()

Unnamed: 0,Lat,Lon,dayofweek,hour,month,dayofweekfull,Cluster_KMeans
67290,40.7313,-73.9755,0,12,4,Monday,3
210817,40.7781,-73.9745,0,9,9,Monday,0
244197,40.714,-74.0096,0,17,4,Monday,3
640767,40.7409,-74.0076,0,22,9,Monday,3
468102,40.7548,-73.9755,0,19,9,Monday,0


In [18]:
fig = px.scatter_mapbox(dataset_sample, lat="Lat", lon="Lon", color="Cluster_KMeans",
                        mapbox_style="carto-positron",
                       animation_frame = "dayofweekfull")
fig.show()

This Kmeans clusterisation is not really satisfyng. It splits Manhattan into different clusters.

<a id='dbscan'></a>
##### 3 - DBScan

In [49]:
# Instanciate DBSCAN 
db = DBSCAN(eps=0.2, min_samples=60, metric="manhattan", algorithm="auto") #choix des hyperparametres
db.fit(X)

In [50]:
dataset_sample["Cluster_DBScan"] = db.labels_
dataset_sample.head()

Unnamed: 0,Lat,Lon,dayofweek,hour,month,dayofweekfull,Cluster_KMeans,Cluster_DBScan
67290,40.7313,-73.9755,0,12,4,Monday,3,0
210817,40.7781,-73.9745,0,9,9,Monday,0,0
244197,40.714,-74.0096,0,17,4,Monday,3,0
640767,40.7409,-74.0076,0,22,9,Monday,3,0
468102,40.7548,-73.9755,0,19,9,Monday,0,0


<a id='dbscanvisualization'></a>
a - Clustering Visualization

In [51]:
fig = px.scatter_mapbox(dataset_sample[dataset_sample["Cluster_DBScan"] != -1], lat="Lat", lon="Lon", color="Cluster_DBScan",
                        mapbox_style="carto-positron",
                       animation_frame = "dayofweekfull")
fig.show()

With DBScan we have the following clusterisation:
- Cluster 0: Manhattan and Brooklyn, where people live and work
- Cluster 1:  LaGuardia airport, closest airport to the city center
- Cluster 2: JFK airport, main international airport
- Cluster 3: Newark airport, oldest NY airport

In [52]:
# Focus on one day : Friday, each hour of the day
dataset_sample_friday = dataset_sample[dataset_sample["dayofweekfull"] == "Friday"]
dataset_sample_friday.head()

Unnamed: 0,Lat,Lon,dayofweek,hour,month,dayofweekfull,Cluster_KMeans,Cluster_DBScan
729650,40.7812,-73.9853,4,16,9,Friday,0,0
947361,40.7476,-73.8536,4,19,9,Friday,2,-1
5518,40.7505,-73.9767,4,16,9,Friday,0,0
602092,40.7698,-73.9603,4,18,9,Friday,0,0
905054,40.7191,-74.0064,4,23,9,Friday,3,0


In [53]:
# Focus on one day : Friday, each hour of the day
fig = px.scatter_mapbox(dataset_sample[dataset_sample["Cluster_DBScan"] != -1].sort_values("hour"), lat="Lat", lon="Lon", color="Cluster_DBScan",
                        mapbox_style="carto-positron",
                       animation_frame = "hour")
fig.show()