# Bloc 3 - Analyse prédictive de données structurées par l'intelligence artificielle - Uber pickups

## Introduction

Uber is a company founded in 2009, initially to provide ridesharing services. Since then, Uber expanded its activities to delivery (food, packages, couriers), freight transport and alternative means of urban transportation (bikes, scooters). Uber operates in about 70 countries and 10.500 cities and generates an average of 23 million trips per day.

### Problematic

Uber experiences ride cancellations because drivers are not always in close proximity with users.

The company would like to recommend hot-zones to their drivers.

### Scope

Uber aims at creating an application that would recommend hot-zones to their drivers at any given time of the day. The uber team already has data about pickups in major cities. They would like to test this new feature by focusing on New York City.

### Aim and objectives

Overall aim: Identify pickup hot-zones in New York City.

Objectives:
- 1 - Find hot-zones (compare two unsupervised algorithms).
- 2 - Visualize results on a dashboard.

##
## Methods

### 1 - Library import

### 2 - File reading and basic exploration

The dataset was composed of 8 files containing information on taxis and uber pickups during the years 2014 and 2015. For the purpose of this project, only information about Uber pickups for the months of April, May, June, and September 2014 were used. Data from July and August 2014 were excluded not to introduce a bias due to summer holidays. Data from 2015 was excluded due to a lack of geographic coordinates.

The resulting dataset contained information on 2.908.931 pickups, including geophaphic coordinates, date, and time. It did not contain any missing values.

### 3 - Preprocessing and exploratory data analysis

Before performing an exploratory data analysis, two features were created, for the day of the week and the hour of the pickups.

The pickup time distribution was displayed for each day of the week, as well as a map summarizing the locations of the pickups. It appeared that most rides occur at the end of each day, and that pickup locations could be narrowed down around Manhattan.

### 4 - Data selection

First, pickups considered as outliers (with latitude or longitude further away than four standard deviations from their respective means) were dropped.

Then, data was selected according to pickup peak hours, when the demand from users is the most important. Indeed, it was assumed that a high demand could cause a drop in drivers' availability, and therefore an increase in waiting time. Thus it was more relevant to identify hot-zones for peak hours to respond to a maximum of demands. To identify peak hours, hours were sorted by pickup probability, and a cut-off based on cumulative probabilities was applied to select peak hours.

### 5 - Clustering with K-Means

The optimal k number of clusters was determined for each day of the week with the Elbow method. To identify the maximum number of relevant clusters, the highest k leading to more than a 5% decrease of the within cluster sum of square was selected.

K-Means clustering was then performed for each day of the week with the corresponding k and the default k-means++ method for initialization of the clusters.

### 6 - Clustering with DBSCAN

Clustering was also performed on the same data with DBSCAN to compare the two clustering methods. Since the data consisted in geographic coordinates, and since we awere dealing with car rides in right-angle organized streets, the Manhattan distance was chosen as metric to calculate distances between pickups. The maximum distance between pickups to be considered as neighboors (epsilon) was selected empirically to be identical for all days of the week. The minimum number of neighboors for a point to be considered as a core point was set to 1/500 of the total number of pickups for each day to account for differences in pickup density between days.

### 7 - Dashboard

For each day of the week, hot-zones during peak hours were displayed on the New York City map. Results obtained with the two methods were plotted in parallel for comparison.

### 8 - Focus on a Sunday in Manhattan

As a proof of concept, clustering was performed with DBSCAN to depict hot-zones per hour for Sundays in Manhattan.

##
## Conclusion

The dataset of Uber, covering almost 3.000.000 pickups, allowed for the identification of hot-zones at peak hours to better respond to the demand of users, and therefore decrease the number of ride cancellations.

As expected for this type of data where some clusters are elongated (Manhattan district for example), and where clusters have different shapes and sizes, DBSCAN performed better than K-Means in defining hot-zones. These hot-zones were globally the same across days. They included the Manhattan district (with a very high demand all over the district), the three airports (J.F. Kennedy International Airport, Newark Liberty International Airport, and LaGuardia Airport), and areas in Brooklyn such as Williamsburg (trendy neighboorhood also called the "Little Berlin"), which are close to the bridges to Manhattan. Amusing enough, a small cluster appeared during the week-end right at the Ikea store.

Because of limitations regarding the display of maps, the analysis was limited to peak hours and days of the week. However, the dashboard could easily be augmented with hot-zones depicted per hour for each day. Additionnaly, a map of hot-zones dedicated to the Manhattan district could be very insightful. A proof of concept for Sundays in Manhattan showed this approach to be feasible. Despite current limitations, Uber could already ask their drivers to be more present in the identified hot-zones at peak hours.

##
## Code

### 1 - Library import

In [None]:
### 1 - library import ### ----

import pandas as pd
import numpy as np

from sklearn.cluster import KMeans, DBSCAN

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots


###
### 2 - File reading and basic exploration

In [None]:
### 2 - file reading and basic exploration - import dataset ### ----

# load data
data_uber_apr14 = pd.read_csv("cnm_bloc3-3_uber-raw-data-apr14.csv")
data_uber_may14 = pd.read_csv("cnm_bloc3-3_uber-raw-data-may14.csv")
data_uber_jun14 = pd.read_csv("cnm_bloc3-3_uber-raw-data-jun14.csv")
data_uber_sep14 = pd.read_csv("cnm_bloc3-3_uber-raw-data-sep14.csv")

# excluded data
#data_taxi = pd.read_csv("cnm_bloc3-3_taxi-zone-lookup.csv")
#data_uber_jul14 = pd.read_csv("cnm_bloc3-3_uber-raw-data-jul14.csv")
#data_uber_aug14 = pd.read_csv("cnm_bloc3-3_uber-raw-data-aug14.csv")
#data_uber_janjune15 = pd.read_csv("cnm_bloc3-3_uber-raw-data-janjune-15.csv")

# concatenate selected data
data = pd.concat([data_uber_apr14, data_uber_may14, data_uber_jun14, data_uber_sep14], axis = 0)


In [None]:
### 2 - file reading and basic exploration - get basic stats ### ----

# print shape of data
print("Number of rows: {}".format(data.shape[0]))
print("Number of columns: {}".format(data.shape[1]))
print()

# display dataset
pd.set_option('display.max_columns', None)
print("Dataset display: ")
display(data.head())
print()

# display basic statistics
print("Basics statistics: ")
data_desc = data.describe(include='all')
display(data_desc)
print()

# display percentage of missing values in columns and rows
percent_nan_col = data.isnull().sum() / data.shape[0] * 100
print("Percentage of missing values per column:\n{}".format(percent_nan_col))
print()
percent_nan_row = data[data.isnull().all(axis = 1)].shape[0] / data.shape[1] * 100
print("Percentage of rows fully filled with missing values: {}".format(percent_nan_row))


###
### 3 - Preprocessing and exploratory data analysis

In [None]:
### 3 - preprocessing and exploratory data analysis - create features from date ### ----

# copy data for safety
data1 = data.copy()

# create usable features from the date column
data1["Date/Time"] = pd.to_datetime(data1["Date/Time"], infer_datetime_format = True)
data1["day_of_week"] = data1["Date/Time"].dt.day_of_week
data1["hour"] = data1["Date/Time"].dt.hour

# drop useless columns
data1 = data1.drop(["Base", "Date/Time"], axis = 1)


In [None]:
### 3 - preprocessing and exploratory data analysis - distributions and map ### ----

# get unique days of week
days_unique = np.sort(data1["day_of_week"].unique())

# set figure to make subplots
fig1 = make_subplots(
    rows = 2,
    cols = 4,
    specs = [[{}, {}, {}, {}], [{}, {}, {}, {'type': 'mapbox'}]],
    subplot_titles = (
        "Monday",
        "Tuesday",
        "Wednesday",
        "Thursday",
        "Friday",
        "Saturday",
        "Sunday",
        "All days"),
    horizontal_spacing = 0.10,
    vertical_spacing = 0.30)

# plot distribution of pickups per day of the week
[fig1.add_trace(go.Histogram(
    x = data1.loc[data1["day_of_week"] == days_unique[i],"hour"],
    marker_color = px.colors.qualitative.Vivid[7],
    showlegend = False), 
    row = 1, col = i+1) for i in [0, 1, 2, 3]]
[fig1.add_trace(go.Histogram(
    x = data1.loc[data1["day_of_week"] == days_unique[i],"hour"],
    marker_color = px.colors.qualitative.Vivid[7],
    showlegend = False), 
    row = 2, col = i-3) for i in [4, 5, 6]]

# plot map
data_fig1 = data1.sample(5000, random_state = 0)
fig1.add_trace(go.Scattermapbox(
        lat = data_fig1["Lat"], 
        lon = data_fig1["Lon"],
        marker_color = px.colors.qualitative.Vivid[7],
        marker_size = 2),
        row = 2, col = 4)

# update layout
fig1.update_xaxes(title = "Hour", tickfont = dict(size = 10))
fig1.update_yaxes(tickfont = dict(size = 10), range = [-1000, 48000])
fig1.update_layout(
        margin = dict(l = 90),
        title_text = "Figure 1. Pickup time distribution",
        title_x = 0.5,
        title_y = 0.95,
        title_font_size = 18,
        showlegend = False,
        plot_bgcolor = "rgba(0,0,0,0)",
        paper_bgcolor = "rgb(232,232,232)",
        width = 800,
        height = 500)
fig1.update_mapboxes(
    style = "carto-positron",
    center = dict(
        lat = data1.loc[:,"Lat"].mean(),
        lon = data1.loc[:,"Lon"].mean()),
    zoom = 7.5)

fig1.show()


###
### 4 - Data selection

In [None]:
### 4 - data selection - narrow down around Manhattan ### ----

# consider as outliers pickups whose latitude or longitude is more than four standard deviations 
# away from the mean

# copy data for safety
data2 = data1.copy()

# set columns to check
columns_check = ["Lat","Lon"]

# initialise count for droped rows
drop_count = 0

# loop through columns
for column in columns_check:

    # get lower and upper bonds
    bond_lower = data2[column].mean() - 4 * data2[column].std()
    bond_upper = data2[column].mean() + 4 * data2[column].std()

    # get index of rows to drop
    drop_mask = (data2[column] < bond_lower) | (data2[column] > bond_upper)
    drop_index = data2.loc[drop_mask,:].index
    drop_count += len(drop_index)

    # drop rows containing outliers
    data2 = data2.drop(drop_index, axis = 0)

# print number of rows that were droped
print("Number of rows that were droped: {}".format(drop_count))

In [None]:
### 4 - data selection - narrow down to peak hours ### ----

# get unique days of week
days_unique = np.sort(data2["day_of_week"].unique())

# initialise variable to downsampled data
data_sample = pd.DataFrame(columns = ["Lat","Lon","day_of_week"])

# loop through days
for day in days_unique:

    # get data for the current day
    data_current = data2.loc[data2["day_of_week"] == day,:]

    # sort hours by pickup probability
    pickup_proba = pd.DataFrame((data_current["hour"].value_counts() / \
        data_current.shape[0]).sort_values(ascending = False)).reset_index()
    pickup_proba.columns = ["hour","proba"]

    # select peak hours (cumulative probabilities < 50%)
    pickup_proba["proba_cum"] = np.cumsum(pickup_proba["proba"])
    peak_hours = pickup_proba.loc[pickup_proba["proba_cum"] < 0.5,"hour"].values
    data_current = data_current.loc[data_current["hour"].isin(peak_hours),:]

    # store data
    data_sample = pd.concat([data_sample, data_current.loc[:,["Lat","Lon","day_of_week"]]])


###
### 5 - Clustering with K-Means

In [None]:
### 5 - clustering with k-means - get optimal k with elbow ### ----

# get optimal number of clusters per day of the week

# initialise variable to store number of clusters
cluster_nb = pd.DataFrame(index = range(0,len(days_unique)), columns = ["day", "cluster_nb"])
cluster_nb["day"] = days_unique

# loop through days
for day in days_unique:

    # set mask for day
    mask_day = data_sample["day_of_week"] == day

    # get current data
    data_current = data_sample.loc[mask_day,["Lat","Lon"]]

    # initialise temporary variable to store within sum of square
    wcss_temp = []
    wcss =  pd.DataFrame(index = range(0,10), columns = ["k","wcss","wcss_norm","diff"])
    wcss["k"] = range(1,11)

    # get within cluster sum of square for k ranging from 1 to 15
    for k in wcss["k"]: 
        kmeans = KMeans(n_clusters = k, random_state = 0, n_init = "auto")
        kmeans.fit(data_current)
        wcss_temp.append(kmeans.inertia_)
    
    # normalize wcss and select best k
    wcss["wcss"] = wcss_temp
    wcss["wcss_norm"] = wcss["wcss"] * 100 / wcss.loc[0,"wcss"]
    wcss.loc[1:,"diff"] = wcss.loc[0:8,"wcss_norm"].values - wcss.loc[1:,"wcss_norm"].values
    best_k = wcss.loc[wcss["diff"] > 5,"k"].max()
    cluster_nb.loc[cluster_nb["day"] == day,"cluster_nb"] = best_k


In [None]:
### 5 - clustering with k-means - cluster with k-means ### ----

# initialise variable to store clusters and cluster center coordinates
kmeans_data = pd.DataFrame(columns = ["Lat","Lon","day_of_week","cluster","cluster_weight","weight_sort"])

# loop through days
for day in days_unique:

    # set mask for day
    mask_day = data_sample["day_of_week"] == day

    # get current data
    data_current = data_sample.loc[mask_day,["Lat","Lon","day_of_week"]]
    cluster_nb_current = cluster_nb.loc[cluster_nb["day"] == day,"cluster_nb"].values[0]

    # fit k-means 
    kmeans = KMeans(n_clusters = cluster_nb_current, random_state = 0, n_init = "auto")
    kmeans.fit(data_current.loc[:,["Lat","Lon"]])

    # get clusters
    data_current = data_current.assign(cluster = kmeans.predict(data_current.loc[:,["Lat","Lon"]]))

    # add cluster weight for plotting
    cluster_weights = data_current["cluster"].value_counts()
    weights = [cluster_weights[cluster] for cluster in data_current["cluster"]] 
    data_current["cluster_weight"] = weights
    weight_sort = cluster_weights.sort_values(ascending = False).reset_index(drop = True)
    data_current["weight_sort"] = [weight_sort[weight_sort.values == weight].index[0]
        for weight in data_current["cluster_weight"]]

    # concatenate
    kmeans_data = pd.concat([kmeans_data, data_current], ignore_index = True)



###
### 6 - Clustering with DBSCAN

In [None]:
### 6 - clustering with dbscan ### ----

# initialise variable to store clusters and cluster center coordinates
dbscan_data = pd.DataFrame(columns = ["Lat","Lon","day_of_week","cluster","cluster_weight","weight_sort"])

# loop through days
for day in days_unique:

    # set mask for day
    mask_day = data_sample["day_of_week"] == day

    # get current data
    data_current = data_sample.loc[mask_day,:]

    # fit dbscan
    dbscan = DBSCAN(eps = 0.005, min_samples = np.round(data_current.shape[0] / 500).astype(int), 
        metric = "manhattan", n_jobs = -2)
    dbscan.fit(data_current.loc[:,["Lat","Lon"]])

    # get clusters
    data_current = data_current.assign(cluster = dbscan.labels_)

    # add cluster weight for plotting
    cluster_weights = data_current["cluster"].value_counts() * 100 / data_current.shape[0]
    weights = [cluster_weights[cluster] for cluster in data_current["cluster"]] 
    data_current["cluster_weight"] = weights
    weight_sort = cluster_weights.sort_values(ascending = False).reset_index(drop = True)
    data_current["weight_sort"] = [weight_sort[weight_sort.values == weight].index[0]
        for weight in data_current["cluster_weight"]]

    # concatenate
    dbscan_data = pd.concat([dbscan_data, data_current], ignore_index = True)


###
### 7 - Dashboard

In [None]:
### 7 - dashboard ### ----

# set figure to make subplots
fig2 = make_subplots(
    rows = 7,
    cols = 2,
    specs = [[{'type': 'mapbox'}, {'type': 'mapbox'}], [{'type': 'mapbox'}, {'type': 'mapbox'}], 
        [{'type': 'mapbox'}, {'type': 'mapbox'}], [{'type': 'mapbox'}, {'type': 'mapbox'}], 
        [{'type': 'mapbox'}, {'type': 'mapbox'}], [{'type': 'mapbox'}, {'type': 'mapbox'}], 
        [{'type': 'mapbox'}, {'type': 'mapbox'}]],
    subplot_titles = (
        "K-Means", "DBSCAN"),
    column_widths = [0.40, 0.40],
    horizontal_spacing = 0.15,
    vertical_spacing = 0.03)

# plot k-means clusters
for i in range(0,7):
    data_current = kmeans_data.loc[kmeans_data["day_of_week"] == days_unique[i],:].sample(n = 20000,
        random_state = 0)
    fig2.add_trace(go.Scattermapbox(
        lat = data_current["Lat"], 
        lon = data_current["Lon"],
        marker_colorscale = "Portland",
        marker_color = data_current["weight_sort"],
        marker_size = 3),
        row = i+1, col = 1)

# plot dbscan clusters (without outliers)
for i in range(0,7):
    data_current = dbscan_data.loc[dbscan_data["day_of_week"] == days_unique[i],:].sample(n = 20000,
        random_state = 0)
    drop_index = data_current.loc[data_current["cluster"] == -1,:].index
    data_current = data_current.drop(drop_index, axis = 0)
    fig2.add_trace(go.Scattermapbox(
        lat = data_current["Lat"], 
        lon = data_current["Lon"],
        marker_colorscale = "Portland",
        marker_color = data_current["weight_sort"],
        marker_size = 3),
        row = i+1, col = 2)

# add day annotations
fig2.add_annotation(text = "Monday", xref = "paper", yref = "paper", x = -0.07, y = 0.96, textangle = -90, 
    showarrow = False)    
fig2.add_annotation(text = "Tuesday", xref = "paper", yref = "paper", x = -0.07, y = 0.81, textangle = -90, 
    showarrow = False)    
fig2.add_annotation(text = "Wednesday", xref = "paper", yref = "paper", x = -0.07, y = 0.67, textangle = -90, 
    showarrow = False)    
fig2.add_annotation(text = "Thursday", xref = "paper", yref = "paper", x = -0.07, y = 0.50, textangle = -90, 
    showarrow = False)    
fig2.add_annotation(text = "Friday", xref = "paper", yref = "paper", x = -0.07, y = 0.35, textangle = -90, 
    showarrow = False)    
fig2.add_annotation(text = "Saturday", xref = "paper", yref = "paper", x = -0.07, y = 0.19, textangle = -90, 
    showarrow = False)    
fig2.add_annotation(text = "Sunday", xref = "paper", yref = "paper", x = -0.07, y = 0.04, textangle = -90, 
    showarrow = False)    
fig2.update_annotations(font_size = 15)

# update layout
fig2.update_layout(
        margin = dict(l = 90, t= 120),
        title_text = "Figure 2. Hot-zones per week day",
        title_x = 0.5,
        title_y = 0.98,
        title_font_size = 18,
        showlegend = False,
        plot_bgcolor = "rgba(0,0,0,0)",
        paper_bgcolor = "rgb(232,232,232)",
        width = 800,
        height = 2000)
fig2.update_mapboxes(
    style="carto-positron",
    center = dict(
        lat = data2.loc[:,"Lat"].mean(),
        lon = data2.loc[:,"Lon"].mean()),
    zoom = 8.5)

fig2.show()


###
### 8 - Focus on a Sunday in Manhattan

In [None]:
### 8 - focus on a sunday in manhattan - process data ### ----

# copy data for safety
data3 = data1.copy()

# part 1 - select data

# set geographic boundaries for manhattan
bound_south = 40.70
bound_north = 40.87
bound_east = -73.92
bound_west = -74.02

# set mask and select data to keep
mask_keep = (data3["Lat"] > bound_south) & (data3["Lat"] < bound_north) & (data3["Lon"] < bound_east) & \
    (data3["Lon"] > bound_west) & (data3["day_of_week"] == 6)
data3 = data3.loc[mask_keep,:]


# part 2 - cluster with dbscan

# initialise variable to store clusters and cluster center coordinates
manhattan_data = pd.DataFrame(columns = ["Lat","Lon","hour","cluster"])

# get unique hour data
hours_unique = data3["hour"].unique()

# loop through hours
for hour in hours_unique:

    # set mask for hour
    mask_hour = data3["hour"] == hour

    # get current data
    data_current = data3.loc[mask_hour,["Lat","Lon","hour"]]

    # fit dbscan
    dbscan = DBSCAN(eps = 0.0012, min_samples = np.round(data_current.shape[0] / 200).astype(int), 
        metric = "manhattan", n_jobs = -2)
    dbscan.fit(data_current.loc[:,["Lat","Lon"]])

    # get clusters
    data_current = data_current.assign(cluster = dbscan.labels_)

    # concatenate
    manhattan_data = pd.concat([manhattan_data, data_current], ignore_index = True)

# drop outliers for plotting
manhattan_data = manhattan_data.loc[manhattan_data["cluster"] != -1,:]


In [None]:
### 8 - focus on a sunday in manhattan - plot clusters on map ### ----

# plot dbscan clusters (without outliers)
fig3 = px.scatter_mapbox(
    manhattan_data, 
    lat = "Lat", 
    lon = "Lon", 
    color = "cluster",
    animation_frame = "hour")

fig3["layout"].pop("updatemenus") 

# update layout
fig3.update_layout(
        title_text = "Figure 3. Hot-zones in Manhattan on Sundays",
        title_x = 0.5,
        title_y = 0.95,
        title_font_size = 18,
        showlegend = False,
        plot_bgcolor = "rgba(0,0,0,0)",
        paper_bgcolor = "rgb(232,232,232)",
        width = 800,
        height = 600)
fig3.update_mapboxes(
    style="carto-positron",
    center = dict(
        lat = bound_north - (bound_north - bound_south) / 2,
        lon = bound_east - (bound_east - bound_west) / 2),
    zoom = 10)

fig3.show()