# Uber: identify hot zones in New York City
- Objective: Create an algorithhm to dentify hot zones for ride pickups in NYC. Create clusters and identify their characteristics, in order to reduce customer waiting time. Visualize the results in a city map.

![New York Boroughs](https://www.worldatlas.com/r/w768/upload/c6/23/73/shutterstock-152208935.jpg)

## Imports
- Data Exploration

In [None]:
# data handling
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import plotly.express as px
import datetime


# machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
from scipy.spatial.distance import cdist

import warnings
warnings.filterwarnings("ignore")

##### Map
The New York Area will be considered as the southwest end of Staten Island, the north end of the Bronx and the eastern end of Queens.


- __Latitude__: 40.5479 - 40.8673
- __Longitude__: -74.0374 - -73.7467

In [1]:
%%HTML
<div class='tableauPlaceholder' id='viz1691700265760' style='position: relative'><noscript><a href='#'><img alt='NYC Rides Dashboard ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ub&#47;Uber_rides_Dashboard&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Uber_rides_Dashboard&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ub&#47;Uber_rides_Dashboard&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1691700265760');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else { vizElement.style.width='100%';vizElement.style.height='727px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

- Show images of all Uber rides, compared to those in the New York Area
- It will be necessary to apply a filter with latitude and longitude in order to obtain those observations within NYC

In [None]:
# vscode and jupyter notebook data imports

taxi = pd.read_csv("/Users/student/Desktop/UnsupervisedML_Uber/uber-trip-data/taxi-zone-lookup.csv")
april_14 = pd.read_csv("/Users/student/Desktop/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-apr14.csv")
may_14 = pd.read_csv("/Users/student/Desktop/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-may14.csv")
june_14 = pd.read_csv("/Users/student/Desktop/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-jun14.csv")
july_14 = pd.read_csv("/Users/student/Desktop/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-jul14.csv")
august_14 = pd.read_csv("/Users/student/Desktop/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-aug14.csv")
sept_14 = pd.read_csv("/Users/student/Desktop/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-sep14.csv")
# jan_june_15 = pd.read_csv("/Users/student/Desktop/UnsupervisedML_Uber/uber-trip-data/.uber-raw-data-janjune-15.csv")

In [None]:
# taxi = pd.read_csv("/content/drive/MyDrive/Jedha/UnsupervisedML_Uber/uber-trip-data/taxi-zone-lookup.csv")
# april_14 = pd.read_csv("/content/drive/MyDrive/Jedha/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-apr14.csv")
# may_14 = pd.read_csv("/content/drive/MyDrive/Jedha/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-may14.csv")
# june_14 = pd.read_csv("/content/drive/MyDrive/Jedha/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-jun14.csv")
# july_14 = pd.read_csv("/content/drive/MyDrive/Jedha/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-jul14.csv")
# august_14 = pd.read_csv("/content/drive/MyDrive/Jedha/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-aug14.csv")
# sept_14 = pd.read_csv("/content/drive/MyDrive/Jedha/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-sep14.csv")
# jan_june_15 = pd.read_csv("/content/drive/MyDrive/Jedha/UnsupervisedML_Uber/uber-trip-data/uber-raw-data-janjune-15.csv")

In [None]:
# display 5 samples of each dataframe
frames = [taxi, april_14, may_14, june_14, july_14, august_14, sept_14] # , jan_june_15]

for frame in frames:
    print("Shape:", frame.shape)
    display(frame.head())
    # add space
    print()

- data from april '14 to september '14 are all the same, those frames will be joined. By now we will work on these data, and we will leave _taxi_ and _jan_june_15_ dataframes pending.

In [None]:
# join dataframes
frames = april_14, may_14, june_14, july_14, august_14, sept_14
df = pd.concat(frames)
df.sample(5)

In [None]:
# # download csv file, send it to tableau
df.to_csv("uber_data.csv")

In [None]:
df.shape

In [None]:
df.info()

In [None]:
# change "base" to category
df["Base"] = df["Base"].astype("category")

In [None]:
# change date to datetime
df["Date/Time"] = pd.to_datetime(df["Date/Time"])
df["Date/Time"].head()

In [None]:
# create individual columns for date and time for exploratory analysis
df["Day"] = df["Date/Time"].dt.day
df["Month"] = df["Date/Time"].dt.month
df["Year"] = df["Date/Time"].dt.year
df["Time"] = df["Date/Time"].dt.time

# drop Date/Time column
df.drop(columns="Date/Time", inplace=True)

df.head()

In [None]:
# sanity check
df.info()

In [None]:
# missing values
df.isna().sum()

In [None]:
# duplicates
df.duplicated().sum()

In [None]:
# percentage of duplicates
(df.duplicated().sum()  / len(df)) * 100

In [None]:
# drop duplicates
df.drop_duplicates(inplace=True)

# sanity check
df.duplicated().sum()

In [None]:
# filter for latitude and longitude, to obtain observations only for NYC

# set values for minimum and maximum latitudes and longitudes
min_lat = 40.5479
max_lat = 40.8673
min_lon = -74.0374
max_lon = -73.7467

# latitude and longitude mask
df = df[(df["Lat"] >= min_lat) & (df["Lon"] >= min_lon) & \
 (df["Lon"] >= min_lon) & (df["Lon"] <= max_lon)]

df.shape

## Exploratory Data Analysis

In [None]:
sns.heatmap(df.select_dtypes("number").corr(), annot=True)

- Very low correlation between our variables, that is a good thing.

In [None]:
print("Average latitude", round(df["Lat"].mean(),2))

In [None]:
print("Average longitude", round(df["Lon"].mean(),2))

In [None]:
# for loop to create countplot for selected variables

cols = ["Day", "Month", "Base"]

for col in cols:
    plt.figure(figsize=(10,4))
    sns.countplot(data=df,
                  x=col,
                  palette="muted")
    plt.title(col)
    plt.show()

##### Insight
- There is, a relationship regarding time of the day. Logically Uber rides are at its highes during peak times of the day, that is between 15 and 21 hours.
- There is, in the other hand, no specific pattern regarding the day of the month in relationship to quantity of Uber rides.
- Uber rides rose trhoughout the year, with September having around double sales than April.
- Bases B02598 and B02617 are most used, while B02764 and B02512 are barely used.

In [None]:
# for loop to create boxplot for selected variables

cols = ["Lat", "Lon"]

for col in cols:
    plt.figure(figsize=(12,3))
    sns.boxplot(data=df,
            x=col)
    plt.title(col)
    plt.show()

###### Insight
- Uber rides are far more concentrated latitudwise than longitudwise. In other words, Uber rides are spread far more from North to South than West to East.
- Most of Uber rides are located on the center of Lower Manhattan

## Machine Learning

### Preprocessing

In [None]:
X = df[["Lat", "Lon"]]
# "Day", "Month", "Hour", "Minute", "Base"]
X.head()

In [None]:
X.info()

In [None]:
# # create dummy variables
# X = pd.get_dummies(X, dtype=int, drop_first=True)
X_cols = list(X.columns)
X_cols

In [None]:
# normalize X
scaler = StandardScaler()
X_norm = scaler.fit_transform(X)

# visualize random sample
X_norm[48]

### K-Means

In [None]:
# fit baseline model
kmeans = KMeans(random_state=42)
kmeans.fit(X_norm)

In [None]:
# create cluster centers, or the average of each cluster
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
display(cluster_centers)

In [None]:
# try different number of clusters, from 1 to 10
# mean distortions = average euclidean distance
clusters = range(1, 10)
mean_distortions = []

# make loop to find ideal number of clusters
for k in clusters:
    model = KMeans(n_clusters=k)
    model.fit(X_norm)
    # assign clusters
    pred = model.predict(X_norm)
    mean_distortions.append(
        sum(
            np.min(cdist(X, model.cluster_centers_, "euclidean"), axis=1)
        )  # find centroids and distances
        / pd.DataFrame(X).shape[0]
    )

In [None]:
# average euclidean distance from centroid
plt.plot(clusters, mean_distortions, "bx-")
plt.xlabel("K")
plt.ylabel("Average Distortion")
plt.title("Find K with Elbow Method")
plt.show()

- It seems like our optimal number of clusters is four.
- There is a minimal difference between four and nine, the next best option, but it increases complexity and interpretability, so it is best to stick with four.

#### K-Means 4

In [None]:
# make predictions and assign variables to clusters
kmeans_4 = KMeans(n_clusters=4, random_state=0)
kmeans_4.fit(X_norm)
kmeans_4_pred = kmeans_4.predict(X_norm)

X["kmeans_4_cluster"] = kmeans_4_pred
X.head()

In [None]:
#  silhouette score
# kmeans_4_sil_score = silhouette_score(X_norm, X["kmeans_4_cluster"])
# display(kmeans_4_sil_score)

In [None]:
sns.countplot(data=X,
              x="kmeans_4_cluster")
plt.title("Quantity of Observations per Cluster")
plt.show()

In [None]:
X.boxplot(by="kmeans_4_cluster",
          layout=(5,2),
          figsize=(15,10))
plt.show()

In [None]:
# average latitude and longitude for each cluster
X.groupby("kmeans_4_cluster").mean()

In [None]:
# centroids for model
kmeans_4_clusters = scaler.inverse_transform(kmeans_4.cluster_centers_)
display(kmeans_4_clusters)

In [None]:
# download csv file, send it to tableau
X.to_csv("kmeans_4_clusters.csv")
files.download("kmeans_4_clusters.csv")

##### Insight:
- Four main clusters were identified, which are:
      - Lower Manhattan & Midwest Brooklyn (Red)
      - Midtown & Upper Manhattan (Blue)
      - Eastern Brooklyn & Eastern Queens (orange)
      - Western Queens & The Bronx (Green)



### DBSCAN