# Understanding Hired Rides in NYC
## Introduction:

Uber announced that its users in New York City could order yellow taxis through the Uber app in the future. To explore the trends for Uber and yellow taxi, we make analysis based on hired-ride trip data from Uber and NYC Yellow cab from January 2009 through June 2015, and local historical weather data.
##### The analysis is mainly broken up into 4 Parts: (Detailed analysis about each part are shown in the following Jupyter Notebook)
<br>
Data Preprocessing
<br>
Storing Data
<br>
Understanding Data
<br>
Visualizing Data

## Project Setup
All import statements needed for the project

In [1]:

import math
import numpy as np
import bs4
import matplotlib.pyplot as plt
import pandas as pd
import requests
import sqlalchemy as db
from scipy.stats import sem
from keplergl import KeplerGl
from math import sin, cos, sqrt, atan2, radians
import geopandas as gpd
import re
import matplotlib.patches as mpatches
from matplotlib.animation import FuncAnimation

## Part 1: Data Preprocessing

### Calculating distance
Define a functin called "calculate_distance" that calculates the distance between two coordinates in kilometers

In [2]:
def calculate_distance(from_lat, from_long,to_lat,to_long):
    """Calculate the distance bewteen two coordinates in kilometers.

    Keyword arguments:
    Inputs: 
        from_lat -- first coordinate's latitude
        from_long -- first coordinate's longitude
        to_lat -- second coordinate's latitude
        to_long -- second coordinate's longitude
    Output:
        distance -- distance between two coordinates in kilometers
    """
    
    R = 6373.0
    lat1 = radians(from_lat)
    lon1 = radians(from_long)
    lat2 = radians(to_lat)
    lon2 = radians(to_long)
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

### Processing Taxi Data

For the taxi data, we first downloaded parquet files programmingly using regular expression,requests and beautifulsoup module. Then, we constructed a sampling of 3000 rows for each month data to make the sample size consistent with the one of Uber Data. The next step is cleaning data including removing invalid data, unnecessary columns, and add a column of distance according to the coordinates. Lastly, we append data of each month to a big dataframe.
<br>

##### Invalid Data Criteria:
passenger count=0
<br>
fare amount<=0
<br>
distance <=0
<br>
coordinates out of New York box or NaN

__1. Find Urls of Taxi Data__

In [3]:
def find_taxi_csv_urls():
    """Get Urls using requests and beautifulsoup
    Keyword Arguments:
    Output:
        res -- A list contain all urls of yellow taxi montly data
    
    """
    TAXI_URL = "https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page"
    response = requests.get(TAXI_URL)
    html = response.content
    soup = bs4.BeautifulSoup(html,'html.parser')
    w=soup.find_all("a")
    res=[]
    for i in range(len(w)):
        if w[i].text=="Yellow Taxi Trip Records":
            res.append(w[i]['href'])
    return res

__2. Download Taxi Data from 2009-01 to 2015-06__

In [None]:
res=find_taxi_csv_urls()
for x in res:
    #Use regular expression to extract required urls and use requests to download data
    pattern = r"(2009-\d{2}|2010-\d{2}|2011-\d{2}|2012-\d{2}|2013-\d{2}|2014-\d{2}|2015-0[1-6])\.(parquet)"
    result = re.search(pattern, x)
   
    if result != None:
        response = requests.get(x, stream=True)
        title=result.groups()[0]
    with open(title+".parquet", "wb") as f:
        f.write(response.content)

__3. Clean Data from 2011 to 2015 (Location ID is provided instead of Coordinates)__

In [4]:
df2 = gpd.read_file('taxi_zones.shp')
df2 = df2.to_crs(epsg=4326)  # EPSG 4326 = WGS84 = https://epsg.io/4326
def find_long(ID,df2):
    """Find longitude using taxi_zones.shp corresponding to the Location ID
    Key Arguments:
    Inputs:
        ID -- Location ID
        df2 -- dataframe of taxi_zones.shp
    
    Outputs:
        long-- longitude
        np.nan-- if the coordinate is out of new york box
    """
    long=df2.iloc[ID-1].geometry.centroid.x
    if -74.242330<=long<=-73.717047:
        return long
    else:
        return np.nan
def find_lat(ID,df2):
    """Find latitude using taxi_zones.shp corresponding to the Location ID
    Key Arguments:
    Inputs:
        ID -- Location ID
        df2 -- dataframe of taxi_zones.shp 
    
    Outputs:
        lat-- latitude
        np.nan-- if the coordinate is out of new york box
    """
    lat=df2.iloc[ID-1].geometry.centroid.y
    if 40.560445<=lat<=40.908524:
        return lat
    else:
        return np.nan
def normal_from_2011(parquet_file):
    """Clean Data: Normalize column names, remove invalid rows, sampling size=3000, find coordinates and add distance
    Key Arguments:
    Inputs:
        parquet_file -- a string of parquet file's name
    
    Outputs:
        taxi_df -- dataframe with columns of datetime, coordinates of pickup&dropoff, distance, and tip amount
    
    """
    
    df = pd.read_parquet(parquet_file,engine='pyarrow')
    df = df[(df.passenger_count != 0) & (df.fare_amount > 0)]
    df.rename(columns={'tpep_pickup_datetime':'pickup_datetime'},inplace=True)    
    df.set_index(pd.to_datetime(df["pickup_datetime"]),inplace=True)
    df = df[["PULocationID", "DOLocationID","tip_amount"]]
    taxi_df=df.sample(n=3000,random_state=100)
    taxi_df = taxi_df.loc[(taxi_df['PULocationID'] < 264) & (taxi_df['PULocationID'] >= 1)]
    taxi_df = taxi_df.loc[(taxi_df['DOLocationID'] < 264) & (taxi_df['DOLocationID'] >= 1)]
    
    taxi_df["pickup_latitude"]=np.nan
    taxi_df["pickup_longitude"]=np.nan
    taxi_df["dropoff_latitude"]=np.nan
    taxi_df["dropoff_longitude"]=np.nan
    lat1 = taxi_df.apply(
        lambda row: find_lat(row["PULocationID"].astype('int'),df2),axis=1)
    long1 = taxi_df.apply(
        lambda row: find_long(row["PULocationID"].astype('int'),df2),axis=1)
    
    taxi_df["pickup_latitude"] = lat1
    taxi_df["pickup_longitude"] = long1
 
    lat2 = taxi_df.apply(
        lambda row: find_lat(row["DOLocationID"].astype('int'),df2),axis=1)
    long2 = taxi_df.apply(
    lambda row: find_long(row["DOLocationID"].astype('int'),df2),axis=1)

    
    taxi_df["dropoff_latitude"] = lat2
    taxi_df["dropoff_longitude"] = long2   
    
    add_distance = taxi_df.apply(
                 lambda row: calculate_distance(row["pickup_latitude"], row["pickup_longitude"],row["dropoff_latitude"],row["dropoff_longitude"]),
                 axis=1)
    taxi_df['distance'] = add_distance
    taxi_df = taxi_df[taxi_df.distance > 0]
    taxi_df.dropna(inplace=True)
    
    return pd.DataFrame(taxi_df, columns=["pickup_longitude", "pickup_latitude", "dropoff_longitude","dropoff_latitude","tip_amount","distance"])

__4. Clean Data of 2009 and 2010 Respectively__

In [5]:
def norm_2009(file):
    """Clean Data: extract necessary columns and normalize names, remove invalid rows, sample size=3000
    Key Arguments:
    Inputs:
        file -- a string of parquet file's name
    
    Outputs:
        par -- dataframe with columns of datetime, coordinates of pickup&dropoff and tip amount
    """
    par=pd.read_parquet(file,engine='pyarrow')
    par.rename(columns={'Trip_Pickup_DateTime':'pickup_datetime'},inplace=True)
    par.set_index(pd.to_datetime(par['pickup_datetime']),inplace=True)
    par = par[(par.Passenger_Count != 0) & (par.Fare_Amt > 0)]
    par=par[['Start_Lon','Start_Lat','End_Lon','End_Lat','Tip_Amt']]
    par=par.sample(3000,random_state=100)
    par.rename(columns={'Start_Lon': 'pickup_longitude', 'Start_Lat': 'pickup_latitude','End_Lon':'dropoff_longitude','End_Lat':'dropoff_latitude','Tip_Amt':'tip_amount'},inplace=True)
    return par
def norm_2010(parquet_file): 
    """Clean Data: extract necessary columns and normalize names, remove invalid rows, sample size=3000
    Key Arguments:
    Inputs:
        parquet_file -- a string of parquet file's name
    
    Outputs:
        taxi_df -- dataframe with columns of datetime, coordinates of pickup&dropoff and tip amount
    """
    df = pd.read_parquet(parquet_file,engine='pyarrow')
    df.set_index(pd.to_datetime(df["pickup_datetime"]),inplace=True)
    df = df[(df.passenger_count != 0) & (df.fare_amount > 0)]
    df = df[["pickup_longitude","pickup_latitude","dropoff_longitude","dropoff_latitude","tip_amount"]]
    taxi_df=df.sample(n=3000,random_state=100)
    return taxi_df
def normal_before_2011(file):
    """Calculate Distance: remove rows with coordinates outside of new york box and add distance
    Key Arguments:
    Inputs:
        file -- sampled dataframe 
    
    Outputs:
        taxi_df -- dataframe with columns of datetime, coordinates of pickup&dropoff, distance and tip amount
    """
    if file[:4]=="2009":
        taxi_df=norm_2009(file)
    else:
        taxi_df=norm_2010(file)
    taxi_df=taxi_df.loc[(taxi_df["pickup_latitude"]<=40.908524)&(taxi_df["pickup_latitude"]>=40.560445)&(taxi_df["dropoff_latitude"]<=40.908524)&(taxi_df["dropoff_latitude"]>=40.560445)&(taxi_df["pickup_longitude"]<=-73.717047)&(taxi_df["pickup_longitude"]>=-74.242330)&(taxi_df["dropoff_longitude"]<=-73.717047)&(taxi_df["dropoff_longitude"]>=-74.242330)].copy()
    add_distance = taxi_df.apply(
        lambda row: calculate_distance(row["pickup_latitude"], row["pickup_longitude"],row["dropoff_latitude"],row["dropoff_longitude"]),axis=1)
    taxi_df['distance'] = add_distance
    taxi_df = taxi_df[taxi_df.distance > 0]
    taxi_df.dropna(inplace = True)
    return taxi_df

In [6]:
#create a list storing all the file names of yellow taxi monthly data
res=find_taxi_csv_urls()
title=[]
for x in res:
    pattern = r"(2009-\d{2}|2010-\d{2}|2011-\d{2}|2012-\d{2}|2013-\d{2}|2014-\d{2}|2015-0[1-6])\.(parquet)"
    result = re.search(pattern, x)
   
    if result != None:
        title.append(result.groups()[0]+".parquet")
   
    

__5. Append Monthly Data to a Big Dataframe__

In [7]:
def get_and_clean_taxi_data(title):
    """Process Monthly Data and append to a single DataFrame
    Key Arguments:
    Inputs:
        file -- sampled dataframe 
    
    Outputs:
        taxi_df -- dataframe with columns of datetime, coordinates of pickup&dropoff, distance and tip amount
    """
    all_taxi_dataframes = []    
    for urls in title:
        if urls[:4] == "2009" or  urls[:4] == "2010":
            all_taxi_dataframes.append(normal_before_2011(urls))
        else:
            all_taxi_dataframes.append(normal_from_2011(urls))
    taxi_data = pd.concat(all_taxi_dataframes)
    return taxi_data
    
    

In [8]:
Taxi_Data=get_and_clean_taxi_data(title)

### Processing Uber Data

__1. Manually downloaded and stored Uber data as "uber_rides_sample.csv"__
<br>
__2. Replace index with pickup_datetime__
<br>
__3. Remove invalid trips__
<br>
&nbsp;&nbsp;&nbsp;&nbsp; * Trips outside the required coordinate box
<br>
&nbsp;&nbsp;&nbsp;&nbsp; * Trips with zero passenger count
<br>
&nbsp;&nbsp;&nbsp;&nbsp; * Trips with no fare
<br>
&nbsp;&nbsp;&nbsp;&nbsp; * Trips with no distance between dropoff and pickup
<br>
__4. Remove unnecessary columns__
<br>
&nbsp;&nbsp;&nbsp;&nbsp;The dataset now only has 4 columns which represent longtitudes and latitudes respectively. 
<br>
__5. Add distance column__
<br>
&nbsp;&nbsp;&nbsp;&nbsp;Implemented calculate_distance function and add distance as a new column
<br>
__6. Drop NaN & Normalize column names__

In [9]:
def load_and_clean_uber_data(csv_file):
    """Load and clean the Uber data.

    Keyword arguments:
    Inputs: 
        csv_file -- Uber data's file name
    Output:
        uber -- cleaned dataframe with columns of pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, and distance
    """
    uber = pd.read_csv(csv_file) 
    uber.set_index(pd.to_datetime(uber['pickup_datetime']),inplace=True)
    uber.drop(["key","Unnamed: 0",'pickup_datetime'],axis=1,inplace=True)
    uber = uber[(uber.passenger_count != 0) & (uber.fare_amount > 0)]
    uber.drop(["passenger_count","fare_amount"],axis=1,inplace=True)
    uber['pickup_latitude'].apply(lambda x: float(x))
    uber['pickup_longitude'].apply(lambda x: float(x))
    uber['dropoff_latitude'].apply(lambda x: float(x))
    uber['dropoff_longitude'].apply(lambda x: float(x))
    uber=uber.loc[(uber["pickup_latitude"]<=40.908524)&(uber["pickup_latitude"]>=40.560445)&(uber["dropoff_latitude"]<=40.908524)&(uber["dropoff_latitude"]>=40.560445)&(uber["pickup_longitude"]<=-73.717047)&(uber["pickup_longitude"]>=-74.242330)&(uber["dropoff_longitude"]<=-73.717047)&(uber["dropoff_longitude"]>=-74.242330)]
    add_distance = uber.apply(
        lambda row: calculate_distance(row["pickup_latitude"], row["pickup_longitude"],row["dropoff_latitude"],row["dropoff_longitude"]),axis=1)
    uber['distance'] = add_distance
    uber = uber[uber.distance > 0]
    uber.dropna(inplace = True)
    
    return uber




In [10]:
Uber_Data=load_and_clean_uber_data("uber_rides_sample.csv")