# NYC Yellow Cab efficiecy
---

### Efficiency metric for public transport:
$$ \textbf{Efficiency} = \frac{\textit{Passenger displacement}}{\textit{cost}} $$
$$ \textbf{Passenger displacement} = \mu_{\textit{distance covered}} * \mu_{\textit{number of passengers}} $$
$$ \textbf{Cost} = \textit{(fuel cost + social carbon cost) * miles + idle time} + \textit{(drive salary + oppotunity cost) time} $$


#### Passenger displacement:
* Distance travelled by each passenger.
* Adds up when multiple passengers share a ride.

#### Cost components:
1. Driver salary 
    * $\$$20 / hour fixed salary.
2. Fuel cost
    * Assumes 13 mpg fuel efficiency.
    * Assumes 25 mph average crusing speed to determine idle time and 0.5 gallon per hour consumption when idle.
    * Based on a $\$$2.385 gallon and a $\$$0.453 social carbon cost per gallon emissions.
3. Opportunity cost 
    * $\$$15 / hour opportunity cost per passenger for commuting.

#### Aggregation strategy:
* Merges rides with similar pickup locations and times goint to a similar dropoff location.
* Controls how similar they have to be in order to merge using a wait time threshold.
* The wait time for ride $\textit{A}$ to be merged with ride $\textit{B}$ is its difference in pickup time added to the distance of their pickup and dropoff locations assuming a walking speed of 3.6 mph.
* Choose an optimal wait time threshold by computing mean efficiency across a range of reasonable wait times.    
    
#### Assumptions:
* Trips beginning and ending in Manhattan have twice the weight of those that only begin or end in Manhattan when determing efficiency. 
* Pickup and dropoff walking distances are are set to the Manhattan distance between pairs of pickup and dropoff points.
* Shared rides can take up to 6 passengers at once.
* Passengers can only aggregate at the beginning of a trip (i.e. no pickups in the middle of a trip).
* Traffic speed, fuel cost and driver salary are the same for non-aggregated and aggregated rides.

#### Data pipeline:
* 1st full week of June 2016 $\Rightarrow$ training (find optimal waiting time).
* 2nd full week of June 2016 $\Rightarrow$ validation (compare aggregated vs. yellow cab efficiency).

#### Things I would like to explore if I had more time:
1. Aggregate trips along routes.
2. Account for driver idle time on the efficiency metric.

#### Load packages and read input data

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from functools import partial
from collections import Counter, defaultdict
import time

# gis imports
import geopandas as gpd
import folium
from shapely.geometry import Point, LineString, Polygon
from shapely.strtree import STRtree
from shapely.ops import nearest_points
from scipy.spatial import cKDTree
from osmnx import quadrat_cut_geometry
from multiprocessing import Pool

# read input data
pd.set_option('display.float_format', lambda x: '%.5f' % x)
boroughs = gpd.read_file('Borough Boundaries/geo_export_acd45f17-302a-4908-b0d7-223e9510dc04.shp')
trips_all = pd.read_csv('yellow_tripdata_2016-06.csv')
manhattan = boroughs.loc[boroughs.boro_name == 'Manhattan', 'geometry'].values[0]
boroughs['city'] = ['NYC'] * len(boroughs)
all_boroughs = boroughs.dissolve(by='city').geometry.values[0]
trips_all.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0,11135470.0
mean,1.52982,1.65727,3.04401,-73.05081,40.24282,1.04388,-73.12388,40.28391,1.34972,13.50708,0.34072,0.4973,1.84212,0.34021,0.29968,16.83016
std,0.49911,1.30249,21.83019,8.20805,4.52167,0.56606,7.88031,4.3412,0.4945,275.53582,0.53397,0.04452,2.71359,1.71971,0.01358,275.86082
min,1.0,0.0,0.0,-118.18626,0.0,1.0,-118.18626,0.0,1.0,-450.0,-41.23,-2.7,-67.7,-12.5,-0.3,-450.8
25%,1.0,1.0,1.0,-73.99178,40.73653,1.0,-73.99123,40.73492,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.8
50%,2.0,1.0,1.71,-73.98135,40.75358,1.0,-73.97935,40.75412,1.0,10.0,0.0,0.5,1.35,0.0,0.3,12.3
75%,2.0,2.0,3.23,-73.96617,40.76831,1.0,-73.96202,40.76954,2.0,15.5,0.5,0.5,2.46,0.0,0.3,18.36
max,2.0,9.0,71732.7,0.0,64.09648,99.0,106.24688,60.04071,5.0,628544.74,597.92,60.35,854.85,970.0,11.64,629033.78


## Calculating efficiency
---

* Create function to get trips starting or ending in Manhattan
* Create function to get efficiency and emissions for those trips 
* Calculate efficiency for Yellow cabs in Manhattan 


#### Function to get Manhattan trips

In [2]:
def get_trips_polygon(df: pd.DataFrame, polygon: Polygon, bounds: Polygon=None) -> pd.DataFrame:
    """
    Helper function to subset trips to those starting and/or ending inside a georefrenced polygon
    Inputs:
    df: pandas DataFrame with trips
    polygon: shapely polygon
    """
    # create columns for point geometries
    points_pick = [Point(*ele) for ele in zip(df['pickup_longitude'], df['pickup_latitude'])]
    points_drop = [Point(*ele) for ele in zip(df['dropoff_longitude'], df['dropoff_latitude'])]
    pick_id = [id(ele) for ele in points_pick]
    drop_id= [id(ele) for ele in points_drop]
               
    # create a dictionary from id to index
    id_to_idx = {point_id: df.index[idx] for idx, point_id in enumerate(pick_id)}
    for idx, point_id in enumerate(drop_id):
        id_to_idx[point_id] = df.index[idx]
    
    # create R-tree for points to query 
    pick_tree = STRtree(points_pick)
    drop_tree = STRtree(points_drop)
    
    # filter to stay within 4 borough boundaries
    if bounds:
        query_pick = pick_tree.query(bounds)
        quert_drop = drop_tree.query(bounds)
        ids_pick = set([id_to_idx[id(ele)] for ele in query_pick])
        ids_drop = set([id_to_idx[id(ele)] for ele in quert_drop])
        df = df.loc[list(ids_pick.intersection(ids_drop))]
        
        # update trees
        pick_tree = STRtree(query_pick)
        drop_tree = STRtree(quert_drop)

    # chop polygon into quadrats for R-Tree searching
    pol_cut = quadrat_cut_geometry(polygon, 0.025)
    
    # loop through quadrats storing points that start or end in each quadrant
    ids_pick = set()
    ids_drop = set()
    for quadrat in pol_cut:
        drop_quadrat = set([id_to_idx[id(ele)] for ele in drop_tree.query(quadrat)])
        pick_quadrat = set([id_to_idx[id(ele)] for ele in pick_tree.query(quadrat)])
        ids_drop = ids_drop.union(drop_quadrat)
        ids_pick = ids_pick.union(pick_quadrat)
    
    # keep points that start or end in Manhattan, give extra weight to those that do both
    weight2_trips = list(ids_drop.intersection(ids_pick))
    df = df.loc[df.index.isin(list(ids_pick.union(ids_drop)))]
    df = df.assign(weight=[1] * len(df))
    df.loc[weight2_trips, 'weight'] = 2

    # change datetime columns to datetime and return# change time columns to datetime
    for col in ['tpep_pickup_datetime', 'tpep_dropoff_datetime']:
        df[col] = df[col].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
        
    # return df
    return df


#### Get all Manhattan trips during the second week of June 2016

In [3]:
# subset trips by date
trips = trips_all.loc[[12 < int(ele[8:10]) < 20 for ele in trips_all.tpep_pickup_datetime]]

# subset by location and add geometries and weights
tic = time.time()
chunks = (trips.iloc[idx:idx + 10000] for idx in range(0, len(trips), 10000))
pool = Pool(24)
results = pool.map_async(partial(get_trips_polygon, polygon=manhattan,
                                bounds=all_boroughs), chunks)
trips = results.get()
pool.close()
pool.join()
print(f'Finished processing trips in {time.time() - tic} seconds')

# subset trips and explore resulting data
trips = pd.concat(trips)
trips = trips.assign(total_wait=[0] * len(trips)) 
trips.describe()

Finished processing trips in 18.226601362228394 seconds


Unnamed: 0,VendorID,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,weight,total_wait
count,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0
mean,1.52998,1.65863,2.91308,-73.97651,40.75263,1.03359,-73.97534,40.75334,1.34364,12.84476,0.33835,0.49814,1.80424,0.34002,0.29979,16.12879,1.87539,0.0
std,0.4991,1.3036,3.6237,0.03227,0.02542,0.3355,0.0322,0.03014,0.49046,11.63771,0.59303,0.05078,2.53402,1.69356,0.01113,14.11393,0.33028,0.0
min,1.0,0.0,0.0,-74.22618,40.51811,1.0,-74.25534,40.51123,1.0,-88.0,-4.5,-0.5,-46.68,-12.5,-0.3,-88.8,1.0,0.0
25%,1.0,1.0,1.0,-73.99229,40.73904,1.0,-73.99157,40.73818,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.76,2.0,0.0
50%,2.0,1.0,1.7,-73.9821,40.75475,1.0,-73.98035,40.75529,1.0,9.5,0.0,0.5,1.35,0.0,0.3,12.09,2.0,0.0
75%,2.0,2.0,3.12,-73.96908,40.76853,1.0,-73.96519,40.77034,2.0,15.0,0.5,0.5,2.45,0.0,0.3,17.8,2.0,0.0
max,2.0,9.0,805.9,-73.71921,40.90976,99.0,-73.70019,40.91549,4.0,8452.0,597.92,60.35,854.85,554.0,0.3,8454.24,2.0,0.0


#### Cleanup trip data
* There is no missing data on columns of interest.
* Some trips have a passenger count of zero.
* Some trips have zero distance.
* Some trips have an unrealistic distance (> 500 mi).

#### Imputation strategy:
* Substitute passenger counts of zero with a sample from passenger counts
* Substitute missing / unrealistic distances with Manhattan distances

In [4]:
# get number of trips with zero distance, zero passengers and unrealisticly high distances
print(f'Number of trips with zero distance: {len(trips.loc[trips.trip_distance == 0])}')
print(f'Number of trips with zero passengers: {len(trips.loc[trips.passenger_count == 0])}')
print(f'Number of trips over 50 miles:  {len(trips.loc[trips.trip_distance > 50])}')

# Helper function top get manhattan distances between vectors of lat, lon
def manhattan_dist(lat1, lat2, lon1, lon2):
    dlat = abs(lat1 - lat2)
    dlon = abs(lon1 - lon2)
    return 69 * dlat + 69 * dlon * np.cos(np.radians(lon2)) 

# function to clean trips
def cleanup_trips(trips: pd.DataFrame):
    
    # Fill zero passenger count trips sampling from observed passenger counts
    passenger_counts = list(trips.loc[trips.passenger_count > 0].passenger_count)
    trips.loc[trips.passenger_count == 0, 'passenger_count'] = np.random.choice(passenger_counts, 
                         size=len(trips.loc[trips.passenger_count == 0]))

    # Fill trips with unrealistic or zero distance using the Manhattan distance 
    dist_trips = trips.loc[(trips.trip_distance < 50) & (trips.trip_distance != 0)]
    trips.loc[(trips.trip_distance < 50) | (trips.trip_distance != 0), 'trip_distance'] =  \
        manhattan_dist(dist_trips.pickup_latitude, dist_trips.dropoff_latitude, 
                       dist_trips.pickup_longitude, dist_trips.dropoff_longitude)

    return trips

# clean trips and display new values
trips = cleanup_trips(trips)
trips.describe()

Number of trips with zero distance: 6504
Number of trips with zero passengers: 45
Number of trips over 50 miles:  26


Unnamed: 0,VendorID,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,weight,total_wait
count,2431120.0,2431120.0,2424590.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0,2431120.0
mean,1.52998,1.65866,1.93259,-73.97651,40.75263,1.03359,-73.97534,40.75334,1.34364,12.84476,0.33835,0.49814,1.80424,0.34002,0.29979,16.12879,1.87539,0.0
std,0.4991,1.30359,2.08733,0.03227,0.02542,0.3355,0.0322,0.03014,0.49046,11.63771,0.59303,0.05078,2.53402,1.69356,0.01113,14.11393,0.33028,0.0
min,1.0,1.0,0.0,-74.22618,40.51811,1.0,-74.25534,40.51123,1.0,-88.0,-4.5,-0.5,-46.68,-12.5,-0.3,-88.8,1.0,0.0
25%,1.0,1.0,0.69861,-73.99229,40.73904,1.0,-73.99157,40.73818,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.76,2.0,0.0
50%,2.0,1.0,1.2578,-73.9821,40.75475,1.0,-73.98035,40.75529,1.0,9.5,0.0,0.5,1.35,0.0,0.3,12.09,2.0,0.0
75%,2.0,2.0,2.36362,-73.96908,40.76853,1.0,-73.96519,40.77034,2.0,15.0,0.5,0.5,2.45,0.0,0.3,17.8,2.0,0.0
max,2.0,9.0,21.61685,-73.71921,40.90976,99.0,-73.70019,40.91549,4.0,8452.0,597.92,60.35,854.85,554.0,0.3,8454.24,2.0,0.0


#### Efficiency function

In [5]:
# function to get efficiency for trip
def get_efficiency(trips, gallon_price: float=2.385, mpg: float=13.0,
                   social_carbon_cost: float=0.453, idle_cost: float=0.5, 
                   avg_speed: float=25.0, driver_salary: float=20.0, 
                   opportunity_cost: float=15.0) -> pd.DataFrame:
    """
    Determines the efficiency of a trip as passenger displacement / trip cost
    
    gallon_price (float): price of a gallon of fuel in US$
    mpg (float): miles per gallon for vehicles
    idle_cost: gallons consumed per hour when car is idle
    avg_speed: average speed in mph when not suck in traffic
    driver_salary: driver hourly pay in US$
    opportunity_cost: hourly ocpportunity cost for passengers in US$
    
    returns: efficiency for trip, doubled if it starts and ends within Manhattan 
    """
    # add extra weight to manhattan trips
    trips = trips.append(trips.loc[trips.weight == 2], ignore_index=True)
    
    # get displacement
    displacement = trips['trip_distance'] * trips['passenger_count']
    
    # get duration in hours
    duration = (trips['tpep_dropoff_datetime'] - trips['tpep_pickup_datetime']).astype('timedelta64[s]') / 3600
    
    # get fuel and driver cost
    idle_time = np.clip(duration - trips['trip_distance'] / avg_speed, 0, 10E6)
    total_fuel = trips['trip_distance'] / mpg + idle_cost * idle_time
    cost_fuel = total_fuel * (gallon_price + social_carbon_cost)
    emissions = total_fuel * 9.07185 / 1000
    cost_salary = duration * driver_salary + ((trips['total_wait'] + duration)) * opportunity_cost
    
    # get efficiency 
    total_cost = cost_fuel + cost_salary
    efficiency = displacement / total_cost
    
    # return metrics
    metrics = pd.DataFrame({'efficiency': efficiency,
                            'duration': (trips['total_wait'] + duration) * 60,
                            'emissions': emissions})
    return metrics
    


#### Get efficiency for NYC yellow cabs

In [6]:
# get mean efficiency for yellow cabs
get_efficiency(trips).describe()

Unnamed: 0,efficiency,duration,emissions
count,4546369.0,4559293.0,4546379.0
mean,0.35166,15.28889,0.00204
std,0.37258,54.03392,0.0043
min,-0.00267,-5703.76667,0.0
25%,0.14947,6.55,0.00091
50%,0.23705,10.73333,0.00146
75%,0.39451,17.01667,0.00238
max,22.90345,1439.95,0.116


## Trip aggregation strategy
---

1. Filter trips by pickup and dropoff location and pickup time.
2. Convert distance of nearby pickups and dropoffs to time assuming a fixed walking speed.
3. Aggregate trips within a given waiting time prioritizing those that are closer and respecting a vehicle capacity.

#### Get first week of June 2016 to find the most efficient waiting time threshold

In [7]:
# subset training trips by date and cleanup trips
trips_train = trips_all.loc[[5 < int(ele[8:10]) < 13 for ele in trips_all.tpep_pickup_datetime]]
trips_train = cleanup_trips(trips_train)

# subset by location and add geometries and weights
tic = time.time()
chunks = (trips_train.iloc[idx:idx + 10000] for idx in range(0, len(trips_train), 10000))
pool = Pool(24)
results = pool.map_async(partial(get_trips_polygon, polygon=manhattan,
                                bounds=all_boroughs), chunks)
trips_train = results.get()
pool.close()
pool.join()
trips_train = pd.concat(trips_train)
print(f'Finished processing trips in {time.time() - tic} seconds')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Finished processing trips in 19.793173789978027 seconds


#### Ride aggregation function
* 

In [8]:
# helper function to aggregate by waiting time
def agg_by_time(df: pd.DataFrame, walking_speed: float, max_wait: float,
                max_passengers: int) -> pd.DataFrame:
    
      
    # find distance buffer given walking speed
    dist_buffer = walking_speed * max_wait / 60 
    
    # add rides at capacity to final output
    agg = pd.DataFrame()
    at_capacity = df.passenger_count >= max_passengers
    agg = agg.append(df.loc[at_capacity])
    agg= agg.assign(total_wait=[0] * len(agg))
    df = df.loc[-at_capacity]
    
    # loop through trips
    while len(df) > 0:
        
        # get first row
        curr = df.iloc[0]
        lat_pick, lon_pick = curr.pickup_latitude, curr.pickup_longitude
        lat_drop, lon_drop = curr.dropoff_latitude, curr.dropoff_longitude
        pick_time = curr.tpep_pickup_datetime
        df = df.iloc[1:]
        
        # adjust max wait time for short trips to prevent too much waiting / travelling time
        wait_time = min(max_wait, (pick_time - curr.tpep_dropoff_datetime).seconds / 30)
        
        # filter by pickup time, latitude and dropoff latitude
        #time_diffs = [(df.iloc[idx].tpep_pickup_datetime - pick_time).seconds / 60 for idx in range(len(df))]
        time_diffs = (df.loc[:, 'tpep_pickup_datetime'] - pick_time).astype('timedelta64[s]') / 3600
        filtered = df.loc[np.array(time_diffs) < wait_time]
        lat_diffs_pick = abs(filtered.pickup_latitude - lat_pick)
        filtered = filtered.loc[np.array(lat_diffs_pick) < (dist_buffer / 69)]
        lat_diffs_drop = abs(filtered.dropoff_latitude - lat_drop)
        filtered = filtered.loc[np.array(lat_diffs_drop) < (dist_buffer / 69)]
        
        # subset by distance to dropoff location
        distance_drop = manhattan_dist([lat_drop] * len(filtered), filtered.dropoff_latitude,
                                       [lon_drop] * len(filtered), filtered.dropoff_longitude)
        filtered = filtered.assign(wait_time=(distance_drop / walking_speed)) 
        filtered = filtered.loc[filtered.wait_time < wait_time]
        
        # subset by pickup location
        distance_pick = manhattan_dist([lat_pick] * len(filtered), filtered.pickup_latitude,
                                       [lon_pick] * len(filtered), filtered.pickup_longitude)
        filtered['wait_time'] += distance_pick / walking_speed
        
        # threshold by waiting time
        filtered = filtered.loc[filtered.wait_time < wait_time]
         
        # add best passenger groups 
        filtered = filtered.sort_values(by=['wait_time'])
        total_wait = 0
        idcs = set([])
        for idx, row in filtered.iterrows():
            if curr.passenger_count == max_passengers:
                break
            if (row.passenger_count + curr.passenger_count) <= max_passengers:
                total_wait += row.passenger_count * row.wait_time
                curr.loc['passenger_count'] += row.passenger_count
                idcs.add(idx)  
        curr['total_wait'] = total_wait
        agg = agg.append(curr)
        
        # remove aggregated rides from dataframe
        df = df.loc[~df.index.isin(idcs)]
    
    # return aggregated rides
    return agg
          

#### Search for optimal waiting time threshold for aggregation 
> Caveat: cap waiting time as twice the trip duration to prevent unnecessary aggregation on short trips

In [None]:
# create date ranges for chunking
start_date = min(trips_train.tpep_pickup_datetime)
end_date = max(trips_train.tpep_pickup_datetime)
total_minutes = int((end_date - start_date).seconds / 60) + 1
date_range = [[start_date + timedelta(minutes=ele), 
               start_date + timedelta(minutes=ele + 15)] for ele in range(0, total_minutes, 15)]

# find mid latitude for chunking
mid_lat = np.median(trips_train.pickup_latitude)

# search for optimal waiting time 
efficiency_wait = {}
for max_wait in range(5, 26, 5):
    
    # generate chunks using date ranges
    chunks = (trips_train.loc[(trips_train.tpep_pickup_datetime >= date[0]) & 
                        (trips_train.tpep_pickup_datetime < date[1])]  
              for idx, date in enumerate(date_range))

    # process chunks
    tic = time.time()
    pool = Pool(24)
    results = pool.map_async(partial(agg_by_time, walking_speed=3.4, 
                                     max_wait=max_wait, max_passengers=6), chunks)
    agg_trips = results.get()
    pool.close()
    pool.join()
    agg_trips = pd.concat(agg_trips)
    print(f"finished aggregating with a {max_wait} minute threshold in {time.time() - tic} seconds.")
    print(f"number of aggregated trips: {len(agg_trips)}.")
    
    # add weight to Manhattan trips and get efficiency for shared rides 
    agg_trips = agg_trips.append(agg_trips.loc[agg_trips.weight == 2], ignore_index=True)
    efficiency_wait[max_wait] =  get_efficiency(agg_trips, opportunity_cost=15)
    print(f"mean efficiency: {efficiency_wait[max_wait].efficiency.mean()}\n")

    
        

#### Compare with NYC cabs, walking and biking in the second full week of June 2016
* Use best waiting time threshold for aggregating rides in the previous week.
* Obtain passenger mile / cost for each mode of transportation.
* Assume walking speed of 3.4 mph and biking speed of 11.5 mph.

In [None]:
# helper function to get walking and biking efficiency
def get_efficiency_walk(trips: pd.DataFrame, speed: float=3.4, opportunity_cost: float=15.0):
    
    # add extra weight to manhattan trips
    trips = trips.append(trips.loc[trips.weight==2], ignore_index=True)
    
    # get walking distances
    distances = manhattan_dist(trips.pickup_latitude, trips.dropoff_latitude, 
                               trips.pickup_longitude, trips.dropoff_longitude)
        
    # get efficiency 
    duration = distances / speed
    efficiency = distances / (duration * opportunity_cost)
    
    # return metrics
    metrics = pd.DataFrame({'efficiency': efficiency,
                            'duration': duration * 60,
                            'emissions': [0] * len(efficiency)})
    return metrics

# aggregate rides using a 15 minute cutoff
start_date = min(trips.tpep_pickup_datetime)
end_date = max(trips.tpep_pickup_datetime)
total_minutes = int((end_date - start_date).seconds / 60) + 1
date_range = [[start_date + timedelta(minutes=ele), 
               start_date + timedelta(minutes=ele + 15)] for ele in range(0, total_minutes, 15)]
chunks = (trips.loc[(trips.tpep_pickup_datetime >= date[0]) & 
                    (trips.tpep_pickup_datetime < date[1])]  
          for idx, date in enumerate(date_range))

pool = Pool(24)
results = pool.map_async(partial(agg_by_time, walking_speed=3.4, 
                                 max_wait=15, max_passengers=6), chunks)
agg_trips = results.get()
pool.close()
pool.join()
agg_trips = pd.concat(agg_trips)

# walking
walk_eff = get_efficiency_walk(trips)
print(f"mean efficiency for walking: {walk_eff.efficiency.mean()}\n")

# biking
bike_eff = get_efficiency_walk(trips, speed=11.5)
print(f"mean efficiency for biking: {bike_eff.efficiency.mean()}\n")

# yellow cab
cab_eff = get_efficiency(trips)
print(f"mean efficiency for yellow cabs: {cab_eff.efficiency.mean()}\n")

# ride share 
share_eff = get_efficiency(agg_trips)
print(f"mean efficiency for ride shares: {share_eff.efficiency.mean()}\n")


## Create Visualizations
---

* Neighborhood traffic map
* Efficiency by means of transportation
* Split by neighborhood: Downtown / Midtown / Uptown
* Split by time of day: Morning / Afternoon / Evening
* Split by weekdays vs. weekends

#### Neighborhood traffic map

In [None]:
eff = get_efficiency(trips)
pd.concat([trips.reset_index(), eff], axis=1, join='inner')

In [None]:
# get neighborhood locations and add efficiency metrics to trips
neighborhoods = gpd.read_file('Neighborhood Names GIS/geo_export_ecd7b650-b6be-4ee3-827b-92a33f23d30f.shp')
trips = pd.concat([trips.reset_index(), get_efficiency(trips)], axis=1, join='inner')
agg_trips = pd.concat([agg_trips.reset_index(), get_efficiency(agg_trips)], axis=1, join='inner')

# function to find traffic between neighborhoods
def get_traffic(trips: pd.DataFrame, neighborhoods: gpd.GeoDataFrame):
    
    # extract locations for neighborhoods, pickups and dropoffs
    pickup_points = list(zip(trips.pickup_latitude, trips.pickup_longitude))
    dropoff_points = list(zip(trips.dropoff_latitude, trips.dropoff_longitude))
    nb_points = list(zip(neighborhoods.geometry.y, neighborhoods.geometry.x))
    
    # get number of trips between each pair neighborhoods
    nb_tree = cKDTree(nb_points)
    _, idx_pick = nb_tree.query(pickup_points, k=1)
    _, idx_drop = nb_tree.query(dropoff_points, k=1)
    pairs = [tuple(sorted(ele)) for ele in (zip(neighborhoods.loc[idx_pick, 'name'].values,
                                                neighborhoods.loc[idx_drop, 'name'].values))
            if ele[0] != ele[1]]
    traffic = Counter(pairs)
    
    # get locations and efficiency for each pair of neighborhoods
    print(traffic)
    

traffic = get_traffic(trips, neighborhoods)


# create a categorical column for weekday vs. weekend


# plot perforformance of NYC yellow cabs vs. aggregated rides

#### Plot efficiency across time of the day 

In [None]:
import plotnine as p9

# helper function to add time of day column
def add_time_col(trips: pd.DataFrame):
    trips = trips.assign(time_of_day=['morning'] * len(trips))
    trips.loc[trips.tpep_pickup_datetime.dt.hour >= 12, 'time_of_day'] = 'afternoon'
    trips.loc[trips.tpep_pickup_datetime.dt.hour >= 17, 'time_of_day'] = 'evening'
    trips.loc[(trips.tpep_pickup_datetime.dt.hour >= 20) |
              (trips.tpep_pickup_datetime.dt.hour < 6), 'time_of_day'] = 'night'
    return trips

trips = add_time_col(trips)
print(Counter(trips['time_of_day']))
print(trips.tpep_pickup_datetime)
print(len(trips))

In [None]:
sample_trips = trips.sample(10000)
lines = []

points_pick = [Point(*ele) for ele in zip(sample_trips['pickup_longitude'], sample_trips['pickup_latitude'])]
points_drop = [Point(*ele) for ele in zip(sample_trips['dropoff_longitude'], sample_trips['dropoff_latitude'])]
lines = [LineString([ele1, ele2]) for ele1, ele2 in zip(points_pick, points_drop)]
    
coco = gpd.GeoDataFrame(geometry=lines)
coco['distance'] = list(sample_trips['trip_distance'])
coco['weight'] = list(sample_trips['weight'])
coco.to_file('coco.shp')
