# Efficiency Exploration

The goal is to reduce the amount of drivers and complete rides without hurting user experience and profit. In this notebook I create a simulation of ride aggregation and examine the impact on profit, cars, and user experience (primarily duration of their ride), and state some assumptions. I am focussing on the first week of June 2016, the 6th through the 12th, in Manhattan

Data:
- Yellow taxi trip data from https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page (yellow_tripdata_2016-06.csv)
- Shapefile for boroughs: https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm


In [1]:
import pandas as pd
import numpy as np
import datetime
import math
from pyproj import Geod
import geopandas as gpd
import warnings
import matplotlib.pyplot as plt
plt.style.use('ggplot')
warnings.filterwarnings('ignore')

df = pd.read_csv('yellow_tripdata_2016-06.csv')
#df = pd.read_csv('yellow_tripdata_2016-06.csv', nrows = 2000) #subset for quick testing
df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2016-06-09 21:06:36,2016-06-09 21:13:08,2,0.79,-73.98336,40.760937,1,N,-73.977463,40.753979,2,6.0,0.5,0.5,0.0,0.0,0.3,7.3
1,2,2016-06-09 21:06:36,2016-06-09 21:35:11,1,5.22,-73.98172,40.736668,1,N,-73.981636,40.670242,1,22.0,0.5,0.5,4.0,0.0,0.3,27.3
2,2,2016-06-09 21:06:36,2016-06-09 21:13:10,1,1.26,-73.994316,40.751072,1,N,-74.004234,40.742168,1,6.5,0.5,0.5,1.56,0.0,0.3,9.36
3,2,2016-06-09 21:06:36,2016-06-09 21:36:10,1,7.39,-73.982361,40.773891,1,N,-73.929466,40.85154,1,26.0,0.5,0.5,1.0,0.0,0.3,28.3
4,2,2016-06-09 21:06:36,2016-06-09 21:23:23,1,3.1,-73.987106,40.733173,1,N,-73.985909,40.766445,1,13.5,0.5,0.5,2.96,0.0,0.3,17.76


In [2]:
df=df.drop(['VendorID','RatecodeID','store_and_fwd_flag','payment_type','extra','mta_tax','tip_amount','tolls_amount','improvement_surcharge','total_amount'],1) #keep focus columns
df = df.dropna()
df = df[(df != 0).all(1)] #exclude zero numerical values

In [3]:
df['tpep_pickup_datetime']=pd.to_datetime(df['tpep_pickup_datetime']) #convert drop off and pickup to datetime
df['tpep_dropoff_datetime']=pd.to_datetime(df['tpep_dropoff_datetime'])

df['duration']=df['tpep_dropoff_datetime']-df['tpep_pickup_datetime'] #feature for duration of ride
df['duration']=df['duration'].dt.total_seconds()/60 #convert to minutes integer value

df = df.sort_values(by=['tpep_pickup_datetime']) #sort by pickup time
df.reset_index(drop=True,inplace=True) #reset after sorting

#used for matching nearby rides in my simulation algorithm later on
df['position'] = df.index.astype(int) 

df.head()

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,fare_amount,duration,position
0,2016-06-01 00:00:00,2016-06-01 00:07:41,1,1.3,-74.008446,40.706024,-74.01339,40.709644,7.5,7.683333,0
1,2016-06-01 00:00:00,2016-06-01 00:15:25,1,5.9,-73.962227,40.760635,-73.922287,40.827213,19.5,15.416667,1
2,2016-06-01 00:00:00,2016-06-01 00:14:11,4,2.3,-73.972916,40.754993,-73.992264,40.725243,11.5,14.183333,2
3,2016-06-01 00:00:00,2016-06-01 00:00:00,1,2.59,-73.997505,40.725487,-73.997742,40.744919,18.0,0.0,3
4,2016-06-01 00:00:01,2016-06-01 00:03:44,3,0.9,-74.002426,40.750156,-73.991066,40.755154,5.0,3.716667,4


In [5]:
#first week of June 2016, the 6th through the 12th
start_date = '06-06-2016'
end_date = '06-13-2016'
mask = (df['tpep_pickup_datetime'] > start_date) & (df['tpep_pickup_datetime'] < end_date) 
df = df.loc[mask] 
print(len(df))

2577836


In [6]:
drop_indices = np.random.choice(df.index, 2500000, replace=False) #work with subset for testing
df = df.drop(drop_indices)
n=len(df)

gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.pickup_longitude, df.pickup_latitude)) #turn to geopandas
boroughs = "Borough Boundaries/geo_export_c9a5ff22-58ea-448e-9dbf-9521f130c237.shp" #bring in shapemap
data = gpd.read_file(boroughs)[['geometry','boro_name']] #turn into geopandas dataframe 
df = gpd.sjoin(gdf, data, how="inner", op='intersects') #spatial join to map to borough
df = df[df['boro_name']=='Manhattan'].drop(['index_right'],1) #only manhattan

df[['tpep_pickup_datetime','pickup_longitude','pickup_latitude','geometry','boro_name']].head()

Unnamed: 0,tpep_pickup_datetime,pickup_longitude,pickup_latitude,geometry,boro_name
1903107,2016-06-06 00:00:24,-74.007278,40.705364,POINT (-74.00728 40.70536),Manhattan
1903108,2016-06-06 00:00:26,-73.986488,40.730331,POINT (-73.98649 40.73033),Manhattan
1903291,2016-06-06 00:01:47,-74.011902,40.703724,POINT (-74.01190 40.70372),Manhattan
1903401,2016-06-06 00:02:38,-74.002243,40.739922,POINT (-74.00224 40.73992),Manhattan
1903413,2016-06-06 00:02:44,-74.003761,40.742073,POINT (-74.00376 40.74207),Manhattan


# Functions

I now create a few more functions for my algorithm to simulate aggregating rides heading in the same direction.

One feature I have is angle (direction of ride). I am using a straight line direction here. In production I would likely use an algorithm to simulate the actual route path of multiple rides and compare them based on that (i.e. is pickup location of one passenger near somewhere along the route of another passenger)

In [7]:
#get direction of pickup to dropoff (to determine rides going in same direction)
def angleFromCoordinate(df):
    
    dLon = (df['dropoff_longitude'] - df['pickup_longitude'])

    y = math.sin(dLon) * math.cos(df['dropoff_latitude'])
    x = math.cos(df['pickup_latitude']) * math.sin(df['dropoff_latitude']) - math.sin(df['pickup_latitude'])* math.cos(df['dropoff_latitude']) * math.cos(dLon)

    brng = math.atan2(y, x)

    brng = math.degrees(brng)
    brng = (brng + 360) % 360
    brng = 360 - brng

    return brng

df['angle'] = df.apply(angleFromCoordinate, axis=1)

df[['pickup_longitude','pickup_latitude','angle']].head()

Unnamed: 0,pickup_longitude,pickup_latitude,angle
1903107,-74.007278,40.705364,13.482426
1903108,-73.986488,40.730331,195.535016
1903291,-74.011902,40.703724,52.582385
1903401,-74.002243,40.739922,18.704163
1903413,-74.003761,40.742073,181.612315


# Main Ride Share Algorithm

I create a function called shareRide to determine if 2 rides are a good fit for each other to aggregate. It is based on:

- Angle (direction both cars are heading towards)
- Pickup Time
- Total Route distance
- Pickup Location distance

I then set some constraints
- Angle is within 15 degrees of each other
- Pickup time within 5 minutes of each other
- Drive distance difference < 3 miles
- Pickup Location Distance is within half a mile of each other (using Geod for this distance)

The algorithm intakes rides, and compares them to each other to determine if they are a good fit to aggregate. When a ride is a good fit to consolidate with another, we add their positions in the dataframe to a matches list for each of them. Constraints may be easily adjusted upon request.

I  use a batch based process to ensure my algorithm can scale. I already sorted the data by pickup time to ensure potential matches are nearby each other in the dataframe, and feed the dataframe in chunks for all rows to be compared to each other. This way my number of computations does not equal n * n, but rather (x * x) * (n/x), where x is a constant value that is controlled and picked based on the density of rides I have.

In production, I'd rather use a simulation of Via's own algorithm for determining if 2 rides should be aggregated, or create a lightweight version similar to Via's constraints.



In [8]:
wgs84_geod = Geod(ellps='WGS84')

#function to determine if 2 rides are a good fit to consolidate
def shareRide(unit1, unit2, time1, time2, dist1, dist2, lat1, lon1, lat2, lon2, constraints):

    angleDiff = 180 - abs(abs(unit1 - unit2) - 180) #angle difference

    diff = abs(time1 - time2) #pickup time difference
    pickupTimeDiff = diff.total_seconds()/60

    distanceDiff = abs(dist1 - dist2) #route distance difference

    az12,az21,pickupLocationDiff = wgs84_geod.inv(lon1, lat1, lon2, lat2) #pickup location difference
    
    values = [angleDiff, pickupTimeDiff, distanceDiff, pickupLocationDiff]
    
    val=True
    for i in range(len(constraints)):
        if values[i]>constraints[i]:
            val=False
    
    return val


In [9]:
constraints = [15,5,3,700] #(angle, pickup time, route distance, pickup location)
df_shared = pd.DataFrame() #dataframe to store aggregated rides
df['matches']=0
skip = 150 #constant for batches, chosen based on density of rides
start=0
end = skip 

#function to apply matching values to df rows
def func(unit1, time1, dist1, lat1, lon1):
        matches = df.iloc[start:end].apply(lambda row: (shareRide(row['angle'], unit1, 
                                              row['tpep_pickup_datetime'], time1,
                                                  row['trip_distance'], dist1,
                                                 row['pickup_latitude'],row['pickup_longitude'],
                                                 lat1, lon1, constraints)), axis=1)
        
        #focusing on only 2 car aggregation, adjust this for >2 car grouping ([0:2])
        return [i for i, x in enumerate(matches) if x][0:2]
    
#batch matching process for scalability, df sorted by pickup times to ensure close fits
while end <= len(df): 
    #apply aggregation algorithm of chunk against itself
    df['matches'][start:end]=df[start:end].apply(lambda row: func(row['angle'],row['tpep_pickup_datetime'], 
                              row['trip_distance'],
                              row['pickup_latitude'],row['pickup_longitude']), axis=1)
    
    
    #aggregate matches and add to dataframe of matched rides
    df_current = df[start:end].copy() #grab chunk
    df_current['matches']=df_current['matches'].astype('str')
    #keep parameters to observe
    df_current=df_current[['matches','fare_amount','duration','passenger_count','tpep_pickup_datetime','pickup_latitude','pickup_longitude']]
    #grouping rules
    d = {'fare_amount':'sum', 'duration':'sum','passenger_count':'sum', 'tpep_pickup_datetime':'first', 'pickup_latitude':'first', 'pickup_longitude':'first'} 
    
    #rides that match with each other and no other rides
    df_current = df_current[df_current.duplicated('matches') | df_current.duplicated('matches', keep='last')]
    df_current=df_current.groupby(['matches']).agg(d) #group by those that matched together
    
    df_shared=df_shared.append(df_current)#add chunk to global df of matched rides
    
    start+=skip #advance chunk
    end+=skip

df_shared.head()


Unnamed: 0_level_0,fare_amount,duration,passenger_count,tpep_pickup_datetime,pickup_latitude,pickup_longitude
matches,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"[132, 134]",14.0,12.333333,2,2016-06-06 00:44:03,40.733292,-73.991173
"[35, 40]",30.5,30.25,2,2016-06-06 00:13:23,40.724777,-73.983513
"[37, 41]",33.5,35.016667,2,2016-06-06 00:13:40,40.718307,-73.988235
"[46, 58]",26.0,30.233333,5,2016-06-06 00:17:10,40.751881,-73.994583
"[62, 68]",10.5,9.5,2,2016-06-06 00:20:50,40.721066,-73.993973


# Example of 2 Rides That Aggregated

<img src="consolidated.png">

# Efficiency Evaluation

We've consolidated fare price, cars, passengers, and duration and can now evaluate efficiency overall.

##### Assumptions:
- Each passenger pays flat fee of \\$7 on average
- Drivers are paid \\$20 an hour
- The cost saved from removing a driver (aggregating 2 cars to one), is the duration of that driver's previous ride at their hourly wage ($20/hour)
- It takes an extra 2 minutes to pick up and drop off a new passenger on a new ride
- Rides cannot be matched together with more than 4 passengers

##### Efficiency estimation:
I define an efficiently aggregated ride as one that allows us to remain profitable once combined, and keep users rides no more than 5-10 minutes of what it would have been before they were combined. If we are at the very least remaining largely net positive on revenue, without hurting user expereince, then I will also define this as efficient, as we are removing cars from the road and providing better experience for users (low cost alternative to the fare they paid previously), as well as gaining other positive benefits from our platform (reduced co2 emissions, etc.)


In [13]:
df_shared['list'] = df_shared.index
df_shared['cars combined'] = df_shared['list'].str.split(',').map(len) #number of cars were consolidated into one
df_shared = df_shared.drop(['list'],1)

df_shared = df_shared[df_shared['passenger_count']<5] #cant have more than 4 passengers
df_shared['new fare'] = df_shared['passenger_count']*7 #$7 flat fee for users

df_shared['driver cost saved'] = df_shared['duration']/2*(20/60) #hourly rate of $20
#df_shared['driver cost saved'] = df_shared['duration']/df_shared['card combined']*(20/60) #if allowing more than 2 cars aggregated

df_shared['new duration'] = df_shared['duration']+(df_shared['passenger_count']*2) #2 minutes for new passenger pickup

#looking at change in fare, duration, profit, before and after aggregated
df_shared['fare difference'] = df_shared['new fare']-df_shared['fare_amount']
df_shared['duration difference'] = df_shared['new duration']-df_shared['duration']
df_shared['profit difference'] = df_shared['fare difference'] + df_shared['driver cost saved']

df_shared.head()

Unnamed: 0_level_0,fare_amount,duration,passenger_count,tpep_pickup_datetime,pickup_latitude,pickup_longitude,cars combined,new fare,driver cost saved,new duration,fare difference,duration difference,profit difference
matches,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
"[132, 134]",14.0,12.333333,2,2016-06-06 00:44:03,40.733292,-73.991173,2,14,3.083333,16.333333,0.0,4.0,3.083333
"[35, 40]",30.5,30.25,2,2016-06-06 00:13:23,40.724777,-73.983513,2,14,7.5625,34.25,-16.5,4.0,-8.9375
"[37, 41]",33.5,35.016667,2,2016-06-06 00:13:40,40.718307,-73.988235,2,14,8.754167,39.016667,-19.5,4.0,-10.745833
"[62, 68]",10.5,9.5,2,2016-06-06 00:20:50,40.721066,-73.993973,2,14,2.375,13.5,3.5,4.0,5.875
"[71, 84]",13.0,9.033333,4,2016-06-06 00:24:13,40.754555,-73.983658,2,28,2.258333,17.033333,15.0,8.0,17.258333


In [20]:
print('total rides shared:', len(df_shared)) #total new rides created from aggregating more than this amount
print('total saved in driver costs: $', df_shared['driver cost saved'].sum())
print()

print('average increase in rider time:',df_shared['duration difference'].mean().round(2), 'minutes')
print('average increase in ride profit: $', df_shared['profit difference'].mean().round(2))
print('total revenue increase: $', df_shared['profit difference'].sum().round(2))

#df_shared.to_csv('matched rides.csv') #used in location and time analysis

total rides shared: 6696
total saved in driver costs: $ 33712.0

average increase in rider time: 5.03 minutes
average increase in ride profit: $ 3.1
total revenue increase: $ 20786.15


# Conclusion

Overall we see that we are not only keeping the added rider time to a minimum, but also making more on our rides on average while removing the need for some drivers.  Manhattan was an efficient area for aggregating rides in the first week of June. Further analysis of the area would need to be conducted to ensure all costs are accounted for and other measures of route and passenger fare's are accurately represented.

Again, this is my estimation and method of determining efficiency for the algorithm above. Depending on the rideshare algorithm used in practice, results may vary. However, this can provde a metric/method of determining the potential efficiency of areas with future additions and iterations.