# Going to work as a yellow taxi driver

> Imagine that you decide to drive a taxi for 10 hours each week to earn a little extra money. Explain how you would approach maximizing your income as a taxi driver.

# Approach

If I want to maximize the amount of money I make, I should find the times of day each week which result in the most profitable taxi rides. I will look for times of day where there are lots of taxi rides, and where those taxi rides result in a high revenue per hour. Ideally (for me, the taxi driver trying to maximize the money I make) taxis would be able to negotiate their rates, and then I would pick times with the most demand for taxi rides relative to the number of taxis on the road. If I drove for Uber, this would probably involve looking for times when surge pricing is likely to be in effect. 

Unfortunately, I purchased a taxi medalion instead of an iPhone, so I can't drive for Uber. Taxis have a fixed taxi rate, that doesn't vary by time of day. As such, its not really great to wait for there to be very few taxis on the road, if at that time there will also be very few people waiting for taxis – I'll just spend a lot of time waiting to pick up passengers.

Instead, I'll pick times when there are lots of taxi rides recorded, implying there were lots of people who wanted to take a taxi. Preferrably, there will be lots of short taxi rides, because taxis have a fixed initial fare, and then a variable fare after the first moment, and it is generally more profitiable for me to pick up many riders, rather than give one longer ride.

Finally, its no good if there are lots of taxi rides, but they are spread out all over the city. So I'll concecntrate on rides split by pick up location as recorded in the taxi dataset.

## Weaknesses

There are some flaws to my approach. Noteably, it might be that although there are a lot of taxi riders at some location at some time, there are also a lot of taxis around at that time. Unfortuantely, the dataset doesn't identify individual taxis, so I can't figure out how much time taxis spend idling. If we could get data which identifies a taxi, that would be great. Without that data, we could model the process of taxis waiting around for a new ride. 

I'd approach this using a data-driven model, where I can use the data to provide a lower bound on the number of taxis in a given location at a given time. I would bound the number of taxis active at any time by assuming that no taxi can be conducting multiple trips at the same time. Then, I could model some over-supply of taxis. The process would look roughtly like the following:

- For each time interval, figure out how many taxis are occupied with fares on the road.
- Apply some over-supply parameter which determines how many extra taxis which are also on the road.
- At each time point, randomly assign a taxi to a ride.
- Continue the simulation for some time, recording, for each taxi, the effective hourly rate earned by that taxi.
- Compute the expected value of the effective hourly rate (or better, examine the distributions of hourly rates).

# Exploration

Let's explore the taxi data set, and understand what it contains, and what we might do with it...

In [7]:
%matplotlib inline
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import datetime as dt
import numpy as np
sns.set(style='whitegrid')

In [2]:
yellow_trips = pd.read_csv("data/yellow_tripdata_2017-06.csv", parse_dates=[1,2])

In [3]:
yellow_trips.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2017-06-08 07:52:31,2017-06-08 08:01:32,6,1.03,1,N,161,140,1,7.5,1.0,0.5,1.86,0.0,0.3,11.16
1,2,2017-06-08 08:08:18,2017-06-08 08:14:00,6,1.03,1,N,162,233,1,6.0,1.0,0.5,2.34,0.0,0.3,10.14
2,2,2017-06-08 08:16:49,2017-06-08 15:43:22,6,5.63,1,N,137,41,2,21.5,1.0,0.5,0.0,0.0,0.3,23.3
3,2,2017-06-29 15:52:35,2017-06-29 16:03:27,6,1.43,1,N,142,48,1,8.5,1.0,0.5,0.88,0.0,0.3,11.18
4,1,2017-06-01 00:00:00,2017-06-01 00:03:43,1,0.6,1,N,140,141,1,4.5,0.5,0.5,2.0,0.0,0.3,7.8


In [4]:
yellow_trips.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
VendorID,9656993.0,1.546961,0.49779,1.0,1.0,2.0,2.0,2.0
passenger_count,9656993.0,1.623943,1.264608,0.0,1.0,1.0,2.0,9.0
trip_distance,9656993.0,2.978617,5.704095,0.0,1.0,1.67,3.1,9496.98
RatecodeID,9656993.0,1.045527,0.566504,1.0,1.0,1.0,1.0,99.0
PULocationID,9656993.0,162.623514,66.752232,1.0,114.0,162.0,233.0,265.0
DOLocationID,9656993.0,160.737865,70.473428,1.0,107.0,162.0,233.0,265.0
payment_type,9656993.0,1.33404,0.492962,1.0,1.0,1.0,2.0,5.0
fare_amount,9656993.0,13.287274,215.167502,-550.0,6.5,9.5,15.0,630461.82
extra,9656993.0,0.341331,0.462329,-50.56,0.0,0.0,0.5,22.5
mta_tax,9656993.0,0.497225,0.076252,-0.5,0.5,0.5,0.5,140.0


## Infering Cash Tips

The data does not include tips on cash payments, so we are going to infer those as having about the same tipping rate as credit cards.

In [5]:
yellow_trips['tip_percent'] = (yellow_trips['tip_amount'] / (yellow_trips['total_amount'] - yellow_trips['tip_amount'])).clip(0, 1)

In [None]:
sns.distplot(yellow_trips.query("payment_type == 1 and total_amount > 0")['tip_percent'], kde=False)

The tip distribution is pretty narrow, and well focused around 20%. We will assume that the median tip percent for these credit card trips is also the median tip provided for cash trips.

In [None]:
median_cc_tip = yellow_trips.query('payment_type == 1 and total_amount > 0')['tip_percent'].quantile(0.5)
cash_trips = yellow_trips.eval('payment_type == 2 and total_amount > 0')
yellow_trips.loc[cash_trips, 'tip_amount'] = yellow_trips.loc[cash_trips, 'total_amount'] * median_cc_tip

## Data Transformation

Adding a few useful columns which are derived from the existing ones:

- Trip duration, to figure out how long the meter was running.
- Did the trip change taxi zones between pickup and drop-off.
- How much revenue did the trip generate for the driver, excluding the MTA tax, tolls, and the improvement surcharge. The remaining components I assume are linearly porportional to driver revenue, and therefore driver profit. There are certainly some non-linearities in this relationship, but I won't be modeling the driver's entire business model.

In [None]:
yellow_trips['trip_duration'] = pd.to_datetime(yellow_trips['tpep_dropoff_datetime']) - pd.to_datetime(yellow_trips['tpep_pickup_datetime'])
yellow_trips['delta_location'] = (yellow_trips['PULocationID'] != yellow_trips['DOLocationID']).astype(int)
yellow_trips['driver_revenue'] = yellow_trips['fare_amount'] + yellow_trips['tip_amount'] + yellow_trips['extra']

## Excluding unpaid trips

I can also exclude trips which weren't paid in cash or card, as they don't count as revenue and represent a small piece of the data which I can't account for. I could include "No Charge" trips, as those might represent times when a driver decided not to charge the riders (many of those trips include non-zero `total_amount` columns, but for this analysis, I'm choosing to infer that I'm a perfect driver, nothing is ever my fault, and I'll give myself the optimism of always charging for trips. This also amounts to assuming that the likelihood of not charging for a trip is independent of the time I work, which is probably not a bad assumption.

In [None]:
yellow_trips.payment_type.value_counts(normalize=True).map("{:.1%}".format)

In [None]:
VALID_PAYMENTS = yellow_trips.payment_type.isin((1,2))
include = VALID_PAYMENTS.copy()

## Additional Outliers

Above, I have excluded trips and infered data where the data dictionary suggests that data will either be incomplete, not relevant, or where transformation was necessary to inform the dataset. In this section, I'll look for additional annomolies in the data which ought to be cleaned out.

### Some fares are very large (>$100k)

In [None]:
yellow_trips.query('fare_amount > 1_000')[['VendorID', 'trip_duration', 'trip_distance', 'RatecodeID', 'PULocationID', 'DOLocationID', 'fare_amount']]

These trips fall into three categories
1. long trips which took a long time
2. short trips which have very little recorded distance and a very short duration but a high fare.
3. trips which have a fare of exactly 9999.99, which is probably indicative of a missing value. These trips have vendorID = 2

I've chosen to discard category (3) trips, as they look like likely errors or cases where the fare can't actually be recorded. Also, even if they were real, when deciding when and how to work betting on $10,000 trips isn't a great idea.

In [None]:
V2_OUT_OF_BOUNDS = 'VendorID == 2 and fare_amount == 9999.99'
yellow_trips.query(V2_OUT_OF_BOUNDS)

These four trips appear to have exceeded the fare the meter can report. We could leave them in the data set in the hopes that they indicate where we might pick up very long trips, as they appear to only be truncated.

Removing them from the dataset arguably makes the truncation worse, as the expected value of trips will decrease.

However, we only really want to work for 10 hours per week, so excluding them might help us find times where we don't have to go on 2 day, 21 hour taxi trips.

For category (2), we can exclude any trip with 0 distance and 0 duration on the taxi meter as likely erroneous data.

In [None]:
yellow_trips[VALID_PAYMENTS].query('trip_duration.dt.seconds < 1 and trip_distance < 0.01').describe().T

There is quite a bit of data here (almost 9000 trips), but it contains an error somewhere (no distance or time) and represents only 0.1% of trips taken, so we should feel okay excluding them. With more work, we could try to infer the correct values (e.g. inferring distance using the pick up and drop off locations)

In [None]:
VERY_SHORT_TRIPS = 'trip_duration.dt.seconds < 1 and trip_distance < 0.01'
print("{0:.4%}".format(yellow_trips[VALID_PAYMENTS].eval('trip_duration.dt.seconds < 1 and trip_distance < 0.01').mean()))

In [None]:
include &= ~(yellow_trips.eval(VERY_SHORT_TRIPS) | yellow_trips.eval(V2_OUT_OF_BOUNDS))

In [None]:
yellow_trips[include].query('fare_amount > 500')

There are only 2 trips which cost more than $10,000, and both look suspicious (short durations and short distances. I'll exclude any trip which costs more than that.

In [None]:
EXPENSIVE_TRIPS = yellow_trips.eval("total_amount > 10_000")
include &= ~EXPENSIVE_TRIPS

In [None]:
yellow_trips['trip_duration_s'] = yellow_trips['trip_duration'].dt.seconds + yellow_trips['trip_duration'].dt.days * 24 * 60 * 60

In [None]:
sns.pairplot(yellow_trips[include][['trip_duration_s', 'trip_distance', 'fare_amount']].sample(100_000))

Some of these outliers seem really large. I can estimate what the fare should have been based on the NYC TLC website's fare descriptor. We'll ignore anything complicated, and just use this to see if we can define what an outlier is in terms of trips that took longer than 60s or went further than 0.2 miles. For trips shorter than both of those, we'll also look at the fare distribution, and understand if there are problems in the data there.

### Long trip fare estimation

In [None]:
LONG_TRIPS = yellow_trips.eval('trip_distance > 0.2 or trip_duration_s > 60')

In [None]:
yellow_trips['estimated_fare'] = 2.5 + (yellow_trips['trip_distance'] / 5) * 0.5 + ((yellow_trips['trip_duration'].dt.seconds + yellow_trips['trip_duration'].dt.days * 24 * 60) / 60) * 0.5

In [None]:
sns.jointplot(data=yellow_trips[LONG_TRIPS & include].sample(100_000), x='fare_amount', y='estimated_fare')

Since I only want to drive a taxi for 10 hours a week, lets exclude these really long taxi sessions from our data.

I could identify them in a principled way (e.g. % error between estimated fare and actual fare), but for now, I'll take a short cut and just eliminate trips longer than 10 hours

In [None]:
LONG_DURATION_TRIPS = (yellow_trips.trip_duration.dt.seconds >= (10 * 60 * 60)) | (yellow_trips.trip_duration.dt.days > 1)

In [None]:
include &= ~LONG_DURATION_TRIPS

In [None]:
sns.jointplot(data=yellow_trips[LONG_TRIPS & include].sample(100_000), x='fare_amount', y='estimated_fare')

This looks decent. Let's take one last look at the overall distributions

In [None]:
yellow_trips[include].describe().T

Some trips still have a negative driver revenue amount – those are probably indicative of something stragne in the data, and I'll exclude them.

In [None]:
NO_REVENUE = yellow_trips['driver_revenue'] <= 0
include &= ~NO_REVENUE

In [None]:
print("{:.1%}".format(include.mean()))

In this process, I cleaned the dataset, and ended up excluding 1% of trips overall.

I'm going to save this data for easy re-processing later, after dropping the estimated fare (I'll use the actual value going forward)

_I'm using a pickle, because although its not particularily efficient or portable, it is a quick way to recover the work done above without re-computing anything_

In [None]:
yellow_trips[include].drop('estimated_fare', axis=1).to_pickle("data/yellow_tripdata_2017-06.cleaned.pkl", compression="gzip")

# Estimating Revenue

As a first pass estimate, I'll look at revenue per second, by hour, folded over the course of a week. This will help predict when profitible work times might occur. There is a weakness, though, and that is that revenue is only captured here when the meter is running.

In [None]:
yellow_trips = pd.read_pickle("data/yellow_tripdata_2017-06.cleaned.pkl", compression="gzip")

In [None]:
yellow_trips.head()

To find profitable times to drive a taxi, I'll transform the data to contain driver revenue per hour:

In [None]:
yellow_trips['driver_revenue_rate'] = yellow_trips['driver_revenue'] / yellow_trips['trip_duration_s'] * 60.0 * 60.0

Check the distribution of driver revenue rates – there may be some outliers:

In [None]:
np.percentile(yellow_trips['driver_revenue_rate'], [0.1, 99.5])

The bottom end of this range looks reasonable (effectively \$35/hour) but the top end is questionable.

First, we'll remove any trips with fares greater than $1000

In [None]:
yt_filtered_rates = yellow_trips.eval("driver_revenue_rate > 400 and fare_amount < 1000")

yellow_trips[yt_filtered_rates][['trip_duration_s', 'trip_distance', 'fare_amount']].describe()

Many of these high-rate trips are very short. I am tempted to remove them, but won't, since maybe the most profitable way to be a taxi driver is to find a location with lots of short trips.

Let's start by looking at what times of day ar profitable for taxi drivers:

In [None]:
trips_by_time = yellow_trips[~yt_filtered_rates][['tpep_pickup_datetime', 'driver_revenue_rate']].set_index('tpep_pickup_datetime')
earning_times = trips_by_time.resample('15min')['driver_revenue_rate'].median()

In [None]:
from matplotlib.gridspec import GridSpec

ONE_DAY = pd.date_range('2000-1-1 00:00', '2000-1-2 00:00', freq='1H').time

def rename_dow(df):
    """Rename a pandas dataframe which contains columns 0-6 to use day of week names"""
    return df.rename(columns={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'})

def earnings_view(trips, dow=False):
    """Make an earnings view"""
    earnings = trips.set_index('tpep_pickup_datetime').resample('15min')['driver_revenue_rate'].median()
    if dow:
        earnings_groups = [earnings.index.dayofweek]
    else:
        earnings_groups = []
    earnings_time = earnings.groupby(earnings_groups + [earnings.index.time]).median()
    if earnings_time.index.nlevels > 1:
        earnings_time = earnings_time.unstack(0)
    if dow:
        earnings_time = rename_dow(earnings_time)
    
    demand = trips.set_index('tpep_pickup_datetime').resample('15min')['driver_revenue_rate'].count()
    if dow:
        demand_groups = [demand.index.dayofweek]
    else:
        demand_groups = []
    demand_time = demand.groupby(demand_groups + [demand.index.time]).median()
    if demand_time.index.nlevels > 1:
        demand_time = demand_time.unstack(0)  
    if dow:
        demand_time = rename_dow(demand_time)
        
        
    fig = plt.figure(figsize=(17.0, 6.0))
    gs = GridSpec(2, 2, figure=fig, width_ratios=[1.0, 0.1], hspace=0.5)
    
        
    ax_earnings = fig.add_subplot(gs[0,0])
    ax_trips = fig.add_subplot(gs[1,0])
    ax_legend = fig.add_subplot(gs[:,1])
    
    earnings_time.plot(xticks=ONE_DAY, ax=ax_earnings, legend=False)
    ax_earnings.set_title("Earnings by Time of Day")
    ax_earnings.set_ylabel("Earnings ($/hour/trip)")
    ax_earnings.legend(bbox_transform=ax_legend.transAxes, bbox_to_anchor=(0,0.5), loc='center left')
    
    demand_time.plot(xticks=ONE_DAY, ax=ax_trips, legend=False)
    ax_trips.set_title("Trips by Time of Day")
    ax_trips.set_ylabel("N Trips")
    ax_trips.set_xlabel("Time")
    
    ax_legend.remove()

In [None]:
earnings_view(yellow_trips[~yt_filtered_rates])

Earnings peak between 4am and 6am on the median day, but trip volume is also quite low at these times. Trip volume rises throughout the day, and stays relatively steady through midnight of each day.

I should check to see if there are weekly cycles in this data:

In [None]:
earnings_view(yellow_trips[~yt_filtered_rates], dow=True)

The earning data shows a familiar pattern – weekdays show the same shape, and weekends show different patterns, with less concentration around the early morning rush-hour time, and the trip count data is pretty consistent, with fewer rides on early saturday mornings and sundays generally having fewer trips overall.

There are two further things I could investigate to maximize my earnings here:

1. Is it more profitable to work on weekends, where there is a longer time in which there are high-margin taxi rides?
2. What is the optimal set of 15 minute blocks to work to maximize earnings?

Instead of doing anything sophisticated (after all, I lack the data to compare supply and demand in these times), let's make some simplifying assumptions. If I become a taxi driver, I can test these assumptions.

The periods of good work look (by eye) to be around midnight (when there is a good trade off between trip volume and earnings per hour, and after dinner (when earnings per hour is in the middle of the expeted range, and trip count is highest).

Instead of finely optimizing the time of day I'll drive as a taxi driver, I'll examine _where_ I might like to start my taxi journeys to make the most money.


## Geography

Lets consider weekday post-dinner-hour only (if I'm going to work 10 hours a week, it looks like working 10pm-12am Monday through Friday is a decent first bet) and see if I should optimize where I work.

In [None]:
after_ten = yellow_trips['tpep_pickup_datetime'].dt.time < dt.time(22,0,0)
weekday = yellow_trips['tpep_pickup_datetime'].dt.dayofweek.isin(set(range(5)))

dinnerhour_trips = yellow_trips[after_ten & weekday & ~yt_filtered_rates]

To avoid noisy locations, I'm going to exclude any location which has fewer than 80 trips in the dataset. 80 is a conservative lower bound, because 80 implies that in 4 weeks, there was an average of four trips per day at rush hour in those locations. I'll certainly need more taxi demand than that if my work is going to be interesting.

I'm also going to remove Pick Up zones 264 and 265, as they are labeled 'Unknown'

Finally, I'm going to filter out EWR (Newark Airport). It might be a profitable place to work as a taxi driver, but airports seem like a prime location for a mismatch between supply and demand – I've often observed long lines of waiting taxis at an airport, and wondered if those long lines (or entire parking lots) are worth the driver's time. That's certainly a problem to be investigated, but needs more data than this data set can provide.

In [None]:
keep_locations = (dinnerhour_trips.groupby('PULocationID')['driver_revenue_rate'].count() >= 80)
keep_filter = dinnerhour_trips['PULocationID'].isin(keep_locations.index[keep_locations])
keep_filter &= ~dinnerhour_trips['PULocationID'].isin((264,265))
keep_filter &= ~dinnerhour_trips['PULocationID'].isin((1,))

dinnerhour_trips_filtered_trips = dinnerhour_trips[keep_filter]

Looking at the distribtuion of driver revenue rates will tell me if there are specific locations which result in significantly higher earnings than others:

In [None]:
sns.distplot(dinnerhour_trips_filtered_trips.groupby('PULocationID')['driver_revenue_rate'].median(), kde=False, bins=30)

Lets look at the 5 most profitable pickup areas, and see if there is potential value in sticking to those locations:

In [None]:
locations = pd.read_csv('data/taxi+_zone_lookup.csv', index_col=0)

In [None]:
locations.head()

Rush-hour profitable taxi pickup locations

In [None]:
by_loc = dinnerhour_trips_filtered_trips.groupby('PULocationID')['driver_revenue_rate'].agg(['median', 'count']).sort_values('median', ascending=False)
by_loc.join(locations).head(7)

## In Demand Loactions

Lets look at trips in these locations

In [None]:
notairport_trips = yellow_trips[yellow_trips['PULocationID'].isin(set(by_loc.head(7).index) - set((138,132,1))) &  ~yt_filtered_rates]
earnings_view(notairport_trips, dow=True)

Many of these locations are in Queens, where both green and yellow taxis can pick up riders, and many show only 1 or two trips per 15 minute increment, even during the time I care about. 

Instead, I'll look at the combined earnings and trip count on the same plot:

In [None]:
by_loc = dinnerhour_trips_filtered_trips.groupby('PULocationID')['driver_revenue_rate'].agg(['median', 'count']).sort_values('median', ascending=False)
dinnerhour_trips_locations = by_loc.join(locations).rename(columns={'median':'Median Earnings / Hour'})
dinnerhour_trips_locations['Trips / Hour'] = dinnerhour_trips_locations['count'] / 2 / 5

In [None]:
sns.jointplot(data=dinnerhour_trips_locations, x='Median Earnings / Hour', y='Trips / Hour')

In [None]:
sns.jointplot(data=dinnerhour_trips_locations, x='Median Earnings / Hour', y='Trips / Hour')
plt.yscale('log')

This plot suggests that there is some market elasticity going on – locations with very few tirps result in higher earning rates. Soem locations have few trips and low earning rates, thoes are definitely to be avoided.

There are some outliers, and I can carve out a segment in this space to understand what outliers might be valuable. I'll carve out locations with the following properties:

- More than 900 trips and more than \$70 / hour in earnings.
- More than 100 trips and more than \$78 / hour in earnings.

These locations are carved out visually to capture outliers in this space. I could use a much more quantitative appraoch (e.g. fitting a relationship in log-linear space between these quantities and asking for outliers) but given the data at hand, it probably isn't worth it to automate this process.

Any fewer than 100 trips / hour and I risk not having a rider in the car at all times.

These regions are shown on the joint plot below:

In [None]:
from matplotlib.patches import Rectangle

In [None]:
j = sns.jointplot(data=dinnerhour_trips_locations, x='Median Earnings / Hour', y='Trips / Hour')
j.ax_joint.set_yscale('log')
j.ax_joint.add_artist(Rectangle((70, 900), 70, 100_000, fc='red', alpha=0.2))
j.ax_joint.add_artist(Rectangle((78, 100), 70, 100_000, fc='red', alpha=0.2))

In [None]:
r1 = (dinnerhour_trips_locations['Median Earnings / Hour'] >= 70) & (dinnerhour_trips_locations['Trips / Hour'] >= 900)
r2 = (dinnerhour_trips_locations['Median Earnings / Hour'] >= 78) & (dinnerhour_trips_locations['Trips / Hour'] >= 100)

dinnerhour_trips_locations[r1 | r2][['Borough', 'Zone', 'service_zone', 'Median Earnings / Hour', 'Trips / Hour']]

There is a lot of useful information here. First, JFK Airport shows up – but I would treat airports separately for the reasons described above.

Among the remaining zones, it stands out that specific portions of Queens are quite profitable, but have low trip pickup rates for yellow cabs. Excluding those areas, there seems to be a lot of \$70/hr taxi driving available around Harlem (zones 42, 74, and 41 represent a contiguous block). There is also a profitable area on the upper west side (zones 24, 238 and 151 are contiguous on the upper west side). Both of these represent good geographic areas to start.

Yorkville West (zone 263) might also make a good starting point.

# Summary

This gives me a pretty good starting point – I've identified times and locations which are likely to be profitable as a taxi driver.

I would be sure to record my work over the first few weeks, to iteratively find the most profitable times and areas to drive, specifically by incorporating variations in the hours and locations that I work.

## Coda: Airports

Airports might be a profitable place to work as a taxi driver, but they seem like a prime location for a mismatch between supply and demand – I've often observed long lines of waiting taxis at an airport, and wondered if those long lines (or entire parking lots) are worth the driver's time. That's certainly a problem to be investigated, but needs more data than this data set can provide. If I want to gather the data during my work time on airport profitablity, I would look for valuable times to work at airports:

In [None]:
airport_trips = yellow_trips[yellow_trips['PULocationID'].isin((1, 138,132)) & ~yt_filtered_rates]

earnings_view(airport_trips, dow=True)

We can conclude that trips departing between 4am and 6am are quite valuable (in terms of driver rate per hour), but that there is very low demand at airports in this time (undoubtably due to low arrival rates of passengers). We see that the driver earning rate increases after 7pm, and the trip count remains relatively steady at that time. If I wanted to try airport driving, I imight try it between 8pm and 10pm, two hours earlier than my proposed optimal taxi driving time.