## Trip Lab notebook
Predict when the driver will arrive at the pick up location before the driver has started driving there. 
What factors are most predictive of lateness?

#### Framing the business case
On-time arrival is an important metric for the business to optimize. 25% of customer negative comments after the trip mention the word "late" and another 19% mention time-related issues. "Driver" and "Late" are the top 2 most frequently used words in negative comments. The business would like to reduce lateness to improve customer satisfaction, improve retention, and reduce support costs.

There are automated alerts in place when the driver is estimated to be late (based on current time, time of pickup, location of driver, location of pick up, predicted route and predicted traffic). By the time the driver is estimated to be late there may be too little time to correct the situation. The company incurs support costs working with the parties involved. It is not a good customer or driver experience. 

To reduce no-shows, the driver must check-in 60 to 90 minutes before the the scheduled start of the ride (otherwise the driver is replaced on that ride). The check-in period is an ideal time to alert the driver about the likelihood of arriving late to the pick up location. The driver is already interacting with the driver app at that time. Their attention is avaiable without separate messaging. The driver could take actions to reduce the chance of lateness, such as starting to commute on time. (is there an alert already when the driver has not starting on-the-way on time?)

Accurate prediction is important. Reducing false alarms would tend to improve trust in this prediction and show respect the drivers time. Driver incentives could be added rides with high likelihood of being late to encourage on-time behavior.

#### Data considerations to address this use case
We will ignore trips in training and test sets when we know the driver never arrived (in the data there is no arrival time). The on-time properly of these rides can not be directly assessed. These could be unclaimed rides or rides where the driver canceled or the organizer canceled before the driver arrived. 

We will predict on-time arrival on any trip which has not been canceled when the prediction is requested. Effectively, we assume the ride will not be canceled between the prediction time and the scheduled pick up. Predicting cancelation is a separate data science problem beyond the scope of this project. About 5% of remaining trips are cancelled within 90 minutes of the scheduled start. This discrepancy should have a small effect.

The prediction model must only use data which is available at the prediction time. The following are those data items in this data set which may not have final values at prediction time: trip_state, canceled_by, canceled_at_pdt, canceled_before_scheduled_start, departs_for_trip_at_pdt, origin_arrived_at_pdt, origin_departed_at_pdt, destination_arrived_at_pdt, rating, trip_paid_time, events, has_cd_unclaimed_event, is_unfilled_canceled, is_cd_cancel, canceled_by_id, commute_miles.


#### Labeling the data
For supervised learning the training and testing data must be labeled. The labels are the target output of the prediction model. Our label will be the duration in seconds between when the driver arrives and the scheduled start. We can use the canceled rides which were canceled after the driver arrived. This adds about 2% in volume to the complete rides.

The prediction target (label) arrived_seconds_after_scheduled_start will be <0 if arrived early, 0 if arrived exactly on time, >0 if arrived late. For trips where the driver never arrived, which ultimately get canceled, make the prediction target NaN. 

```
arrived_seconds_after_scheduled_start = 
    np.nan if np.isnat(origin_arrived_at_pdt)
           else (origin_arrived_at_pdt - scheduled_starts_at_pdt).total_seconds()

```

In addition to arrived_seconds_after_scheduled_start we will classify each trip to help interpret the results of the prediction in a business-friendly way. The classes are 
0. arrived on-time
1. almost on-time (arrived up to 5 minutes after scheduled start)
2. late (arrived 5 to 15 minutes after scheduled start)
3. extremely late (arrived 15 minutes or more after scheduled start).

LATER:
 * check-in time might be very informative. Could build a different model which uses check-in time when available.
 * could use prediction time itself compared to predicted on-the-way time, if we have that.
 * create transformed features such as time of day, day of week, distance from driver home to origin, duration claimed before scheduled start, compute duration from end of previously claimed trip to this trip, was drivers previous ride or recent rides late, unclaimed near scheduled time, claimed-scheduled.

TRIP LIFECYCLE STATES
- invalid - trips remain invalid until they have enough information to schedule, route, and bill
- awaiting_carpool_invitation (private carpools only), carpools need invitations resolved to be valid
- unclaimed - valid trips which are booked but not claimed
- visible to drivers to claim (not an actual trip state). Policies embargo trips until 7 days before the scheduled start of the trip.
- claimed - driver commits to giving ride. aka filled.
- checked in - driver confirms intent to give ride 60 to 90 minutes before schedule start, within 60 minutes manual effort may be involved to fill the ride and get the driver checked-in
- on the way - driver departs for the trip
- arrives at origin location (aka pick up point)
- waiting for passenger(s)
- departs origin location with passenger(s)
- arrives at destination
- complete after passenger(s) are safe at last destination
- canceled - the ride can be canceled anytime after first unclaimed and before complete by the organizer, safe ride specialist team, or automatically.

KINDS OF TRIPS (not mutually exclusive)
- multiple pickups or dropoffs 
- multiple organizers (private carpools or shuttle trips)
- multiple passengers
- shuttle
- affiliate organizer
- bonused
- recurring
- pickup or dropoff anchor



In [1]:
import datetime
a = datetime.datetime.now()
print(a)

2018-09-03 17:15:19.829058


In [2]:
b = datetime.datetime.now()
print(b)
c = b - a
print(c.total_seconds())
d = a - b
print(d.total_seconds())

2018-09-03 17:15:19.836220
0.007162
-0.007162


In [3]:
import pandas as pd
import numpy as np

In [4]:
dateColNames=['canceled_at_pdt', 'claimed_at_pdt', 'departs_for_trip_at_pdt',
       'created_at_pdt', 'scheduled_starts_at_pdt', 'scheduled_ends_at_pdt',
       'origin_arrived_at_pdt', 'origin_departed_at_pdt',
       'destination_arrived_at_pdt', 'lead_organizer_created_at_pdt',
       'driver_created_at_pdt']

In [5]:
df=pd.read_csv('/Users/bob 2/Projects/TripsData2017.txt', sep='\t', parse_dates=dateColNames)

In [6]:
df.shape

(217348, 74)

In [7]:
list(df)

['id',
 'trip_state',
 'canceled_by',
 'canceled_at_pdt',
 'canceled_before_scheduled_start',
 'claimed_at_pdt',
 'driver_id',
 'lead_organizer_id',
 'creator_id',
 'carpool',
 'is_repeating_ride',
 'shuttle',
 'trip_template_id',
 'departs_for_trip_at_pdt',
 'time_anchor',
 'created_at_pdt',
 'scheduled_starts_at_pdt',
 'scheduled_ends_at_pdt',
 'origin_arrived_at_pdt',
 'origin_departed_at_pdt',
 'destination_arrived_at_pdt',
 'origin_location_id',
 'destination_location_id',
 'origin_region_id',
 'origin_analysis_metro_name',
 'destination_region_id',
 'destination_region_name',
 'origin_region_name',
 'origin_metro_area_name',
 'destination_metro_area_name',
 'destination_analysis_metro_name',
 'origin_lat',
 'origin_lon',
 'destination_lat',
 'destination_lon',
 'passengers_ids',
 'organizers_count',
 'passengers_count',
 'driver_home_lat',
 'driver_home_lon',
 'organizer_home_lat',
 'organizer_home_lon',
 'rating',
 'canceled_by_id',
 'route_legs_count',
 'start_waypoints_zipcode

In [8]:
df.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217348 entries, 0 to 217347
Columns: 74 entries, id to driver_has_driven_this_route_before
dtypes: bool(8), datetime64[ns](11), float64(29), int64(5), object(21)
memory usage: 111.1+ MB


In [9]:
df.info

<bound method DataFrame.info of             id trip_state canceled_by     canceled_at_pdt  \
0       367322   canceled    Employee 2017-03-08 13:18:55   
1       492666   canceled    Employee 2017-09-11 09:09:43   
2       527985   canceled    Employee 2017-09-27 14:02:52   
3       415557   canceled    Employee 2017-05-22 13:47:32   
4       317210   canceled    Employee 2017-01-18 10:14:22   
5       338150   complete         NaN                 NaT   
6       310958   complete         NaN                 NaT   
7       338135   complete         NaN                 NaT   
8       324615   complete         NaN                 NaT   
9       247127   complete         NaN                 NaT   
10      306833   complete         NaN                 NaT   
11      436184   complete         NaN                 NaT   
12      473473   complete         NaN                 NaT   
13      354050   complete         NaN                 NaT   
14      415507   complete         NaN                

#### columns_from_the_future must not be used in predictions but some are needed to label the data

In [10]:
columns_from_the_future = ['trip_state', 'canceled_by', 'canceled_at_pdt', 'canceled_before_scheduled_start', 'departs_for_trip_at_pdt', 'origin_arrived_at_pdt', 'origin_departed_at_pdt', 'destination_arrived_at_pdt', 'rating', 'trip_paid_time', 'driver_fare_multiplier', 'events', 'has_cd_unclaimed_event', 'is_unfilled_canceled', 'is_cd_cancel', 'canceled_by_id', 'commute_distance', 'predicted_driver_fare_with_multiplier']
print(columns_from_the_future)
print(len(columns_from_the_future))

['trip_state', 'canceled_by', 'canceled_at_pdt', 'canceled_before_scheduled_start', 'departs_for_trip_at_pdt', 'origin_arrived_at_pdt', 'origin_departed_at_pdt', 'destination_arrived_at_pdt', 'rating', 'trip_paid_time', 'driver_fare_multiplier', 'events', 'has_cd_unclaimed_event', 'is_unfilled_canceled', 'is_cd_cancel', 'canceled_by_id', 'commute_distance', 'predicted_driver_fare_with_multiplier']
18


In [11]:
list(df.select_dtypes(['object']).columns)

['trip_state',
 'canceled_by',
 'time_anchor',
 'origin_analysis_metro_name',
 'destination_region_name',
 'origin_region_name',
 'origin_metro_area_name',
 'destination_metro_area_name',
 'destination_analysis_metro_name',
 'passengers_ids',
 'rating',
 'start_waypoints_zipcodes',
 'end_waypoints_zipcodes',
 'lead_organizer_platform',
 'lead_organizer_app_version',
 'driver_platform',
 'driver_app_version',
 'driver_gender',
 'coupon',
 'events',
 'has_cd_unclaimed_event']

In [12]:
df.dtypes

id                                                  int64
trip_state                                         object
canceled_by                                        object
canceled_at_pdt                            datetime64[ns]
canceled_before_scheduled_start                   float64
claimed_at_pdt                             datetime64[ns]
driver_id                                         float64
lead_organizer_id                                   int64
creator_id                                          int64
carpool                                              bool
is_repeating_ride                                    bool
shuttle                                              bool
trip_template_id                                  float64
departs_for_trip_at_pdt                    datetime64[ns]
time_anchor                                        object
created_at_pdt                             datetime64[ns]
scheduled_starts_at_pdt                    datetime64[ns]
scheduled_ends

In [13]:
df.head(1)

Unnamed: 0,id,trip_state,canceled_by,canceled_at_pdt,canceled_before_scheduled_start,claimed_at_pdt,driver_id,lead_organizer_id,creator_id,carpool,...,lead_organizer_previous_completed_trips,commute_distance,claimed_after_created,is_unfilled_canceled,is_cd_cancel,is_same_day_ride,trip_predicted_raw_fare,predicted_driver_fare,predicted_driver_fare_with_multiplier,driver_has_driven_this_route_before
0,367322,canceled,Employee,2017-03-08 13:18:55,18.68,2017-03-08 13:16:39,62908.0,87303,87303,False,...,0,,0.72,False,True,False,,,,False


#### pick features to use as input to prediction. here just numerics and not those from the future.
todo: add booleans, date features, object features as categorical dummies

In [14]:
feature_columns_to_use=list(set(df.select_dtypes(include=['number']).columns) - set(columns_from_the_future))
print(feature_columns_to_use)

['trip_template_id', 'origin_location_id', 'organizer_home_lon', 'destination_region_id', 'id', 'route_legs_count', 'organizers_count', 'passengers_count', 'driver_home_lon', 'lead_organizer_previous_completed_trips', 'driver_previous_completed_trips', 'lead_organizer_id', 'origin_lon', 'trip_predicted_raw_fare', 'creator_id', 'predicted_driver_fare', 'driver_home_lat', 'claimed_after_created', 'origin_lat', 'organizer_home_lat', 'origin_region_id', 'destination_lat', 'destination_location_id', 'total_predicted_distance_miles', 'destination_lon', 'driver_id', 'total_predicted_duration', 'coupon_consumed']


#### compute arrived_late label, ah should be 5 min late. At 0 seconds, 9% of trips are "late"!

In [15]:
arrival_time_df=df[['id','scheduled_starts_at_pdt','origin_arrived_at_pdt']]
arrival_time_df.shape
arrival_time_df=arrival_time_df.assign(
    arrived_seconds_after_scheduled_start=lambda x:
        (x['origin_arrived_at_pdt'] - x['scheduled_starts_at_pdt']).dt.total_seconds()
)

# show number of nulls
arrival_time_df.isnull().sum()



id                                           0
scheduled_starts_at_pdt                      0
origin_arrived_at_pdt                    91673
arrived_seconds_after_scheduled_start    91673
dtype: int64

In [16]:
arrival_time_df.dropna().shape


(125675, 4)

In [17]:
arrival_time_df.dropna().hist(column='arrived_seconds_after_scheduled_start',bins=np.linspace(-1800,1800,60),grid=False)


array([[<matplotlib.axes._subplots.AxesSubplot object at 0x111990f98>]],
      dtype=object)

In [18]:
# arrives 8.5 minutes early on average
arrival_time_df[['arrived_seconds_after_scheduled_start']].dropna().mean()/60

arrived_seconds_after_scheduled_start   -8.456539
dtype: float64

In [19]:
# most often arrives 4.5 minutes early on average
arrival_time_df[['arrived_seconds_after_scheduled_start']].dropna().mode()/60

Unnamed: 0,arrived_seconds_after_scheduled_start
0,-4.45


In [20]:
# show rows and label to verify logic
# arrived_late_df[arrived_late_df['label']].head # version where label is TRUE
arrival_time_df.head

<bound method NDFrame.head of             id scheduled_starts_at_pdt origin_arrived_at_pdt  \
0       367322     2017-03-09 08:00:00                   NaT   
1       492666     2017-09-11 10:00:00                   NaT   
2       527985     2017-09-27 15:07:00                   NaT   
3       415557     2017-05-22 21:16:00                   NaT   
4       317210     2017-01-22 10:00:00                   NaT   
5       338150     2017-02-14 16:00:00   2017-02-14 15:51:43   
6       310958     2017-01-24 16:20:00   2017-01-24 16:20:05   
7       338135     2017-02-08 16:00:00   2017-02-08 15:54:47   
8       324615     2017-01-26 16:30:00   2017-01-26 16:25:04   
9       247127     2017-01-23 14:45:00   2017-01-23 14:42:14   
10      306833     2017-01-11 14:25:00   2017-01-11 14:20:23   
11      436184     2017-06-26 13:30:00   2017-06-26 13:15:40   
12      473473     2017-08-21 14:30:00   2017-08-21 14:10:25   
13      354050     2017-02-27 15:00:00   2017-02-27 14:48:47   
14      41

In [21]:
# features look ok, contain some NaNs
df[feature_columns_to_use].head

<bound method NDFrame.head of         trip_template_id  origin_location_id  organizer_home_lon  \
0                    NaN            119715.0             -118.27   
1                    NaN             52933.0             -118.34   
2                    NaN             81534.0             -118.34   
3                    NaN            118697.0             -118.30   
4                    NaN             97313.0             -118.27   
5                    NaN             82879.0             -118.34   
6                 6325.0             83958.0             -118.34   
7                    NaN             84580.0             -118.11   
8                    NaN             87880.0             -118.22   
9                 3719.0             92149.0             -118.38   
10                   NaN            100484.0             -118.21   
11                   NaN             83332.0             -117.90   
12                   NaN             85043.0             -117.86   
13                

In [22]:
# hmmm too many NaNs
df[feature_columns_to_use].dropna().shape

(64, 28)

In [23]:
# replace NaNs with 0s, although another solution for missing latlons is probably better
df[feature_columns_to_use].fillna(0).shape

(217348, 28)

In [24]:
# eliminate rows where on-time arrival cannot be computed (canceled before arrival, possibly filled).
df.loc[df['origin_arrived_at_pdt'].notnull()].shape

(125675, 74)

In [25]:
# how many observations of each final state do we have left where on-time arrival can be labeled
df.loc[df['origin_arrived_at_pdt'].notnull()].groupby(['trip_state']).size()

trip_state
canceled      1529
complete    124146
dtype: int64

In [26]:
# how many are filled; all should be but 1 wierd trip arrived with no driver! 
# chaotic race condition when scheduled start time approaching, cancel by organizer, removing driver, and driver arriving 
# It will be filled with 0 which is ok.
df.loc[df['origin_arrived_at_pdt'].notnull()].groupby(['trip_state',df.driver_id.notnull()]).size()

trip_state  driver_id
canceled    False             1
            True           1528
complete    True         124146
dtype: int64

In [27]:
# materialize the useful dataset for training and testing
df_X=df.loc[df['origin_arrived_at_pdt'].notnull()][feature_columns_to_use].fillna(0)
df_X.shape

(125675, 28)

In [28]:
# materialize the prediction target (arrived_seconds_after_scheduled_start)
y=(df.loc[df['origin_arrived_at_pdt'].notnull()].origin_arrived_at_pdt 
   - df.loc[df['origin_arrived_at_pdt'].notnull()].scheduled_starts_at_pdt).dt.total_seconds()

# describe and convert from scientific notation
y.describe().apply(lambda x: '%.0f' % x)

count      125675
mean         -507
std         22763
min      -7862491
25%          -657
50%          -324
75%           -96
max        204537
dtype: object

hmmm. the std is huge. todo: saturate outlier y values

In [29]:
# select a linear model
import sklearn.linear_model
model=sklearn.linear_model.LinearRegression()

In [30]:
# fit model
model=model.fit(df_X,y)

In [31]:
model.coef_

array([-1.62476201e-02,  2.70595633e-03, -1.81823726e+02,  4.50137898e+01,
       -3.25499236e-03, -3.22183801e+02,  1.01516860e+02,  2.48505505e+01,
       -8.58522900e+03,  3.59288569e-02,  1.10736255e+00,  4.80168450e-03,
        5.25099032e+03, -5.01205250e+00, -1.05420992e-02, -3.34419741e+01,
        4.33147288e+04, -1.67092237e-03, -2.45797966e+04, -6.14462926e+02,
        4.32138965e+01, -1.81633913e+04,  6.22559707e-04,  4.73574151e+01,
        2.69712025e+03,  3.79677890e-03,  2.63405249e+01, -4.39468979e+01])

In [32]:
model.intercept_

-98297.10086836905

In [33]:
# score against full training set
model.score(df_X,y)

0.3241908396604172

In [34]:
model_y_pred = model.predict(df_X)
model_y_pred.size

125675

In [35]:
type(model_y_pred)

numpy.ndarray

In [36]:
from sklearn.metrics import explained_variance_score
explained_variance_score(y, model_y_pred)

0.3241908396604172

In [37]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y, model_y_pred)

350179853.5337528

In [38]:
from sklearn.metrics import r2_score
r2_score(y, model_y_pred)

0.3241908396604172

regression score == r2_score. So use model.score. .32 seems high for the data I gave it. how to see hwat it did?

In [39]:
# score a model using constant (the mean) as the prediction to set a baseline
k = pd.Series([-8.456539*60]*y.size)
print(k.head(10))
print(k.size)

0   -507.39234
1   -507.39234
2   -507.39234
3   -507.39234
4   -507.39234
5   -507.39234
6   -507.39234
7   -507.39234
8   -507.39234
9   -507.39234
dtype: float64
125675


In [40]:
model.score(df_X,k)

-2.079540733227249e+33

In [41]:
explained_variance_score(y, k)

0.0

In [42]:
r2_score(y, k)

1.1102230246251565e-16

##### hmmm. 
note: R-squared will always increase as you add more features to the model, even if they are unrelated to the response
Selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model. 
Train/test split or cross-validation. More reliable estimate of out-of-sample error. Better for choosing which of your models will best generalize to out-of-sample data[https://www.ritchieng.com/machine-learning-evaluate-linear-regression-model/]

used: https://stackoverflow.com/questions/41900387/mapping-column-names-to-random-forest-feature-importances

Later: ? fancier models
Hypotheses: new drivers, specific locations/routes/pickups, last minute unclaims/claims
Remove 0-weighted features?
Add the log of features?
Try some fancier models like random forests or whatever?
Deal with imbalanced classes (on-time arrivals are 91% of observations)?
