# A better baseline solution

<p>The following tutorial illustrates a simple model for the NYC Taxi Trip Duration competition on <a href = "https://www.kaggle.com/c/nyc-taxi-trip-duration">Kaggle</a>. Our goal here is to create a model with minimal work to act as a *baseline*. This notebook reads the dataset, encodes the necessary columns, and trains a regressor.</p>

## Step 1: Download raw data
<p>The first step is to download the raw data from the <a href="https://www.kaggle.com/c/nyc-taxi-trip-duration/data">Kaggle website</a>. For the purposes of this tutorial only two files are necessary: `train.csv` and `test.csv`: if you have not already, you should download them from and save them into the data folder. We once again use the `pandas` data analysis library to read in the data in a usable format for python.


In [1]:
import pandas as pd
import numpy as np
import utils

In [2]:
TRAIN_DIR = "data/train.csv"
TEST_DIR = "data/test.csv"

data_train, data_test = utils.read_data(TRAIN_DIR, TEST_DIR)

data_train.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,False,455
1,id2377394,1,2016-06-12 00:43:35,1,-73.980415,40.738564,-73.999481,40.731152,False,663
2,id3858529,2,2016-01-19 11:35:24,1,-73.979027,40.763939,-74.005333,40.710087,False,2124
3,id3504673,2,2016-04-06 19:32:31,1,-74.01004,40.719971,-74.012268,40.706718,False,429
4,id2181028,2,2016-03-26 13:30:55,1,-73.973053,40.793209,-73.972923,40.78252,False,435


<h2>Step 2: Prepare the Data </h2>

In [3]:
X_train = data_train.copy()
X_test = data_test.copy()

<p>Next, to use machine learning algorithms, we need to change the **pickup_datetime** column.</p>

In [4]:
X_test.loc[:, 'pickup_year'] = X_test['pickup_datetime'].dt.year
X_train.loc[:, 'pickup_year'] = X_train['pickup_datetime'].dt.year

X_test.loc[:, 'pickup_month'] = X_test['pickup_datetime'].dt.month
X_train.loc[:, 'pickup_month'] = X_train['pickup_datetime'].dt.month

X_test.loc[:, 'pickup_day'] = X_test['pickup_datetime'].dt.day
X_train.loc[:, 'pickup_day'] = X_train['pickup_datetime'].dt.day

X_test.loc[:, 'pickup_hour'] = X_test['pickup_datetime'].dt.hour
X_train.loc[:, 'pickup_hour'] = X_train['pickup_datetime'].dt.hour

X_test.loc[:, 'pickup_minute'] = X_test['pickup_datetime'].dt.minute
X_train.loc[:, 'pickup_minute'] = X_train['pickup_datetime'].dt.minute

X_test.loc[:, 'pickup_second'] = X_test['pickup_datetime'].dt.second
X_train.loc[:, 'pickup_second'] = X_train['pickup_datetime'].dt.second

In [5]:
X_test = X_test.drop(['pickup_datetime'], axis=1)
X_train = X_train.drop(['pickup_datetime'], axis=1)

In [6]:
X_train.head(5)

Unnamed: 0,id,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,pickup_year,pickup_month,pickup_day,pickup_hour,pickup_minute,pickup_second
0,id2875421,2,1,-73.982155,40.767937,-73.96463,40.765602,False,455,2016,3,14,17,24,55
1,id2377394,1,1,-73.980415,40.738564,-73.999481,40.731152,False,663,2016,6,12,0,43,35
2,id3858529,2,1,-73.979027,40.763939,-74.005333,40.710087,False,2124,2016,1,19,11,35,24
3,id3504673,2,1,-74.01004,40.719971,-74.012268,40.706718,False,429,2016,4,6,19,32,31
4,id2181028,2,1,-73.973053,40.793209,-73.972923,40.78252,False,435,2016,3,26,13,30,55


<h2>Step 3: Build the Model </h2>

<p>We can make sure the `id` is not used to train the model by setting it as the index for both feature matrices.</p>

In [7]:
X_train = X_train.set_index(['id'])
X_test = X_test.set_index(['id'])

Since the data is not linearly distributed, taking the `log` of the trip duration will improve the results of the regression.

In [8]:
labels = np.log(X_train['trip_duration'].values + 1)
X_train = X_train.drop(['trip_duration'], axis=1)

From there, we run `xgboost` and see how well our model fits our test data.

In [9]:
model = utils.train_xgb(X_train, labels)

[0]	train-rmse:5.00492	valid-rmse:5.00431
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 60 rounds.
[10]	train-rmse:0.998573	valid-rmse:0.99962
[20]	train-rmse:0.560765	valid-rmse:0.563015
[30]	train-rmse:0.514289	valid-rmse:0.517435
[40]	train-rmse:0.498138	valid-rmse:0.502091
[50]	train-rmse:0.475875	valid-rmse:0.480689
[60]	train-rmse:0.466928	valid-rmse:0.47247
[70]	train-rmse:0.446379	valid-rmse:0.452796
[80]	train-rmse:0.443136	valid-rmse:0.450146
[90]	train-rmse:0.437758	valid-rmse:0.445328
[100]	train-rmse:0.42893	valid-rmse:0.437137
[110]	train-rmse:0.424125	valid-rmse:0.432772
[120]	train-rmse:0.423322	valid-rmse:0.432275
[130]	train-rmse:0.419133	valid-rmse:0.428544
[140]	train-rmse:0.415563	valid-rmse:0.425435
[150]	train-rmse:0.413068	valid-rmse:0.423458
[160]	train-rmse:0.406811	valid-rmse:0.417733
[170]	train-rmse:0.403302	valid-rmse:0.414718
[180]	train-rmse:0.402445	valid-rmse:0.4142

<h2>Step 4: Make a Submission</h2>
Some utility functions are stored in `utils.py`. In that file there is a `predict_xgb` which tests data against our model and `feature_importances` which we will use below.

In [10]:
submission = utils.predict_xgb(model, X_test)
submission.head(5)

Unnamed: 0_level_0,trip_duration
id,Unnamed: 1_level_1
id3004672,735.397034
id3505355,431.585022
id1217141,458.278717
id2150126,1094.633301
id1598245,285.801025


In [11]:
submission.to_csv('trip_duration_baseline.csv', index=True, index_label='id')

<dt>This solution:</dt>
<dd>&nbsp; &nbsp; Received a score of 0.46084 on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Placed 578 out of 1083.</dd>
<dd>&nbsp; &nbsp; Beat 46% of competitors on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Had a modeling RMSLE of 0.40937</dd>

September 7, 2017.

<h2>Additional Analysis</h2>
<p>Let's look at how important each feature was for the model.</p>

In [12]:
feature_names = X_train.columns.values
ft_importances = utils.feature_importances(model, feature_names)
ft_importances

Unnamed: 0,feature_name,importance
3,pickup_latitude,9476.0
5,dropoff_latitude,7998.0
4,dropoff_longitude,7753.0
2,pickup_longitude,7560.0
8,pickup_day,3235.0
10,pickup_hour,2995.0
9,pickup_second,2910.0
11,pickup_minute,2726.0
7,pickup_month,1751.0
1,passenger_count,918.0
