# A better baseline solution

<p>The following tutorial illustrates a simple model for the NYC Taxi Trip Duration competition on <a href = "https://www.kaggle.com/c/nyc-taxi-trip-duration">Kaggle</a>. Our goal here is to create a model with minimal work to act as a *baseline*. This notebook reads the dataset, encodes the necessary columns, and trains a regressor.</p>

## Step 1: Download raw data
<p>The first step is to download the raw data from the <a href="https://www.kaggle.com/c/nyc-taxi-trip-duration/data">Kaggle website</a>. For the purposes of this tutorial only two files are necessary: `train.csv` and `test.csv`: if you have not already, you should download them from and save them into the data folder. We once again use the `pandas` data analysis library to read in the data in a usable format for python.


In [1]:
import pandas as pd
import numpy as np
import taxi_utils

In [2]:
TRAIN_DIR = "data/train.csv"
TEST_DIR = "data/test.csv"

data_train, data_test = taxi_utils.read_data(TRAIN_DIR, TEST_DIR)

data_train.head(5)

Unnamed: 0,id,vendor_id,pickup_datetime,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration
0,id2875421,2,2016-03-14 17:24:55,1,-73.982155,40.767937,-73.96463,40.765602,False,455
1,id2377394,1,2016-06-12 00:43:35,1,-73.980415,40.738564,-73.999481,40.731152,False,663
2,id3858529,2,2016-01-19 11:35:24,1,-73.979027,40.763939,-74.005333,40.710087,False,2124
3,id3504673,2,2016-04-06 19:32:31,1,-74.01004,40.719971,-74.012268,40.706718,False,429
4,id2181028,2,2016-03-26 13:30:55,1,-73.973053,40.793209,-73.972923,40.78252,False,435


<h2>Step 2: Prepare the Data </h2>

In [3]:
X_train = data_train.copy()
X_test = data_test.copy()

<p>Next, to use machine learning algorithms, we need to change the **pickup_datetime** column.</p>

In [4]:
X_test.loc[:, 'pickup_year'] = X_test['pickup_datetime'].dt.year
X_train.loc[:, 'pickup_year'] = X_train['pickup_datetime'].dt.year

X_test.loc[:, 'pickup_month'] = X_test['pickup_datetime'].dt.month
X_train.loc[:, 'pickup_month'] = X_train['pickup_datetime'].dt.month

X_test.loc[:, 'pickup_day'] = X_test['pickup_datetime'].dt.day
X_train.loc[:, 'pickup_day'] = X_train['pickup_datetime'].dt.day

X_test.loc[:, 'pickup_hour'] = X_test['pickup_datetime'].dt.hour
X_train.loc[:, 'pickup_hour'] = X_train['pickup_datetime'].dt.hour

X_test.loc[:, 'pickup_minute'] = X_test['pickup_datetime'].dt.minute
X_train.loc[:, 'pickup_minute'] = X_train['pickup_datetime'].dt.minute

X_test.loc[:, 'pickup_second'] = X_test['pickup_datetime'].dt.second
X_train.loc[:, 'pickup_second'] = X_train['pickup_datetime'].dt.second

In [5]:
X_test = X_test.drop(['pickup_datetime'], axis=1)
X_train = X_train.drop(['pickup_datetime'], axis=1)

In [6]:
X_train.head(5)

Unnamed: 0,id,vendor_id,passenger_count,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,trip_duration,pickup_year,pickup_month,pickup_day,pickup_hour,pickup_minute,pickup_second
0,id2875421,2,1,-73.982155,40.767937,-73.96463,40.765602,False,455,2016,3,14,17,24,55
1,id2377394,1,1,-73.980415,40.738564,-73.999481,40.731152,False,663,2016,6,12,0,43,35
2,id3858529,2,1,-73.979027,40.763939,-74.005333,40.710087,False,2124,2016,1,19,11,35,24
3,id3504673,2,1,-74.01004,40.719971,-74.012268,40.706718,False,429,2016,4,6,19,32,31
4,id2181028,2,1,-73.973053,40.793209,-73.972923,40.78252,False,435,2016,3,26,13,30,55


<h2>Step 3: Build the Model </h2>

<p>We can make sure the `id` is not used to train the model by setting it as the index for both feature matrices.</p>

In [7]:
X_train = X_train.set_index(['id'])
X_test = X_test.set_index(['id'])

Since the data is not linearly distributed, taking the `log` of the trip duration will improve the results of the regression.

In [8]:
labels = np.log(X_train['trip_duration'].values + 1)
X_train = X_train.drop(['trip_duration'], axis=1)

From there, we run a simple `xgboost` and see how well our model fits our test data.

In [9]:
model = taxi_utils.train_xgb(X_train, labels)

[0]	train-rmse:5.00222	valid-rmse:5.00164
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 60 rounds.
[10]	train-rmse:1.03448	valid-rmse:1.03527
[20]	train-rmse:0.588907	valid-rmse:0.591222
[30]	train-rmse:0.521448	valid-rmse:0.524679
[40]	train-rmse:0.49152	valid-rmse:0.49566
[50]	train-rmse:0.475846	valid-rmse:0.480838
[60]	train-rmse:0.47131	valid-rmse:0.47699
[70]	train-rmse:0.457938	valid-rmse:0.464323
[80]	train-rmse:0.452809	valid-rmse:0.459851
[90]	train-rmse:0.448331	valid-rmse:0.455781
[100]	train-rmse:0.442711	valid-rmse:0.450655
[110]	train-rmse:0.434419	valid-rmse:0.442925
[120]	train-rmse:0.430539	valid-rmse:0.439403
[130]	train-rmse:0.427872	valid-rmse:0.437131
[140]	train-rmse:0.426528	valid-rmse:0.436249
[150]	train-rmse:0.424141	valid-rmse:0.43431
[160]	train-rmse:0.421148	valid-rmse:0.43164
[170]	train-rmse:0.418708	valid-rmse:0.429658
[180]	train-rmse:0.41323	valid-rmse:0.424686
[19

<h2>Step 4: Make a Submission</h2>
Some utility functions are stored in `taxi_utils.py`. In that file there is a `predict_xgb` which tests data against our model and `feature_importances` which we will use below.

In [10]:
submission = taxi_utils.predict_xgb(model, X_test)
submission.head(5)

Unnamed: 0_level_0,trip_duration
id,Unnamed: 1_level_1
id3004672,744.959778
id3505355,439.119873
id1217141,432.329071
id2150126,1111.528809
id1598245,314.304413


In [11]:
submission.shape

(625134, 1)

In [12]:
X_test.shape

(625134, 14)

In [13]:
submission.to_csv('trip_duration_baseline.csv', index=True, index_label='id')

<dt>This solution:</dt>
<dd>&nbsp; &nbsp; Received a score of 0.46589 on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Placed 738 out of 1257.</dd>
<dd>&nbsp; &nbsp; Beat 41% of competitors on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Had a modeling RMSE of 0.41591</dd>

December 27, 2017.

<h2>Additional Analysis</h2>
<p>Let's look at how important each feature was for the model.</p>

In [14]:
feature_names = X_train.columns.values
ft_importances = taxi_utils.feature_importances(model, feature_names)
ft_importances

Unnamed: 0,feature_name,importance
3,dropoff_longitude,8512.0
1,dropoff_latitude,8365.0
0,pickup_longitude,8213.0
2,pickup_latitude,7799.0
5,pickup_hour,3190.0
6,pickup_day,3071.0
7,pickup_minute,2854.0
11,pickup_second,2536.0
8,pickup_month,1535.0
4,passenger_count,831.0


<p align="center">
<img width=50% src="https://alteryx-open-source-images.s3.amazonaws.com/OpenSource_Logo-01.jpg" alt="ayx_os" />
</p>

Featuretools was created by the developers at [Alteryx](https://www.alteryx.com). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.alteryx.com/contact-us/).