<center><a href="https://www.featuretools.com/"><img src="img/featuretools-logo.png" width="400" height="200" /></a></center>

<h2> An Advanced Featuretools Approach with Custom Primitives</h2>
<p>The following tutorial illustrates an advanced featuretools model for the NYC Taxi Trip Duration competition on Kaggle using our custom primitive API. You will need to download the following five files into a `data` folder in this repository.</p>

<h2>Step 1: Download raw data </h2>
<a href="https://www.kaggle.com/c/nyc-taxi-trip-duration/data">test.csv and train.csv</a>

The functions used here don't appear until Featuretools version 0.1.9. If you're using a lower version you will get an error.

In [1]:
import featuretools as ft
import pandas as pd
import numpy as np
import taxi_utils
ft.__version__

'0.5.1'

<h2>Step 2: Prepare the Data </h2>

In [2]:
TRAIN_DIR = "data/train.csv"
TEST_DIR = "data/test.csv"

data_train, data_test = taxi_utils.read_data(TRAIN_DIR, TEST_DIR)

The `fastest_routes` dataset has information on the shortest path between two coordinates in NYC. We can merge it in with our dataset here and then merge together our test and train datasets after marking them as such.

In [3]:
# Make a train/test column
data_train['test_data'] = False
data_test['test_data'] = True

# Combine the data and convert some strings
data = pd.concat([data_train, data_test])
data.head(5)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,dropoff_latitude,dropoff_longitude,id,passenger_count,pickup_datetime,pickup_latitude,pickup_longitude,store_and_fwd_flag,test_data,trip_duration,vendor_id
0,40.765602,-73.96463,id2875421,1,2016-03-14 17:24:55,40.767937,-73.982155,False,False,455.0,2
1,40.731152,-73.999481,id2377394,1,2016-06-12 00:43:35,40.738564,-73.980415,False,False,663.0,1
2,40.710087,-74.005333,id3858529,1,2016-01-19 11:35:24,40.763939,-73.979027,False,False,2124.0,2
3,40.706718,-74.012268,id3504673,1,2016-04-06 19:32:31,40.719971,-74.01004,False,False,429.0,2
4,40.78252,-73.972923,id2181028,1,2016-03-26 13:30:55,40.793209,-73.973053,False,False,435.0,2


## Step 3: Custom Primitives
The custom primitive API is new to Featuretools version 0.1.9. Our workflow will be as follows:
1. Make a new LatLong class which is a tuple of a latitude and longitude
2. Define some functions for LatLong
3. Make new primitives from those functions

For the first step, we'll make new columns for the pickup and dropoff locations which are tuples.

In [4]:
data["pickup_latlong"] = data[['pickup_latitude', 'pickup_longitude']].apply(tuple, axis=1)
data["dropoff_latlong"] = data[['dropoff_latitude', 'dropoff_longitude']].apply(tuple, axis=1)
data = data.drop(["pickup_latitude", "pickup_longitude", "dropoff_latitude", "dropoff_longitude"], axis = 1)

<p>We can define the `pickup_latlong` to be a `TripStart` type which will be built on our `LatLong` type. This is a way to tell DFS that certain primitives are only important in one direction, from the beginning of a trip to the end. 

Marking separate types for `TripStart` and `TripEnd` can be skipped as of Featuretools version 0.1.17 with the `commutative` functionality in DFS.</p>

In [5]:
from featuretools import variable_types as vtypes

class LatLong(vtypes.Variable):
    _dtype_repr = "latlong"

class TripStart(LatLong):
    _dtype_repr = "latlong"

class TripEnd(LatLong):
    _dtype_repr = "latlong"
    
trip_variable_types = {
    'passenger_count': vtypes.Ordinal, 
    'vendor_id': vtypes.Categorical,
    'pickup_latlong': TripStart,
    'dropoff_latlong': TripEnd,
}

es = ft.EntitySet("taxi")

es.entity_from_dataframe(entity_id="trips",
                         dataframe=data,
                         index="id",
                         time_index='pickup_datetime',
                         variable_types=trip_variable_types)

es.normalize_entity(base_entity_id="trips",
                    new_entity_id="vendors",
                    index="vendor_id")

es.normalize_entity(base_entity_id="trips",
                    new_entity_id="passenger_cnt",
                    index="passenger_count")

cutoff_time = es['trips'].df[['id', 'pickup_datetime']]

<p>Next, we can create primitives for our new `pickup_latlong` and `dropoff_latlong` which are of type `TripStart` and `TripEnd`. 

The distance between two `LatLong` types is most accurately represented by the Haversine distance: the nearest length along a sphere. We can also define the Cityblock distance (also called the taxicab metric), which takes into account that we can't move diagonally through buildings in Manhattan.</p>

In [6]:
from featuretools.primitives import make_agg_primitive, make_trans_primitive

def haversine(latlong1, latlong2):
    lat_1s = np.array([x[0] for x in latlong1])
    lon_1s = np.array([x[1] for x in latlong1])
    lat_2s = np.array([x[0] for x in latlong2])
    lon_2s = np.array([x[1] for x in latlong2])
    lon1, lat1, lon2, lat2 = map(np.radians, [lon_1s, lat_1s, lon_2s, lat_2s])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    km = 6367 * 2 * np.arcsin(np.sqrt(a))
    return km

def cityblock(latlong1, latlong2):
    lon_dis = haversine(latlong1, latlong2)
    lat_dist = haversine(latlong1, latlong2)
    return lon_dis + lat_dist

def bearing(latlong1, latlong2):
    lat1 = np.array([x[0] for x in latlong1])
    lon1 = np.array([x[1] for x in latlong1])
    lat2 = np.array([x[0] for x in latlong2])
    lon2 = np.array([x[1] for x in latlong2])
    delta_lon = np.radians(lon2 - lon1)
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    x = np.cos(lat2) * np.sin(delta_lon)
    y = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(delta_lon)
    return np.degrees(np.arctan2(x, y))

We can also make som primitives directly from `LatLong` types. In particular, we'll almost certainly want to extract the Latitude, the Longitude and the central point between two such coordinates.

In [7]:
def latitude(latlong):
    return np.array([x[0] for x in latlong])

def longitude(latlong):
    return np.array([x[1] for x in latlong])

<p>Now lets make primitives from those functions!</p>

In [8]:
Bearing = make_trans_primitive(function=bearing,
                               input_types=[TripStart, TripEnd],
                               return_type=vtypes.Numeric)

Cityblock = make_trans_primitive(function=cityblock,
                                 input_types=[TripStart, TripEnd],
                                 return_type=vtypes.Numeric)

Haversine = make_trans_primitive(function=haversine,
                                 input_types=[TripStart, TripEnd],
                                 return_type=vtypes.Numeric)

Latitude = make_trans_primitive(function=latitude, 
                                input_types=[LatLong],
                                return_type=vtypes.Numeric)

Longitude = make_trans_primitive(function=longitude,
                                 input_types=[LatLong],
                                 return_type=vtypes.Numeric)

## Seed Features
A "Seed Feature" is a feature that we don't necessarily plan on using as a primitive outside of this problem. Deep Feature Synthesis allows for automatic stacking on top of seed features as well. 

Here we can define some seed features which are relevant directly to predicting trip duration. The feature `Geobox` finds if a latlong is in a rectangle defined by two coordinate pairs. The feature `Numbox` is given here as a similar example, though it's more straightforward to define each of those seed features directly.

In [9]:
def geobox(latlong, bottomleft=None, topright=None):
    lat = np.array([x[0] for x in latlong])
    lon = np.array([x[1] for x in latlong])
    boxlats = [bottomleft[0],topright[0]]
    boxlongs = [bottomleft[1], topright[1]]
    output = []
    for i, name in enumerate(lat):
        if (min(boxlats) <= lat[i] and lat[i] <= max(boxlats) and 
                min(boxlongs) <= lon[i] and lon[i] <= max(boxlongs)):
            output.append(True)
        else: 
            output.append(False)
    return output

def numbox(number, less=None, more=None):
    output = []
    for i, name in enumerate(number):
        if less<=number[i] and number[i]<=more:
            output.append(True)
        else: 
            output.append(False)
    return output

def geobox_get_name(self):
    return u"GEOBOX({}, {}, {})".format(self.base_features[0].get_name(),
                                        str(self.kwargs['bottomleft']),
                                        str(self.kwargs['topright']))

Geobox = make_trans_primitive(function=geobox,
                              input_types=[LatLong],
                              return_type=vtypes.Boolean,
                              cls_attributes={"get_name": geobox_get_name})

def numbox_get_name(self):
    return u"NUMBOX({}, {}, {})".format(self.base_features[0].get_name(),
                                        str(self.kwargs['less']), 
                                        str(self.kwargs['more']))

Numbox = make_trans_primitive(function=numbox,
                              input_types=[vtypes.Ordinal],
                              return_type = vtypes.Boolean,
                              cls_attributes = {"get_name": numbox_get_name})

With those functions defined, we can now implement a geographic box around JFK airport with the `GEOBOX` primitive. We can also group together some times of day with `NUMBOX`.

In [10]:
from featuretools.primitives import (Feature, Hour)

seed_features = []

jfk_pick = Geobox(es['trips']['pickup_latlong'], 
                  bottomleft = (40.62, -73.85), 
                  topright = (40.70, -73.75))
jfk_drop = Geobox(es['trips']['dropoff_latlong'],
                  bottomleft = (40.62, -73.85), 
                  topright = (40.70, -73.75))

yonkers_pick = Geobox(es['trips']['pickup_latlong'], 
                      bottomleft = (40.70, -73.97), 
                      topright = (40.77, -73.9))
yonkers_drop = Geobox(es['trips']['dropoff_latlong'],
                      bottomleft = (40.70, -73.97), 
                      topright = (40.77, -73.9))

     
rush = Numbox(Hour(es["trips"]["pickup_datetime"]), less = 7, more = 11)
noon = Numbox(Hour(es["trips"]["pickup_datetime"]), less = 11, more = 13)
night = Numbox(Hour(es["trips"]["pickup_datetime"]), less = 18, more = 23)

seed_features = [jfk_pick, jfk_drop,
                 yonkers_pick, yonkers_drop, rush, noon, night]

## Step 4: Using custom primitives in DFS

Let's see what features are created if we run DFS now:

In [11]:
agg_primitives = []
trans_primitives = [Bearing, Haversine, Cityblock, Latitude, Longitude]

# calculate feature_matrix using deep feature synthesis
features = ft.dfs(entityset=es,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=agg_primitives,
                  drop_contains=['trips.test_data'],
                  verbose=True,
                  cutoff_time=cutoff_time,
                  approximate='36d',
                  seed_features=seed_features,
                  max_depth=3,
                  max_features=40,
                  features_only=True)
features

Built 19 features


[<Feature: store_and_fwd_flag>,
 <Feature: test_data>,
 <Feature: trip_duration>,
 <Feature: passenger_count>,
 <Feature: vendor_id>,
 <Feature: GEOBOX(pickup_latlong, (40.62, -73.85), (40.7, -73.75))>,
 <Feature: GEOBOX(dropoff_latlong, (40.62, -73.85), (40.7, -73.75))>,
 <Feature: GEOBOX(pickup_latlong, (40.7, -73.97), (40.77, -73.9))>,
 <Feature: GEOBOX(dropoff_latlong, (40.7, -73.97), (40.77, -73.9))>,
 <Feature: BEARING(pickup_latlong, dropoff_latlong)>,
 <Feature: HAVERSINE(pickup_latlong, dropoff_latlong)>,
 <Feature: CITYBLOCK(pickup_latlong, dropoff_latlong)>,
 <Feature: LATITUDE(pickup_latlong)>,
 <Feature: LATITUDE(dropoff_latlong)>,
 <Feature: LONGITUDE(pickup_latlong)>,
 <Feature: LONGITUDE(dropoff_latlong)>,
 <Feature: NUMBOX(HOUR(pickup_datetime), 7, 11)>,
 <Feature: NUMBOX(HOUR(pickup_datetime), 11, 13)>,
 <Feature: NUMBOX(HOUR(pickup_datetime), 18, 23)>]

Our new features were applied exactly where they should be! The distances were only calcuated between the pickup_latlong and the dropoff_latlong, while primitives like `LATITUDE` were calculated on both latlong columns.

For this dataset, we have approximately 2 million distinct cutoff times, times at which DFS will have to recalculate aggregation primitives. Calculating those features at specific times every hour and using that number instead will save computation time and perhaps not lose too much information. The `approximate` parameter in DFS lets you specify a window size to use when approximating.

With the features from before, we calculate the whole feature matrix:

In [12]:
agg_primitives = ['Sum', 'Mean', 'Median', 'Std', 'Count', 'Min', 'Max', 'Num_Unique', 'Skew']
trans_primitives = [Bearing, Haversine, Cityblock, Latitude, Longitude, 
                    'Day', 'Hour', 'Minute', 'Month', 'Weekday', 'Week', 'Weekend']

# this allows us to create features that are conditioned on a second value before we calculate.
es.add_interesting_values()

# calculate feature_matrix using deep feature synthesis
feature_matrix, features = ft.dfs(entityset=es,
                                  target_entity="trips",
                                  trans_primitives=trans_primitives,
                                  agg_primitives=agg_primitives,
                                  drop_contains=['trips.test_data'],
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  approximate='36d',
                                  seed_features=seed_features,
                                  max_depth=4)
feature_matrix.head()

Built 166 features
Elapsed: 35:25 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


Unnamed: 0_level_0,store_and_fwd_flag,test_data,trip_duration,passenger_count,vendor_id,"GEOBOX(pickup_latlong, (40.62, -73.85), (40.7, -73.75))","GEOBOX(dropoff_latlong, (40.62, -73.85), (40.7, -73.75))","GEOBOX(pickup_latlong, (40.7, -73.97), (40.77, -73.9))","GEOBOX(dropoff_latlong, (40.7, -73.97), (40.77, -73.9))","BEARING(pickup_latlong, dropoff_latlong)",...,passenger_cnt.NUM_UNIQUE(trips.MONTH(pickup_datetime)),passenger_cnt.NUM_UNIQUE(trips.WEEKDAY(pickup_datetime)),passenger_cnt.NUM_UNIQUE(trips.WEEK(pickup_datetime)),"passenger_cnt.SKEW(trips.BEARING(pickup_latlong, dropoff_latlong))","passenger_cnt.SKEW(trips.HAVERSINE(pickup_latlong, dropoff_latlong))","passenger_cnt.SKEW(trips.CITYBLOCK(pickup_latlong, dropoff_latlong))",passenger_cnt.SKEW(trips.LATITUDE(pickup_latlong)),passenger_cnt.SKEW(trips.LATITUDE(dropoff_latlong)),passenger_cnt.SKEW(trips.LONGITUDE(pickup_latlong)),passenger_cnt.SKEW(trips.LONGITUDE(dropoff_latlong))
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
id0190469,False,False,849.0,5,2,False,False,False,False,16.442364,...,,,,,,,,,,
id0621643,False,True,,2,2,False,False,False,True,10.237892,...,,,,,,,,,,
id1384355,False,True,,1,1,False,False,False,False,30.275237,...,,,,,,,,,,
id1665586,False,False,1294.0,1,1,False,False,False,True,145.360294,...,,,,,,,,,,
id1210365,False,False,408.0,5,2,False,False,False,False,43.630148,...,,,,,,,,,,


## Step 5: Build the Model

<p>As before, we need to retrieve our labels for the train dataset, so we should merge our current feature matrix with the original dataset. </p>

<p>We use the `log` of the trip duration since that measure is better at distinguishing distances within the city</p>

In [13]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, labels, X_test = taxi_utils.get_train_test_fm(feature_matrix)
labels = np.log(labels.values + 1)

In [14]:
model = taxi_utils.train_xgb(X_train, labels)

[0]	train-rmse:5.00585	valid-rmse:5.00563
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 60 rounds.
[10]	train-rmse:0.92119	valid-rmse:0.921423
[20]	train-rmse:0.395373	valid-rmse:0.398681
[30]	train-rmse:0.354122	valid-rmse:0.359861
[40]	train-rmse:0.344739	valid-rmse:0.352303
[50]	train-rmse:0.338561	valid-rmse:0.347664
[60]	train-rmse:0.332614	valid-rmse:0.343897
[70]	train-rmse:0.328251	valid-rmse:0.341719
[80]	train-rmse:0.325735	valid-rmse:0.340382
[90]	train-rmse:0.323722	valid-rmse:0.339769
[100]	train-rmse:0.322003	valid-rmse:0.339312
[110]	train-rmse:0.320877	valid-rmse:0.338868
[120]	train-rmse:0.319926	valid-rmse:0.338511
[130]	train-rmse:0.318342	valid-rmse:0.337846
[140]	train-rmse:0.317369	valid-rmse:0.337434
[150]	train-rmse:0.315876	valid-rmse:0.336961
[160]	train-rmse:0.315199	valid-rmse:0.33668
[170]	train-rmse:0.313739	valid-rmse:0.33597
[180]	train-rmse:0.312947	valid-rmse:0.3358

In [15]:
submission = taxi_utils.predict_xgb(model, X_test)
submission.head(5)

Unnamed: 0_level_0,trip_duration
id,Unnamed: 1_level_1
id0621643,1223.690674
id1384355,1549.502686
id2568735,1886.844482
id3700764,1383.922363
id3008929,292.624847


In [16]:
submission.to_csv('trip_duration_ft_custom_primitives.csv', index=True, index_label='id')

<dl>
<dt>This solution:</dt>
<dd>&nbsp; &nbsp; Received a score of 0.39256 on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Placed 416 out of 1257.</dd>
<dd>&nbsp; &nbsp; Beat 67% of competitors on the Kaggle competition.</dd>
<dd>&nbsp; &nbsp; Had a modeling RMSLE of 0.33454</dd>
</dl>

December 27, 2017.

<h2>Additional Analysis</h2>
<p>Lets look at how important the features we created were for the model.</p>

In [17]:
feature_names = X_train.columns.values
ft_importances = taxi_utils.feature_importances(model, feature_names)
ft_importances[:50]

Unnamed: 0,feature_name,importance
39,LATITUDE(dropoff_latlong),4377.0
5,LATITUDE(pickup_latlong),4374.0
4,"HAVERSINE(pickup_latlong, dropoff_latlong)",4250.0
6,"BEARING(pickup_latlong, dropoff_latlong)",3907.0
28,LONGITUDE(pickup_latlong),3814.0
69,LONGITUDE(dropoff_latlong),3752.0
20,"CITYBLOCK(pickup_latlong, dropoff_latlong)",3283.0
0,HOUR(pickup_datetime),2680.0
32,DAY(pickup_datetime),1752.0
53,WEEKDAY(pickup_datetime),1307.0


In [21]:
# Save output files

import os

try:
    os.mkdir("output")
except:
    pass

feature_matrix.to_csv('output/feature_matrix.csv')
cutoff_time.to_csv('output/cutoff_times.csv')

<p>
    <img src="https://www.featurelabs.com/wp-content/uploads/2017/12/logo.png" alt="Featuretools" />
</p>

Featuretools was created by the developers at [Feature Labs](https://www.featurelabs.com/). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.featurelabs.com/contact/).