<center><a href="https://www.featuretools.com/"><img src="img/featuretools-logo.png" width="400" height="200" /></a></center>

<h2> An Advanced Featuretools Approach with Custom Primitives</h2>
<p>This tutorial shows how to use our custom primitives API with Featuretools. This tutorial assumes that you have completed Notebook 4. If you have not, you will need to download the following five files into a `data` folder in this repository.</p>

<h2>Step 1: Download raw data </h2>
+ <a href="https://www.kaggle.com/c/nyc-taxi-trip-duration/data">test.csv and train.csv</a></p>
+ <a href="https://www.kaggle.com/oscarleo/new-york-city-taxi-with-osrm/data">fastest_routes train1, train2 and test</a>

The functions used here don't appear until Featuretools version 0.1.9. If you're using a lower version you will get an error.

In [2]:
import featuretools as ft
import pandas as pd
import numpy as np
import utils
ft.__version__

'0.1.14'

<h2>Step 2: Prepare the Data </h2>

In [3]:
TRAIN_DIR = "data/train.csv"
TEST_DIR = "data/test.csv"
data_train, data_test = utils.read_data(TRAIN_DIR, TEST_DIR, nrows=None)

# Load other data https://www.kaggle.com/oscarleo/new-york-city-taxi-with-osrm
fastest_routes_part1 = pd.read_csv('data/fastest_routes_train_part_1.csv', 
                                   usecols=['id', 'number_of_steps'])
fastest_routes_part2 = pd.read_csv('data/fastest_routes_train_part_2.csv', 
                                   usecols=['id', 'number_of_steps'])
fastest_routes_train = pd.concat((fastest_routes_part1, fastest_routes_part2))

fatest_routes_test = pd.read_csv('data/fastest_routes_test.csv',
                                 usecols=['id', 'number_of_steps'])

data_train = data_train.merge(fastest_routes_train, how='left', on='id')
data_test = data_test.merge(fatest_routes_test, how='left', on='id')

# Make a train/test column
data_train['test_data'] = False
data_test['test_data'] = True

# Combine the data and convert some strings
data = pd.concat([data_train, data_test])
data.loc[:, 'store_and_fwd_flag'] = data['store_and_fwd_flag'].map({'Y': 1, 'N': 0})

## Step 3: Custom Primitives
The custom primitive API is new to Featuretools version 0.1.9. Our workflow will be as follows:
1. Make a new LatLong class which is a tuple of a latitude and longitude
2. Define some functions for LatLong
3. Make new primitives from those functions

For the first step, we'll make new columns for the pickup and dropoff locations which are tuples.

In [4]:
data = data[0:500000]
data["pickup_latlong"] = data[['pickup_latitude', 'pickup_longitude']].apply(tuple, axis=1)
data["dropoff_latlong"] = data[['dropoff_latitude', 'dropoff_longitude']].apply(tuple, axis=1)
data = data.drop(["pickup_latitude", "pickup_longitude", "dropoff_latitude", "dropoff_longitude"], axis = 1)

<p>We can define the `pickup_latlong` to be a `TripStart` type which will be built on our `LatLong` type. This is a way to tell DFS that certain primitives are only important in one direction, from the beginning of a trip to the end.</p>

In [5]:
from featuretools import variable_types as vtypes

class LatLong(vtypes.Variable):
    _dtype_repr = "latlong"

class TripStart(LatLong):
    _dtype_repr = "latlong"

class TripEnd(LatLong):
    _dtype_repr = "latlong"
    
trip_variable_types = {
    'passenger_count': vtypes.Ordinal, 
    'vendor_id': vtypes.Categorical,
    'pickup_latlong': TripStart,
    'dropoff_latlong': TripEnd,
}

es = ft.EntitySet("taxi")

es.entity_from_dataframe(entity_id="trips",
                         dataframe=data,
                         index="id",
                         time_index='pickup_datetime',
                         variable_types=trip_variable_types)

es.normalize_entity(base_entity_id="trips",
                    new_entity_id="vendors",
                    index="vendor_id")

es.normalize_entity(base_entity_id="trips",
                    new_entity_id="passenger_cnt",
                    index="passenger_count")

cutoff_time = es['trips'].df[['id', 'pickup_datetime']][es['trips'].df.pickup_datetime>pd.Timestamp('2016-2-15')]

<p>Next, we can create some primitives for our new `pickup_latlong` and `dropoff_latlong` which are of type `TripStart` and `TripEnd`. We'll define functions for the Haversine distance, Cityblock distance and Bearing distance as well as a couple of other primitives.</p>

In [6]:
from featuretools.primitives import make_agg_primitive, make_trans_primitive

def haversine(latlong1, latlong2):
    lat_1s = np.array([x[0] for x in latlong1])
    lon_1s = np.array([x[1] for x in latlong1])
    lat_2s = np.array([x[0] for x in latlong2])
    lon_2s = np.array([x[1] for x in latlong2])
    lon1, lat1, lon2, lat2 = map(np.radians, [lon_1s, lat_1s, lon_2s, lat_2s])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
    km = 6367 * 2 * np.arcsin(np.sqrt(a))
    return km

def cityblock(latlong1, latlong2):
    lon_dis = haversine(latlong1, latlong2)
    lat_dist = haversine(latlong1, latlong2)
    return lon_dis + lat_dist

def bearing(latlong1, latlong2):
    lat1 = np.array([x[0] for x in latlong1])
    lon1 = np.array([x[1] for x in latlong1])
    lat2 = np.array([x[0] for x in latlong2])
    lon2 = np.array([x[1] for x in latlong2])
    delta_lon = np.radians(lon2 - lon1)
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
    x = np.cos(lat2) * np.sin(delta_lon)
    y = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(delta_lon)
    return np.degrees(np.arctan2(x, y))

def latitude(latlong):
    return np.array([x[0] for x in latlong])

def longitude(latlong):
    return np.array([x[1] for x in latlong])

def center(latlong1, latlong2):
    return np.array([[.5*(latlong1[i][0] + latlong2[i][0]),
                          .5*(latlong1[i][1] + latlong2[i][1])]
                     for i, x in enumerate(latlong1)])

def geobox(latlong, bottomleft=None, topright=None):
    lat = np.array([x[0] for x in latlong])
    lon = np.array([x[1] for x in latlong])
    boxlats = [bottomleft[0],topright[0]]
    boxlongs = [bottomleft[1], topright[1]]
    output = []
    for i, name in enumerate(lat):
        if (min(boxlats) <= lat[i] and lat[i] <= max(boxlats) and 
                min(boxlongs) <= lon[i] and lon[i] <= max(boxlongs)):
            output.append(True)
        else: 
            output.append(False)
    return output

def numbox(number, less=None, more=None):
    output = []
    for i, name in enumerate(number):
        if less<=number[i] and number[i]<=more:
            output.append(True)
        else: 
            output.append(False)
    return output

<p>Now lets make primitives from those functions!</p>

In [7]:
Bearing = make_trans_primitive(function=bearing,
                               input_types=[TripStart, TripEnd],
                               return_type=vtypes.Numeric)

Cityblock = make_trans_primitive(function=cityblock,
                                 input_types=[TripStart, TripEnd],
                                 return_type=vtypes.Numeric)

Haversine = make_trans_primitive(function=haversine,
                                 input_types=[TripStart, TripEnd],
                                 return_type=vtypes.Numeric)

Latitude = make_trans_primitive(function=latitude, 
                                input_types=[LatLong],
                                return_type=vtypes.Numeric)

Longitude = make_trans_primitive(function=longitude,
                                 input_types=[LatLong],
                                 return_type=vtypes.Numeric)

Center = make_trans_primitive(function=center,
                              input_types=[TripStart, TripEnd],
                              return_type=LatLong)

<p>Let's define some more primitives with a custom name. </p>

In [8]:
def geobox_get_name(self):
    return u"GEOBOX({}, {}, {})".format(self.base_features[0].get_name(),
                                        str(self.kwargs['bottomleft']),
                                        str(self.kwargs['topright']))

Geobox = make_trans_primitive(function=geobox,
                              input_types=[LatLong],
                              return_type=vtypes.Boolean,
                              cls_attributes={"get_name": geobox_get_name})

def numbox_get_name(self):
    return u"NUMBOX({}, {}, {})".format(self.base_features[0].get_name(),
                                        str(self.kwargs['less']), 
                                        str(self.kwargs['more']))

Numbox = make_trans_primitive(function=numbox,
                              input_types=[vtypes.Ordinal],
                              return_type = vtypes.Boolean,
                              cls_attributes = {"get_name": numbox_get_name})

The main idea behind defining custom primitives is that you are intending to reuse them outside of a given problem. It's easy to imagine we might care about the Latitude of any `LatLong` type or the distance travelled whenever you have a `TripStart` and a `TripEnd`. However, there are also features that you really only want to use for one specific problem. These are called seed features. Featuretools also has functionality for adding those to DFS. Here, we add a geographic box around JFK airport with the `GEOBOX` primitive. We can also group together some times of day with `NUMBOX`.

In [9]:
from featuretools.primitives import (Feature, Hour)

seed_features = []

jfk_pick = Geobox(es['trips']['pickup_latlong'], 
                  bottomleft = (40.62, -73.85), 
                  topright = (40.70, -73.75))
jfk_drop = Geobox(es['trips']['dropoff_latlong'],
                  bottomleft = (40.62, -73.85), 
                  topright = (40.70, -73.75))

yonkers_pick = Geobox(es['trips']['pickup_latlong'], 
                      bottomleft = (40.70, -73.97), 
                      topright = (40.77, -73.9))
yonkers_drop = Geobox(es['trips']['dropoff_latlong'],
                      bottomleft = (40.70, -73.97), 
                      topright = (40.77, -73.9))

     
rush = Numbox(Hour(es["trips"]["pickup_datetime"]), less = 7, more = 11)
noon = Numbox(Hour(es["trips"]["pickup_datetime"]), less = 11, more = 13)
night = Numbox(Hour(es["trips"]["pickup_datetime"]), less = 18, more = 23)

seed_features = [jfk_pick, jfk_drop,
                 yonkers_pick, yonkers_drop, rush, noon, night]

## Step 4: Using custom primitives in DFS

Let's see what features are created if we run DFS now:

In [10]:
agg_primitives = []
trans_primitives = [Bearing, Haversine, Cityblock, Center, Latitude, Longitude]

# calculate feature_matrix using deep feature synthesis
features = ft.dfs(entityset=es,
                  target_entity="trips",
                  trans_primitives=trans_primitives,
                  agg_primitives=agg_primitives,
                  drop_contains=['trips.test_data'],
                  verbose=True,
                  cutoff_time=cutoff_time,
                  approximate='36d',
                  seed_features=seed_features,
                  max_depth=3,
                  max_features=40,
                  features_only=True)
features

Building features: 31it [00:00, 4250.10it/s]


[<Feature: passenger_count>,
 <Feature: number_of_steps>,
 <Feature: store_and_fwd_flag>,
 <Feature: vendor_id>,
 <Feature: test_data>,
 <Feature: trip_duration>,
 <Feature: LATITUDE(dropoff_latlong)>,
 <Feature: LONGITUDE(pickup_latlong)>,
 <Feature: BEARING(pickup_latlong, dropoff_latlong)>,
 <Feature: GEOBOX(dropoff_latlong, (40.7, -73.97), (40.77, -73.9))>,
 <Feature: GEOBOX(pickup_latlong, (40.7, -73.97), (40.77, -73.9))>,
 <Feature: HAVERSINE(pickup_latlong, dropoff_latlong)>,
 <Feature: GEOBOX(pickup_latlong, (40.62, -73.85), (40.7, -73.75))>,
 <Feature: CITYBLOCK(pickup_latlong, dropoff_latlong)>,
 <Feature: LONGITUDE(dropoff_latlong)>,
 <Feature: GEOBOX(dropoff_latlong, (40.62, -73.85), (40.7, -73.75))>,
 <Feature: LATITUDE(pickup_latlong)>,
 <Feature: NUMBOX(HOUR(pickup_datetime), 18, 23)>,
 <Feature: LONGITUDE(CENTER(pickup_latlong, dropoff_latlong))>,
 <Feature: LATITUDE(CENTER(pickup_latlong, dropoff_latlong))>,
 <Feature: NUMBOX(HOUR(pickup_datetime), 11, 13)>,
 <Feature:

Our new features were applied exactly where they should be! The distances were only calcuated between the pickup_latlong and the dropoff_latlong, while primitives like `LATITUDE` were calculated on both latlong columns. With those in hand, we calculate the whole feature matrix:

In [11]:
from featuretools.primitives import (Sum, Mean, Median, Std, Count, Min, Max, NUnique, Skew,
                                     Day, Hour, Minute, Month, Weekday, Week, Weekend)
agg_primitives = [Sum, Mean, Median, Std, Count, Min, Max, NUnique, Skew]
trans_primitives = [Bearing, Haversine, Cityblock, Center, Latitude, Longitude, 
                    Day, Hour, Minute, Month, Weekday, Week, Weekend]

# calculate feature_matrix using deep feature synthesis
feature_matrix, features = ft.dfs(entityset=es,
                                  target_entity="trips",
                                  trans_primitives=trans_primitives,
                                  agg_primitives=agg_primitives,
                                  drop_contains=['trips.test_data'],
                                  verbose=True,
                                  cutoff_time=cutoff_time,
                                  approximate='36d',
                                  seed_features=seed_features,
                                  max_depth=4)
feature_matrix.head()

Building features: 432it [00:00, 2127.80it/s]
Progress: 100%|██████████| 5/5 [04:22<00:00, 53.22s/cutoff time]


Unnamed: 0_level_0,store_and_fwd_flag,number_of_steps,test_data,trip_duration,vendor_id,passenger_count,LONGITUDE(pickup_latlong),MONTH(pickup_datetime),"HAVERSINE(pickup_latlong, dropoff_latlong)","CITYBLOCK(pickup_latlong, dropoff_latlong)",...,"vendors.MEAN(trips.LONGITUDE(CENTER(pickup_latlong, dropoff_latlong)))","vendors.MIN(trips.LONGITUDE(CENTER(pickup_latlong, dropoff_latlong)))","vendors.SKEW(trips.LATITUDE(CENTER(pickup_latlong, dropoff_latlong)))","vendors.MAX(trips.LONGITUDE(CENTER(pickup_latlong, dropoff_latlong)))","vendors.MEAN(trips.LATITUDE(CENTER(pickup_latlong, dropoff_latlong)))","passenger_cnt.MAX(trips.LONGITUDE(CENTER(pickup_latlong, dropoff_latlong)))","vendors.STD(trips.LATITUDE(CENTER(pickup_latlong, dropoff_latlong)))","passenger_cnt.MEDIAN(trips.LONGITUDE(CENTER(pickup_latlong, dropoff_latlong)))","passenger_cnt.SUM(trips.LONGITUDE(CENTER(pickup_latlong, dropoff_latlong)))","passenger_cnt.MIN(trips.LATITUDE(CENTER(pickup_latlong, dropoff_latlong)))"
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
id0001948,,7.0,False,191.0,1,1,-73.976723,2,0.637667,1.275333,...,-73.974061,-74.017681,-0.536945,-73.773407,40.75135,-73.773407,0.024105,-73.979847,-1436576.0,40.639
id0002179,,10.0,False,1860.0,2,3,-73.974907,2,4.7071,9.414199,...,-73.973544,-74.018089,-0.563442,-73.778049,40.751491,-73.777454,0.023997,-73.980869,-87806.48,40.645321
id0003337,,13.0,False,1479.0,2,1,-73.994598,2,4.38751,8.775019,...,-73.973544,-74.018089,-0.563442,-73.778049,40.751491,-73.773407,0.023997,-73.979847,-1436576.0,40.639
id0003864,,9.0,False,749.0,2,1,-74.010231,2,1.475399,2.950798,...,-73.973544,-74.018089,-0.563442,-73.778049,40.751491,-73.773407,0.023997,-73.979847,-1436576.0,40.639
id0006435,,8.0,False,752.0,1,2,-73.959831,2,2.141434,4.282868,...,-73.974061,-74.017681,-0.536945,-73.773407,40.75135,-73.781967,0.024105,-73.979868,-304175.6,40.644646


In [12]:
features[:30]

[<Feature: store_and_fwd_flag>,
 <Feature: number_of_steps>,
 <Feature: test_data>,
 <Feature: trip_duration>,
 <Feature: vendor_id>,
 <Feature: passenger_count>,
 <Feature: LONGITUDE(pickup_latlong)>,
 <Feature: MONTH(pickup_datetime)>,
 <Feature: HAVERSINE(pickup_latlong, dropoff_latlong)>,
 <Feature: CITYBLOCK(pickup_latlong, dropoff_latlong)>,
 <Feature: MINUTE(pickup_datetime)>,
 <Feature: LONGITUDE(dropoff_latlong)>,
 <Feature: LATITUDE(pickup_latlong)>,
 <Feature: HOUR(pickup_datetime)>,
 <Feature: GEOBOX(dropoff_latlong, (40.7, -73.97), (40.77, -73.9))>,
 <Feature: DAY(pickup_datetime)>,
 <Feature: IS_WEEKEND(pickup_datetime)>,
 <Feature: BEARING(pickup_latlong, dropoff_latlong)>,
 <Feature: WEEK(pickup_datetime)>,
 <Feature: GEOBOX(pickup_latlong, (40.62, -73.85), (40.7, -73.75))>,
 <Feature: WEEKDAY(pickup_datetime)>,
 <Feature: LATITUDE(dropoff_latlong)>,
 <Feature: GEOBOX(pickup_latlong, (40.7, -73.97), (40.77, -73.9))>,
 <Feature: GEOBOX(dropoff_latlong, (40.62, -73.85), (

## Step 5: Build the Model

<p>As before, we need to retrieve our labels for the train dataset, so we should merge our current feature matrix with the original dataset. </p>

<p>We also get the log of the trip duration so that a more linear relationship can be found.</p>

In [13]:
# separates the whole feature matrix into train data feature matrix, train data labels, and test data feature matrix 
X_train, labels, X_test = utils.get_train_test_fm(feature_matrix)
labels = np.log(labels.values + 1)

In [14]:
model = utils.train_xgb(X_train, labels)

[0]	train-rmse:5.00902	valid-rmse:5.01039
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 60 rounds.
[10]	train-rmse:0.921313	valid-rmse:0.922226
[20]	train-rmse:0.38935	valid-rmse:0.396997
[30]	train-rmse:0.342259	valid-rmse:0.355585
[40]	train-rmse:0.329698	valid-rmse:0.347305
[50]	train-rmse:0.321991	valid-rmse:0.342474
[60]	train-rmse:0.316924	valid-rmse:0.340564
[70]	train-rmse:0.312619	valid-rmse:0.338999
[80]	train-rmse:0.308778	valid-rmse:0.337238
[90]	train-rmse:0.30645	valid-rmse:0.336333
[100]	train-rmse:0.304547	valid-rmse:0.335969
[110]	train-rmse:0.302373	valid-rmse:0.335708
[120]	train-rmse:0.301056	valid-rmse:0.335559
[130]	train-rmse:0.298848	valid-rmse:0.335223
[140]	train-rmse:0.296946	valid-rmse:0.334768
[150]	train-rmse:0.295887	valid-rmse:0.334681
[160]	train-rmse:0.295072	valid-rmse:0.334643
[170]	train-rmse:0.293767	valid-rmse:0.334252
[180]	train-rmse:0.292196	valid-rmse:0.333

<h2>Additional Analysis</h2>
<p>Lets look at how important each feature was for the model.</p>

In [15]:
ft_importances = utils.feature_importances(model, X_train.columns.values)
ft_importances[:50]

Unnamed: 0,feature_name,importance
131,"HAVERSINE(pickup_latlong, dropoff_latlong)",2509.0
124,"BEARING(pickup_latlong, dropoff_latlong)",2220.0
30,"LATITUDE(CENTER(pickup_latlong, dropoff_latlong))",2094.0
134,LONGITUDE(dropoff_latlong),1918.0
116,LATITUDE(dropoff_latlong),1902.0
119,LATITUDE(pickup_latlong),1878.0
142,"LONGITUDE(CENTER(pickup_latlong, dropoff_latlo...",1861.0
129,LONGITUDE(pickup_latlong),1759.0
132,"CITYBLOCK(pickup_latlong, dropoff_latlong)",1616.0
120,HOUR(pickup_datetime),1339.0
