# Overview
This notebook will have 4 components matching the modules of Henchman:
1. diagnostics
2. selection
3. learning
4. plotting

We will demonstrate the available functionality using the flight dataset from Featuretools. If you have both `henchman` and `featuretools` installed, you should be able to run the following without issue. Here we build the feature matrix and entityset we'll be using.

In [1]:
import numpy as np
import pandas as pd
import featuretools as ft

import numpy as np
import pandas as pd
import featuretools as ft

es = ft.demo.load_flight(categorical_filter={'dest_city': ['New York, NY'], 
                                             'origin_city': ['New York, NY']}, 
                         verbose=True)
def make_cutoffs(es, delay=75, advance='24h'):
    # Predict for all non-canceled, non-diverted flights
    tmp = es['trip_logs'].df[(es['trip_logs'].df['cancelled'] == False) & (es['trip_logs'].df['diverted'] == False)]

    # Set the cutoff time to be `advance` hours before the scheduled departure time (default 24)
    cutoff_times = tmp[['trip_log_id', 'scheduled_dep_time']]
    cutoff_times['scheduled_dep_time'] = cutoff_times['scheduled_dep_time'] - pd.Timedelta(advance)

    # Check if the flight will have a delay of `delay` (default 75)
    label = (es['trip_logs'].df['arr_delay'] > delay).reset_index()
    label = label.rename(columns={'index': 'trip_log_id'})
    
    # Rename the columns
    cutoff_times = cutoff_times.merge(label).rename(columns={'arr_delay': 'label', 'scheduled_dep_time': 'cutoff_time'})
    return cutoff_times

cutoff_times = make_cutoffs(es, delay=15, advance='24h')
fm, features = ft.dfs(entityset=es, 
                      target_entity='trip_logs',
                      cutoff_time=cutoff_times,
                      n_jobs=2,
                      approximate='12h',
                      verbose=True)
fm_enc, features_enc = ft.encode_features(fm, features)
fm.to_csv('fm.csv')
fm_enc.to_csv('fm_enc.csv')

100%|██████████| 100/100 [01:13<00:00,  1.35it/s]


Built 115 features
EntitySet scattered to workers in 4.016 seconds
Elapsed: 04:59 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


In [2]:
# If you've already built the feature matrix once, comment out dfs and uncomment these
# fm = pd.read_csv('fm.csv', index_col='trip_log_id')
# fm_enc = pd.read_csv('fm_enc.csv', index_col='trip_log_id')

# Diagnostics
There are consistent questions we'd like to ask about dataframes. The `henchman.diagnostics` module provides plaintext answers to those questions. Let's take a macroscopic look at all the entities in the entityset.

In [3]:
from henchman.diagnostics import overview
for entity_name in es.entity_dict:
    print('\n--------\n'+entity_name+'\n--------')
    overview(es[entity_name].df)


--------
airports
--------

+--------------+
|  Data Shape  |
+--------------+
Number of columns: 3
Number of rows: 85

+------------------+
|  Missing Values  |
+------------------+
Most values missing from column: 0
Average missing values by column: 0.00

+----------------+
|  Memory Usage  |
+----------------+
Total memory used: 0.02 MB
Average memory by column: 0.00 MB

+--------------+
|  Data Types  |
+--------------+
        index
0            
object      3

--------
flights
--------

+--------------+
|  Data Shape  |
+--------------+
Number of columns: 9
Number of rows: 2739

+------------------+
|  Missing Values  |
+------------------+
Most values missing from column: 0
Average missing values by column: 0.00

+----------------+
|  Memory Usage  |
+----------------+
Total memory used: 1.23 MB
Average memory by column: 0.12 MB

+--------------+
|  Data Types  |
+--------------+
                index
0                    
int64               2
datetime64[ns]      1
object     

There is also functionality to find common dataset problems. It is possible to modify every warning threshold, but the defaults tend to do a good job of finding red flags. Here are the warnings for the flight feature matrix.

In [4]:
from henchman.diagnostics import warnings
warnings(fm)


+------------+
+------------+
distance and scheduled_elapsed_time are linearly correlated: 0.972
distance and flights.distance_group are linearly correlated: 0.979
distance and flights.MAX(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.971
distance and flights.MIN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.972
distance and flights.MEAN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.972
distance and flights.MAX(trip_logs.air_time) are linearly correlated: 0.904
scheduled_elapsed_time and flights.distance_group are linearly correlated: 0.953
scheduled_elapsed_time and flights.MIN(trip_logs.distance) are linearly correlated: 0.972
scheduled_elapsed_time and flights.MEAN(trip_logs.air_time) are linearly correlated: 0.904
scheduled_elapsed_time and flights.MAX(trip_logs.distance) are linearly correlated: 0.972
scheduled_elapsed_time and flights.MAX(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.999
scheduled_elapsed_time and fligh

flight_id has many unique values: 2716
flights.origin has many unique values: 83
flights.dest has many unique values: 85
flights.origin_city has many unique values: 79
flights.airports.dest_city has many unique values: 81


Lots of highly correlated features in our feature matrix! Many of them come from correlations that had already existed in `trip_logs`. It's often the case that you'd like a full profile for your dataframe. There's a function for that as well. This will crash the notebook preview, but begins returning output very quickly.

In [5]:
from henchman.diagnostics import profile
# profile(fm_enc)

Note that the column summaries are created according to the pandas `dtype`, so that we're not trying to average categorical values.

# Selection
We would like to remove some of our highly correlated features prior to machine learning. We would have moderate success just randomly selecting a smaller feature subset. Before we start, we're going to separate off a holdout set for testing.

In [6]:
from henchman.learning import create_holdout
from henchman.selection import RandomSelect
X = fm_enc.copy().fillna(0)
y = X.pop('label')

X, X_ho, y, y_ho = create_holdout(X, y)
rand_selector = RandomSelect(n_feats=50)
rand_selector.fit(X)
X_rand = rand_selector.transform(X)
warnings(X_rand)
X_rand.head()


+------------+
+------------+
DataFrame has 414 duplicates
WEEKDAY(scheduled_dep_time) = 6 and WEEKDAY(dep_time) = 6 are linearly correlated: 0.985
DAY(scheduled_dep_time) = 24 and DAY(dep_time) = 24 are linearly correlated: 0.991
flights.MAX(trip_logs.scheduled_elapsed_time) and flights.MIN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.999
flights.STD(trip_logs.arr_delay) and flights.MAX(trip_logs.dep_delay) are linearly correlated: 0.913
DAY(flight_date) = 24 and DAY(dep_time) = 24 are linearly correlated: 0.991


Unnamed: 0_level_0,WEEKDAY(scheduled_dep_time) = 6,flights.dest = SFO,DAY(scheduled_dep_time) = 24,flights.origin = JFK,flights.MAX(trip_logs.scheduled_elapsed_time),flights.dest = JFK,flights.WEEKDAY(first_trip_logs_time) = 2,YEAR(dep_time) = 2017,flights.carrier = VX,flights.MEAN(trip_logs.air_time),flights.STD(trip_logs.arr_delay),"flights.airports.dest_city = San Francisco, CA",flights.origin_state = TX,DAY(flight_date) = 24,flights.flight_num = 2240,flights.flight_num = 347,DAY(dep_time) = 20,DAY(dep_time) = 23,flights.carrier = AA,flights.SKEW(trip_logs.arr_delay),flights.STD(trip_logs.national_airspace_delay),DAY(scheduled_arr_time) = 27,flights.airports.dest_state = MA,DAY(scheduled_arr_time) = 13,flight_id = DL-2404:SAN->JFK,flights.carrier = B6,flights.origin = BOS,WEEKDAY(time_index) = 4,DAY(time_index) = 5,flights.distance_group = 7,WEEKDAY(flight_date) = 1,WEEKDAY(dep_time) = 6,DAY(dep_time) = 24,flights.MEAN(trip_logs.dep_delay),flights.MIN(trip_logs.dep_delay),flights.MEAN(trip_logs.weather_delay),flights.MAX(trip_logs.dep_delay),flights.MIN(trip_logs.scheduled_elapsed_time),MONTH(scheduled_dep_time) = 1,"flights.origin_city = Chicago, IL","flights.airports.dest_city = Orlando, FL",YEAR(arr_time) = 2017,flights.flight_num = 418,flights.DAY(first_trip_logs_time) = 9,YEAR(flight_date) = 2017,DAY(arr_time) = 10,DAY(dep_time) = 16,flights.origin_state = PR,WEEKDAY(flight_date) = 3,flights.MONTH(first_trip_logs_time) = 9
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1
22874,1,0,0,0,17700000000000,1,0,1,0,0.0,0.0,0,0,0,0,0,0,0,1,0.0,0.0,0,0,0,0,0,0,0,0,0,0,1,0,0.0,0.0,0.0,0.0,17700000000000,1,0,0,1,0,0,1,0,0,0,0,1
1140,1,0,0,0,14400000000000,1,0,1,0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,1,0,0,0,1,0,1,0,0.0,0.0,0.0,0.0,14340000000000,1,0,0,1,0,0,1,0,0,1,0,1
8941,1,0,0,0,9780000000000,1,0,1,0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,1,0,0.0,0.0,0.0,0.0,9480000000000,1,0,0,1,0,0,1,0,0,0,0,1
564,1,0,0,1,10080000000000,0,0,1,0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,1,0,0,0,0,0,1,0,0.0,0.0,0.0,0.0,9960000000000,1,0,1,1,0,0,1,0,0,0,0,1
8678,1,0,0,1,13500000000000,0,0,1,0,0.0,0.0,0,0,0,0,0,0,0,0,0.0,0.0,0,0,0,0,1,0,0,0,1,0,1,0,0.0,0.0,0.0,0.0,13200000000000,1,0,0,1,0,0,1,0,0,0,0,1


We can be a little bit more precise with our selection as well:

In [7]:
from henchman.selection import Dendrogram
corr_selector = Dendrogram(X, max_threshes=250)
X_careful = corr_selector.transform(X, n_feats=50)

100%|██████████| 236/236 [00:47<00:00,  4.96it/s]
100%|██████████| 236/236 [00:18<00:00, 13.09it/s]

There are 50 distinct connected components at thresh step 10 in the Dendrogram
You might also be interested in 51 components at step 9





The Dendrogram object actually has the complete connectivity of `X` according to pairwise correlation. It chooses at random one representative from each connected component of the graph, and does that for every threshold of connectivity. That is to say, it's a one time calculation to find a feature set of an arbitrary size. 

In [8]:
corr_selector.transform(X, n_feats=100).head()

There are 91 distinct connected components at thresh step 4 in the Dendrogram
You might also be interested in 113 components at step 3


Unnamed: 0_level_0,distance,flight_id = DL-984:ATL->JFK,flight_id = DL-906:ATL->LGA,flight_id = DL-41:JFK->LAX,flight_id = DL-400:JFK->LAS,flight_id = DL-2404:SAN->JFK,flight_id = B6-616:SFO->JFK,flight_id = B6-1816:JFK->SYR,flight_id = B6-1415:JFK->SFO,flight_id = B6-1323:JFK->LAX,flight_id = unknown,WEEKDAY(scheduled_dep_time) = 0,WEEKDAY(scheduled_dep_time) = 1,WEEKDAY(scheduled_dep_time) = 4,WEEKDAY(scheduled_dep_time) = 6,WEEKDAY(scheduled_dep_time) = 3,WEEKDAY(scheduled_dep_time) = 5,WEEKDAY(scheduled_dep_time) = unknown,flights.carrier = UA,flights.carrier = VX,flights.carrier = NK,flights.carrier = F9,flights.carrier = OO,flights.carrier = unknown,YEAR(scheduled_arr_time) = 2017,YEAR(scheduled_arr_time) = unknown,WEEKDAY(time_index) = unknown,flights.origin = ORD,WEEKDAY(arr_time) = unknown,YEAR(dep_time) = 2017,YEAR(dep_time) = unknown,YEAR(time_index) = 2016,YEAR(time_index) = unknown,YEAR(flight_date) = 2017,YEAR(flight_date) = unknown,YEAR(scheduled_dep_time) = 2017,YEAR(scheduled_dep_time) = unknown,WEEKDAY(scheduled_arr_time) = unknown,MONTH(arr_time) = 3,MONTH(arr_time) = unknown,MONTH(dep_time) = 3,MONTH(dep_time) = unknown,MONTH(flight_date) = unknown,flights.origin_state = CO,flights.distance_group = 9,flights.distance_group = unknown,YEAR(arr_time) = 2017,YEAR(arr_time) = unknown,MONTH(scheduled_dep_time) = unknown,flights.flight_num = 347,flights.flight_num = 423,flights.flight_num = 2240,flights.flight_num = 418,flights.flight_num = 34,flights.flight_num = 24,flights.flight_num = 1805,flights.flight_num = 117,flights.flight_num = 224,flights.flight_num = 2183,MONTH(scheduled_arr_time) = 3,MONTH(scheduled_arr_time) = unknown,WEEKDAY(dep_time) = unknown,WEEKDAY(flight_date) = unknown,MONTH(time_index) = unknown,flights.SUM(trip_logs.security_delay),flights.SKEW(trip_logs.distance),flights.SKEW(trip_logs.carrier_delay),flights.SKEW(trip_logs.taxi_in),flights.YEAR(first_trip_logs_time) = 2016,flights.YEAR(first_trip_logs_time) = unknown,flights.SKEW(trip_logs.national_airspace_delay),flights.SKEW(trip_logs.arr_delay),flights.SKEW(trip_logs.taxi_out),flights.MONTH(first_trip_logs_time) = 9,flights.MONTH(first_trip_logs_time) = unknown,flights.DAY(first_trip_logs_time) = 20,flights.DAY(first_trip_logs_time) = 6,flights.DAY(first_trip_logs_time) = 9,flights.DAY(first_trip_logs_time) = 5,flights.DAY(first_trip_logs_time) = 19,flights.DAY(first_trip_logs_time) = unknown,flights.MIN(trip_logs.security_delay),flights.SKEW(trip_logs.weather_delay),flights.MIN(trip_logs.national_airspace_delay),flights.SKEW(trip_logs.security_delay),flights.STD(trip_logs.distance),flights.SKEW(trip_logs.air_time),flights.SKEW(trip_logs.late_aircraft_delay),flights.SKEW(trip_logs.dep_delay),flights.WEEKDAY(first_trip_logs_time) = unknown,flights.SKEW(trip_logs.scheduled_elapsed_time)
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1
22874,2248.0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1,0,0.0,0.0,0.0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0
1140,1598.0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1,0,0.0,0.0,0.0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,-0.707107
8941,1069.0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1,0,0.0,0.0,0.0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,-4.419303
564,944.0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1,0,0.0,0.0,0.0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,-1.601282
8678,1598.0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1,0,0.0,0.0,0.0,1,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,-4.089039


# Learning

The biggest offender of code reuse in our demos is in the machine learning section. The workflow is always the same with potentially different *models* and *metrics*. This is provided as the single function `create_model`.

In [9]:
from henchman.learning import create_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

scores, fit_model = create_model(X, y, RandomForestClassifier(), roc_auc_score, n_splits=4)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(scores), np.std(scores)))

Average score of 0.55 with stdev 0.027


In [10]:
scores_rand, fit_model_rand = create_model(X_rand, y, RandomForestClassifier(), roc_auc_score, n_splits=4)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(scores_rand), np.std(scores_rand)))

Average score of 0.55 with stdev 0.005


In [11]:
scores_careful, fit_model_careful = create_model(X_careful, y, RandomForestClassifier(), roc_auc_score, n_splits=4)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(scores_careful), np.std(scores_careful)))

Average score of 0.58 with stdev 0.017


We can also check the normalized feature importances from the `learning` module.

In [12]:
from henchman.learning import feature_importances
top_feats = feature_importances(X_careful, fit_model_careful, n_feats=5)

1: flights.SKEW(trip_logs.scheduled_elapsed_time) [1.000]
2: distance [0.547]
3: flights.SUM(trip_logs.security_delay) [0.037]
4: flights.DAY(first_trip_logs_time) = 20 [0.002]
5: flights.distance_group = unknown [0.002]
-----



In [13]:
# Test on test set
X = fm_enc.copy().fillna(0)
y = X.pop('label')
X_final = corr_selector.transform(X, n_feats=50)
real_scores, _ = create_model(X_final, y, RandomForestClassifier(), roc_auc_score)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(real_scores), np.std(scores_careful)))

There are 50 distinct connected components at thresh step 10 in the Dendrogram
You might also be interested in 51 components at step 9
Average score of 0.62 with stdev 0.017


# Plotting
The API for this section is more of a work in progress. Here are some example plots.

In [14]:
from bokeh.io import output_notebook, show
import henchman.plotting as hplot
output_notebook()
show(hplot.feature_importances(X_careful, fit_model_careful, n_feats=5))

In [15]:
show(hplot.static_histogram(X['flights.STD(trip_logs.arr_delay)'], n_bins=50))

In [16]:
show(hplot.static_histogram_and_label(X[top_feats[0]], y, n_bins=50))