# Overview
This notebook will have 4 components matching the modules of Henchman:
1. diagnostics
2. selection
3. learning
4. plotting

We will demonstrate the available functionality using the flight dataset from Featuretools. If you have both `henchman` and `featuretools` installed, you should be able to run the following without issue. Here we build the feature matrix and entityset we'll be using.

In [1]:
import numpy as np
import pandas as pd
import featuretools as ft

es = ft.demo.load_flight(categorical_filter={'dest_city': ['New York, NY'], 
                                             'origin_city': ['New York, NY']}, 
                         verbose=True)
def make_cutoffs(es, delay=75, advance='24h'):
    # Predict for all non-canceled, non-diverted flights
    tmp = es['trip_logs'].df[(es['trip_logs'].df['cancelled'] == False) & (es['trip_logs'].df['diverted'] == False)]

    # Set the cutoff time to be `advance` hours before the scheduled departure time (default 24)
    cutoff_times = tmp[['trip_log_id', 'scheduled_dep_time']]
    cutoff_times['scheduled_dep_time'] = cutoff_times['scheduled_dep_time'] - pd.Timedelta(advance)

    # Check if the flight will have a delay of `delay` (default 75)
    label = (es['trip_logs'].df['arr_delay'] > delay).reset_index()
    label = label.rename(columns={'index': 'trip_log_id'})
    
    # Rename the columns
    cutoff_times = cutoff_times.merge(label).rename(columns={'arr_delay': 'label', 'scheduled_dep_time': 'cutoff_time'})
    return cutoff_times

cutoff_times = make_cutoffs(es, delay=15, advance='24h')
fm, features = ft.dfs(entityset=es, 
                      target_entity='trip_logs',
                      cutoff_time=cutoff_times,
                      n_jobs=2,
                      approximate='12h',
                      verbose=True)
fm_enc, features_enc = ft.encode_features(fm, features)
fm.to_csv('fm.csv')
fm_enc.to_csv('fm_enc.csv')

100%|██████████| 100/100 [01:23<00:00,  1.20it/s]


Built 115 features
EntitySet scattered to workers in 4.743 seconds
Elapsed: 07:04 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


# Diagnostics
There are consistent questions we'd like to ask about dataframes. The `henchman.diagnostics` module provides plaintext answers to those questions. Let's take a macroscopic look at all the entities in the entityset.

In [2]:
from henchman.diagnostics import overview
for entity_name in es.entity_dict:
    print('\n--------\n'+entity_name+'\n--------')
    overview(es[entity_name].df)


--------
airports
--------

+--------------+
|  Data Shape  |
+--------------+
Number of columns: 3
Number of rows: 85

+------------------+
|  Missing Values  |
+------------------+
Most values missing from column: 0
Average missing values by column: 0.00

+----------------+
|  Memory Usage  |
+----------------+
Total memory used: 0.02 MB
Average memory by column: 0.00 MB

+--------------+
|  Data Types  |
+--------------+
        index
0            
object      3

--------
flights
--------

+--------------+
|  Data Shape  |
+--------------+
Number of columns: 9
Number of rows: 2739

+------------------+
|  Missing Values  |
+------------------+
Most values missing from column: 0
Average missing values by column: 0.00

+----------------+
|  Memory Usage  |
+----------------+
Total memory used: 1.23 MB
Average memory by column: 0.12 MB

+--------------+
|  Data Types  |
+--------------+
                index
0                    
int64               2
datetime64[ns]      1
object     

There is also functionality to find common dataset problems. It is possible to modify every warning threshold, but the defaults tend to do a good job of finding red flags. Here are the warnings for the flight feature matrix.

In [3]:
# If you've already built the feature matrices, start here
# import numpy as np
# import pandas as pd
# import featuretools as ft
# fm = pd.read_csv('fm.csv', index_col='trip_log_id')
# fm_enc = pd.read_csv('fm_enc.csv', index_col='trip_log_id')

In [4]:
from henchman.diagnostics import warnings
warnings(fm)


+------------+
+------------+
distance and scheduled_elapsed_time are linearly correlated: 0.972
distance and flights.distance_group are linearly correlated: 0.979
distance and flights.MAX(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.971
distance and flights.MIN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.972
distance and flights.MEAN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.972
distance and flights.MAX(trip_logs.air_time) are linearly correlated: 0.904
scheduled_elapsed_time and flights.distance_group are linearly correlated: 0.953
scheduled_elapsed_time and flights.MIN(trip_logs.distance) are linearly correlated: 0.972
scheduled_elapsed_time and flights.MEAN(trip_logs.air_time) are linearly correlated: 0.904
scheduled_elapsed_time and flights.MAX(trip_logs.distance) are linearly correlated: 0.972
scheduled_elapsed_time and flights.MAX(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.999
scheduled_elapsed_time and fligh

flights.SKEW(trip_logs.carrier_delay) has 56353 missing values: (100% of total)
flights.SKEW(trip_logs.taxi_in) has 56353 missing values: (100% of total)
flights.SKEW(trip_logs.national_airspace_delay) has 56353 missing values: (100% of total)
flights.SKEW(trip_logs.arr_delay) has 56353 missing values: (100% of total)
flights.SKEW(trip_logs.taxi_out) has 56353 missing values: (100% of total)
flights.SKEW(trip_logs.weather_delay) has 56353 missing values: (100% of total)
flights.SKEW(trip_logs.security_delay) has 56353 missing values: (100% of total)
flights.SKEW(trip_logs.air_time) has 56353 missing values: (100% of total)
flights.SKEW(trip_logs.late_aircraft_delay) has 56353 missing values: (100% of total)
flights.SKEW(trip_logs.dep_delay) has 56353 missing values: (100% of total)
flight_id has many unique values: 2716
flights.origin has many unique values: 83
flights.dest has many unique values: 85
flights.origin_city has many unique values: 79
flights.airports.dest_city has many uni

Lots of highly correlated features in our feature matrix! Many of them come from correlations that had already existed in `trip_logs`. It's often the case that you'd like a full profile for your dataframe. There's a function for that as well:

In [5]:
from henchman.diagnostics import profile
profile(fm_enc)


+--------------+
|  Data Shape  |
+--------------+
Number of columns: 359
Number of rows: 56353

+------------------+
|  Missing Values  |
+------------------+
Most values missing from column: 56353
Average missing values by column: 1815.18

+----------------+
|  Memory Usage  |
+----------------+
Total memory used: 161.90 MB
Average memory by column: 0.45 MB

+--------------+
|  Data Types  |
+--------------+
         index
0             
bool         1
int64      286
float64     72

+------------+
+------------+
DataFrame has 52 duplicates
distance and scheduled_elapsed_time are linearly correlated: 0.972
distance and flights.MAX(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.971
distance and flights.MIN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.972
distance and flights.MEAN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.972
distance and flights.MAX(trip_logs.air_time) are linearly correlated: 0.904
scheduled_elapsed_time and flights.MI

MONTH(scheduled_dep_time) = 1 and MONTH(scheduled_arr_time) = 1 are linearly correlated: 0.998
MONTH(scheduled_dep_time) = 1 and MONTH(scheduled_arr_time) = 2 are linearly correlated: -0.995
MONTH(scheduled_dep_time) = 1 and MONTH(time_index) = 10 are linearly correlated: -0.900
MONTH(scheduled_dep_time) = 1 and MONTH(time_index) = 9 are linearly correlated: 0.900
MONTH(scheduled_dep_time) = 2 and MONTH(scheduled_arr_time) = 1 are linearly correlated: -0.998
MONTH(scheduled_dep_time) = 2 and MONTH(scheduled_arr_time) = 2 are linearly correlated: 0.995
MONTH(scheduled_dep_time) = 2 and MONTH(time_index) = 10 are linearly correlated: 0.900
MONTH(scheduled_dep_time) = 2 and MONTH(time_index) = 9 are linearly correlated: -0.900
MONTH(scheduled_arr_time) = 1 and MONTH(scheduled_arr_time) = 2 are linearly correlated: -0.997
MONTH(scheduled_arr_time) = 1 and MONTH(time_index) = 10 are linearly correlated: -0.902
MONTH(scheduled_arr_time) = 1 and MONTH(time_index) = 9 are linearly correlated: 


## flights.carrier = EV ##
Maximum: 1, Minimum: 0, Mean: 0.07
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.carrier = WN ##
Maximum: 1, Minimum: 0, Mean: 0.06
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.carrier = UA ##
Maximum: 1, Minimum: 0, Mean: 0.04
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.carrier = VX ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.carrier = NK ##
Maximum: 1, Minimum: 0, Mean: 0.02
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.carrier = F9 ##
Maximum: 1, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.carrier = OO ##
Maximum: 1, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.carrier = unknown ##
Maximum: 1, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## YEAR(scheduled_arr_time) = 2017 ##
Maximum: 1, Minimum: 1, Mean: 1.00
Quartile 3: 1.00 

Maximum: 1, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## MONTH(dep_time) = unknown ##
Maximum: 0, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## MONTH(flight_date) = 1 ##
Maximum: 1, Minimum: 0, Mean: 0.52
Quartile 3: 1.00 | Median: 1.00| Quartile 1: 0.00

## MONTH(flight_date) = 2 ##
Maximum: 1, Minimum: 0, Mean: 0.48
Quartile 3: 1.00 | Median: 0.00| Quartile 1: 0.00

## MONTH(flight_date) = unknown ##
Maximum: 0, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## DAY(arr_time) = 27 ##
Maximum: 1, Minimum: 0, Mean: 0.04
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## DAY(arr_time) = 20 ##
Maximum: 1, Minimum: 0, Mean: 0.04
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## DAY(arr_time) = 24 ##
Maximum: 1, Minimum: 0, Mean: 0.04
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## DAY(arr_time) = 17 ##
Maximum: 1, Minimum: 0, Mean: 0.04
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

#

Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(dep_time) = 6 ##
Maximum: 1, Minimum: 0, Mean: 0.15
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(dep_time) = 2 ##
Maximum: 1, Minimum: 0, Mean: 0.14
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(dep_time) = 3 ##
Maximum: 1, Minimum: 0, Mean: 0.13
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(dep_time) = 5 ##
Maximum: 1, Minimum: 0, Mean: 0.10
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(dep_time) = unknown ##
Maximum: 0, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(flight_date) = 0 ##
Maximum: 1, Minimum: 0, Mean: 0.17
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(flight_date) = 1 ##
Maximum: 1, Minimum: 0, Mean: 0.16
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(flight_date) = 4 ##
Maximum: 1, Minimum: 0, Mean: 0.15
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(flight_date

Quartile 3: 49.97 | Median: 29.81| Quartile 1: 14.16
Missing: 2203

## flights.SKEW(trip_logs.national_airspace_delay) ##
Maximum: nan, Minimum: nan, Mean: nan
Quartile 3: nan | Median: nan| Quartile 1: nan
Missing: 56353

## flights.STD(trip_logs.arr_delay) ##
Maximum: 452.549444812, Minimum: 0.0, Mean: 36.10
Quartile 3: 46.73 | Median: 28.97| Quartile 1: 16.41
Missing: 2203

## flights.MEAN(trip_logs.taxi_out) ##
Maximum: 142.0, Minimum: 0.0, Mean: 18.85
Quartile 3: 23.50 | Median: 18.42| Quartile 1: 14.27
Missing: 2203

## flights.SKEW(trip_logs.arr_delay) ##
Maximum: nan, Minimum: nan, Mean: nan
Quartile 3: nan | Median: nan| Quartile 1: nan
Missing: 56353

## flights.MAX(trip_logs.arr_delay) ##
Maximum: 1369.0, Minimum: -59.0, Mean: 128.51
Quartile 3: 171.00 | Median: 90.00| Quartile 1: 31.00
Missing: 2203

## flights.MAX(trip_logs.national_airspace_delay) ##
Maximum: 948.0, Minimum: 0.0, Mean: 47.47
Quartile 3: 56.00 | Median: 26.00| Quartile 1: 0.00
Missing: 2203

## flights.SUM

Maximum: 554.0, Minimum: 0.0, Mean: 0.12
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00
Missing: 2203


Note that the column summaries are created according to the pandas `dtype`, so that we're not trying to average categorical values.

# Selection
We would like to remove some of our highly correlated features prior to machine learning. We would have moderate success just randomly selecting a smaller feature subset. Before we start, we're going to separate off a holdout set for testing.

In [6]:
from henchman.learning import create_holdout
from henchman.selection import RandomSelect
X = fm_enc.copy().fillna(0)
y = X.pop('label')

X, X_ho, y, y_ho = create_holdout(X, y)
rand_selector = RandomSelect(n_feats=50)
rand_selector.fit(X)
X_rand = rand_selector.transform(X)
warnings(X_rand)
X_rand.head()


+------------+
+------------+
DataFrame has 337 duplicates
flights.origin_state = NY and flights.origin_city = New York, NY are linearly correlated: 0.975
WEEKDAY(dep_time) = 0 and WEEKDAY(flight_date) = 0 are linearly correlated: 0.987
MONTH(dep_time) = 2 and MONTH(flight_date) = 1 are linearly correlated: -1.000
flights.MIN(trip_logs.scheduled_elapsed_time) and scheduled_elapsed_time are linearly correlated: 0.999
DAY(time_index) = 25 and DAY(scheduled_arr_time) = 23 are linearly correlated: 0.902
DAY(arr_time) = 23 and DAY(scheduled_arr_time) = 23 are linearly correlated: 0.952


Unnamed: 0_level_0,flights.STD(trip_logs.taxi_in),flights.STD(trip_logs.distance),flights.origin = SFO,flights.SKEW(trip_logs.scheduled_elapsed_time),flights.SUM(trip_logs.dep_delay),flights.origin_state = NY,flight_id = unknown,WEEKDAY(dep_time) = 0,flights.MAX(trip_logs.air_time),flights.SUM(trip_logs.national_airspace_delay),flights.SKEW(trip_logs.weather_delay),DAY(dep_time) = 20,flights.carrier = UA,MONTH(dep_time) = 2,flights.distance_group = 7,"flights.airports.dest_city = Boston, MA",flights.DAY(first_trip_logs_time) = 4,"flights.airports.dest_city = Los Angeles, CA",flights.airports.dest_state = TX,WEEKDAY(scheduled_arr_time) = 2,WEEKDAY(time_index) = 3,WEEKDAY(arr_time) = 5,DAY(time_index) = 27,YEAR(dep_time) = 2017,flights.MIN(trip_logs.scheduled_elapsed_time),flights.WEEKDAY(first_trip_logs_time) = 4,flights.distance_group = 9,flights.origin = FLL,flights.distance_group = 6,WEEKDAY(flight_date) = 0,flights.airports.dest_state = PR,flights.SUM(trip_logs.weather_delay),flights.dest = MCO,flights.origin_state = unknown,DAY(time_index) = 25,DAY(scheduled_arr_time) = 16,"flights.origin_city = New York, NY",DAY(scheduled_dep_time) = 3,flights.SKEW(trip_logs.dep_delay),WEEKDAY(scheduled_arr_time) = 3,MONTH(flight_date) = 1,scheduled_elapsed_time,DAY(time_index) = 26,DAY(arr_time) = 23,DAY(flight_date) = 10,flights.STD(trip_logs.dep_delay),DAY(time_index) = 19,WEEKDAY(scheduled_arr_time) = unknown,DAY(scheduled_arr_time) = 23,DAY(scheduled_arr_time) = 26
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1
22874,0.0,0.0,0,0.0,0.0,0,1,0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,1,17700000000000,0,1,0,0,0,0,0.0,0,1,0,0,0,0,0.0,0,1,17700000000000,0,0,0,0.0,0,0,0,0
1140,0.0,0.0,0,-0.707107,0.0,0,1,0,0.0,0.0,0.0,0,0,0,1,0,0,0,0,0,0,0,0,1,14340000000000,0,0,0,0,0,0,0.0,0,0,0,0,0,0,0.0,0,1,14400000000000,0,0,0,0.0,0,0,0,0
8941,0.0,0.0,0,-4.419303,0.0,0,1,0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,1,9480000000000,0,0,1,0,0,0,0.0,0,0,0,0,0,0,0.0,0,1,9480000000000,0,0,0,0.0,0,0,0,0
564,0.0,0.0,0,-1.601282,0.0,1,1,0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,1,9960000000000,0,0,0,0,0,0,0.0,1,0,0,0,1,0,0.0,0,1,9960000000000,0,0,0,0.0,0,0,0,0
8678,0.0,0.0,0,-4.089039,0.0,1,1,0,0.0,0.0,0.0,0,0,0,1,0,0,0,0,0,0,0,0,1,13200000000000,0,0,0,0,0,1,0.0,0,0,0,0,1,0,0.0,0,1,13200000000000,0,0,0,0.0,0,0,0,0


We can be a little bit more precise with our selection as well:

In [7]:
from henchman.selection import Dendrogram
corr_selector = Dendrogram(X, max_threshes=100)
X_careful = corr_selector.transform(X, n_feats=50)

100%|██████████| 100/100 [00:23<00:00,  4.34it/s]
100%|██████████| 100/100 [00:15<00:00,  6.46it/s]

There are 50 distinct connected components at thresh step 8 in the Dendrogram
You might also be interested in 51 components at step 7





The Dendrogram object actually has the complete connectivity of `X` according to pairwise correlation. It chooses at random one representative from each connected component of the graph, and does that for every threshold of connectivity. That is to say, it's a one time calculation to find a feature set of an arbitrary size. 

In [8]:
corr_selector.transform(X, n_feats=45).head()

There are 45 distinct connected components at thresh step 12 in the Dendrogram
You might also be interested in 46 components at step 11


Unnamed: 0_level_0,distance,MONTH(dep_time) = unknown,MONTH(flight_date) = unknown,WEEKDAY(dep_time) = unknown,flights.SKEW(trip_logs.taxi_in),flights.YEAR(first_trip_logs_time) = 2016,flights.YEAR(first_trip_logs_time) = unknown,flights.SKEW(trip_logs.carrier_delay),WEEKDAY(scheduled_dep_time) = unknown,flights.SKEW(trip_logs.air_time),flights.SKEW(trip_logs.national_airspace_delay),flights.SKEW(trip_logs.arr_delay),flights.SKEW(trip_logs.taxi_out),YEAR(scheduled_arr_time) = 2017,YEAR(scheduled_arr_time) = unknown,flights.MONTH(first_trip_logs_time) = unknown,WEEKDAY(time_index) = unknown,YEAR(arr_time) = 2017,YEAR(arr_time) = unknown,MONTH(scheduled_dep_time) = unknown,flights.SKEW(trip_logs.weather_delay),WEEKDAY(arr_time) = unknown,YEAR(dep_time) = 2017,YEAR(dep_time) = unknown,flights.STD(trip_logs.distance),MONTH(scheduled_arr_time) = 3,MONTH(scheduled_arr_time) = unknown,flights.SKEW(trip_logs.late_aircraft_delay),flights.SKEW(trip_logs.dep_delay),YEAR(time_index) = 2016,YEAR(time_index) = unknown,WEEKDAY(flight_date) = unknown,MONTH(time_index) = unknown,YEAR(flight_date) = 2017,YEAR(flight_date) = unknown,YEAR(scheduled_dep_time) = 2017,YEAR(scheduled_dep_time) = unknown,flights.SKEW(trip_logs.security_delay),WEEKDAY(scheduled_arr_time) = unknown,flights.WEEKDAY(first_trip_logs_time) = unknown,MONTH(arr_time) = 3,MONTH(arr_time) = unknown,flights.SKEW(trip_logs.distance),flights.MIN(trip_logs.security_delay),MONTH(dep_time) = 3
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1
22874,2248.0,0,0,0,0.0,1,0,0.0,0,0.0,0.0,0.0,0.0,1,0,0,0,1,0,0,0.0,0,1,0,0.0,0,0,0.0,0.0,1,0,0,0,1,0,1,0,0.0,0,0,0,0,0.0,0.0,0
1140,1598.0,0,0,0,0.0,1,0,0.0,0,0.0,0.0,0.0,0.0,1,0,0,0,1,0,0,0.0,0,1,0,0.0,0,0,0.0,0.0,1,0,0,0,1,0,1,0,0.0,0,0,0,0,0.0,0.0,0
8941,1069.0,0,0,0,0.0,1,0,0.0,0,0.0,0.0,0.0,0.0,1,0,0,0,1,0,0,0.0,0,1,0,0.0,0,0,0.0,0.0,1,0,0,0,1,0,1,0,0.0,0,0,0,0,0.0,0.0,0
564,944.0,0,0,0,0.0,1,0,0.0,0,0.0,0.0,0.0,0.0,1,0,0,0,1,0,0,0.0,0,1,0,0.0,0,0,0.0,0.0,1,0,0,0,1,0,1,0,0.0,0,0,0,0,0.0,0.0,0
8678,1598.0,0,0,0,0.0,1,0,0.0,0,0.0,0.0,0.0,0.0,1,0,0,0,1,0,0,0.0,0,1,0,0.0,0,0,0.0,0.0,1,0,0,0,1,0,1,0,0.0,0,0,0,0,0.0,0.0,0


# Learning

The biggest offender of code reuse in our demos is in the machine learning section. The workflow is always the same with potentially different *models* and *metrics*. This is provided as the single function `create_model`.

In [9]:
from henchman.learning import create_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

scores, fit_model = create_model(X, y, RandomForestClassifier(), roc_auc_score, n_splits=4)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(scores), np.std(scores)))

Average score of 0.56 with stdev 0.022


In [10]:
scores_rand, fit_model_rand = create_model(X_rand, y, RandomForestClassifier(), roc_auc_score, n_splits=4)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(scores_rand), np.std(scores_rand)))

Average score of 0.57 with stdev 0.018


In [11]:
scores_careful, fit_model_careful = create_model(X_careful, y, RandomForestClassifier(), roc_auc_score, n_splits=4)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(scores_careful), np.std(scores_careful)))

Average score of 0.58 with stdev 0.017


We can also check the normalized feature importances from the `learning` module.

In [12]:
from henchman.learning import feature_importances
top_feats = feature_importances(X_careful, fit_model_careful, n_feats=5)

1: flights.SKEW(trip_logs.scheduled_elapsed_time) [1.000]
2: distance [0.632]
3: flights.SUM(trip_logs.security_delay) [0.039]
4: flights.distance_group = unknown [0.003]
5: flights.DAY(first_trip_logs_time) = 20 [0.003]
-----



In [13]:
# Test on test set
X = fm_enc.copy().fillna(0)
y = X.pop('label')
X_final = corr_selector.transform(X, n_feats=50)
real_scores, _ = create_model(X_final, y, RandomForestClassifier(), roc_auc_score)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(real_scores), np.std(scores_careful)))

There are 50 distinct connected components at thresh step 8 in the Dendrogram
You might also be interested in 51 components at step 7
Average score of 0.62 with stdev 0.017


# Plotting
The API for this section is more of a work in progress. Here are some example plots.

In [14]:
from bokeh.io import output_notebook, show
import henchman.plotting as hplot
output_notebook()
show(hplot.feature_importances(X_careful, fit_model_careful, n_feats=5))

In [15]:
show(hplot.static_histogram(X['flights.STD(trip_logs.arr_delay)'], n_bins=50))

In [20]:
show(hplot.static_histogram_and_label(X[top_feats[0]], y, n_bins=25))

In [17]:
show(hplot.dynamic_histogram(X['flights.STD(trip_logs.arr_delay)']))

In [23]:
show(hplot.dynamic_histogram_and_label(X[top_feats[1]], y, normalized=False))

In [43]:
show(hplot.static_piechart(fm['flights.carrier']))
show(hplot.static_piechart(fm['label']))