# Overview
This notebook will have 4 components matching the modules of Henchman:
1. diagnostics
2. selection
3. learning
4. plotting

We will demonstrate the available functionality using the flight dataset from Featuretools. If you have both `henchman` and `featuretools` installed, you should be able to run the following without issue. Here we build the feature matrix and entityset we'll be using.

In [28]:
import numpy as np
import pandas as pd
import featuretools as ft

es = ft.demo.load_flight(categorical_filter={'dest_city': ['New York, NY'], 
                                             'origin_city': ['New York, NY']}, 
                         verbose=True)
def make_cutoffs(es, delay=75, advance='24h'):
    # Predict for all non-canceled, non-diverted flights
    tmp = es['trip_logs'].df[(es['trip_logs'].df['cancelled'] == False) & (es['trip_logs'].df['diverted'] == False)]

    # Set the cutoff time to be `advance` hours before the scheduled departure time (default 24)
    cutoff_times = tmp[['trip_log_id', 'scheduled_dep_time']]
    cutoff_times['scheduled_dep_time'] = cutoff_times['scheduled_dep_time'] - pd.Timedelta(advance)

    # Check if the flight will have a delay of `delay` (default 75)
    label = (es['trip_logs'].df['arr_delay'] > delay).reset_index()
    label = label.rename(columns={'index': 'trip_log_id'})
    
    # Rename the columns
    cutoff_times = cutoff_times.merge(label).rename(columns={'arr_delay': 'label', 'scheduled_dep_time': 'cutoff_time'})
    return cutoff_times

cutoff_times = make_cutoffs(es, delay=15, advance='24h')
fm, features = ft.dfs(entityset=es, 
                      target_entity='trip_logs',
                      cutoff_time=cutoff_times,
                      n_jobs=2,
                      approximate='12h',
                      verbose=True)
fm_enc, features_enc = ft.encode_features(fm, features)
fm.to_csv('fm.csv')
fm_enc.to_csv('fm_enc.csv')

100%|██████████| 100/100 [01:25<00:00,  1.17it/s]


Built 115 features
EntitySet scattered to workers in 5.786 seconds
Elapsed: 05:41 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


# Diagnostics
There are consistent questions we'd like to ask about dataframes. The `henchman.diagnostics` module provides plaintext answers to those questions. Let's take a macroscopic look at all the entities in the entityset.

In [2]:
# from henchman.diagnostics import overview
# for entity_name in es.entity_dict:
#     print('\n--------\n'+entity_name+'\n--------')
#     overview(es[entity_name].df)

There is also functionality to find common dataset problems. It is possible to modify every warning threshold, but the defaults tend to do a good job of finding red flags. Here are the warnings for the flight feature matrix.

In [3]:
# If you've already built the feature matrices, start here
import numpy as np
import pandas as pd
import featuretools as ft
fm = pd.read_csv('fm.csv', index_col='trip_log_id')
fm_enc = pd.read_csv('fm_enc.csv', index_col='trip_log_id')

In [4]:
from henchman.diagnostics import warnings
warnings(fm)


+------------+
+------------+
distance and scheduled_elapsed_time are linearly correlated: 0.972
distance and flights.distance_group are linearly correlated: 0.979
distance and flights.MAX(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.971
distance and flights.MIN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.972
distance and flights.MEAN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.972
distance and flights.MAX(trip_logs.air_time) are linearly correlated: 0.904
scheduled_elapsed_time and flights.distance_group are linearly correlated: 0.953
scheduled_elapsed_time and flights.MIN(trip_logs.distance) are linearly correlated: 0.972
scheduled_elapsed_time and flights.MEAN(trip_logs.air_time) are linearly correlated: 0.904
scheduled_elapsed_time and flights.MAX(trip_logs.distance) are linearly correlated: 0.972
scheduled_elapsed_time and flights.MAX(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.999
scheduled_elapsed_time and fligh

flight_id has many unique values: 2716
flights.origin has many unique values: 83
flights.dest has many unique values: 85
flights.origin_city has many unique values: 79
flights.airports.dest_city has many unique values: 81


Lots of highly correlated features in our feature matrix! Many of them come from correlations that had already existed in `trip_logs`. It's often the case that you'd like a full profile for your dataframe. There's a function for that as well:

In [5]:
from henchman.diagnostics import profile
profile(fm_enc)


+--------------+
|  Data Shape  |
+--------------+
Number of columns: 359
Number of rows: 56353

+------------------+
|  Missing Values  |
+------------------+
Most values missing from column: 56353
Average missing values by column: 1815.18

+----------------+
|  Memory Usage  |
+----------------+
Total memory used: 161.90 MB
Average memory by column: 0.45 MB

+--------------+
|  Data Types  |
+--------------+
         index
0             
bool         1
int64      286
float64     72

+------------+
+------------+
DataFrame has 52 duplicates
distance and scheduled_elapsed_time are linearly correlated: 0.972
distance and flights.MAX(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.971
distance and flights.MIN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.972
distance and flights.MEAN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.972
distance and flights.MAX(trip_logs.air_time) are linearly correlated: 0.904
scheduled_elapsed_time and flights.MI

flights.airports.dest_state = GA and flights.airports.dest_city = Atlanta, GA are linearly correlated: 0.954
flights.MEAN(trip_logs.air_time) and flights.MAX(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.903
flights.MEAN(trip_logs.air_time) and flights.MIN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.903
flights.MEAN(trip_logs.air_time) and flights.MEAN(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.903
flights.MEAN(trip_logs.air_time) and flights.MAX(trip_logs.air_time) are linearly correlated: 0.969
flights.STD(trip_logs.security_delay) and flights.MEAN(trip_logs.security_delay) are linearly correlated: 0.942
flights.STD(trip_logs.security_delay) and flights.MAX(trip_logs.security_delay) are linearly correlated: 0.976
flights.MAX(trip_logs.carrier_delay) and flights.STD(trip_logs.carrier_delay) are linearly correlated: 0.949
flights.MAX(trip_logs.distance) and flights.MAX(trip_logs.scheduled_elapsed_time) are linearly correlated: 0.971
fli

Maximum: 1, Minimum: 0, Mean: 0.02
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.carrier = F9 ##
Maximum: 1, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.carrier = OO ##
Maximum: 1, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.carrier = unknown ##
Maximum: 1, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## YEAR(scheduled_arr_time) = 2017 ##
Maximum: 1, Minimum: 1, Mean: 1.00
Quartile 3: 1.00 | Median: 1.00| Quartile 1: 1.00

## YEAR(scheduled_arr_time) = unknown ##
Maximum: 0, Minimum: 0, Mean: 0.00
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(time_index) = 6 ##
Maximum: 1, Minimum: 0, Mean: 0.17
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(time_index) = 0 ##
Maximum: 1, Minimum: 0, Mean: 0.16
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## WEEKDAY(time_index) = 3 ##
Maximum: 1, Minimum: 0, Mean: 0.15
Quartile 3: 0.00 | Med


## DAY(arr_time) = unknown ##
Maximum: 1, Minimum: 0, Mean: 0.63
Quartile 3: 1.00 | Median: 1.00| Quartile 1: 0.00

## flights.origin_city = New York, NY ##
Maximum: 1, Minimum: 0, Mean: 0.50
Quartile 3: 1.00 | Median: 1.00| Quartile 1: 0.00

## flights.origin_city = Los Angeles, CA ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.origin_city = Chicago, IL ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.origin_city = Boston, MA ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.origin_city = Atlanta, GA ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.origin_city = Miami, FL ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.origin_city = Fort Lauderdale, FL ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 

Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.airports.dest_state = NY ##
Maximum: 1, Minimum: 0, Mean: 0.51
Quartile 3: 1.00 | Median: 1.00| Quartile 1: 0.00

## flights.airports.dest_state = FL ##
Maximum: 1, Minimum: 0, Mean: 0.11
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.airports.dest_state = CA ##
Maximum: 1, Minimum: 0, Mean: 0.07
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.airports.dest_state = TX ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.airports.dest_state = GA ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.airports.dest_state = IL ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.airports.dest_state = MA ##
Maximum: 1, Minimum: 0, Mean: 0.03
Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.airports.dest_state = NC ##
Maximum: 1, Minimum: 0, Mean: 0.03


Quartile 3: 0.00 | Median: 0.00| Quartile 1: 0.00

## flights.STD(trip_logs.weather_delay) ##
Maximum: 319.996386698, Minimum: 0.0, Mean: 4.06
Quartile 3: 0.79 | Median: 0.00| Quartile 1: 0.00
Missing: 2203

## flights.MEAN(trip_logs.taxi_in) ##
Maximum: 105.0, Minimum: 0.0, Mean: 7.52
Quartile 3: 9.10 | Median: 6.93| Quartile 1: 5.25
Missing: 2203

## flights.SKEW(trip_logs.air_time) ##
Maximum: nan, Minimum: nan, Mean: nan
Quartile 3: nan | Median: nan| Quartile 1: nan
Missing: 56353

## flights.SKEW(trip_logs.late_aircraft_delay) ##
Maximum: nan, Minimum: nan, Mean: nan
Quartile 3: nan | Median: nan| Quartile 1: nan
Missing: 56353

## flights.MAX(trip_logs.taxi_out) ##
Maximum: 164.0, Minimum: 0.0, Mean: 44.25
Quartile 3: 58.00 | Median: 41.00| Quartile 1: 27.00
Missing: 2203

## flights.STD(trip_logs.scheduled_elapsed_time) ##
Maximum: 1.05801133829e+12, Minimum: 0.0, Mean: 80672912093.52
Quartile 3: 119967545638.00 | Median: 60749753451.20| Quartile 1: 16710179720.50

## flights.S

Note that the column summaries are created according to the pandas `dtype`, so that we're not trying to average categorical values.

# Selection
We would like to remove some of our highly correlated features prior to machine learning. We would have moderate success just randomly selecting a smaller feature subset. Before we start, we're going to separate off a holdout set for testing.

In [6]:
from henchman.learning import create_holdout
from henchman.selection import RandomSelect
X = fm_enc.copy().fillna(0)
y = X.pop('label')

X, X_ho, y, y_ho = create_holdout(X, y)
rand_selector = RandomSelect(n_feats=50)
rand_selector.fit(X)
X_rand = rand_selector.transform(X)
warnings(X_rand)
X_rand.head()


+------------+
+------------+
DataFrame has 342 duplicates
DAY(dep_time) = 20 and DAY(scheduled_arr_time) = 20 are linearly correlated: 0.912
DAY(scheduled_dep_time) = 17 and DAY(arr_time) = 17 are linearly correlated: 0.917
DAY(scheduled_dep_time) = 10 and DAY(dep_time) = 10 are linearly correlated: 0.988


Unnamed: 0_level_0,flights.STD(trip_logs.carrier_delay),flights.SKEW(trip_logs.scheduled_elapsed_time),DAY(dep_time) = 20,DAY(scheduled_arr_time) = 13,WEEKDAY(scheduled_arr_time) = unknown,flights.flight_num = unknown,DAY(scheduled_arr_time) = 20,flights.dest = LGA,flights.MAX(trip_logs.taxi_in),flights.distance_group = 7,...,DAY(time_index) = 4,WEEKDAY(arr_time) = unknown,DAY(dep_time) = 27,flights.distance_group = 4,DAY(dep_time) = 10,flights.SUM(trip_logs.weather_delay),DAY(arr_time) = 17,flights.DAY(first_trip_logs_time) = 11,DAY(flight_date) = 23,flights.DAY(first_trip_logs_time) = 7
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
22874,0.0,0.0,0,0,0,1,0,0,0.0,0,...,0,0,0,0,0,0.0,0,0,0,0
1140,0.0,-0.707107,0,0,0,1,0,0,0.0,1,...,0,0,0,0,0,0.0,0,0,0,0
8941,0.0,-4.419303,0,0,0,1,0,0,0.0,0,...,0,0,0,0,0,0.0,0,0,0,0
564,0.0,-1.601282,0,0,0,1,0,0,0.0,0,...,0,0,0,1,0,0.0,0,0,0,0
8678,0.0,-4.089039,0,0,0,1,0,0,0.0,1,...,0,0,0,0,0,0.0,0,0,0,0


We can be a little bit more precise with our selection as well:

In [7]:
from henchman.selection import Dendrogram
corr_selector = Dendrogram(X, max_threshes=100)
X_careful = corr_selector.transform(X, n_feats=50)

100%|██████████| 100/100 [00:21<00:00,  4.65it/s]
100%|██████████| 100/100 [00:14<00:00,  6.83it/s]

There are 50 distinct connected components at thresh step 8 in the Dendrogram
You might also be interested in 51 components at step 7





The Dendrogram object actually has the complete connectivity of `X` according to pairwise correlation. It chooses at random one representative from each connected component of the graph, and does that for every threshold of connectivity. That is to say, it's a one time calculation to find a feature set of an arbitrary size. 

In [8]:
corr_selector.transform(X, n_feats=45).head()

There are 45 distinct connected components at thresh step 12 in the Dendrogram
You might also be interested in 46 components at step 11


Unnamed: 0_level_0,distance,MONTH(dep_time) = unknown,MONTH(flight_date) = unknown,WEEKDAY(dep_time) = unknown,flights.SKEW(trip_logs.taxi_in),flights.YEAR(first_trip_logs_time) = 2016,flights.YEAR(first_trip_logs_time) = unknown,flights.SKEW(trip_logs.carrier_delay),WEEKDAY(scheduled_dep_time) = unknown,flights.SKEW(trip_logs.air_time),...,YEAR(scheduled_dep_time) = 2017,YEAR(scheduled_dep_time) = unknown,flights.SKEW(trip_logs.security_delay),WEEKDAY(scheduled_arr_time) = unknown,flights.WEEKDAY(first_trip_logs_time) = unknown,MONTH(arr_time) = 3,MONTH(arr_time) = unknown,flights.SKEW(trip_logs.distance),flights.MIN(trip_logs.security_delay),MONTH(dep_time) = 3
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
22874,2248.0,0,0,0,0.0,1,0,0.0,0,0.0,...,1,0,0.0,0,0,0,0,0.0,0.0,0
1140,1598.0,0,0,0,0.0,1,0,0.0,0,0.0,...,1,0,0.0,0,0,0,0,0.0,0.0,0
8941,1069.0,0,0,0,0.0,1,0,0.0,0,0.0,...,1,0,0.0,0,0,0,0,0.0,0.0,0
564,944.0,0,0,0,0.0,1,0,0.0,0,0.0,...,1,0,0.0,0,0,0,0,0.0,0.0,0
8678,1598.0,0,0,0,0.0,1,0,0.0,0,0.0,...,1,0,0.0,0,0,0,0,0.0,0.0,0


# Learning

The biggest offender of code reuse in our demos is in the machine learning section. The workflow is always the same with potentially different *models* and *metrics*. This is provided as the single function `create_model`.

In [9]:
from henchman.learning import create_model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

scores, fit_model = create_model(X, y, RandomForestClassifier(), roc_auc_score, n_splits=4)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(scores), np.std(scores)))

Average score of 0.55 with stdev 0.035


In [10]:
scores_rand, fit_model_rand = create_model(X_rand, y, RandomForestClassifier(), roc_auc_score, n_splits=4)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(scores_rand), np.std(scores_rand)))

Average score of 0.54 with stdev 0.030


In [11]:
scores_careful, fit_model_careful = create_model(X_careful, y, RandomForestClassifier(), roc_auc_score, n_splits=4)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(scores_careful), np.std(scores_careful)))

Average score of 0.58 with stdev 0.018


We can also check the normalized feature importances from the `learning` module.

In [12]:
from henchman.learning import feature_importances
top_feats = feature_importances(X_careful, fit_model_careful, n_feats=5)

1: flights.SKEW(trip_logs.scheduled_elapsed_time) [1.000]
2: distance [0.623]
3: flights.SUM(trip_logs.security_delay) [0.043]
4: flights.distance_group = unknown [0.002]
5: flights.DAY(first_trip_logs_time) = 20 [0.002]
-----



In [13]:
# Test on test set
X = fm_enc.copy().fillna(0)
y = X.pop('label')
X_final = corr_selector.transform(X, n_feats=50)
real_scores, _ = create_model(X_final, y, RandomForestClassifier(), roc_auc_score)
print('Average score of {:.2f} with stdev {:.3f}'.format(np.mean(real_scores), np.std(scores_careful)))

There are 50 distinct connected components at thresh step 8 in the Dendrogram
You might also be interested in 51 components at step 7
Average score of 0.60 with stdev 0.018


# Plotting
The API for this section is more of a work in progress. Here are some example plots.

In [15]:
import henchman.plotting as hplot
hplot.show(hplot.feature_importances(X_careful, fit_model_careful, n_feats=5))

In [16]:
hplot.show(hplot.static_histogram(X['flights.STD(trip_logs.arr_delay)'], n_bins=50))

In [17]:
hplot.show(hplot.static_histogram_and_label(X[top_feats[0]], y, n_bins=25))

In [18]:
hplot.show(hplot.dynamic_histogram(X['flights.STD(trip_logs.arr_delay)']))

In [27]:
hplot.show(hplot.dynamic_histogram_and_label(X[top_feats[1]], y, normalized=False))

In [26]:
hplot.show(hplot.dynamic_piechart(fm['flights.carrier']))

In [25]:
hplot.show(hplot.static_scatterplot(X[top_feats[0]], X[top_feats[1]]))
hplot.show(hplot.static_scatterplot_and_label(X[top_feats[0]], X[top_feats[1]], y))

In [29]:
from henchman.selection import RandomSelect
sample_df = pd.read_csv('fm.csv', index_col='trip_log_id').iloc[:100,:15]

X = sample_df.copy().fillna(0)
y = fm_enc.copy().pop('label')[:100]
sel = RandomSelect(n_feats=8)
sel.fit(X)
sel.transform(X).head()

from henchman.selection import Dendrogram
from henchman.learning import inplace_encoder
X = inplace_encoder(X)

sel2 = Dendrogram(X)
sel2.transform(X, n_feats=8).head()

  y = column_or_1d(y, warn=True)
100%|██████████| 202/202 [00:00<00:00, 2190.44it/s]
100%|██████████| 202/202 [00:00<00:00, 16990.52it/s]


IndexError: list index out of range

In [76]:
def _make_scatter_source(col1, col2):
    tmp = pd.DataFrame({col1.name: col1, col2.name: col2})
    tmp['pairs'] = tmp.apply(lambda row: (row[0], row[1]), axis=1)
    source = pd.DataFrame(tmp.groupby('pairs').first())
    source['count'] = tmp.groupby('pairs').count().iloc[:, 1]
    source['x'] = source[col1.name]
    source['y'] = source[col2.name]
    return source

col1 = fm['flights.MEAN(trip_logs.arr_delay)'].fillna(0)
col2 = fm['WEEKDAY(scheduled_dep_time)'].fillna(0)

def static_scatterplot(col1, col2, hover=True):
    '''Creates a static scatterplot.
    Plots two numeric variables against one another. In this function,
    we only take one from each numeric pair and count how many times it
    appears in the data.


    Args:
        col1 (pd.Series): The column to use for the x_axis.
        col2 (pd.Series): The column to use for the y_axis.
        hover (bool): Whether or not to include the hover tooltip. Default is True.

    Example:
        If the dataframe ``X`` has columns named ``amount`` and ``num_purchases``:

        >>> import henchman.plotting as hplot
        >>> plot = hplot.static_scatterplot(X['num_purchases'], X['amount'])
        >>> hplot.show(plot)
    '''
    source = ColumnDataSource(_make_scatter_source(col1, col2))
    tools = ['box_zoom', 'save', 'reset']
    if hover:
        hover = HoverTool(tooltips=[
            (col1.name, '@x'),
            (col2.name, '@y'),
            ('count', '@count'),
        ])
        tools += [hover]

    p = figure(tools=tools)
    p.scatter(x='x',
              y='y',
              source=source,
              alpha=.5)
    return p
show(static_scatterplot(col1, col2))

In [70]:
pd.DataFrame({col1.name: col1, col2.name: col2})

Unnamed: 0_level_0,WEEKDAY(scheduled_dep_time),flights.MEAN(trip_logs.arr_delay)
trip_log_id,Unnamed: 1_level_1,Unnamed: 2_level_1
22874,6,0.000000
1140,6,
8941,6,0.000000
564,6,
8678,6,0.000000
15702,6,
16000,6,0.000000
25397,6,
23757,6,
14879,6,0.000000
