<a href="https://colab.research.google.com/github/allen44/riiid-test-answer-prediction/blob/main/feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%cd /content/drive/MyDrive/Colab Notebooks/riiid-test-answer-prediction/
%pwd

/content/drive/MyDrive/Colab Notebooks/riiid-test-answer-prediction


'/content/drive/MyDrive/Colab Notebooks/riiid-test-answer-prediction'

## Preprocessed data

In [2]:
import pickle
from pathlib import Path

# #Define data paths
df_train_preprocessed_path = Path('/content/drive/MyDrive/Colab Notebooks/riiid-test-answer-prediction/data/intermediate/df_train_preprocessed.pkl.gzip')
df_lectures_preprocessed_path = Path('/content/drive/MyDrive/Colab Notebooks/riiid-test-answer-prediction/data/intermediate/df_lectures_preprocessed.pkl.gzip')
df_questions_preprocessed_path = Path('/content/drive/MyDrive/Colab Notebooks/riiid-test-answer-prediction/data/intermediate/df_questions_preprocessed.pkl.gzip')

Using our insights gained from the EDA, when can import the data from csv with the best preprocessing for feature engineering.

In [3]:
with open(df_train_preprocessed_path, 'rb') as f:
  df_train = pickle.load(f)

with open(df_lectures_preprocessed_path, 'rb') as f:
  df_lectures = pickle.load(f)

with open(df_questions_preprocessed_path, 'rb') as f:
  df_questions = pickle.load(f)

assert df_train['content_id'].dtype == df_lectures['lecture_id'].dtype
assert df_questions['question_id'].dtype == df_lectures['lecture_id'].dtype

df_train.shape, df_lectures.shape, df_questions.shape 

((101230331, 9), (418, 4), (13522, 192))

In [5]:
# Use a subset of df_train
df_train = df_train[df_train.index % 1000 == 0]
df_train.head()

Unnamed: 0_level_0,timestamp,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,2018-01-02 03:03:47.951,115,5692,0,1,3,1,,True
1000,2018-06-25 19:32:50.920,13134,9155,0,299,3,1,27666.0,True
2000,2018-01-20 21:54:35.915,24418,476,0,112,0,1,15000.0,True
3000,2018-02-26 01:28:59.702,24418,5168,0,1000,1,1,6000.0,True
4000,2018-03-08 03:28:44.125,24418,3746,0,1747,3,0,27000.0,True


## Manually add features

## Install and import featuretools

In [10]:
# % pip install featuretools dask distributed tornado
import featuretools as ft

# Prepare data

First, we specify a dictionary with all the entities in our dataset.

In [11]:
# Define entity set
es = ft.EntitySet()

# Add df_train to entity set
es.entity_from_dataframe(entity_id='train', 
                         dataframe=df_train,
                         index='row_id',
                         time_index='timestamp')

Entityset: None
  Entities:
    train [Rows: 101231, Columns: 10]
  Relationships:
    No relationships

In [12]:
# Add df_lectures to entity set
es.entity_from_dataframe(entity_id='lectures', dataframe=df_lectures, index='lecture_id')

Entityset: None
  Entities:
    train [Rows: 101231, Columns: 10]
    lectures [Rows: 418, Columns: 4]
  Relationships:
    No relationships

In [13]:
# Add df_questions to entity set
es.entity_from_dataframe(entity_id='questions', dataframe=df_questions, index='question_id')

Entityset: None
  Entities:
    train [Rows: 101231, Columns: 10]
    lectures [Rows: 418, Columns: 4]
    questions [Rows: 13522, Columns: 192]
  Relationships:
    No relationships

In [14]:
es['train'].variables

[<Variable: row_id (dtype = index)>,
 <Variable: timestamp (dtype: datetime_time_index, format: None)>,
 <Variable: user_id (dtype = categorical)>,
 <Variable: content_id (dtype = categorical)>,
 <Variable: content_type_id (dtype = categorical)>,
 <Variable: task_container_id (dtype = categorical)>,
 <Variable: user_answer (dtype = categorical)>,
 <Variable: answered_correctly (dtype = categorical)>,
 <Variable: prior_question_elapsed_time (dtype = numeric)>,
 <Variable: prior_question_had_explanation (dtype = boolean)>]

In [15]:
es['lectures'].variables

[<Variable: lecture_id (dtype = index)>,
 <Variable: tag (dtype = categorical)>,
 <Variable: part (dtype = categorical)>,
 <Variable: type_of (dtype = categorical)>]

In [16]:
es['questions'].variables

[<Variable: question_id (dtype = index)>,
 <Variable: bundle_id (dtype = categorical)>,
 <Variable: correct_answer (dtype = categorical)>,
 <Variable: part (dtype = categorical)>,
 <Variable: 0 (dtype = boolean)>,
 <Variable: 1 (dtype = boolean)>,
 <Variable: 10 (dtype = boolean)>,
 <Variable: 100 (dtype = boolean)>,
 <Variable: 101 (dtype = boolean)>,
 <Variable: 102 (dtype = boolean)>,
 <Variable: 103 (dtype = boolean)>,
 <Variable: 104 (dtype = boolean)>,
 <Variable: 105 (dtype = boolean)>,
 <Variable: 106 (dtype = boolean)>,
 <Variable: 107 (dtype = boolean)>,
 <Variable: 108 (dtype = boolean)>,
 <Variable: 109 (dtype = boolean)>,
 <Variable: 11 (dtype = boolean)>,
 <Variable: 110 (dtype = boolean)>,
 <Variable: 111 (dtype = boolean)>,
 <Variable: 112 (dtype = boolean)>,
 <Variable: 113 (dtype = boolean)>,
 <Variable: 114 (dtype = boolean)>,
 <Variable: 115 (dtype = boolean)>,
 <Variable: 116 (dtype = boolean)>,
 <Variable: 117 (dtype = boolean)>,
 <Variable: 118 (dtype = boolean)>

In [17]:
import gc

del df_train
del df_lectures
del df_questions

gc.collect()

153

In [18]:
r_lectures_train= ft.Relationship(es["lectures"]["lecture_id"],
                                          es["train"]["content_id"])

r_questions_train = ft.Relationship(es["questions"]["question_id"],
                                    es["train"]["content_id"])

es = es.add_relationship(r_lectures_train)
es = es.add_relationship(r_questions_train)
es

Entityset: None
  Entities:
    train [Rows: 101231, Columns: 10]
    lectures [Rows: 418, Columns: 4]
    questions [Rows: 13522, Columns: 192]
  Relationships:
    train.content_id -> lectures.lecture_id
    train.content_id -> questions.question_id

In [19]:
es['train']['answered_correctly'].interesting_values = [0, 1]

In [20]:
feature_defs = ft.dfs(entityset=es, target_entity='train', 
                      where_primitives = ['sum', 'mean'],
                      max_depth=2, features_only=True)

print(f'This will generate {len(feature_defs)} features.\n')

This will generate 242 features.



  where_primitives: ['mean', 'sum']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


In [21]:
import random; random.seed(42)

random.sample(feature_defs, 10)

[<Feature: questions.60>,
 <Feature: questions.107>,
 <Feature: prior_question_elapsed_time>,
 <Feature: questions.84>,
 <Feature: questions.145>,
 <Feature: questions.138>,
 <Feature: questions.133>,
 <Feature: questions.113>,
 <Feature: questions.83>,
 <Feature: questions.105>]

### Aggregation Primitives

In [None]:
all_p = ft.list_primitives()
trans_p = all_p.loc[all_p['type'] == 'transform'].copy()
agg_p = all_p.loc[all_p['type'] == 'aggregation'].copy()

pd.options.display.max_colwidth = 100
# agg_p

In [23]:
# Specify aggregation primitives
agg_primitives = ['sum', 'time_since_last', 'avg_time_between', 'all', 'mode', 'num_unique', 'min', 'last', 
                  'mean', 'percent_true', 'max', 'std', 'count']

## Transform Primitives

In [None]:
# trans_p

In [25]:
# Specify transformation primitives
trans_primitives = ['cum_sum', 'diff', 'time_since_previous']

### Where Primitives

These primitives are applied to the `interesting_values` to build conditional features. 

In [26]:
# Specify where primitives
where_primitives = ['sum', 'mean', 'percent_true', 'all', 'any']

## Custom Primitives

For this problem, I wrote a custom primitive that calculates the sum of a value in the month prior to the cutoff time.

The second custom primitive finds the time since a previous true value. It simply finds the time between True examples.

In [27]:
def total_previous_month(numeric, datetime, time):
    """Return total of `numeric` column in the month prior to `time`."""
    df = pd.DataFrame({'value': numeric, 'date': datetime})
    previous_month = time.month - 1
    year = time.year
   
    # Handle January
    if previous_month == 0:
        previous_month = 12
        year = time.year - 1
        
    # Filter data and sum up total
    df = df[(df['date'].dt.month == previous_month) & (df['date'].dt.year == year)]
    total = df['value'].sum()
    
    return total


def time_since_true(boolean, datetime):
    """Calculate time since previous true value"""
    
    if np.any(np.array(list(boolean)) == 1):
        # Create dataframe sorted from oldest to newest 
        df = pd.DataFrame({'value': boolean, 'date': datetime}).\
                sort_values('date', ascending = False).reset_index()

        older_date = None

        # Iterate through each date in reverse order
        for date in df.loc[df['value'] == 1, 'date']:

            # If there was no older true value
            if older_date == None:
                # Subset to times on or after true
                times_after_idx = df.loc[df['date'] >= date].index

            else:
                # Subset to times on or after true but before previous true
                times_after_idx = df.loc[(df['date'] >= date) & (df['date'] < older_date)].index
            older_date = date
            # Calculate time since previous true
            df.loc[times_after_idx, 'time_since_previous'] = (df.loc[times_after_idx, 'date'] - date).dt.total_seconds()

        return list(df['time_since_previous'])[::-1]
    
    # Handle case with no true values
    else:
        return [np.nan for _ in range(len(boolean))]

### Custom Primitive Implementation

Making a custom primitive is simple: first we define a function (`total_previous_month`) and then we `make_agg_primitive` with `input_type[s]`, a `return_type`, and whether or not the primitive requires the `cutoff_time` through `uses_calc_time`. 

This primitive is an aggregation primitive because it takes in multiple numbers - transactions for the previous month - and returns a single number - the total of the transactions. 

In [28]:
from featuretools.primitives import make_agg_primitive

# Takes in a number and outputs a number
total_previous = make_agg_primitive(total_previous_month, 
                                    input_types = [ft.variable_types.Numeric,
                                                   ft.variable_types.Datetime],
                                    return_type = ft.variable_types.Numeric, 
                                    uses_calc_time = True)

In [29]:
from featuretools.primitives import make_trans_primitive

# Specify the inputs and return
time_since = make_trans_primitive(time_since_true, 
                                  input_types = [ft.variable_types.Boolean, 
                                                  ft.variable_types.Datetime],
                                  return_type = ft.variable_types.Numeric)

Now just have to pass this in as another aggregation primitive for Featuretools to use it in calculations.



Let's add the two custom primitives to the respective lists. In the final version of feature engineering, I did not use the `time_since` primitive. I ran into problems with the implementation but would encourage anyone to try and fix it or build their own custom primitive[s].

In [30]:
agg_primitives.append(total_previous)
trans_primitives.append(time_since)

## Deep Feature Synthesis with Specified Primitives

We'll again run Deep Feature Synthesis to make the feature definitions this time using the selected primitives and the custom primitives. 

In [31]:
feature_defs = ft.dfs(entityset=es, target_entity='train', 
                      # cutoff_time = cutoff_times, 
                      agg_primitives = agg_primitives,
                      trans_primitives = trans_primitives,
                      where_primitives = where_primitives,
                      chunk_size = 100, #len(cutoff_times), 
                      # cutoff_time_in_index = True,
                      max_depth = 2, 
                      features_only = True)

  where_primitives: ['all', 'any', 'mean', 'percent_true', 'sum']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


In [32]:
print(f'This will generate {len(feature_defs)} features.')

This will generate 452 features.


In [33]:
random.sample(feature_defs, 15)

[<Feature: TIME_SINCE_TRUE(questions.172, timestamp)>,
 <Feature: TIME_SINCE_TRUE(questions.33, timestamp)>,
 <Feature: TIME_SINCE_TRUE(questions.111, timestamp)>,
 <Feature: questions.121>,
 <Feature: TIME_SINCE_TRUE(questions.132, timestamp)>,
 <Feature: lectures.LAST(train.user_id)>,
 <Feature: questions.correct_answer>,
 <Feature: questions.bundle_id>,
 <Feature: questions.124>,
 <Feature: questions.182>,
 <Feature: questions.20>,
 <Feature: questions.NUM_UNIQUE(train.user_id)>,
 <Feature: TIME_SINCE_TRUE(questions.138, timestamp)>,
 <Feature: lectures.part>,
 <Feature: TIME_SINCE_TRUE(questions.119, timestamp)>]

# Run Deep Feature Synthesis

Once we're happy with the features that will be generated, we can run deep feature synthesis to make the actual features. We need to change `feature_only` to `False` and then we're good to go.

In [34]:
from timeit import default_timer as timer

start = timer()
feature_matrix, feature_defs = ft.dfs(entityset=es, 
                                      target_entity='train', 
                                      # cutoff_time = cutoff_times, 
                                      agg_primitives = agg_primitives,
                                      trans_primitives = trans_primitives,
                                      where_primitives = where_primitives,
                                      max_depth = 2, features_only = False,
                                      verbose = 1, 
                                      chunk_size = 100,  
                                      # n_jobs = -1,
                                      # cutoff_time_in_index = True
                                      )
end = timer()
print(f'{round(end - start)} seconds elapsed.')

Built 452 features
Elapsed: 00:00 | Progress:   0%|          

  where_primitives: ['all', 'any', 'mean', 'percent_true', 'sum']
This may be caused by a using a value of max_depth that is too small, not setting interesting values, or it may indicate no compatible variable types for the primitive were found in the data.


Elapsed: 23:19 | Progress: 100%|██████████
1400 seconds elapsed.


The `chunk_size` is a parameter that may need to be adjusted to optimize the calculation. I suggest playing around with this parameter to find the optimal value. Generally I've found that a large value makes the calculation proceed quicker although it depends on the machine in use and the number of unique cutoff times. 

In [55]:
feature_matrix

Unnamed: 0_level_0,user_id,content_id,content_type_id,task_container_id,user_answer,answered_correctly,prior_question_elapsed_time,prior_question_had_explanation,CUM_SUM(prior_question_elapsed_time),DIFF(prior_question_elapsed_time),TIME_SINCE_PREVIOUS(timestamp),"TIME_SINCE_TRUE(prior_question_had_explanation, timestamp)",lectures.tag,lectures.part,lectures.type_of,questions.bundle_id,questions.correct_answer,questions.part,questions.0,questions.1,questions.10,questions.100,questions.101,questions.102,questions.103,questions.104,questions.105,questions.106,questions.107,questions.108,questions.109,questions.11,questions.110,questions.111,questions.112,questions.113,questions.114,questions.115,questions.116,questions.117,...,"TIME_SINCE_TRUE(questions.63, timestamp)","TIME_SINCE_TRUE(questions.64, timestamp)","TIME_SINCE_TRUE(questions.65, timestamp)","TIME_SINCE_TRUE(questions.66, timestamp)","TIME_SINCE_TRUE(questions.67, timestamp)","TIME_SINCE_TRUE(questions.68, timestamp)","TIME_SINCE_TRUE(questions.69, timestamp)","TIME_SINCE_TRUE(questions.7, timestamp)","TIME_SINCE_TRUE(questions.70, timestamp)","TIME_SINCE_TRUE(questions.71, timestamp)","TIME_SINCE_TRUE(questions.72, timestamp)","TIME_SINCE_TRUE(questions.73, timestamp)","TIME_SINCE_TRUE(questions.74, timestamp)","TIME_SINCE_TRUE(questions.75, timestamp)","TIME_SINCE_TRUE(questions.76, timestamp)","TIME_SINCE_TRUE(questions.77, timestamp)","TIME_SINCE_TRUE(questions.78, timestamp)","TIME_SINCE_TRUE(questions.79, timestamp)","TIME_SINCE_TRUE(questions.8, timestamp)","TIME_SINCE_TRUE(questions.80, timestamp)","TIME_SINCE_TRUE(questions.81, timestamp)","TIME_SINCE_TRUE(questions.82, timestamp)","TIME_SINCE_TRUE(questions.83, timestamp)","TIME_SINCE_TRUE(questions.84, timestamp)","TIME_SINCE_TRUE(questions.85, timestamp)","TIME_SINCE_TRUE(questions.86, timestamp)","TIME_SINCE_TRUE(questions.87, timestamp)","TIME_SINCE_TRUE(questions.88, timestamp)","TIME_SINCE_TRUE(questions.89, timestamp)","TIME_SINCE_TRUE(questions.9, timestamp)","TIME_SINCE_TRUE(questions.90, timestamp)","TIME_SINCE_TRUE(questions.91, timestamp)","TIME_SINCE_TRUE(questions.92, timestamp)","TIME_SINCE_TRUE(questions.93, timestamp)","TIME_SINCE_TRUE(questions.94, timestamp)","TIME_SINCE_TRUE(questions.95, timestamp)","TIME_SINCE_TRUE(questions.96, timestamp)","TIME_SINCE_TRUE(questions.97, timestamp)","TIME_SINCE_TRUE(questions.98, timestamp)","TIME_SINCE_TRUE(questions.99, timestamp)"
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
0,115,5692,0,1,3,1,,True,,,,0.0,,,,5692,3,5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,,,,0.000000e+00,,,0.000000e+00,,,,0.000,0.000000e+00,,0.000000e+00,,,0.000000e+00,0.000,0.000,0.000000e+00,0.000,,,,,,,0.000000e+00,0.000,,0.000000e+00,0.000,0.000,0.000000e+00,,0.000000e+00,0.000,,,
164,3505219,7900,0,0,0,1,,True,,,0.000,0.0,,,,7900,0,1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,,,,0.000000e+00,,,0.000000e+00,,,,0.000,0.000000e+00,,0.000000e+00,,,0.000000e+00,0.000,0.000,0.000000e+00,0.000,,,,,,,0.000000e+00,0.000,,0.000000e+00,0.000,0.000,0.000000e+00,,0.000000e+00,0.000,,,
451,9063293,5529,0,0,2,1,,True,,,0.000,0.0,,,,5529,2,5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,,,,0.000000e+00,,,0.000000e+00,,,,0.000,0.000000e+00,,0.000000e+00,,,0.000000e+00,0.000,0.000,0.000000e+00,0.000,,,,,,,0.000000e+00,0.000,,0.000000e+00,0.000,0.000,0.000000e+00,,0.000000e+00,0.000,,,
526,10501741,7900,0,0,0,1,,True,,,0.000,0.0,,,,7900,0,1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,,,,0.000000e+00,,,0.000000e+00,,,,0.000,0.000000e+00,,0.000000e+00,,,0.000000e+00,0.000,0.000,0.000000e+00,0.000,,,,,,,0.000000e+00,0.000,,0.000000e+00,0.000,0.000,0.000000e+00,,0.000000e+00,0.000,,,
811,16467140,7900,0,0,2,0,,True,,,0.000,0.0,,,,7900,0,1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,,,,0.000000e+00,,,0.000000e+00,,,,0.000,0.000000e+00,,0.000000e+00,,,0.000000e+00,0.000,0.000,0.000000e+00,0.000,,,,,,,0.000000e+00,0.000,,0.000000e+00,0.000,0.000,0.000000e+00,,0.000000e+00,0.000,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,697068556,6821,0,1879,0,1,37750.0,True,2.511123e+09,-65850.0,130740.453,0.0,,,,6821,0,6,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,2.264906e+07,6.042727e+06,1.692607e+07,1.486060e+07,1.437942e+07,1.132949e+07,8.320505e+06,1.046060e+07,2491099.950,705002.340,2283997.436,6.203755e+06,4587680.864,1.831762e+07,1.196415e+07,4636684.671,1.709451e+07,3176999.892,742690.095,1.350814e+07,185514.294,5892677.262,1.105405e+07,3260647.318,2.478788e+07,5.527849e+07,6.639053e+06,9.752815e+06,4993450.837,6270002.866,1.673886e+07,2791677.268,185514.294,1.198574e+07,1.079791e+07,1.456008e+07,636452.912,4636684.671,1.412204e+07,331073.124
14376,311071830,10556,0,934,1,1,24000.0,True,2.511147e+09,-13750.0,190173.210,0.0,,,,10556,1,1,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,2.283924e+07,6.232901e+06,1.711624e+07,1.505078e+07,1.456959e+07,1.151966e+07,8.510678e+06,1.065077e+07,2681273.160,895175.550,2474170.646,6.393928e+06,4777854.074,1.850779e+07,1.215432e+07,4826857.881,1.728469e+07,3367173.102,932863.305,1.369831e+07,375687.504,6082850.472,1.124422e+07,3450820.528,2.497806e+07,5.546866e+07,6.829227e+06,9.942988e+06,5183624.047,0.000,1.692903e+07,2981850.478,0.000,1.217592e+07,1.098809e+07,1.475026e+07,826626.122,4826857.881,1.431221e+07,521246.334
44262,937249824,2726,0,1289,2,1,6666.0,True,2.511153e+09,-17334.0,752570.045,0.0,,,,2725,2,4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,2.359181e+07,6.985471e+06,1.786881e+07,1.580335e+07,1.532216e+07,1.227223e+07,9.263248e+06,1.140334e+07,3433843.205,1647745.595,3226740.691,7.146498e+06,5530424.119,1.926036e+07,1.290689e+07,5579427.926,1.803726e+07,4119743.147,1685433.350,1.445088e+07,0.000,0.000,1.199679e+07,4203390.573,2.573063e+07,5.622123e+07,7.581797e+06,1.069556e+07,5936194.092,752570.045,1.768160e+07,3734420.523,752570.045,1.292849e+07,1.174066e+07,1.550283e+07,1579196.167,5579427.926,1.506478e+07,1273816.379
25738,550249756,11961,0,579,1,0,17000.0,True,2.511170e+09,10334.0,333641.656,0.0,,,,11961,3,2,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,...,2.392545e+07,7.319112e+06,1.820246e+07,1.613699e+07,1.565580e+07,1.260587e+07,9.596890e+06,1.173698e+07,3767484.861,1981387.251,3560382.347,7.480140e+06,5864065.775,1.959401e+07,1.324053e+07,5913069.582,1.837090e+07,4453384.803,2019075.006,1.478452e+07,0.000,333641.656,1.233043e+07,4537032.229,2.606427e+07,5.655487e+07,7.915438e+06,1.102920e+07,6269835.748,1086211.701,1.801524e+07,4068062.179,0.000,1.326213e+07,1.207430e+07,1.583647e+07,1912837.823,5913069.582,1.539842e+07,1607458.035


We can save these feature definitions as a binary file which will allow us to make the same exact features for another entityset of the same format. This is useful when we have multiple partitions and we want to make the same features for each. Instead of remaking the feature definitions, we pass in the same feature definitions to a call to `calculate_feature_matrix`.

In [None]:
feature_defs = ft.load_features('./data/churn/features.txt')
print(f'There are {len(feature_defs)} features.')

# Conclusions

Automated feature engineering is a significant improvement over manual feature engineering in terms of both time and modeling performance. In this notebook, we implemented an automated feature engineering workflow with Featuretools for the customer churn problem. Given customer data and label times, we can now calculate a feature matrix with several hundred relevant features for predicting customer churn while ensuring that our features are made with valid data for each cutoff time. 

Along the way, we implemented a number of Featuretools concepts:

1. An entityset and entities
2. Relationships between entities
3. Cutoff times
4. Feature primitives
5. Custom primitives
6. Deep feature synthesis

These concepts will serve us well in future machine learning projects that we can tackle with automated feature engineering.

## Next Steps

Although we often hear that "data is the fuel of machine learning", data is not exactly a fuel but more like crude oil. _Features_ are the refined product that we feed into a machine learning model to make accurate predictions. After performing prediction engineering and automated feature engineering, the next step is to use these features in a predictive model to estimate the _label_ using the _features_. 

Generating hundreds of features automatically is impressive, but if those features cannot allow a model to learn our prediction problem then they are not mcuch help! The next step is to use our features and labeled historical examples to train a machine learning model to make predictions of customer churn. We'll make sure to test our model using a hold-out testing set to estimate performance on new data. Then, after validating our model, we can use it on new examples by passing the data through the feature engineering process. 


If you want to see how to parallelize feature engineering in Spark, see the `Feature Engineering on Spark` notebook. Otherwise, the next notebook is `Modeling`, where we develop a machine learning model to predict churn using the historical labeled examples and the automatically engineered features.