In [30]:
import numpy as np
import pandas as pd
import featuretools as ft
import utilities
ft.__version__

'0.1.17'

# Phase 1: Creating a useful dataset structure
Since we have so many categorical columns, it's worth taking a moment to think about how this data is structured. At the base level we have `transactions`, every event that is recorded in the data. The columns of those transactions have variables that can be grouped together. As an example, there are only 78 distinct `problem_steps` for the 6778 transactions we have. Associated to each such problem step, we have a variety of knowledge components (KC) and custom fields (CF) associated to that step.

We create an entityset structure using the `learnlab_to_entityset` function in [utilities](utilities.py).

In [31]:
data = pd.read_csv('data/data.txt', '\t')
es = utilities.learnlab_to_entityset(data)
es

Entityset: Dataset
  Entities:
    transactions (shape = [6778, 26])
    problem_steps (shape = [78, 49])
    problems (shape = [20, 1])
    sessions (shape = [59, 3])
    students (shape = [59, 2])
    ...And 3 more
  Relationships:
    transactions.Step Name -> problem_steps.Step Name
    problem_steps.Problem Name -> problems.Problem Name
    transactions.Session Id -> sessions.Session Id
    sessions.Anon Student Id -> students.Anon Student Id
    transactions.Class -> classes.Class
    ...and 2 more

# Phase 2: Building Features
We create a custom primitive: `ProbFail`, which calculates the likelihood that a boolean variable is false. It's worth noting that the opposite of this primitive is built in to Featuretools: `PercentTrue`. One of the many advantages in defining custom primitives is that we can define the name and input types as we would like. If you're interested in creating your own custom primitives for this dataset, copy and modify this code as necessary.

In [32]:
from featuretools.primitives import make_agg_primitive
import featuretools.variable_types as vtypes

def probability(boolean):
    numtrue = len([x for x in boolean if x==1])
    return 1 - numtrue/len(boolean)

ProbFail = make_agg_primitive(probability,
                              input_types=[vtypes.Boolean],
                              name='failure_rate',
                              description='Calculates likelihood a boolean is false over a region',
                              return_type=vtypes.Numeric)


Here we calculate a feature matrix on the `transactions` entity to try to predict the outcome of a given transaction. It's at this step that our previous setup pays off: we can automatically calculate features as if at a given point in time using Deep Feature Synthesis. Furthermore, we can guarentee that future values for `Outcome` won't be used for any calculations because we set the time index of that value to be after the cutoff time.

Lastly, we can automatically apply `Prob` while grouping by any of the entities we created before.

In [33]:
# Automatically generate features on collected data
from featuretools.primitives import Sum, Mean, Median, Count, Hour 
cutoff_times = es['transactions'].df[['Transaction Id', 'End Time', 'Outcome']][500:]
fm, features = ft.dfs(entityset=es, 
                      target_entity='transactions',
                      agg_primitives=[ProbFail],
                      trans_primitives=[],
                      seed_features=[],
                      max_depth=3,
                      approximate='1m',
                      cutoff_time=cutoff_times,
                      verbose=True)
print('Created {} features'.format(len(features)))

Building features: 147it [00:00, 9258.67it/s]
Progress: 100%|██████████| 118/118 [01:33<00:00,  1.26cutoff time/s]
Created 74 features


In [34]:
features[-8:]

[<Feature: problem_steps.Problem Name>,
 <Feature: sessions.Anon Student Id>,
 <Feature: classes.School>,
 <Feature: problem_steps.FAILURE_RATE(transactions.Outcome)>,
 <Feature: sessions.FAILURE_RATE(transactions.Outcome)>,
 <Feature: attempts.FAILURE_RATE(transactions.Outcome)>,
 <Feature: problem_steps.problems.FAILURE_RATE(transactions.Outcome)>,
 <Feature: sessions.students.FAILURE_RATE(transactions.Outcome)>]

Let's parse a couple of features. The feature `problem_steps.FAILURE_RATE(transactions.Outcome)` is exactly the percent of students who did not succeed on a given `problem_step` as calculated at a given time. Similarly, the `attempts.FAILURE_RATE(transactions.Outcome)` is the failure rate as grouped by the problem attempt (i.e. more students miss the questions on an earlier attempts than later ones).

# Phase 3: Making predictions
We were able to add our label `Outcome` to our feature matrix using cutoff times. We now pop it out of the feature matrix and make predictions over time.

In [35]:
from featuretools.selection import remove_low_information_features
fm_enc, _ = ft.encode_features(fm, features)
fm_enc = fm_enc.fillna(0)
fm_enc = remove_low_information_features(fm_enc)
labels = fm.pop('Outcome')

In [36]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score


splitter = TimeSeriesSplit(n_splits=5, max_train_size=None)
i=0
for train_index, test_index in splitter.split(fm):
    clf = RandomForestClassifier()
    X_train, X_test = fm_enc.iloc[train_index], fm_enc.iloc[test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    score = round(roc_auc_score(preds, y_test), 2)
    print("AUC score on time split {} is {}".format(i, score))
    feature_imps = [(imp, fm_enc.columns[i]) for i, imp in enumerate(clf.feature_importances_)]
    feature_imps.sort()
    feature_imps.reverse()
    print("Top 5 features: {}".format([f[1] for f in feature_imps[0:5]]))
    print("-----\n")
    
    i += 1

AUC score on time split 0 is 0.61
Top 5 features: ['sessions.FAILURE_RATE(transactions.Outcome)', 'sessions.students.FAILURE_RATE(transactions.Outcome)', 'attempts.FAILURE_RATE(transactions.Outcome)', 'problem_steps.FAILURE_RATE(transactions.Outcome)', 'problem_steps.problems.FAILURE_RATE(transactions.Outcome)']
-----

AUC score on time split 1 is 0.6
Top 5 features: ['sessions.FAILURE_RATE(transactions.Outcome)', 'sessions.students.FAILURE_RATE(transactions.Outcome)', 'attempts.FAILURE_RATE(transactions.Outcome)', 'problem_steps.FAILURE_RATE(transactions.Outcome)', 'problem_steps.problems.FAILURE_RATE(transactions.Outcome)']
-----

AUC score on time split 2 is 0.59
Top 5 features: ['sessions.students.FAILURE_RATE(transactions.Outcome)', 'sessions.FAILURE_RATE(transactions.Outcome)', 'attempts.FAILURE_RATE(transactions.Outcome)', 'problem_steps.FAILURE_RATE(transactions.Outcome)', 'problem_steps.problems.FAILURE_RATE(transactions.Outcome)']
-----

AUC score on time split 3 is 0.62
Top 