# Making predictions from a LearnLab dataset
<p style="margin:30px">
    <img width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
</p>

In this tutorial, we show how to use [Featuretools](www.featuretools.com) on the standard LearnLab dataset structure. The workflow shown here can be used to quickly **organize** and **make predictions** about any LearnLab dataset.

*If you're running this notebook yourself, please download the [geometry dataset](https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=76) into the `data` folder in this repository. You will only need the `.txt` file. The infrastructure in this notebook will work with **any** learnlab dataset, but you will need to change the filename in the following cell.*

## Highlights
* Show how to import a LearnLab dataset into featuretools
* Show how to make custom primitives for stacking
* Show efficacy of automatic feature generation with these datasets

In [27]:
import numpy as np
import pandas as pd
import featuretools as ft
import utilities
print('Using Featuretools version {}'.format(ft.__version__))
data = pd.read_csv('data/ds2174_tx_All_Data_3991_2017_1128_123859.txt', '\t')


Using Featuretools version 0.1.17


# Phase 1: Creating a useful dataset structure
At the beginning of any project, it is worthwhile to take a moment to think about how your dataset is structured.

In these datasets the unique events come from `transactions`: places where a student interacts with a system. However, the columns of those transactions have variables that can be grouped together. 

For instance, there are only 78 distinct `problem_steps` for the 6778 transactions we in the geometry dataset. Associated to each problem step, we have a variety of knowledge components (KC) and custom fields (CF).

We create an entityset structure using the `learnlab_to_entityset` function in [utilities](utilities.py). If you're interested in how `learnlab_to_entityset` is structured, there's an associated notebook [entityset_function](entityset_function.ipynb) explains choices made in more detail.

In [16]:
# Note that each branch is a one -> many relationship

# schools       students     problems
#        \        |         /
#   classes   sessions   problem steps
#          \     |       /
#           transactions  -- attempts
#

es = utilities.learnlab_to_entityset(data)
es

Entityset: Dataset
  Entities:
    transactions (shape = [6778, 27])
    problem_steps (shape = [78, 49])
    problems (shape = [20, 1])
    classes (shape = [1, 2])
    schools (shape = [1, 1])
    ...And 1 more
  Relationships:
    transactions.Step Name -> problem_steps.Step Name
    problem_steps.Problem Name -> problems.Problem Name
    transactions.Class -> classes.Class
    classes.School -> schools.School
    transactions.Attempt At Step -> attempts.Attempt At Step

# Phase 2: Building Features
We create a custom primitive: `ProbFail`, which calculates the likelihood that a boolean variable is false. It's worth noting that the opposite of this primitive is built in to Featuretools: `PercentTrue`. One of the many advantages in defining custom primitives is that we can define the name and input types as we would like. If you're interested in creating your own custom primitives for this dataset, copy and modify this step as necessary.

In [17]:
from featuretools.primitives import make_agg_primitive
import featuretools.variable_types as vtypes

def probability(boolean):
    numtrue = len([x for x in boolean if x==1])
    return 1 - numtrue/len(boolean)

ProbFail = make_agg_primitive(probability,
                              input_types=[vtypes.Boolean],
                              name='failure_rate',
                              description='Calculates likelihood a boolean is false over a region',
                              return_type=vtypes.Numeric)


Next, we calculate a feature matrix on the `transactions` entity to try to predict the outcome of a given transaction. It's at this step that our previous setup pays off: we can automatically calculate features as if at a given point in time using Deep Feature Synthesis. Furthermore, we can guarentee that future values for `Outcome` won't be used for any calculations because we set the time index of that value to be after the cutoff time.

Lastly, we can automatically apply `ProbFail` while grouping by any of the entities we created before.

In [19]:
# Automatically generate features on collected data
from featuretools.primitives import Sum, Mean, Median, Count, Hour 
cutoff_times = es['transactions'].df[['Transaction Id', 'End Time', 'Outcome']]
fm, features = ft.dfs(entityset=es, 
                      target_entity='transactions',
                      agg_primitives=[Sum, Mean, ProbFail],
                      trans_primitives=[Hour],
                      max_depth=3,
                      approximate='2m',
                      cutoff_time=cutoff_times,
                      verbose=True)
print('Created {} features'.format(len(features)))

Building features: 402it [00:00, 5109.05it/s]
Progress: 100%|██████████| 61/61 [00:50<00:00,  1.20cutoff time/s]
Created 177 features


In [22]:
features[-5:]

[<Feature: problem_steps.problems.MEAN(transactions.Feedback Text)>,
 <Feature: problem_steps.problems.MEAN(transactions.Feedback Classification)>,
 <Feature: problem_steps.problems.MEAN(transactions.Help Level)>,
 <Feature: problem_steps.problems.MEAN(transactions.Total Num Hints)>,
 <Feature: problem_steps.problems.FAILURE_RATE(transactions.Outcome)>]

The feature `problem_steps.problems.FAILURE_RATE(transactions.Outcome)` is exactly the percent of students who did not succeed on a given `problem_step` as calculated at a given time. A feature like `attempts.FAILURE_RATE(transactions.Outcome)` would be the failure rate as grouped by the problem attempt (i.e. more students miss the questions on an earlier attempts than later ones). It's easy to see how a feature like this might be important in predicting the outcome of a given transaction.

# Phase 3: Making predictions
We were able to add our label `Outcome` to our feature matrix using cutoff times. We now pop it out of the feature matrix and make predictions using a timeseries split and the `roc_auc_score`.

In [23]:
from featuretools.selection import remove_low_information_features
fm_enc, _ = ft.encode_features(fm, features)
fm_enc = fm_enc.fillna(0)
fm_enc = remove_low_information_features(fm_enc)
labels = fm.pop('Outcome')

In [26]:
print("Using {} features".format(len(fm_enc)))
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score


splitter = TimeSeriesSplit(n_splits=5, max_train_size=None)
i=0
for train_index, test_index in splitter.split(fm):
    clf = RandomForestClassifier()
    X_train, X_test = fm_enc.iloc[train_index], fm_enc.iloc[test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    score = round(roc_auc_score(preds, y_test), 2)
    print("AUC score on time split {} is {}".format(i, score))
    feature_imps = [(imp, fm_enc.columns[i]) for i, imp in enumerate(clf.feature_importances_)]
    feature_imps.sort()
    feature_imps.reverse()
    print("Top 5 features: {}".format([f[1] for f in feature_imps[0:5]]))
    print("-----\n")
    
    i += 1

Using 6778 features
AUC score on time split 0 is 0.54
Top 5 features: ['Attempt At Step = 1', 'problem_steps.MEAN(transactions.Duration (sec))', 'Attempt At Step = 2', 'Anon Student Id = unknown', 'attempts.MEAN(transactions.Problem View)']
-----

AUC score on time split 1 is 0.6
Top 5 features: ['Problem View', 'attempts.FAILURE_RATE(transactions.Outcome)', 'problem_steps.MEAN(transactions.Duration (sec))', 'attempts.SUM(transactions.Duration (sec))', 'problem_steps.FAILURE_RATE(transactions.Outcome)']
-----

AUC score on time split 2 is 0.57
Top 5 features: ['Problem View', 'problem_steps.FAILURE_RATE(transactions.Outcome)', 'attempts.SUM(transactions.Problem View)', 'Attempt At Step = 1', 'problem_steps.MEAN(transactions.Duration (sec))']
-----

AUC score on time split 3 is 0.58
Top 5 features: ['Problem View', 'Attempt At Step = 2', 'problem_steps.FAILURE_RATE(transactions.Outcome)', 'Session Id = unknown', 'problem_steps.MEAN(transactions.Duration (sec))']
-----

AUC score on time

# Next Steps
This notebook showed how to structure your data and make predictions with machine learning. Rather than spending time creating features, it's now possible to explore the relationships and implications betweem thousands of features directly. Reasonable next steps might be to:
1. Make plots to better understand the relationship between existing features and the label 
2. Reduce the total number of features and tune the machine learning model
3. Create discipline specific *custom primitives* that might be useful for this prediction problem


