# Making predictions from a LearnLab dataset
<p style="margin:30px">
    <img width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
</p>

In this tutorial, we show how to use [Featuretools](www.featuretools.com) on the standard LearnLab dataset structure. The workflow shown here can be used to quickly **organize** and **make predictions** about any LearnLab dataset. In this notebook we use machine learning to predict whether a student will get a problem right or wrong before they attempt it.

*If you're running this notebook yourself, please download the [geometry dataset](https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=76) into the `data` folder in this repository. You will only need the `.txt` file. The infrastructure in this notebook will work with **any** learnlab dataset, but you will need to change the filename in the following cell.*

## Highlights
* Show how to import a LearnLab dataset into featuretools
* Show how to make custom primitives for stacking
* Show efficacy of automatic feature generation with these datasets

In [1]:
import numpy as np
import pandas as pd
import featuretools as ft
import utils
print('Using Featuretools version {}'.format(ft.__version__))
data = pd.read_csv('data/ds2174_tx_All_Data_3991_2017_1128_123859.txt', '\t')


Using Featuretools version 0.1.17


# Phase 1: Creating a useful dataset structure
At the beginning of any project, it is worthwhile to take a moment to think about how your dataset is structured.

In these datasets the unique events come from `transactions`: places where a student interacts with a system. However, the columns of those transactions have variables that can be grouped together. 

For instance, there are only 78 distinct `problem_steps` for the 6778 transactions we in the geometry dataset. Associated to each problem step, we have a variety of knowledge components (KC) and custom fields (CF).

We create an entityset structure using the `learnlab_to_entityset` function in [utilities](utilities.py). If you're interested in how `learnlab_to_entityset` is structured, there's an associated notebook [entityset_function](entityset_function.ipynb) explains choices made in more detail.

In [2]:
# Note that each branch is a one -> many relationship

# schools       students     problems
#        \        |         /
#   classes   sessions   problem steps
#          \     |       /
#           transactions  -- attempts
#

es = utils.learnlab_to_entityset(data)
es

Entityset: Dataset
  Entities:
    transactions (shape = [6778, 26])
    problem_steps (shape = [78, 49])
    problems (shape = [20, 1])
    sessions (shape = [59, 3])
    students (shape = [59, 2])
    ...And 3 more
  Relationships:
    transactions.Step Name -> problem_steps.Step Name
    problem_steps.Problem Name -> problems.Problem Name
    transactions.Session Id -> sessions.Session Id
    sessions.Anon Student Id -> students.Anon Student Id
    transactions.Class -> classes.Class
    ...and 2 more

Here, we've moved the "Knowledge Components" and "Custom Fields" columns into a `problem_steps` entity. The EntitySet construct lets us repeat as little information as possible in the dataset.

In [3]:
es['problem_steps'].df.head(3)

Unnamed: 0_level_0,Step Name,KC (Geometry),KC Category (Geometry),KC (Textbook),KC Category (Textbook),KC (Single-KC),KC Category (Single-KC),KC (Unique-step),KC Category (Unique-step),KC (NewModel),...,CF (Factor embeddedness),CF (Factor figure-part),CF (Factor figure-type),CF (Factor non-standard-orientation-or-shape),CF (Factor parallelogram),CF (Factor parallelogram-type),CF (Factor repeat),CF (Factor required),CF (Factor trapezoid-part),Problem Name
Step Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(APOTHEM QUESTION2),(APOTHEM QUESTION2),Geometry,,pentagon-area,,Single-KC,,KC25,,pentagon-area,...,alone,apothem,pentagon,0,0,0,repeat,additional,0,PENTAGON_ABCDE
(AREA QUESTION1),(AREA QUESTION1),Geometry,,rectangle-area,,Single-KC,,KC12,,square-rect-area,...,alone,area,rectangle,0,1,rectangle,repeat,required,0,RECTANGLE_ABCD
(AREA QUESTION2),(AREA QUESTION2),Geometry,,triangle-area,,Single-KC,,KC40,,triangle-area,...,alone,area,triangle,0,0,0,initial,required,0,DESIGNING_A_QUILT


Here, `Step Name` has a one-to-many relationship with the entity `transactions` and `Problem Name` is a many-to-one relationship to the `problems` entity. The transactions entity is everything that's left over from our entityset construction.

In [4]:
es['transactions'].head(3)

Unnamed: 0_level_0,Sample Name,Transaction Id,Session Id,Time,Time Zone,Duration (sec),Student Response Type,Student Response Subtype,Tutor Response Type,Tutor Response Subtype,...,Outcome,Selection,Action,Input,Feedback Text,Feedback Classification,Help Level,Total Num Hints,Class,End Time
Transaction Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
499a0a18d7b6d96d4ee9c16d4bead6f2,All Data,499a0a18d7b6d96d4ee9c16d4bead6f2,GEO-408d5ed7:10e14be5d3a:-8000,1996-02-01 00:00:00,US/Eastern,0,ATTEMPT,,RESULT,,...,0,(CIRCLE-AREA_A QUESTION1),,,,,,,,1996-02-01 00:00:00
d398b66148a76c537cba816efe946b85,All Data,d398b66148a76c537cba816efe946b85,GEO-408d5ed7:10e14be5d3a:-8000,1996-02-01 00:00:01,US/Eastern,1,ATTEMPT,,RESULT,,...,1,(CIRCLE-AREA_A QUESTION1),,,,,,,,1996-02-01 00:00:02
1c133061306fd8e099eb0e4f2ac21430,All Data,1c133061306fd8e099eb0e4f2ac21430,GEO-408d5ed7:10e14be5d3a:-6e40,1996-02-01 00:00:02,US/Eastern,0,ATTEMPT,,RESULT,,...,1,(AREA QUESTION1),,,,,,,,1996-02-01 00:00:02


# Phase 2: Building Features

Next, we calculate a feature matrix on the `transactions` entity to try to predict the outcome of a given transaction. It's at this step that our previous setup pays off: we can automatically calculate features as if at a given point in time using Deep Feature Synthesis. Furthermore, we can guarentee that future values for `Outcome` won't be used for any calculations because we set the time index of that value to be after the cutoff time.

We use the function from utils which uses Deep Feature Synthesis with `entityset`, label and target entity defined.

In [5]:
fm_enc, label = utils.autorun_dfs(es, target_entity='transactions', label='Outcome')
print("Created {} features".format(len(fm_enc)))

Building features: 515it [00:00, 4193.95it/s]
Progress: 100%|██████████| 61/61 [01:13<00:00,  1.20s/cutoff time]
Created 6778 features


# Phase 3: Making predictions
Using the feature matrix `fm_enc` and the label `label`, we find the `roc_auc_score` at five values in a time series split.

In [6]:
print("Using {} features".format(len(fm_enc)))
from sklearn.model_selection import TimeSeriesSplit

splitter = TimeSeriesSplit(n_splits=5, max_train_size=None)
utils.score_with_tssplit(fm_enc, label, splitter)

Using 6778 features
AUC score on time split 0 is 0.55
Top 5 features: ['sessions.MEAN(transactions.Is Last Attempt)', 'attempts.MEAN(transactions.Problem View)', 'attempts.SUM(transactions.Problem View)', 'sessions.SUM(transactions.Problem View)', 'Attempt At Step = 2']
-----

AUC score on time split 1 is 0.58
Top 5 features: ['sessions.MEAN(transactions.Is Last Attempt)', 'sessions.students.SUM(transactions.Duration (sec))', 'attempts.SUM(transactions.Duration (sec))', 'sessions.SUM(transactions.Problem View)', 'sessions.MEAN(transactions.Duration (sec))']
-----

AUC score on time split 2 is 0.57
Top 5 features: ['sessions.MEAN(transactions.Duration (sec))', 'sessions.students.MEAN(transactions.Duration (sec))', 'sessions.students.MEAN(transactions.Is Last Attempt)', 'Problem View', 'sessions.students.SUM(transactions.Is Last Attempt)']
-----

AUC score on time split 3 is 0.63
Top 5 features: ['sessions.MEAN(transactions.Duration (sec))', 'Problem View', 'sessions.students.MEAN(transa

# Next Steps
This notebook showed how to structure your data and make predictions with machine learning. Rather than spending time creating features, it's now possible to explore the relationships and implications betweem thousands of features directly. Reasonable next steps might be to:
1. Make plots to better understand the relationship between existing features and the label 
2. Reduce the total number of features and tune the machine learning model
3. Create discipline specific *custom primitives* that might be useful for this prediction problem




# Appendix: Custom Primitives
It's often the case that you'd like to create discipline specific primitives. Here we create a custom primitive: `ProbFail`, which calculates the likelihood that a boolean variable is false. One of the many advantages in defining custom primitives is that we can define the name and input types as we would like. If you're interested in creating your own custom primitives for this dataset, copy and modify this step as necessary.

In [7]:
from featuretools.primitives import make_agg_primitive
import featuretools.variable_types as vtypes

def probability(boolean):
    numtrue = len([x for x in boolean if x==1])
    return 1 - numtrue/len(boolean)

ProbFail = make_agg_primitive(probability,
                              input_types=[vtypes.Boolean],
                              name='failure_rate',
                              description='Calculates likelihood a boolean is false over a region',
                              return_type=vtypes.Numeric)

fm_enc2, label2 = utils.autorun_dfs(es, target_entity='transactions', 
                                  label='Outcome',
                                  custom_agg=[ProbFail])

utils.score_with_tssplit(fm_enc2, label2, splitter)


Building features: 527it [00:00, 4864.54it/s]
Progress: 100%|██████████| 61/61 [01:16<00:00,  1.25s/cutoff time]
AUC score on time split 0 is 0.55
Top 5 features: ['attempts.FAILURE_RATE(transactions.Outcome)', 'sessions.SUM(transactions.Problem View)', 'sessions.MEAN(transactions.Duration (sec))', 'sessions.students.MEAN(transactions.Is Last Attempt)', 'Attempt At Step = 2']
-----

AUC score on time split 1 is 0.63
Top 5 features: ['sessions.MEAN(transactions.Is Last Attempt)', 'sessions.students.MEAN(transactions.Duration (sec))', 'sessions.students.SUM(transactions.Problem View)', 'Attempt At Step = 1', 'sessions.FAILURE_RATE(transactions.Outcome)']
-----

AUC score on time split 2 is 0.57
Top 5 features: ['sessions.MEAN(transactions.Duration (sec))', 'Problem View', 'sessions.students.FAILURE_RATE(transactions.Outcome)', 'sessions.SUM(transactions.Duration (sec))', 'attempts.FAILURE_RATE(transactions.Outcome)']
-----

AUC score on time split 3 is 0.62
Top 5 features: ['Problem View