# Making predictions from a DataShop dataset
<p style="margin:30px">
    <img width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
</p>

In this tutorial, we show how to predict whether a student will succesfully answer a problem using a dataset from [CMU DataShop](https://pslcdatashop.web.cmu.edu/). The workflow given here can be used to quickly **organize** and **make predictions** about any column from any similarly structured DataShop dataset. 

*If you're running this notebook yourself, please download the [geometry dataset](https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=76) into the `data` folder in this repository. You will only need the `.txt` file. The infrastructure in this notebook will work with **any** LeanLab dataset, but you will need to change the file name in the first cell.*

## Highlights
* Show how to import a DataShop dataset into featuretools
* Show how to make custom primitives for stacking
* Show efficacy of automatic feature generation with these datasets

# Step 1: Creating a useful dataset structure
At the beginning of any project, it is worthwhile to take a moment to think about how your dataset is structured.

In these datasets the unique events come from `transactions`: places where a student interacts with a system. However, the columns of those transactions have variables that can be grouped together. 

For instance, there are only 59 distinct students for the 6778 transactions we in the geometry dataset. Those students log in to the system and have individual sessions. We can break down problems and problem steps in a similar way.

We create an entityset structure using the `datashop_to_entityset` function in [utilities](utilities.py). If you're interested in how `datashop_to_entityset` is structured, there's an associated notebook [entityset_function](entityset_function.ipynb) which explains choices made in more detail.

In [1]:
# Note that each branch is a one -> many relationship

# schools       students     problems
#        \        |         /
#   classes   sessions   problem steps
#          \     |       /
#           transactions  -- attempts
#

import utils

filename = 'data/ds2174_tx_All_Data_3991_2017_1128_123859.txt'
es = utils.datashop_to_entityset(filename)
es

Entityset: Dataset
  Entities:
    transactions (shape = [6778, 73])
    problem_steps (shape = [78, 2])
    problems (shape = [20, 1])
    sessions (shape = [59, 3])
    students (shape = [59, 2])
    ...And 3 more
  Relationships:
    transactions.Step Name -> problem_steps.Step Name
    problem_steps.Problem Name -> problems.Problem Name
    transactions.Session Id -> sessions.Session Id
    sessions.Anon Student Id -> students.Anon Student Id
    transactions.Class -> classes.Class
    ...and 2 more

Here, we've set up entities for unique pieces of information. As an example, there are only 59 students while there are nearly 7000 transactions. Our `students` entity represents that: there are only 59 rows, one for each Anonymous student ID.

In [2]:
es['students'].df.head(3)

Unnamed: 0_level_0,Anon Student Id,first_sessions_time
Anon Student Id,Unnamed: 1_level_1,Unnamed: 2_level_1
Stu_c0bf45c22dc46067350d304ce330067e,Stu_c0bf45c22dc46067350d304ce330067e,1996-02-01 00:00:00
Stu_af3a2f63bda8c1338556108cb8d519a0,Stu_af3a2f63bda8c1338556108cb8d519a0,1996-02-01 00:00:02
Stu_d7f18a5fa205a889b0c5b0b56a7127d3,Stu_d7f18a5fa205a889b0c5b0b56a7127d3,1996-02-01 00:00:02


The transactions entity is everything that's left over after we've *normalized* out by other interesting variables.

In [3]:
es['transactions'].head(3)

Unnamed: 0_level_0,Sample Name,Transaction Id,Session Id,Time,Time Zone,Duration (sec),Student Response Type,Student Response Subtype,Tutor Response Type,Tutor Response Subtype,...,CF (Factor embeddedness),CF (Factor figure-part),CF (Factor figure-type),CF (Factor non-standard-orientation-or-shape),CF (Factor parallelogram),CF (Factor parallelogram-type),CF (Factor repeat),CF (Factor required),CF (Factor trapezoid-part),End Time
Transaction Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
499a0a18d7b6d96d4ee9c16d4bead6f2,All Data,499a0a18d7b6d96d4ee9c16d4bead6f2,GEO-408d5ed7:10e14be5d3a:-8000,1996-02-01 00:00:00,US/Eastern,0,ATTEMPT,,RESULT,,...,embedded,area,circle,0,0,0,initial,required,0,1996-02-01 00:00:00
d398b66148a76c537cba816efe946b85,All Data,d398b66148a76c537cba816efe946b85,GEO-408d5ed7:10e14be5d3a:-8000,1996-02-01 00:00:01,US/Eastern,1,ATTEMPT,,RESULT,,...,embedded,area,circle,0,0,0,initial,required,0,1996-02-01 00:00:02
1c133061306fd8e099eb0e4f2ac21430,All Data,1c133061306fd8e099eb0e4f2ac21430,GEO-408d5ed7:10e14be5d3a:-6e40,1996-02-01 00:00:02,US/Eastern,0,ATTEMPT,,RESULT,,...,alone,area,rectangle,0,1,rectangle,repeat,required,0,1996-02-01 00:00:02


# Step 2: Building Features

Next, we calculate a feature matrix on the `transactions` entity to try to predict the outcome of a given transaction. It's at this step that our previous setup pays off: we can automatically calculate features as if at a given point in time using Deep Feature Synthesis. Furthermore, we can guarentee that future values for `Outcome` won't be used for any calculations because we set the time index of that value to be after the cutoff time.

We use the function from utils which uses Deep Feature Synthesis with `entityset`, label and target entity defined.

In [4]:
fm_enc, label = utils.create_features(es, label='Outcome')
print("Created {} features".format(len(fm_enc)))

Building features: 792it [00:00, 4811.19it/s]
Progress: 100%|██████████| 61/61 [01:38<00:00,  1.62s/cutoff time]
Created 6778 features


# Step 3: Making predictions
Using the feature matrix `fm_enc` and the label `label`, we find the `roc_auc_score` at five values in a time series split.

In [5]:
print("Using {} features".format(len(fm_enc)))
from sklearn.model_selection import TimeSeriesSplit

splitter = TimeSeriesSplit(n_splits=5, max_train_size=None)
utils.score_with_tssplit(fm_enc, label, splitter)

Using 6778 features
AUC score on time split 0 is 0.65
Feature Importances: 
1: problem_steps.SUM(transactions.Problem View)
2: Attempt At Step = 2
3: attempts.SUM(transactions.Problem View)
4: Attempt At Step = 1
5: problem_steps.SUM(transactions.Duration (sec))
-----

AUC score on time split 1 is 0.62
Feature Importances: 
1: sessions.SUM(transactions.Duration (sec))
2: sessions.students.MEAN(transactions.Duration (sec))
3: sessions.MEAN(transactions.Duration (sec))
4: sessions.students.MEAN(transactions.CF (Factor parallelogram))
5: sessions.students.MEAN(transactions.Is Last Attempt)
-----

AUC score on time split 2 is 0.56
Feature Importances: 
1: problem_steps.MEAN(transactions.Duration (sec))
2: sessions.students.MEAN(transactions.Is Last Attempt)
3: attempts.SUM(transactions.CF (Factor backward))
4: sessions.students.MEAN(transactions.Duration (sec))
5: sessions.students.SUM(transactions.Duration (sec))
-----

AUC score on time split 3 is 0.63
Feature Importances: 
1: problem_st

# Next Steps
This notebook showed how to structure your data and make predictions with machine learning. Rather than spending time creating features, it's now possible to explore the relationships and implications betweem thousands of features directly. Reasonable next steps might be to:
1. Make plots to better understand the relationship between existing features and the label 
2. Reduce the total number of features and tune the machine learning model
3. Create discipline specific *custom primitives* that might be useful for this prediction problem




# Appendix: Custom Primitives
It's often the case that you'd like to create discipline specific primitives. Here we create a custom primitive: `ProbFail`, which calculates the likelihood that a boolean variable is false. One of the many advantages in defining custom primitives is that we can define the name and input types as we would like. If you're interested in creating your own custom primitives for this dataset, copy and modify this step as necessary.

In [6]:
from featuretools.primitives import make_agg_primitive
import featuretools.variable_types as vtypes

def probability(boolean):
    numtrue = len([x for x in boolean if x==1])
    return 1 - numtrue/len(boolean)

ProbFail = make_agg_primitive(probability,
                              input_types=[vtypes.Boolean],
                              name='failure_rate',
                              description='Calculates likelihood a boolean is false over a region',
                              return_type=vtypes.Numeric)

fm_enc2, label2 = utils.create_features(es,
                                  label='Outcome',
                                  custom_agg=[ProbFail])

utils.score_with_tssplit(fm_enc2, label2, splitter)


Building features: 804it [00:00, 5014.04it/s]
Progress: 100%|██████████| 61/61 [01:44<00:00,  1.72s/cutoff time]
AUC score on time split 0 is 0.65
Feature Importances: 
1: Attempt At Step = 2
2: problem_steps.FAILURE_RATE(transactions.Outcome)
3: sessions.FAILURE_RATE(transactions.Outcome)
4: attempts.FAILURE_RATE(transactions.Outcome)
5: problem_steps.MEAN(transactions.Duration (sec))
-----

AUC score on time split 1 is 0.57
Feature Importances: 
1: sessions.MEAN(transactions.Duration (sec))
2: problem_steps.FAILURE_RATE(transactions.Outcome)
3: sessions.students.MEAN(transactions.Duration (sec))
4: Attempt At Step = 2
5: sessions.FAILURE_RATE(transactions.Outcome)
-----

AUC score on time split 2 is 0.58
Feature Importances: 
1: sessions.students.MEAN(transactions.Duration (sec))
2: sessions.MEAN(transactions.Duration (sec))
3: problem_steps.MEAN(transactions.Duration (sec))
4: attempts.FAILURE_RATE(transactions.Outcome)
5: problem_steps.SUM(transactions.Duration (sec))
-----

AUC sc

In [8]:
from bokeh.io import show, output_notebook

output_notebook()
p = utils.plot(fm_enc2,
               col1='sessions.FAILURE_RATE(transactions.Outcome)',
               col2='problem_steps.MEAN(transactions.Duration (sec))',
               label=label2)
show(p)