# Predicting incorrect student answers from DataShop data
<p style="margin:30px">
    <img width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
</p>

In this tutorial, we show how to predict whether a student will succesfully answer a problem using a dataset from [CMU DataShop](https://pslcdatashop.web.cmu.edu/). While online courses are logistically efficient, the structure can make it more difficult for a teacher to understand how students are learning in their class. To try to fill in those gaps, we can apply machine learning. However, building an accurate machine learning model requires extracting information called **features**. Finding the right features is a crucial component of both finding a satisfactory answer and of interpreting the dataset as a whole. The process of **feature engineering** is made simple by [Featuretools](http://www.featuretools.com).

*If you're running this notebook yourself, please download the [geometry dataset](https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=76) into the `data` folder in this repository. You can follow this [instructions](https://pslcdatashop.web.cmu.edu/help?datasetId=76&page=export) to download the data. You will only need the `.txt` file. The infrastructure in this notebook will work with **any** DataShop dataset, but you will need to change the file name in the first cell.*

## Highlights
* Show how to import a DataShop dataset into featuretools
* Demonstrate efficacy of automatic feature generation by training a machine learning model 
* Give an example of how Featuretools can reveal and help answer interesting questions

# Step 1: Load data
At the beginning of any project, it is worthwhile to take a moment to think about how your dataset is structured.

In these datasets the unique events come from `transactions`: places where a student interacts with a system. Each transaction has a `time-index`, the time at which information in a row becomes known. Furthermore, the columns of those transactions have variables that can be grouped together. 

For instance, there are only 59 distinct students for the 6778 transactions we have in the geometry dataset. Those students log in to the system and have individual sessions. We can break down problems and problem steps in a similar way.

Featuretools stores data in an `EntitySet`. This is an abstraction which allows us to hold on to not only the data itself, but also to metadata like relationships and column types.

We create an entityset structure using the `datashop_to_entityset` function in [utils](utils.py). If you're interested in how `datashop_to_entityset` is structured, there's an associated notebook [entityset_function](entityset_function.ipynb) which explains choices made in more detail.

In [1]:
# Note that each branch is a one -> many relationship

# schools       students     problems
#        \        |         /
#   classes   sessions   problem steps
#          \     |       /
#           transactions
#

import utils

filename = 'data/ds76_tx_All_Data_74_2018_0912_070949.txt'
es = utils.datashop_to_entityset(filename)
es

Entityset: Dataset
  Entities:
    problem_steps [Rows: 78, Columns: 156]
    sessions [Rows: 59, Columns: 3]
    students [Rows: 59, Columns: 2]
    transactions [Rows: 6778, Columns: 26]
    problems [Rows: 20, Columns: 2]
    classes [Rows: 1, Columns: 2]
    schools [Rows: 1, Columns: 1]
  Relationships:
    transactions.Step Name -> problem_steps.Step Name
    problem_steps.Problem Name -> problems.Problem Name
    transactions.Session Id -> sessions.Session Id
    sessions.Anon Student Id -> students.Anon Student Id
    transactions.Class -> classes.Class
    classes.School -> schools.School

Our `students` entity represents that: there are only 59 rows, one for each Anonymous student ID.

In [2]:
es['students'].df.head(3)

Unnamed: 0,Anon Student Id,first_sessions_time
Stu_c0bf45c22dc46067350d304ce330067e,Stu_c0bf45c22dc46067350d304ce330067e,1996-02-01 00:00:00
Stu_af3a2f63bda8c1338556108cb8d519a0,Stu_af3a2f63bda8c1338556108cb8d519a0,1996-02-01 00:00:02
Stu_d7f18a5fa205a889b0c5b0b56a7127d3,Stu_d7f18a5fa205a889b0c5b0b56a7127d3,1996-02-01 00:00:02


Featuretools allows us to make new entities as grouped by categorical values. Through this process of *normalization* we have created 8 connected entities from an initial table of transactions. We can look at what is left in `transactions` after normalization.

In [3]:
es['transactions'].df.head(3)

Unnamed: 0,Transaction Id,Sample Name,Total Num Hints,Level (Unit),Tutor Response Subtype,Time Zone,Input,Problem View,Feedback Classification,Class,...,End Time,Time,Action,Outcome,Step Name,Student Response Type,Student Response Subtype,Tutor Response Type,Is Last Attempt,Duration (sec)
bdc63de3ac6ace889eed3850997ea333,bdc63de3ac6ace889eed3850997ea333,All Data,,Area,,US/Eastern,,1,,,...,1996-02-01 00:00:00,1996-02-01 00:00:00,,0,(CIRCLE-AREA_A QUESTION1),ATTEMPT,,RESULT,0,0
34f5d8fcf207513d222bf6d1bced2046,34f5d8fcf207513d222bf6d1bced2046,All Data,,Area,,US/Eastern,,1,,,...,1996-02-01 00:00:02,1996-02-01 00:00:01,,1,(CIRCLE-AREA_A QUESTION1),ATTEMPT,,RESULT,1,1
171f36289815d9782f8beab6ad8b0fac,171f36289815d9782f8beab6ad8b0fac,All Data,,Area,,US/Eastern,,1,,,...,1996-02-01 00:00:02,1996-02-01 00:00:02,,0,(AREA QUESTION1),ATTEMPT,,RESULT,0,0


# Step 2: Building Features
Next, we calculate a feature matrix on the `transactions` entity to try to predict the outcome of a given transaction. It's at this step that our previous setup pays off: we can automatically calculate features using data from the whole `EntitySet`. 

## Cutoff times
We are going to be generating features and doing predictive modeling on time-sensitive data. That comes with a high risk of label leakage. 

In this case, we are predicting if a student will get a particular problem correct. For a fixed problem, the feature "There exists an attempt number three" would be highly predictive of the result on attempts one and two. There can only be a third attempt if there first two attempts were wrong! In that way, storing future `attempt` information in a feature to predict `Outcome` would yield higher test accuracy than the model deserves. It's not ok to have the feature "There exists an attempt number three" while predicting attempts one and two because it contains information that can not be known at that point in time.

To circumvent that, we introduce the notion of [cutoff_times](https://docs.featuretools.com/automated_feature_engineering/handling_time.html). A `cutoff_time` has an index column and a datetime column indicating the last acceptable date we can use while generating features for a historical training example. We can also add in a label, which will be passed through Deep Feature Synthesis ([DFS](https://docs.featuretools.com/automated_feature_engineering/afe.html)) untouched so we can recover it later.

Setting cutoff times immediately mitigates the risk of fraudulently using future data, controls the number of predictions we make and controls what data is used while calculating features.


In [4]:
cutoff_times = es['transactions'].df[['Transaction Id', 'End Time', 'Outcome']]
cutoff_times.head()

Unnamed: 0,Transaction Id,End Time,Outcome
bdc63de3ac6ace889eed3850997ea333,bdc63de3ac6ace889eed3850997ea333,1996-02-01 00:00:00,0
34f5d8fcf207513d222bf6d1bced2046,34f5d8fcf207513d222bf6d1bced2046,1996-02-01 00:00:02,1
171f36289815d9782f8beab6ad8b0fac,171f36289815d9782f8beab6ad8b0fac,1996-02-01 00:00:02,0
5ad4ed1a55851edb89fc23d8ed4076e1,5ad4ed1a55851edb89fc23d8ed4076e1,1996-02-01 00:00:02,1
de0bfeea6b32cb37891c0b7dbc6b7a18,de0bfeea6b32cb37891c0b7dbc6b7a18,1996-02-01 00:00:02,1


With that in hand, we can guarentee that future values for `Outcome` won't be used for any calculations because we set the time index of `Outcome` to be after the cutoff time.

From there, we can call `ft.dfs` to generate our features and feature matrix. Deep Feature Synthesis creates features using reusable functions ([Primitives](https://docs.featuretools.com/automated_feature_engineering/primitives.html)). The algorithm attempts to combine primitives together with actual data to create a feature matrix. Here, we'll use the primitives `Sum`, `Mean`, `PercentTrue` and `Hour`. On an ordinary laptop, you should expect the following cell to take roughly 40 minutes to complete as there are more than 3000 unique cutoff times. For faster results, uncomment the approximate line. 

In [5]:
import featuretools as ft
from featuretools.selection import remove_low_information_features

import pandas as pd
import numpy as np
pd.options.display.max_columns = 500

fm, features = ft.dfs(entityset=es,
                      target_entity='transactions',
                      agg_primitives=['Sum', 'Mean', 'Percent_True'],
                      trans_primitives=['Hour'],
                      max_depth=3,
                      # approximate='2m',
                      cutoff_time=cutoff_times[1000:],
                      verbose=True)

# Encode the feature matrix using One-Hot encoding
fm_enc, f_enc = ft.encode_features(fm, features)
fm_enc = fm_enc.fillna(0)
fm_enc = remove_low_information_features(fm_enc)

# Pop the label
label = fm_enc.pop('Outcome')

fm.tail()

Built 464 features
Elapsed: 47:11:55 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks  


Unnamed: 0_level_0,Input,Selection,Tutor Response Type,Attempt At Step,Tutor Response Subtype,Problem View,Level (Unit),Time Zone,Help Level,Class,Student Response Type,Sample Name,Session Id,Step Name,Total Num Hints,Feedback Classification,Feedback Text,Student Response Subtype,Action,problem_steps.KC Category (hLFASearchModel1-back-context).1,problem_steps.KC Category (Area),problem_steps.KC (Lasso Model).1,problem_steps.CF (Factor repeat),problem_steps.KC (hLFASearchAICWholeModel3arith),problem_steps.KC (hLFASearchModel1-renamed-chgd2),problem_steps.KC Category (Geometry),problem_steps.KC Category (Lasso Model).4,problem_steps.KC (LuS-divide-compose-by-addition),problem_steps.KC Category (LFASearchAIC2_with_texkbook_new_decompose),problem_steps.KC Category (Orige-trap-merge),problem_steps.KC Category (LFASearchAICWholeModel2),problem_steps.KC (Textbook_New_Decompose),problem_steps.KC Category (hLFASearchModel1_context-single),problem_steps.KC (hLFASearchModel1-context).1,problem_steps.KC Category (hLFASearchModel1-back-context).2,problem_steps.KC (Textbook),problem_steps.KC (Item),problem_steps.CF (Factor add-or-m),problem_steps.KC Category (Single_Plus_Unique),problem_steps.KC Category (hLFASearchModel1-renamed-chgd2),problem_steps.KC (combineTraps),problem_steps.KC (Merge-Trap),problem_steps.KC Category (Textbook_New_Decompose),HOUR(End Time),problem_steps.KC (Concepts),problem_steps.KC Category (hLFASearchModel1-back-context),problem_steps.KC Category (Lasso Model).2,problem_steps.KC (Single_Plus_Unique),problem_steps.KC (xDecmpTrapCheat),problem_steps.KC (DecompArithDiam),problem_steps.KC (hLFASearchAICWholeModel3arith0),problem_steps.KC Category (hLFASearchModel1-backward).1,problem_steps.KC Category (Concepts),problem_steps.KC Category (WESST),problem_steps.KC Category (new1),problem_steps.KC Category (DecomposeArith),problem_steps.KC (Yu-Ju_Textbook_modified),problem_steps.KC Category (Orig-trap-merge),problem_steps.CF (Factor basic-shape),problem_steps.KC Category (hLFASearchModel1-context).1,problem_steps.KC Category (new),problem_steps.KC (Single_Plus_Unique).1,problem_steps.KC Category (hLFASearchAICWholeModel3arith2),problem_steps.KC (Circle-Collapse),problem_steps.KC Category (Original-test),HOUR(Problem Start Time),problem_steps.KC Category (Lasso Model),problem_steps.KC Category (Merge-Trap),problem_steps.KC Category (DecompArithDiam2),problem_steps.KC Category (xDecmpTrapCheat),problem_steps.CF (Factor required),problem_steps.KC (LFASearchAIC1_with_texkbook_new_decompose),problem_steps.KC Category (Item Model),problem_steps.KC Category (Original),problem_steps.KC (Lasso Model).4,problem_steps.KC (Lasso Model).5,problem_steps.KC (Geometry),problem_steps.KC (Orige-trap-merge),problem_steps.KC (WESST),problem_steps.KC Category (hLFASearchModel1-renamed-chgd),problem_steps.KC Category (Decompose),classes.School,problem_steps.KC (new KC model name),problem_steps.CF (Factor parallelogram-type),problem_steps.KC Category (new KC model name),problem_steps.CF (Factor trapezoid-part),problem_steps.KC Category (Lasso Model).5,problem_steps.CF (Factor embeddedness),problem_steps.KC Category (Circle-Collapse),problem_steps.KC (hLFASearchModel1_context-single),problem_steps.KC (Lasso Model),problem_steps.Problem Name,problem_steps.KC (DecompArithDiam2),problem_steps.KC (Original),problem_steps.KC (JohnsNewModel),problem_steps.KC Category (Lasso Model).1,problem_steps.KC (LFASearchAIC2_no_textbook_new_decompose),problem_steps.KC (hLFASearchModel1-context),problem_steps.KC (hLFASearchModel1-renamed-chgd),problem_steps.CF (Factor backward),problem_steps.KC Category (Yu-Ju_Textbook_modified2),problem_steps.KC (Original-test),problem_steps.KC Category (hLFASearchAICWholeModel3arith0),problem_steps.KC Category (JohnsNewModel),problem_steps.KC (Lasso Model).2,problem_steps.KC Category (Yu-Ju_Textbook_modified),problem_steps.KC Category (Lasso Model).3,problem_steps.KC (combineTraps-diffSize),problem_steps.KC Category (Textbook),problem_steps.KC (DecomposeArith),problem_steps.KC Category (Single-KC),problem_steps.CF (Factor cir-quad),problem_steps.KC (Lasso Model).3,problem_steps.KC Category (Unique-step),problem_steps.KC (Decompose_height),problem_steps.KC (hLFASearchAICWholeModel3arith2),problem_steps.CF (Factor embedd3-tri-reg_prob_fix),problem_steps.KC Category (hLFASearchModel1-renamed),problem_steps.KC Category (hLFASearchModel1-renamed-chgd3),problem_steps.KC (LFASearchAICWholeModel3),problem_steps.KC Category (Single_Plus_Unique).1,problem_steps.KC (hLFASearchModel1-back-context),problem_steps.CF (Factor base-or-height),problem_steps.KC (Single-KC),problem_steps.KC Category (LuS-divide-compose-by-addition),problem_steps.KC (Decompose),problem_steps.KC Category (original_geometryConcept),problem_steps.CF (Factor circle-goal),problem_steps.KC (original_geometryConcept),problem_steps.KC (Unique-step),problem_steps.KC Category (hLFASearchModel1-backward),problem_steps.KC (new trap merge),problem_steps.KC Category (Decompose_height),problem_steps.KC (hLFASearchModel1-renamed),sessions.Anon Student Id,problem_steps.KC (hLFASearchModel1-back-context).2,HOUR(Time),problem_steps.KC Category (DecompArithDiam),problem_steps.KC (LFASearchAICWholeModel2),problem_steps.KC Category (Textbook New),problem_steps.KC Category (LFASearchAICWholeModel3),problem_steps.KC Category (combineTraps),problem_steps.KC (Orig-trap-merge),problem_steps.CF (Factor figure-part),problem_steps.KC Category (new_),problem_steps.KC Category (new trap merge),problem_steps.KC Category (Item),problem_steps.KC (LFASearchAIC2_with_texkbook_new_decompose),problem_steps.KC (hLFASearchModel1-backward).1,problem_steps.CF (Factor base-formula-p),problem_steps.KC (Textbook New),problem_steps.KC (hLFASearchModel1-back-context).1,problem_steps.KC (new_),problem_steps.KC Category (combineTraps-diffSize),problem_steps.KC Category (importTest1),problem_steps.KC (hLFASearchModel1-renamed-chgd3),problem_steps.KC (Lasso Model).6,problem_steps.CF (Factor circle-given),problem_steps.KC (importTest1),problem_steps.KC (new),problem_steps.KC Category (hLFASearchAICWholeModel3arith),problem_steps.KC Category (LFASearchAIC2_no_textbook_new_decompose),problem_steps.KC (Area),problem_steps.KC (Yu-Ju_Textbook_modified2),problem_steps.KC Category (hLFASearchModel1-context),problem_steps.KC (LFASearchAIC1_no_textbook_new_decompose),problem_steps.KC Category (LFASearchAIC1_no_textbook_new_decompose),problem_steps.CF (Factor circle-formula),problem_steps.CF (Factor non-standard-orientation-or-shape),problem_steps.KC (hLFASearchModel1-backward),problem_steps.KC Category (modified),problem_steps.CF (Factor parallelogram),problem_steps.KC Category (LFASearchAIC1_with_texkbook_new_decompose),problem_steps.KC Category (Lasso Model).6,problem_steps.CF (Factor figure-type),problem_steps.KC (Item Model),problem_steps.KC (modified),problem_steps.KC (new1),sessions.MEAN(transactions.Input),problem_steps.SUM(transactions.Total Num Hints),problem_steps.MEAN(transactions.Tutor Response Subtype),problem_steps.MEAN(transactions.Is Last Attempt),problem_steps.MEAN(transactions.Feedback Text),sessions.MEAN(transactions.Is Last Attempt),classes.SUM(transactions.Duration (sec)),classes.MEAN(transactions.Is Last Attempt),classes.MEAN(transactions.Feedback Classification),problem_steps.SUM(transactions.Student Response Subtype),sessions.SUM(transactions.Tutor Response Subtype),classes.MEAN(transactions.Help Level),sessions.SUM(transactions.Feedback Classification),classes.PERCENT_TRUE(transactions.Outcome),problem_steps.SUM(transactions.Input),sessions.SUM(transactions.Input),sessions.MEAN(transactions.Total Num Hints),problem_steps.SUM(transactions.Is Last Attempt),problem_steps.MEAN(transactions.Feedback Classification),problem_steps.MEAN(transactions.Problem View),problem_steps.SUM(transactions.Action),classes.SUM(transactions.Problem View),classes.MEAN(transactions.Feedback Text),classes.SUM(transactions.Action),problem_steps.MEAN(transactions.Help Level),sessions.SUM(transactions.Student Response Subtype),sessions.PERCENT_TRUE(transactions.Outcome),sessions.MEAN(transactions.Duration (sec)),problem_steps.MEAN(transactions.Duration (sec)),classes.SUM(transactions.Help Level),classes.SUM(transactions.Feedback Text),sessions.SUM(transactions.Is Last Attempt),classes.SUM(transactions.Total Num Hints),classes.MEAN(transactions.Total Num Hints),problem_steps.MEAN(transactions.Student Response Subtype),problem_steps.SUM(transactions.Tutor Response Subtype),sessions.MEAN(transactions.Feedback Classification),sessions.SUM(transactions.Help Level),sessions.MEAN(transactions.Help Level),problem_steps.HOUR(first_transactions_time),classes.SUM(transactions.Tutor Response Subtype),classes.MEAN(transactions.Student Response Subtype),sessions.SUM(transactions.Total Num Hints),problem_steps.SUM(transactions.Duration (sec)),problem_steps.MEAN(transactions.Input),sessions.MEAN(transactions.Problem View),sessions.SUM(transactions.Problem View),sessions.MEAN(transactions.Tutor Response Subtype),classes.MEAN(transactions.Tutor Response Subtype),classes.MEAN(transactions.Duration (sec)),classes.SUM(transactions.Is Last Attempt),classes.SUM(transactions.Student Response Subtype),classes.MEAN(transactions.Problem View),sessions.SUM(transactions.Duration (sec)),problem_steps.SUM(transactions.Feedback Text),problem_steps.SUM(transactions.Problem View),problem_steps.MEAN(transactions.Action),sessions.HOUR(first_transactions_time),sessions.SUM(transactions.Action),problem_steps.PERCENT_TRUE(transactions.Outcome),classes.SUM(transactions.Input),sessions.MEAN(transactions.Student Response Subtype),sessions.MEAN(transactions.Action),sessions.SUM(transactions.Feedback Text),problem_steps.SUM(transactions.Feedback Classification),problem_steps.MEAN(transactions.Total Num Hints),sessions.MEAN(transactions.Feedback Text),classes.MEAN(transactions.Input),classes.MEAN(transactions.Action),classes.SUM(transactions.Feedback Classification),problem_steps.SUM(transactions.Help Level),problem_steps.problems.MEAN(problem_steps.KC Category (Original)),sessions.students.SUM(transactions.Tutor Response Subtype),problem_steps.problems.MEAN(problem_steps.KC Category (Textbook_New_Decompose)),problem_steps.problems.MEAN(problem_steps.KC Category (Textbook New)),problem_steps.problems.MEAN(transactions.Student Response Subtype),classes.schools.SUM(transactions.Help Level),problem_steps.problems.MEAN(problem_steps.KC Category (LFASearchAIC1_with_texkbook_new_decompose)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-context).1),sessions.students.MEAN(transactions.Duration (sec)),problem_steps.problems.MEAN(problem_steps.KC Category (LFASearchAIC2_with_texkbook_new_decompose)),problem_steps.problems.MEAN(problem_steps.KC Category (Geometry)),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1_context-single)),problem_steps.problems.SUM(problem_steps.KC Category (Textbook)),problem_steps.problems.SUM(problem_steps.CF (Factor backward)),problem_steps.problems.MEAN(problem_steps.KC Category (Merge-Trap)),problem_steps.problems.MEAN(problem_steps.KC Category (importTest1)),problem_steps.problems.MEAN(transactions.Action),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchAICWholeModel3arith0)),problem_steps.problems.SUM(transactions.Input),problem_steps.problems.MEAN(problem_steps.KC Category (modified)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchAICWholeModel3arith0)),problem_steps.problems.MEAN(problem_steps.KC Category (Orig-trap-merge)),problem_steps.problems.SUM(problem_steps.KC Category (importTest1)),sessions.students.MEAN(transactions.Student Response Subtype),sessions.students.MEAN(transactions.Input),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchAICWholeModel3arith2)),classes.schools.MEAN(transactions.Is Last Attempt),problem_steps.problems.MEAN(problem_steps.KC Category (Lasso Model).6),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-renamed-chgd)),problem_steps.problems.MEAN(problem_steps.KC Category (Lasso Model)),problem_steps.problems.SUM(transactions.Is Last Attempt),problem_steps.problems.MEAN(problem_steps.KC Category (DecompArithDiam)),problem_steps.problems.SUM(problem_steps.KC Category (Single_Plus_Unique)),problem_steps.problems.MEAN(problem_steps.KC Category (Original-test)),problem_steps.problems.MEAN(problem_steps.KC Category (Single_Plus_Unique)),problem_steps.problems.MEAN(problem_steps.KC Category (JohnsNewModel)),problem_steps.problems.MEAN(problem_steps.KC Category (Lasso Model).2),problem_steps.problems.SUM(problem_steps.KC Category (JohnsNewModel)),problem_steps.problems.SUM(problem_steps.KC Category (Orig-trap-merge)),problem_steps.problems.MEAN(problem_steps.KC Category (Lasso Model).4),problem_steps.problems.SUM(problem_steps.KC Category (DecompArithDiam2)),sessions.students.SUM(transactions.Problem View),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-renamed-chgd3)),problem_steps.problems.MEAN(problem_steps.KC Category (new_)),problem_steps.problems.MEAN(problem_steps.CF (Factor parallelogram)),problem_steps.problems.SUM(problem_steps.KC Category (xDecmpTrapCheat)),problem_steps.problems.SUM(transactions.Feedback Classification),problem_steps.problems.MEAN(problem_steps.KC Category (Decompose)),problem_steps.problems.SUM(problem_steps.KC Category (new1)),problem_steps.problems.SUM(problem_steps.KC Category (LuS-divide-compose-by-addition)),problem_steps.problems.SUM(transactions.Action),classes.schools.SUM(transactions.Student Response Subtype),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-context)),problem_steps.problems.SUM(transactions.Feedback Text),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-back-context).1),sessions.students.MEAN(transactions.Help Level),classes.schools.SUM(transactions.Is Last Attempt),problem_steps.problems.SUM(problem_steps.CF (Factor non-standard-orientation-or-shape)),problem_steps.problems.MEAN(problem_steps.KC Category (xDecmpTrapCheat)),problem_steps.problems.MEAN(transactions.Feedback Text),problem_steps.problems.SUM(problem_steps.KC Category (LFASearchAIC2_with_texkbook_new_decompose)),problem_steps.problems.MEAN(problem_steps.KC (new)),problem_steps.problems.MEAN(problem_steps.KC Category (Textbook)),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-backward).1),problem_steps.problems.MEAN(problem_steps.KC Category (Single_Plus_Unique).1),problem_steps.problems.SUM(problem_steps.KC Category (Single-KC)),classes.schools.SUM(transactions.Duration (sec)),sessions.students.MEAN(transactions.Is Last Attempt),problem_steps.problems.SUM(problem_steps.KC Category (LFASearchAIC1_no_textbook_new_decompose)),problem_steps.problems.SUM(problem_steps.KC Category (Lasso Model).3),problem_steps.problems.SUM(problem_steps.KC Category (combineTraps)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-renamed)),problem_steps.problems.MEAN(problem_steps.KC Category (LFASearchAIC2_no_textbook_new_decompose)),classes.schools.MEAN(transactions.Problem View),problem_steps.problems.MEAN(transactions.Feedback Classification),classes.schools.MEAN(transactions.Tutor Response Subtype),problem_steps.problems.PERCENT_TRUE(transactions.Outcome),problem_steps.problems.MEAN(problem_steps.KC Category (Orige-trap-merge)),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-renamed-chgd2)),sessions.students.MEAN(transactions.Tutor Response Subtype),problem_steps.problems.SUM(problem_steps.KC Category (Lasso Model).1),problem_steps.problems.SUM(transactions.Tutor Response Subtype),problem_steps.problems.SUM(problem_steps.KC Category (new KC model name)),problem_steps.problems.MEAN(transactions.Input),sessions.students.HOUR(first_sessions_time),problem_steps.problems.MEAN(problem_steps.KC Category (Lasso Model).1),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-back-context).2),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchAICWholeModel3arith)),sessions.students.MEAN(transactions.Problem View),problem_steps.problems.MEAN(problem_steps.KC Category (Yu-Ju_Textbook_modified2)),problem_steps.problems.SUM(problem_steps.KC Category (Textbook New)),classes.schools.SUM(transactions.Problem View),problem_steps.problems.SUM(problem_steps.KC Category (LFASearchAIC2_no_textbook_new_decompose)),problem_steps.problems.MEAN(problem_steps.KC Category (DecompArithDiam2)),sessions.students.SUM(transactions.Input),problem_steps.problems.SUM(problem_steps.KC Category (Single_Plus_Unique).1),problem_steps.problems.MEAN(problem_steps.KC Category (LFASearchAIC1_no_textbook_new_decompose)),problem_steps.problems.MEAN(transactions.Help Level),problem_steps.problems.SUM(problem_steps.KC Category (LFASearchAICWholeModel2)),classes.schools.MEAN(transactions.Duration (sec)),classes.schools.MEAN(transactions.Input),problem_steps.problems.MEAN(problem_steps.KC Category (Single-KC)),problem_steps.problems.SUM(problem_steps.KC Category (Area)),classes.schools.SUM(transactions.Feedback Classification),problem_steps.problems.HOUR(first_problem_steps_time),sessions.students.SUM(transactions.Feedback Text),problem_steps.problems.MEAN(problem_steps.KC Category (Item)),problem_steps.problems.SUM(problem_steps.KC Category (Yu-Ju_Textbook_modified2)),problem_steps.problems.SUM(problem_steps.KC Category (Orige-trap-merge)),classes.schools.MEAN(transactions.Student Response Subtype),problem_steps.problems.MEAN(problem_steps.KC Category (new KC model name)),classes.schools.PERCENT_TRUE(transactions.Outcome),classes.schools.SUM(transactions.Action),classes.schools.MEAN(transactions.Feedback Text),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-renamed)),problem_steps.problems.SUM(transactions.Problem View),problem_steps.problems.SUM(problem_steps.KC Category (Lasso Model)),problem_steps.problems.MEAN(problem_steps.KC Category (new trap merge)),problem_steps.problems.MEAN(problem_steps.CF (Factor non-standard-orientation-or-shape)),sessions.students.SUM(transactions.Total Num Hints),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-backward)),problem_steps.problems.MEAN(problem_steps.KC Category (Lasso Model).5),problem_steps.problems.MEAN(problem_steps.KC Category (Area)),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-back-context).1),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-context).1),problem_steps.problems.MEAN(problem_steps.KC Category (WESST)),problem_steps.problems.MEAN(problem_steps.KC Category (LFASearchAICWholeModel2)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-backward)),problem_steps.problems.SUM(transactions.Total Num Hints),problem_steps.problems.SUM(problem_steps.KC Category (Lasso Model).5),classes.schools.SUM(transactions.Total Num Hints),problem_steps.problems.MEAN(problem_steps.KC Category (LFASearchAICWholeModel3)),problem_steps.problems.MEAN(problem_steps.KC Category (Item Model)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-renamed-chgd3)),problem_steps.problems.SUM(problem_steps.CF (Factor embedd3-tri-reg_prob_fix)),problem_steps.problems.MEAN(problem_steps.KC Category (LuS-divide-compose-by-addition)),sessions.students.SUM(transactions.Is Last Attempt),problem_steps.problems.SUM(problem_steps.KC Category (Decompose_height)),problem_steps.problems.MEAN(problem_steps.KC Category (original_geometryConcept)),classes.schools.SUM(transactions.Input),problem_steps.problems.SUM(problem_steps.KC Category (original_geometryConcept)),classes.schools.SUM(transactions.Tutor Response Subtype),problem_steps.problems.SUM(problem_steps.KC Category (WESST)),sessions.students.MEAN(transactions.Total Num Hints),problem_steps.problems.MEAN(problem_steps.KC Category (combineTraps-diffSize)),sessions.students.SUM(transactions.Duration (sec)),problem_steps.problems.MEAN(problem_steps.KC Category (Concepts)),problem_steps.problems.MEAN(problem_steps.CF (Factor backward)),problem_steps.problems.MEAN(problem_steps.KC Category (Unique-step)),problem_steps.problems.SUM(problem_steps.KC Category (Circle-Collapse)),problem_steps.problems.SUM(problem_steps.CF (Factor parallelogram)),sessions.students.SUM(transactions.Student Response Subtype),classes.schools.MEAN(transactions.Feedback Classification),problem_steps.problems.SUM(problem_steps.KC Category (Yu-Ju_Textbook_modified)),problem_steps.problems.SUM(problem_steps.KC Category (Unique-step)),problem_steps.problems.MEAN(transactions.Is Last Attempt),sessions.students.SUM(transactions.Action),problem_steps.problems.MEAN(problem_steps.KC Category (combineTraps)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-backward).1),problem_steps.problems.SUM(problem_steps.KC Category (Item Model)),problem_steps.problems.SUM(problem_steps.KC Category (new trap merge)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchAICWholeModel3arith2)),sessions.students.PERCENT_TRUE(transactions.Outcome),problem_steps.problems.SUM(problem_steps.KC Category (Concepts)),classes.schools.MEAN(transactions.Action),problem_steps.problems.MEAN(transactions.Duration (sec)),classes.schools.SUM(transactions.Feedback Text),problem_steps.problems.SUM(problem_steps.KC Category (combineTraps-diffSize)),problem_steps.problems.MEAN(problem_steps.KC Category (Decompose_height)),problem_steps.problems.SUM(problem_steps.KC Category (DecomposeArith)),problem_steps.problems.MEAN(problem_steps.KC Category (Yu-Ju_Textbook_modified)),problem_steps.problems.SUM(problem_steps.KC Category (new_)),sessions.students.MEAN(transactions.Action),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1_context-single)),classes.schools.MEAN(transactions.Total Num Hints),problem_steps.problems.SUM(problem_steps.KC Category (Lasso Model).2),problem_steps.problems.SUM(transactions.Student Response Subtype),problem_steps.problems.SUM(problem_steps.KC Category (Geometry)),problem_steps.problems.SUM(problem_steps.KC Category (Merge-Trap)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchAICWholeModel3arith)),problem_steps.problems.SUM(problem_steps.KC (new)),problem_steps.problems.SUM(problem_steps.KC Category (Textbook_New_Decompose)),problem_steps.problems.MEAN(problem_steps.KC Category (new)),problem_steps.problems.SUM(problem_steps.KC Category (Lasso Model).4),problem_steps.problems.SUM(problem_steps.KC Category (new)),problem_steps.problems.MEAN(problem_steps.KC Category (Lasso Model).3),problem_steps.problems.SUM(transactions.Duration (sec)),problem_steps.problems.SUM(problem_steps.KC Category (DecompArithDiam)),problem_steps.problems.MEAN(problem_steps.KC Category (Circle-Collapse)),problem_steps.problems.SUM(problem_steps.KC Category (Original)),sessions.students.MEAN(transactions.Feedback Classification),problem_steps.problems.SUM(problem_steps.KC Category (modified)),problem_steps.problems.MEAN(problem_steps.KC Category (new1)),problem_steps.problems.MEAN(transactions.Total Num Hints),problem_steps.problems.SUM(problem_steps.KC Category (LFASearchAICWholeModel3)),sessions.students.MEAN(transactions.Feedback Text),classes.schools.MEAN(transactions.Help Level),problem_steps.problems.MEAN(transactions.Tutor Response Subtype),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-back-context)),problem_steps.problems.MEAN(problem_steps.CF (Factor embedd3-tri-reg_prob_fix)),problem_steps.problems.SUM(transactions.Help Level),problem_steps.problems.SUM(problem_steps.KC Category (LFASearchAIC1_with_texkbook_new_decompose)),problem_steps.problems.MEAN(problem_steps.KC Category (DecomposeArith)),sessions.students.SUM(transactions.Feedback Classification),problem_steps.problems.MEAN(transactions.Problem View),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-context)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-back-context).2),problem_steps.problems.SUM(problem_steps.KC Category (Item)),problem_steps.problems.SUM(problem_steps.KC Category (Original-test)),problem_steps.problems.SUM(problem_steps.KC Category (Lasso Model).6),problem_steps.problems.SUM(problem_steps.KC Category (Decompose)),problem_steps.problems.MEAN(problem_steps.KC Category (hLFASearchModel1-back-context)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-renamed-chgd)),problem_steps.problems.SUM(problem_steps.KC Category (hLFASearchModel1-renamed-chgd2)),sessions.students.SUM(transactions.Help Level),Outcome
Transaction Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1,Unnamed: 150_level_1,Unnamed: 151_level_1,Unnamed: 152_level_1,Unnamed: 153_level_1,Unnamed: 154_level_1,Unnamed: 155_level_1,Unnamed: 156_level_1,Unnamed: 157_level_1,Unnamed: 158_level_1,Unnamed: 159_level_1,Unnamed: 160_level_1,Unnamed: 161_level_1,Unnamed: 162_level_1,Unnamed: 163_level_1,Unnamed: 164_level_1,Unnamed: 165_level_1,Unnamed: 166_level_1,Unnamed: 167_level_1,Unnamed: 168_level_1,Unnamed: 169_level_1,Unnamed: 170_level_1,Unnamed: 171_level_1,Unnamed: 172_level_1,Unnamed: 173_level_1,Unnamed: 174_level_1,Unnamed: 175_level_1,Unnamed: 176_level_1,Unnamed: 177_level_1,Unnamed: 178_level_1,Unnamed: 179_level_1,Unnamed: 180_level_1,Unnamed: 181_level_1,Unnamed: 182_level_1,Unnamed: 183_level_1,Unnamed: 184_level_1,Unnamed: 185_level_1,Unnamed: 186_level_1,Unnamed: 187_level_1,Unnamed: 188_level_1,Unnamed: 189_level_1,Unnamed: 190_level_1,Unnamed: 191_level_1,Unnamed: 192_level_1,Unnamed: 193_level_1,Unnamed: 194_level_1,Unnamed: 195_level_1,Unnamed: 196_level_1,Unnamed: 197_level_1,Unnamed: 198_level_1,Unnamed: 199_level_1,Unnamed: 200_level_1,Unnamed: 201_level_1,Unnamed: 202_level_1,Unnamed: 203_level_1,Unnamed: 204_level_1,Unnamed: 205_level_1,Unnamed: 206_level_1,Unnamed: 207_level_1,Unnamed: 208_level_1,Unnamed: 209_level_1,Unnamed: 210_level_1,Unnamed: 211_level_1,Unnamed: 212_level_1,Unnamed: 213_level_1,Unnamed: 214_level_1,Unnamed: 215_level_1,Unnamed: 216_level_1,Unnamed: 217_level_1,Unnamed: 218_level_1,Unnamed: 219_level_1,Unnamed: 220_level_1,Unnamed: 221_level_1,Unnamed: 222_level_1,Unnamed: 223_level_1,Unnamed: 224_level_1,Unnamed: 225_level_1,Unnamed: 226_level_1,Unnamed: 227_level_1,Unnamed: 228_level_1,Unnamed: 229_level_1,Unnamed: 230_level_1,Unnamed: 231_level_1,Unnamed: 232_level_1,Unnamed: 233_level_1,Unnamed: 234_level_1,Unnamed: 235_level_1,Unnamed: 236_level_1,Unnamed: 237_level_1,Unnamed: 238_level_1,Unnamed: 239_level_1,Unnamed: 240_level_1,Unnamed: 241_level_1,Unnamed: 242_level_1,Unnamed: 243_level_1,Unnamed: 244_level_1,Unnamed: 245_level_1,Unnamed: 246_level_1,Unnamed: 247_level_1,Unnamed: 248_level_1,Unnamed: 249_level_1,Unnamed: 250_level_1,Unnamed: 251_level_1,Unnamed: 252_level_1,Unnamed: 253_level_1,Unnamed: 254_level_1,Unnamed: 255_level_1,Unnamed: 256_level_1,Unnamed: 257_level_1,Unnamed: 258_level_1,Unnamed: 259_level_1,Unnamed: 260_level_1,Unnamed: 261_level_1,Unnamed: 262_level_1,Unnamed: 263_level_1,Unnamed: 264_level_1,Unnamed: 265_level_1,Unnamed: 266_level_1,Unnamed: 267_level_1,Unnamed: 268_level_1,Unnamed: 269_level_1,Unnamed: 270_level_1,Unnamed: 271_level_1,Unnamed: 272_level_1,Unnamed: 273_level_1,Unnamed: 274_level_1,Unnamed: 275_level_1,Unnamed: 276_level_1,Unnamed: 277_level_1,Unnamed: 278_level_1,Unnamed: 279_level_1,Unnamed: 280_level_1,Unnamed: 281_level_1,Unnamed: 282_level_1,Unnamed: 283_level_1,Unnamed: 284_level_1,Unnamed: 285_level_1,Unnamed: 286_level_1,Unnamed: 287_level_1,Unnamed: 288_level_1,Unnamed: 289_level_1,Unnamed: 290_level_1,Unnamed: 291_level_1,Unnamed: 292_level_1,Unnamed: 293_level_1,Unnamed: 294_level_1,Unnamed: 295_level_1,Unnamed: 296_level_1,Unnamed: 297_level_1,Unnamed: 298_level_1,Unnamed: 299_level_1,Unnamed: 300_level_1,Unnamed: 301_level_1,Unnamed: 302_level_1,Unnamed: 303_level_1,Unnamed: 304_level_1,Unnamed: 305_level_1,Unnamed: 306_level_1,Unnamed: 307_level_1,Unnamed: 308_level_1,Unnamed: 309_level_1,Unnamed: 310_level_1,Unnamed: 311_level_1,Unnamed: 312_level_1,Unnamed: 313_level_1,Unnamed: 314_level_1,Unnamed: 315_level_1,Unnamed: 316_level_1,Unnamed: 317_level_1,Unnamed: 318_level_1,Unnamed: 319_level_1,Unnamed: 320_level_1,Unnamed: 321_level_1,Unnamed: 322_level_1,Unnamed: 323_level_1,Unnamed: 324_level_1,Unnamed: 325_level_1,Unnamed: 326_level_1,Unnamed: 327_level_1,Unnamed: 328_level_1,Unnamed: 329_level_1,Unnamed: 330_level_1,Unnamed: 331_level_1,Unnamed: 332_level_1,Unnamed: 333_level_1,Unnamed: 334_level_1,Unnamed: 335_level_1,Unnamed: 336_level_1,Unnamed: 337_level_1,Unnamed: 338_level_1,Unnamed: 339_level_1,Unnamed: 340_level_1,Unnamed: 341_level_1,Unnamed: 342_level_1,Unnamed: 343_level_1,Unnamed: 344_level_1,Unnamed: 345_level_1,Unnamed: 346_level_1,Unnamed: 347_level_1,Unnamed: 348_level_1,Unnamed: 349_level_1,Unnamed: 350_level_1,Unnamed: 351_level_1,Unnamed: 352_level_1,Unnamed: 353_level_1,Unnamed: 354_level_1,Unnamed: 355_level_1,Unnamed: 356_level_1,Unnamed: 357_level_1,Unnamed: 358_level_1,Unnamed: 359_level_1,Unnamed: 360_level_1,Unnamed: 361_level_1,Unnamed: 362_level_1,Unnamed: 363_level_1,Unnamed: 364_level_1,Unnamed: 365_level_1,Unnamed: 366_level_1,Unnamed: 367_level_1,Unnamed: 368_level_1,Unnamed: 369_level_1,Unnamed: 370_level_1,Unnamed: 371_level_1,Unnamed: 372_level_1,Unnamed: 373_level_1,Unnamed: 374_level_1,Unnamed: 375_level_1,Unnamed: 376_level_1,Unnamed: 377_level_1,Unnamed: 378_level_1,Unnamed: 379_level_1,Unnamed: 380_level_1,Unnamed: 381_level_1,Unnamed: 382_level_1,Unnamed: 383_level_1,Unnamed: 384_level_1,Unnamed: 385_level_1,Unnamed: 386_level_1,Unnamed: 387_level_1,Unnamed: 388_level_1,Unnamed: 389_level_1,Unnamed: 390_level_1,Unnamed: 391_level_1,Unnamed: 392_level_1,Unnamed: 393_level_1,Unnamed: 394_level_1,Unnamed: 395_level_1,Unnamed: 396_level_1,Unnamed: 397_level_1,Unnamed: 398_level_1,Unnamed: 399_level_1,Unnamed: 400_level_1,Unnamed: 401_level_1,Unnamed: 402_level_1,Unnamed: 403_level_1,Unnamed: 404_level_1,Unnamed: 405_level_1,Unnamed: 406_level_1,Unnamed: 407_level_1,Unnamed: 408_level_1,Unnamed: 409_level_1,Unnamed: 410_level_1,Unnamed: 411_level_1,Unnamed: 412_level_1,Unnamed: 413_level_1,Unnamed: 414_level_1,Unnamed: 415_level_1,Unnamed: 416_level_1,Unnamed: 417_level_1,Unnamed: 418_level_1,Unnamed: 419_level_1,Unnamed: 420_level_1,Unnamed: 421_level_1,Unnamed: 422_level_1,Unnamed: 423_level_1,Unnamed: 424_level_1,Unnamed: 425_level_1,Unnamed: 426_level_1,Unnamed: 427_level_1,Unnamed: 428_level_1,Unnamed: 429_level_1,Unnamed: 430_level_1,Unnamed: 431_level_1,Unnamed: 432_level_1,Unnamed: 433_level_1,Unnamed: 434_level_1,Unnamed: 435_level_1,Unnamed: 436_level_1,Unnamed: 437_level_1,Unnamed: 438_level_1,Unnamed: 439_level_1,Unnamed: 440_level_1,Unnamed: 441_level_1,Unnamed: 442_level_1,Unnamed: 443_level_1,Unnamed: 444_level_1,Unnamed: 445_level_1,Unnamed: 446_level_1,Unnamed: 447_level_1,Unnamed: 448_level_1,Unnamed: 449_level_1,Unnamed: 450_level_1,Unnamed: 451_level_1,Unnamed: 452_level_1,Unnamed: 453_level_1,Unnamed: 454_level_1,Unnamed: 455_level_1,Unnamed: 456_level_1,Unnamed: 457_level_1,Unnamed: 458_level_1,Unnamed: 459_level_1,Unnamed: 460_level_1,Unnamed: 461_level_1,Unnamed: 462_level_1,Unnamed: 463_level_1,Unnamed: 464_level_1,Unnamed: 465_level_1
14e2e9870f1b6b6d6f02fea8cd1248ad,,(SQUARE-AREA QUESTION2),RESULT,1,,2,Area,US/Eastern,,,ATTEMPT,All Data,GEO-408d5ed7:10e14be5d3a:-63b8,(SQUARE-AREA QUESTION2),,,,,,,,Merge-Trap-ALT:PARALLELOGRAM-AREA,repeat,Geometry*parallelogram-area,parallelogram-area,,,parallelogram-area,,,,parallelogram-area,,,,square-area,POGS(SQUARE-AREA QUESTION2),0,,,Geometry*parallelogram-area,ALT:PARALLELOGRAM-AREA,,,square,,,KC59,parallelogram-area,parallelogram-area,Geometry*parallelogram-area,,,,,,square-area,,parallelogram,,,Single-KC,,parallelogram-area,,1,,,,,additional,all*parallelogram-area*ALT:PARALLELOGRAM-AREA,,,,,Geometry,ALT:PARALLELOGRAM-AREA,square-rectangle-area,,,,ALT:PARALLELOGRAM,square,,0,,embedded,,parallelogram-area,DecomposeArith-parallelogram-area,POGS,parallelogram-area,ALT:PARALLELOGRAM-AREA,square-area,,all*parallelogram-area*ALT:PARALLELOGRAM-AREA,parallelogram-area,parallelogram-area,0,,ALT:PARALLELOGRAM-AREA,,,,,,Geometry*parallelogram-area,,parallelogram-area,,quad,,,square-area,Geometry*parallelogram-area,0,,,Geometry*parallelogram-area,,parallelogram-area,0,Single-KC,,square-area,,0,ALT:area,KC59,,ALT:PARALLELOGRAM-AREA,,parallelogram-area,Stu_ad3610752c4af1c3cac6638ef588e02b,,1,,Geometry*parallelogram-area,,,,ALT:PARALLELOGRAM-AREA,area,,,,all*parallelogram-area*ALT:PARALLELOGRAM-AREA,parallelogram-area,f,parallelogram-area,,KC59,,,parallelogram-area,,0,Geometry*parallelogram-area,90.588235,,,Non-area formula,square-area & circle area composition,,all*parallelogram-area*ALT:PARALLELOGRAM-AREA,,0,0,forward,,1,,,square,POGS (SQUARE-AREA QUESTION2),,,,0.0,,0.758621,,0.585,,,,0.0,0.0,,0.0,,0.0,0.0,,44.0,,1.322034,0.0,,,,,0.0,0.836634,24.105,15.793103,,,117.0,,,,0.0,,0.0,,0,,,0.0,916.0,,1.237624,250,,,,,,,4821.0,0.0,78,,0,0.0,0.932203,,,,0.0,0.0,,,,,,0.0,,0.0,,,,,,0.0,24.105,,,,0.0,3,,,,,0.0,,0.0,,0.0,,,,,,,,306.0,,0.0,,,,,0.0,0.0,,0.0,250,,,0.25,0.0,0.0,,0.0,0.0,0.0,,,0.0,0.0,,,0,,,0.0,94.970081,,,,0.0,,0.585,0.0,0.0,0.0,0.0,,,,,0.856566,,,,0.0,0.0,0.0,,0,,,,1.237624,,0.0,,0.0,,0.0,0.0,,,0.0,,,,0.0,,0,0.0,,0.0,0.0,,,,,,,681,0.0,,0,0.0,,,,,,,,0.0,0.0,0.0,,,,0.0,0,,117.0,0.0,,,0.0,,0.0,,,4821.0,,0.375,,0.0,2,0.0,,0.0,0.0,0.62069,0.0,,0.0,0.0,0.0,0.0,0.836634,0.0,,11.361055,,0.0,,0.0,,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,759.760649,0.0,,0.0,0.0,,5601.0,0.0,,0.0,,0.0,,,0.0,,,,0.0,0,0.0,0.0,,0.0,1.375758,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1
e55b242b3a622d8bb742c8219d123505,,(SQUARE-AREA QUESTION3),RESULT,1,,2,Area,US/Eastern,,,ATTEMPT,All Data,GEO-408d5ed7:10e14be5d3a:-63b8,(SQUARE-AREA QUESTION3),,,,,,,,Merge-Trap-ALT:PARALLELOGRAM-AREA,repeat,Geometry*parallelogram-area,parallelogram-area,,,parallelogram-area,,,,parallelogram-area,,,,square-area,POGS(SQUARE-AREA QUESTION3),0,,,Geometry*parallelogram-area,ALT:PARALLELOGRAM-AREA,,,square,,,KC50,parallelogram-area,parallelogram-area,Geometry*parallelogram-area,,,,,,square-area,,parallelogram,,,Single-KC,,parallelogram-area,,1,,,,,additional,all*parallelogram-area*ALT:PARALLELOGRAM-AREA,,,,,Geometry,ALT:PARALLELOGRAM-AREA,square-rectangle-area,,,,ALT:PARALLELOGRAM,square,,0,,embedded,,parallelogram-area,DecomposeArith-parallelogram-area,POGS,parallelogram-area,ALT:PARALLELOGRAM-AREA,square-area,,all*parallelogram-area*ALT:PARALLELOGRAM-AREA,parallelogram-area,parallelogram-area,0,,ALT:PARALLELOGRAM-AREA,,,,,,Geometry*parallelogram-area,,parallelogram-area,,quad,,,square-area,Geometry*parallelogram-area,0,,,Geometry*parallelogram-area,,parallelogram-area,0,Single-KC,,square-area,,0,ALT:area,KC50,,ALT:PARALLELOGRAM-AREA,,parallelogram-area,Stu_ad3610752c4af1c3cac6638ef588e02b,,1,,Geometry*parallelogram-area,,,,ALT:PARALLELOGRAM-AREA,area,,,,all*parallelogram-area*ALT:PARALLELOGRAM-AREA,parallelogram-area,f,parallelogram-area,,KC50,,,parallelogram-area,,0,Geometry*parallelogram-area,90.588235,,,Non-area formula,square-area,,all*parallelogram-area*ALT:PARALLELOGRAM-AREA,,0,0,forward,,1,,,square,POGS (SQUARE-AREA QUESTION3),,,,0.0,,0.758621,,0.587065,,,,0.0,0.0,,0.0,,0.0,0.0,,44.0,,1.305085,0.0,,,,,0.0,0.841584,24.0199,11.672414,,,118.0,,,,0.0,,0.0,,0,,,0.0,677.0,,1.237624,250,,,,,,,4828.0,0.0,77,,0,0.0,0.932203,,,,0.0,0.0,,,,,,0.0,,0.0,,,,,,0.0,24.0199,,,,0.0,3,,,,,0.0,,0.0,,0.0,,,,,,,,307.0,,0.0,,,,,0.0,0.0,,0.0,250,,,0.25,0.0,0.0,,0.0,0.0,0.0,,,0.0,0.0,,,0,,,0.0,94.970081,,,,0.0,,0.587065,0.0,0.0,0.0,0.0,,,,,0.858586,,,,0.0,0.0,0.0,,0,,,,1.237624,,0.0,,0.0,,0.0,0.0,,,0.0,,,,0.0,,0,0.0,,0.0,0.0,,,,,,,681,0.0,,0,0.0,,,,,,,,0.0,0.0,0.0,,,,0.0,0,,118.0,0.0,,,0.0,,0.0,,,4828.0,,0.375,,0.0,2,0.0,,0.0,0.0,0.621457,0.0,,0.0,0.0,0.0,0.0,0.841584,0.0,,11.352227,,0.0,,0.0,,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,759.760649,0.0,,0.0,0.0,,5608.0,0.0,,0.0,,0.0,,,0.0,,,,0.0,0,0.0,0.0,,0.0,1.375758,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1
49ca43d57e9f630d6564e6ff07a944be,,(SCRAP-METAL-AREA QUESTION1),RESULT,1,,2,Area,US/Eastern,,,ATTEMPT,All Data,GEO-408d5ed7:10e14be5d3a:-63b8,(SCRAP-METAL-AREA QUESTION1),,,,,,,,DecompArithDiam2-decompose,initial,Geometry*decomp-trap,decomp-trap,,,compose-by-addition-easy,,,,decompose,,,,compose-by-addition,POGS(SCRAP-METAL-AREA QUESTION1),a,,,Geometry*decomp-trap*Geometry*decomp-trap*trap...,ALT:COMPOSE-BY-ADDITION,,,compose-by-addition,,,KC94,decomp-trap,decompose,Geometry*decomp-trap,,,,,,square-area & circle area composition,,0,,,Single-KC,,compose-by-addition,,1,,,,,additional,all*decomp-trap*decompose,,,new trap merge-ALT:COMPOSE-BY-ADDITION,textbook2-compose-by-addition,Geometry,ALT:COMPOSE-BY-ADDITION,compose-by-addition,,,,ALT:COMPOSE-BY-ADDITION,0,,0,,embedded,,decomp-trap,Area-Area formula,POGS,decompose,ALT:COMPOSE-BY-ADDITION,compose-by-addition,,all*decomp-trap*decompose,decomp-trap,decomp-trap,1,,ALT:COMPOSE-BY-ADDITION,,,Decompose-decompose,,,Geometry*decomp-trap*Geometry*decomp-trap*trap...,,decompose,,0,Item-POGS(SCRAP-METAL-AREA QUESTION1),,decompose,Geometry*decomp-trap,0,,,Geometry*decomp-trap,,decomp-trap,0,Single-KC,,decompose,,0,ALT:COMPOSE-BY-ADDITION,KC94,,ALT:COMPOSE-BY-ADDITION,,decomp-trap,Stu_ad3610752c4af1c3cac6638ef588e02b,,1,,Geometry*decomp-trap,,,,ALT:COMPOSE-BY-ADDITION,area-difference,,,,all*decomp-trap*decompose,forward,no-f,compose-by-addition,,KC94,,,decomp-trap,xDecmpTrapCheat-decomp-trap,0,Geometry*decomp-trap,103.448276,,,Area formula,square-area & circle area composition,,all*decomp-trap*decompose,,0,0,decomp-trap,,0,,,0,POGS (SCRAP-METAL-AREA QUESTION1),,,,0.0,,0.528846,,0.589109,,,,0.0,0.0,,0.0,,0.0,0.0,,55.0,,1.333333,0.0,,,,,0.0,0.842365,23.935644,6.509615,,,119.0,,,,0.0,,0.0,,0,,,0.0,677.0,,1.241379,252,,,,,,,4835.0,0.0,140,,0,0.0,0.72381,,,,0.0,0.0,,,,,,0.0,,0.0,,,,,,0.0,23.935644,,,,0.0,3,,,,,0.0,,0.0,,0.0,,,,,,,,308.0,,0.0,,,,,0.0,0.0,,0.0,252,,,0.25,0.0,0.0,,0.0,0.0,0.0,,,0.0,0.0,,,0,,,0.0,94.970081,,,,0.0,,0.589109,0.0,0.0,0.0,0.0,,,,,0.858871,,,,0.0,0.0,0.0,,0,,,,1.241379,,0.0,,0.0,,0.0,0.0,,,0.0,,,,0.0,,0,0.0,,0.0,0.0,,,,,,,683,0.0,,0,0.0,,,,,,,,0.0,0.0,0.0,,,,0.0,0,,119.0,0.0,,,0.0,,0.0,,,4835.0,,0.375,,0.0,2,0.0,,0.0,0.0,0.622222,0.0,,0.0,0.0,0.0,0.0,0.842365,0.0,,11.343434,,0.0,,0.0,,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,759.760649,0.0,,0.0,0.0,,5615.0,0.0,,0.0,,0.0,,,0.0,,,,0.0,0,0.0,0.0,,0.0,1.377016,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1
c2c7a400ebe3ed5c82a09b1c86b4de8e,,(SCRAP-METAL-AREA QUESTION2),RESULT,1,,2,Area,US/Eastern,,,ATTEMPT,All Data,GEO-408d5ed7:10e14be5d3a:-63b8,(SCRAP-METAL-AREA QUESTION2),,,,,,,,DecompArithDiam2-arithmetic,initial,compose-subtract,subtract,,,compose-by-addition-easy,,,,decompose,,,,compose-by-addition,POGS(SCRAP-METAL-AREA QUESTION2),a,,,compose-subtract,ALT:COMPOSE-BY-ADDITION,,,compose-by-addition,,,KC33,Subtract,Subtract,compose-subtract,,,,,,square-area & circle area composition,,0,,,Single-KC,,compose-by-addition,,1,,,,,additional,all*Subtract,,,Item-POGS(SCRAP-METAL-AREA QUESTION2),new trap merge-ALT:COMPOSE-BY-ADDITION,Geometry,ALT:COMPOSE-BY-ADDITION,compose-by-addition,,,,ALT:COMPOSE-BY-ADDITION,0,,0,,embedded,,Subtract,Area-Area formula,POGS,arithmetic,ALT:COMPOSE-BY-ADDITION,compose-by-addition,,all*Subtract,Subtract,subtract,1,,ALT:COMPOSE-BY-ADDITION,,,Decompose-decompose,,,compose-subtract,,Subtract,,0,DecomposeArith-Subtract,,decompose,compose-subtract,0,,,Geometry*Subtract,,Subtract,0,Single-KC,,decompose,,0,ALT:COMPOSE-BY-ADDITION,KC33,,ALT:COMPOSE-BY-ADDITION,,Subtract1,Stu_ad3610752c4af1c3cac6638ef588e02b,,1,,Geometry*Subtract,,,,ALT:COMPOSE-BY-ADDITION,area-difference,,,,all*Subtract,forward,no-f,compose-by-addition,,KC33,,,subtract,,0,compose-subtract,101.724138,,,Area formula,square-area & circle area composition,,all*Subtract,,0,0,Subtract,,0,,,0,POGS (SCRAP-METAL-AREA QUESTION2),,,,0.0,,0.595745,,0.591133,,,,0.0,0.0,,0.0,,0.0,0.0,,56.0,,1.368421,0.0,,,,,0.0,0.843137,23.857143,14.404255,,,120.0,,,,0.0,,0.0,,0,,,0.0,1354.0,,1.245098,254,,,,,,,4843.0,0.0,130,,0,0.0,0.810526,,,,0.0,0.0,,,,,,0.0,,0.0,,,,,,0.0,23.857143,,,,0.0,3,,,,,0.0,,0.0,,0.0,,,,,,,,309.0,,0.0,,,,,0.0,0.0,,0.0,254,,,0.25,0.0,0.0,,0.0,0.0,0.0,,,0.0,0.0,,,0,,,0.0,94.970081,,,,0.0,,0.591133,0.0,0.0,0.0,0.0,,,,,0.859155,,,,0.0,0.0,0.0,,0,,,,1.245098,,0.0,,0.0,,0.0,0.0,,,0.0,,,,0.0,,0,0.0,,0.0,0.0,,,,,,,685,0.0,,0,0.0,,,,,,,,0.0,0.0,0.0,,,,0.0,0,,120.0,0.0,,,0.0,,0.0,,,4843.0,,0.375,,0.0,2,0.0,,0.0,0.0,0.622984,0.0,,0.0,0.0,0.0,0.0,0.843137,0.0,,11.336694,,0.0,,0.0,,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,759.760649,0.0,,0.0,0.0,,5623.0,0.0,,0.0,,0.0,,,0.0,,,,0.0,0,0.0,0.0,,0.0,1.37827,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1
3159d6a4d2dbc03c50f000ac6e208c67,,(SCRAP-METAL-AREA QUESTION3),RESULT,1,,2,Area,US/Eastern,,,ATTEMPT,All Data,GEO-408d5ed7:10e14be5d3a:-63b8,(SCRAP-METAL-AREA QUESTION3),,,,,,,,DecompArithDiam2-arithmetic,initial,compose-subtract,subtract,,,compose-by-addition-easy,,,,decompose,,,,compose-by-addition,POGS(SCRAP-METAL-AREA QUESTION3),a,,,compose-subtract,ALT:COMPOSE-BY-ADDITION,,,compose-by-addition,,,KC77,Subtract,Subtract,compose-subtract,,,,,,square-area & circle area composition,,0,,,Single-KC,,compose-by-addition,,1,,,,,additional,all*Subtract,,,new trap merge-ALT:COMPOSE-BY-ADDITION,,Geometry,ALT:COMPOSE-BY-ADDITION,compose-by-addition,,,,ALT:COMPOSE-BY-ADDITION,0,,0,,embedded,,Subtract,Area-Area formula,POGS,arithmetic,ALT:COMPOSE-BY-ADDITION,compose-by-addition,,all*Subtract,Subtract,subtract,1,,ALT:COMPOSE-BY-ADDITION,,,Decompose-decompose,,,compose-subtract,,Subtract,,0,DecomposeArith-Subtract,,decompose,compose-subtract,0,,,Geometry*Subtract,,Subtract,0,Single-KC,,decompose,,0,ALT:COMPOSE-BY-ADDITION,KC77,,ALT:COMPOSE-BY-ADDITION,,Subtract1,Stu_ad3610752c4af1c3cac6638ef588e02b,,2,,Geometry*Subtract,,,,ALT:COMPOSE-BY-ADDITION,area-difference,,,,all*Subtract,forward,no-f,compose-by-addition,,KC77,,,subtract,,0,compose-subtract,100.0,,,Area formula,square-area & circle area composition,,all*Subtract,,0,0,Subtract,,0,,,0,POGS (SCRAP-METAL-AREA QUESTION3),,,,0.0,,0.643678,,0.593137,,,,0.0,0.0,,0.0,,0.0,0.0,,56.0,,1.352273,0.0,,,,,0.0,0.843902,23.794118,15.931034,,,121.0,,,,0.0,,0.0,,0,,,0.0,1386.0,,1.24878,256,,,,,,,4854.0,0.0,119,,0,0.0,0.875,,,,0.0,0.0,,,,,,0.0,,0.0,,,,,,0.0,23.794118,,,,0.0,3,,,,,0.0,,0.0,,0.0,,,,,,,,310.0,,0.0,,,,,0.0,0.0,,0.0,256,,,0.25,0.0,0.0,,0.0,0.0,0.0,,,0.0,0.0,,,0,,,0.0,94.970081,,,,0.0,,0.593137,0.0,0.0,0.0,0.0,,,,,0.859438,,,,0.0,0.0,0.0,,0,,,,1.24878,,0.0,,0.0,,0.0,0.0,,,0.0,,,,0.0,,0,0.0,,0.0,0.0,,,,,,,687,0.0,,0,0.0,,,,,,,,0.0,0.0,0.0,,,,0.0,0,,121.0,0.0,,,0.0,,0.0,,,4854.0,,0.375,,0.0,2,0.0,,0.0,0.0,0.623742,0.0,,0.0,0.0,0.0,0.0,0.843902,0.0,,11.336016,,0.0,,0.0,,0.0,,0.0,,0.0,0.0,0.0,0.0,0.0,759.760649,0.0,,0.0,0.0,,5634.0,0.0,,0.0,,0.0,,,0.0,,,,0.0,0,0.0,0.0,,0.0,1.379518,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,1


Above, you can scroll to the right to see the 227 features we created. If you look at the column names, you can see that we've done more than individually apply primitives one at a time to the raw data. Features were stacked and combined across entities in an exhaustive way. Using Deep Feature Synthesis is powerful because it greatly increases the likelihood of finding important features while decreasing the workload of the data scientist.

# Step 3: Making predictions

Here we split the data into two parts using `train_test_split` from scikit-learn. Notice that we don't want the splitter to shuffle our data, since that has the risk to leak labels in time sensitive data. 

We can do feature selection with [Recursive Feature Elimination](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html). This recursively removes features by checking feature importances (according to some model) with smaller and smaller feature sets. Here we'll set `RFE` to select 20 features.  

In [6]:
from sklearn.model_selection import TimeSeriesSplit, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from utils import feature_importances
from sklearn.feature_selection import RFE

# 1. Split X and y into a train and test set
X_train, X_test, y_train, y_test = train_test_split(fm_enc, label, shuffle=False)

# 2. Select features using RFE
clf = RandomForestClassifier()
estimator = clf
selector = RFE(estimator, 20, step=1)
selector = selector.fit(X_train, y_train)
X_train.iloc[:, selector.support_].tail()



Unnamed: 0_level_0,Attempt At Step = 2,problem_steps.MEAN(transactions.Is Last Attempt),sessions.MEAN(transactions.Is Last Attempt),sessions.PERCENT_TRUE(transactions.Outcome),sessions.MEAN(transactions.Duration (sec)),problem_steps.MEAN(transactions.Duration (sec)),problem_steps.SUM(transactions.Duration (sec)),sessions.SUM(transactions.Duration (sec)),problem_steps.SUM(transactions.Problem View),problem_steps.PERCENT_TRUE(transactions.Outcome),sessions.students.MEAN(transactions.Duration (sec)),sessions.students.MEAN(transactions.Is Last Attempt),problem_steps.problems.PERCENT_TRUE(transactions.Outcome),sessions.students.MEAN(transactions.Problem View),problem_steps.problems.SUM(transactions.Problem View),sessions.students.SUM(transactions.Duration (sec)),problem_steps.problems.MEAN(transactions.Is Last Attempt),sessions.students.PERCENT_TRUE(transactions.Outcome),problem_steps.problems.MEAN(transactions.Duration (sec)),problem_steps.problems.SUM(transactions.Duration (sec))
Transaction Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
39eae7f8590d2b934315064e9855afff,0,0.444444,0.360656,0.767347,8.02459,7.571429,477.0,1958.0,74,0.609375,8.02459,0.360656,0.702128,1.440816,160,1958.0,0.507143,0.767347,8.85,1239.0
882b1884a9ae8720e1dc00747889797d,0,0.75,0.359712,0.744681,9.978417,7.5,1290.0,1387.0,181,0.83237,9.978417,0.359712,0.85061,1.787234,345,1387.0,0.773006,0.744681,9.800613,3195.0
b2fb7c0e5010305b5af967a99c21ae1d,0,0.619048,0.502732,0.820652,9.295082,10.785714,453.0,1701.0,47,0.837209,9.295082,0.502732,0.829457,1.929348,279,1701.0,0.580392,0.820652,9.85098,2512.0
cf0d3d46691ec06d7c3d4f522d1a8e83,1,0.425963,0.4375,0.753846,13.59375,4.351927,4291.0,1740.0,1486,0.765182,13.59375,0.4375,0.770965,1.907692,1625,1740.0,0.435411,0.753846,5.274616,5839.0
ccd74208da111b48054439b5eca33dca,1,0.627907,0.344398,0.744856,8.078838,13.139535,565.0,1947.0,51,0.822222,8.078838,0.344398,0.833333,1.205761,279,1947.0,0.582031,0.744856,9.855469,2523.0


Finally, we can train a Random Forest Classifier to make predictions. Those predictions can be checked against our `y_test` from above, and scored with a [roc_auc_score](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). Below we'll train and score our model and output the five most important features according to this model.

In [7]:
# 3. Train a Random Forest Classifier
clf.fit(selector.transform(X_train), y_train)

# 4. Make predictions and score
probs = clf.predict_proba(selector.transform(X_test))
print("Auc score of {:.3f}".format(roc_auc_score(y_test, probs[:,1])))

feats = feature_importances(X_train.iloc[:, selector.support_], clf)

Auc score of 0.646
Feature Importances: 
1: Attempt At Step = 2
2: problem_steps.PERCENT_TRUE(transactions.Outcome)
3: problem_steps.problems.PERCENT_TRUE(transactions.Outcome)
4: sessions.PERCENT_TRUE(transactions.Outcome)
5: problem_steps.MEAN(transactions.Duration (sec))
-----



Let's examine a feature. The feature `problem_steps.MEAN(transactions.Duration (sec))` is the average time spent on a given problem step. It's easy to see how 'amount of time people spend on this problem' might be related to problem difficulty and ultimately the `Outcome` of a given attempt.

# Next Steps
This notebook showed how to structure your data and make predictions with machine learning. Rather than spending time creating features, it's now possible to explore the relationships and implications betweem thousands of features directly. Reasonable next steps might be to:
1. Try [plotting](#Appendix:-Plotting) some of the generated features
2. Run feature selection and tune the machine learning model
3. Explore other prediction problems on this `EntitySet`




# Appendix: Plotting
Here, we'll look at a couple of important features as created above. We can use plots to help us understand why certain automatically generated features are good. Here, we'll plot two important features from the model above and match results from the model to our own intuition.

In [8]:
from bokeh.io import show, output_notebook, output_file

output_notebook()
output_file('difficulty_vs_time.html')

p = utils.datashop_plot(fm,
                        col1='problem_steps.problems.PERCENT_TRUE(transactions.Outcome)',
                        col2='problem_steps.problems.MEAN(transactions.Duration (sec))',
                        label=label,
                        names=['Problem difficulty versus problem time', 
                               'Success rate on this problem', 
                               'Average time on this problem'])
show(p)


![](data/images/exampleimage.png)

If you're interested in understanding particular points and clusters [click here](https://www.featuretools.com/wp-content/uploads/2018/03/difficulty_vs_time.html) for an interactive html version. That version will allow you to zoom in and hover over individual points to see which problem step and problem it is. 

Notice that while a feature like *Success rate on this problem* might only have one value if we use all of the data, the graph here shows that data changing with time. To start our analysis, let's get a baseline for the data. The blue dots represent a successful answer while the grey dots indicate an incorrect answer. We can ask how often students are correct on average:

In [9]:
print('Overall success rate is {:.2f}%'.format(100 * np.mean(fm['Outcome'])))

Overall success rate is 79.39%


That is, if you were to pick a point at random, there's a roughly 79% chance it will be a correct answer. There are sections of this graph where that sample is more likely to be correct, and more likely to be incorrect, which can be picked up by the decision trees that make up our Random Forest. From the graph it looks like there is a spike of correct answers in this dataset near problems that take 10 seconds. We can verify that directly:

In [10]:
maxtime = 15
duration_feat = fm['problem_steps.problems.MEAN(transactions.Duration (sec))']
problem_feat = fm['problem_steps.problems.PERCENT_TRUE(transactions.Outcome)']

print('If problem takes more than {} seconds: {:.2f}% of problems answered correctly'.format(maxtime,
    100 * np.mean(fm[(duration_feat >= maxtime)]['Outcome'])))

If problem takes more than 15 seconds: 74.20% of problems answered correctly


In other words, the average time spent on a problem is an indicator of whether or not a student will answer the problem correctly in this dataset. There are a number of possible interpretations and testable hypotheses associated to that. It is clear that the averages don't tell the whole story of what's going on. Let's look at success rate as sorted by problem.



In [11]:
split_line = .85

print('If Success Rate > {}: {:.2f}% of problems answered correctly'.format(split_line,
    100 * np.mean(fm[(problem_feat >= split_line)]['Outcome'])))

If Success Rate > 0.85: 88.07% of problems answered correctly


In [12]:
print('Problems with higher success than {}:'.format(split_line))
for f in fm[problem_feat >= split_line]['problem_steps.Problem Name'].unique():
    print(f)

Problems with higher success than 0.85:
BUILDING_A_SIDEWALK
RECTANGLE_ABCD
POGS
PAINTING_THE_WALL
CIRCLE_O
DESIGNING_A_QUILT


That is, of the 20 problems in this data set, only 6 have a success rate that was over 85% at some point in time. In that way the machine learning has indicated that how previous students have done on the problem is a good predictor of how they will do inside of this dataset.

In addition to our earlier conclusion that "the problems that took a long time had worse scores", we have a secondary conclusion that "some problems are harder than others". What makes this line of inquiry neat is that we didn't have to do very much work to reveal interesting questions. In that way we have used automated feature engineering to make explicit our implicit understanding of this dataset.

In [13]:
# Save output files

import os

try:
    os.mkdir("output")
except:
    pass

fm.to_csv('output/feature_matrix.csv')
cutoff_times[1000:].to_csv('output/cutoff_times.csv')

<p>
    <img src="https://www.featurelabs.com/wp-content/uploads/2017/12/logo.png" alt="Featuretools" />
</p>

Featuretools was created by the developers at [Feature Labs](https://www.featurelabs.com/). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.featurelabs.com/contact/).