In [5]:
#example with Python
#reading data from a file

In [6]:
#this line is only necessary for Jupyter Binder
#if you're trying this example on your local environment, you can skip this step!
import sys
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install sklearn
!{sys.executable} -m pip install statsmodels



In [7]:
#import all the python libraries that we will need for our analysis


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import statsmodels.api as sm


In [9]:
# Read data from file 'geometryITS.csv' 
# (in the same directory that your python process is based)
 
data = pd.read_csv("geometryITS.csv") 


# Preview the first 5 lines of the loaded data 
data.head()

Unnamed: 0,Row,Sample,Anon Student Id,Problem Hierarchy,Problem Name,Problem View,Step Name,Step Start Time,First Transaction Time,Correct Transaction Time,...,Step Duration (sec),Correct Step Duration (sec),Error Step Duration (sec),First Attempt,Incorrects,Hints,Corrects,Condition,KC,Opportunity
0,1,All Data,Stu_02ee1b3f31a6f6a7f4b8012298b2395e,Unit Area,RECTANGLE_ABCD,1,(AREA QUESTION1),02/01/1996 00:00,02/01/1996 00:00,02/01/1996 00:00,...,0,0,.,correct,0,0,1,.,Non-area formula,1
1,2,All Data,Stu_02ee1b3f31a6f6a7f4b8012298b2395e,Unit Area,RECTANGLE_ABCD,1,(HEIGHT QUESTION2),02/01/1996 00:00,02/01/1996 00:02,02/01/1996 00:02,...,128,128,.,correct,0,0,1,.,Non-area formula,2
2,3,All Data,Stu_02ee1b3f31a6f6a7f4b8012298b2395e,Unit Area,RECTANGLE_ABCD,1,(BASE QUESTION3),02/01/1996 00:02,02/01/1996 00:03,02/01/1996 00:03,...,29,29,.,correct,0,0,1,.,Non-area formula,3
3,4,All Data,Stu_02ee1b3f31a6f6a7f4b8012298b2395e,Unit Area,BUILDING_A_SIDEWALK,1,(POOL-AREA QUESTION1),02/01/1996 00:04,02/01/1996 00:04,02/01/1996 00:04,...,0,0,.,correct,0,0,1,.,Non-area formula,4
4,5,All Data,Stu_02ee1b3f31a6f6a7f4b8012298b2395e,Unit Area,BUILDING_A_SIDEWALK,1,(LARGE-RECTANGLE-AREA QUESTION1),02/01/1996 00:04,02/01/1996 00:04,02/01/1996 00:04,...,12,12,.,correct,0,0,1,.,Non-area formula,5


In [10]:
#view all variables that are recorded in the dataset
list(data)

['Row',
 'Sample',
 'Anon Student Id',
 'Problem Hierarchy',
 'Problem Name',
 'Problem View',
 'Step Name',
 'Step Start Time',
 'First Transaction Time',
 'Correct Transaction Time',
 'Step End Time',
 'Step Duration (sec)',
 'Correct Step Duration (sec)',
 'Error Step Duration (sec)',
 'First Attempt',
 'Incorrects',
 'Hints',
 'Corrects',
 'Condition',
 'KC',
 'Opportunity']

## What are we doing today?

In this example, we will implement the Additive Factors Model (AFM) proposed by Cen, et. al.
[Cen, H., Koedinger, K., & Junker, B. (2008, June). Comparing two IRT models for conjunctive skills. In International Conference on Intelligent Tutoring Systems (pp. 796-798). Springer, Berlin, Heidelberg.]

The AFM student model predicts student performance on every step based on student's prior practice (number of practice opportunities) and other factors, such as step difficulty, student's proficiency, learning rate and so on. 

To implement the model, we will use logistic regression and we will train and test using the Geometry Tutor dataset.

## Student model 
The AFM equation (see Cen's paper) states that in order to predict student's performance we need to know as input the knowledge component (KC) that is involved in every step and the number of practice opportunities - that is, how many times the student has tried this step before. Also, since the model predicts individual learning, we have to take into account the student's ID.

The result of the each student's attempt is depicted by the column "first attempt" (the name of the column is misleading, i know!). So we use this, to create the "outcome" variable, that is - the variable we want to predict. Here, 0 signifies an incorrect answer and 1 signified a correct asnwer.

Simply put, the equation that we want to implement should look something like this:

Outcome ~ Opportunity + KC + (1| Anon Student ID)   [equation (1)]

which rougly translates to: the outcome is predicted based on the knowledge component (skill) that is tested and the number of prior opportunities a student had, taking into account the random effect that may come into place regarding student's individual characteristics.

Now lets go and fit the model!

In [19]:
#create a new column to classify final results. classify steps with a corrent answer as "1", the rest as "0"
data["outcome"] = 1

#studentInfo["result.class"] = studentInfo["final_result"].apply(lambda x: 0 if (x == 'Fail') | x == "Withdrawn") else 1)
data["outcome"].loc[(data["First Attempt"] == "incorrect")] = 0


## Fitting the AFM model
Now we will fit our model. We will use a part of the dataset to train the model (let's say 80% of the data) and we will also keep a part of the old data to test our model's performance (let's say the rest 20% of the data).

The datasets used for training have the suffix "_train" while the datasets saved for testing have the suffix "_test". 

In [20]:
data_train, data_test = train_test_split(data, test_size=0.2, random_state=0)

In [21]:
# data = pd.read_csv("dietox.csv")
model = sm.MixedLM.from_formula("outcome ~ Opportunity + KC", data_train, groups=data_train["Anon Student Id"])
result = model.fit()
result.summary()

0,1,2,3
Model:,MixedLM,Dependent Variable:,outcome
No. Observations:,4083,Method:,REML
No. Groups:,59,Scale:,0.1788
Min. group size:,2,Likelihood:,-2333.1763
Max. group size:,196,Converged:,Yes
Mean group size:,69.2,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,0.720,0.021,33.593,0.000,0.678,0.762
KC[T.Non-area formula],0.008,0.014,0.582,0.561,-0.020,0.037
Opportunity,0.000,0.000,0.292,0.770,-0.000,0.001
Group Var,0.012,0.011,,,,


Now lets use the newly fitted model to predict the outcomes of the test set. To keep it simple, we're just calculating the Mean Absolute Error as a performance metric. Of course there are other, more sofisticated evaluation methods but these are beyond the scope of this course!

In [22]:
#predict on a testset
y_pred = result.predict(data_test)
data_test["y_pred"] = y_pred
data_test["outcome_pred"] = 1

data_test["outcome_pred"].loc[(data_test["y_pred"] <0.5)] = 0
error = abs(data_test["outcome_pred"] - data_test["outcome"])
float(sum(error))/float(len(error))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0.2595494613124388

So, our model is able to predict student performance with a 26% error.
Not bad for such a simple approach, right?

## Assigment 3

Pavlik et.al [Pavlik Jr, P. I., Cen, H., & Koedinger, K. R. (2009). Performance Factors Analysis--A New Alternative to Knowledge Tracing. Online Submission.] took the AFM model further by proposing to use - instead of prior opportunities - prior correct and prior incorrect answers for each step - they named this new model the Performance Factors Analysis Model (PFM). 
The simplified [equation (1)], for the PFM model can be written as:

Outcome ~ Prior Correct + Prior Incorrect + KC + (1| Anon Student ID)   [equation (2)]

For the third assignment, you have to do the following:
1. Modify the AFM model provided in this example in order to implement the PFM model. You already have information about the correct and incorrect answers in the dataset. 
2. Use the PFM model to predict student performance and compare the results with the AFM model. Discuss your findings.
3. Propose additional factors that you could take into account in order to improve the student model. You can be as creative as you like - as long as you're keeping your feet on earth. Think of improvement vs. feasibility.

I'm looking forward to reading your ideas!



## Looking For Data List Value

Before I start modifying the previous AFM model, I need to check again what is data list value available.

In [27]:
list(data)
# data.head()

['Row',
 'Sample',
 'Anon Student Id',
 'Problem Hierarchy',
 'Problem Name',
 'Problem View',
 'Step Name',
 'Step Start Time',
 'First Transaction Time',
 'Correct Transaction Time',
 'Step End Time',
 'Step Duration (sec)',
 'Correct Step Duration (sec)',
 'Error Step Duration (sec)',
 'First Attempt',
 'Incorrects',
 'Hints',
 'Corrects',
 'Condition',
 'KC',
 'Opportunity',
 'outcome']

## Task 1. Using Prior Correct and Prior Incorrect as Our Predictor

In [68]:
modelPFM = sm.MixedLM.from_formula("outcome ~ Corrects + Incorrects + KC", data_train, groups=data_train["Anon Student Id"])
resultPFM = modelPFM.fit()
resultPFM.summary()



0,1,2,3
Model:,MixedLM,Dependent Variable:,outcome
No. Observations:,4083,Method:,REML
No. Groups:,59,Scale:,0.0104
Min. group size:,2,Likelihood:,3508.8920
Max. group size:,196,Converged:,Yes
Mean group size:,69.2,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,0.769,0.007,112.925,0.000,0.756,0.782
KC[T.Non-area formula],0.006,0.003,1.971,0.049,0.000,0.013
Corrects,0.209,0.006,35.063,0.000,0.198,0.221
Incorrects,-0.923,0.004,-260.320,0.000,-0.930,-0.916
Group Var,0.000,0.000,,,,


## Task 2. Use the PFM model to predict student performance and comparet the results with the AFM model

In [69]:
#predict on a testset
y_pred_pfm = resultPFM.predict(data_test)
data_test["y_pred_pfm"] = y_pred_pfm
data_test["outcome_pred_pfm"] = 1

data_test["outcome_pred_pfm"].loc[(data_test["y_pred_pfm"] <0.5)] = 0
error = abs(data_test["outcome_pred_pfm"] - data_test["outcome"])
float(sum(error))/float(len(error))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


0.00881488736532811

Our model gives a very low *0.88%* error rate to predict student performance. It's such a huge improvement!

Comparing between AFM (Additional Factors Model) and new PFM (Performance Factors Model), here are some of my findings:
- PFM is using student's performance as the main indication of student learning with its two variables, correct and incorrect responses.
- PFM result can be converted to a prediction of performance latency or duration. Helped with this process, PFM can provide an estimate of the cost for every action and will be used to decide which one the optimal action to choose.

## Task 3. Additional Factors to Improve The Student Model

Some additional factors that possibly will improve the model are:
- 'Correct Step Duration (sec)'
- 'Error Step Duration (sec)'

Similar to "Prior Corrects" and "Prior Incorrects", these two factors will help to determine student performance, and can be used for decision making process to pick optimal action for the future.