In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [2]:
df = pd.read_csv('assistment_data_corrected.csv', encoding="ISO-8859-1")

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401756 entries, 0 to 401755
Data columns (total 30 columns):
order_id                401756 non-null int64
assignment_id           401756 non-null int64
user_id                 401756 non-null int64
assistment_id           401756 non-null int64
problem_id              401756 non-null int64
original                401756 non-null int64
correct                 401756 non-null int64
attempt_count           401756 non-null int64
ms_first_response       401756 non-null int64
tutor_mode              401756 non-null object
answer_type             401756 non-null object
sequence_id             401756 non-null int64
student_class_id        401756 non-null int64
position                401756 non-null int64
type                    401756 non-null object
base_sequence_id        401756 non-null int64
skill_id                338001 non-null float64
skill_name              325637 non-null object
teacher_id              401756 non-null int64
school_id

### An explanation of the fields
This is taken from [the ASSISTments data website](https://sites.google.com/site/assistmentsdata/how-to-interpret).
* `order_id`: These id's are chronological, and refer to the id of the original problem log.
* `assignment_id`: Each time a teacher assigns a problem set that assignment gets a separate number. This is that number. 
* `user_id`: The ID of the student doing the problem.
* `assistment_id`: Similar to problem_id. The ID of a problem one will see in the builder. If a problem has multiple main problems and/or scaffolding, everything relating to one problem is called an assistment and has the same assistment_id. If you see problem logs with the same assistment number, they are multiple main problems(or scaffolding problems) that are part of the same overarching problem.
* `problem_id`: The ID of the problem. If a problem has multiple main problems, each multiple main problem will have a different problem_id.
* `original`: 
  * 1 = Main problem
  * 0 = Scaffolding problem  
    If a problem has scaffolding and the student answers incorrectly or asks for the problem to be broken into steps, a new problem will be created called a scaffolding problem. This creates a separate problem log row in the file with the variable original = 0.
* `correct`: 
  * 1 = Correct on first attempt
  * 0 = Incorrect on first attempt, or asked for help  
This column is often the target for prediction. (Minor note: Neil Heffernan notes that while this is true most of the time, we also have Essay questions that teachers can grade.  Neil thinks that if this value is say .25 that means the teacher gave it a 1 our of 4.)
* `attempt_count`: Number of attempts(number of times a student entered an answer)
* `ms_first_response`: Time between start time and first student action(asking for hint or entering an answer) (in milliseconds)
* `tutor_mode`: Tutor, test mode, pre-test, or post-test
* `answer_type`: Unknown
* `sequence_id`: ASSISTments is very confusing in how they use "Problem Set" and "Sequence".  The same object that is called a "Sequence" in the database is exposed to teachers as a "Problem Set".  If you have a sequence ID, you can use the converter [here](http://users.wpi.edu/~xxiong/app.html) to get the corresponding problem set number to use ASSISTments and see exactly what the content looked like.
* `student_class_id`: The ID of the class, the same for all students in the same class.  If you want to heiricharchila liearn modeling you can use this for the class ID.  We can also give you a teacher ID.  You might also want to look at section ID
* `position`: The placement of the assignment within the teacher's assignment page (i.e., 5 means the 5th problem set assigned)
* `type`: Determined by assignment type id. Usually ClassAssignment, but sometimes ARRS or remedial.
* `base_sequence_id`: This is to account for if a sequence has been copied. This will point to the original copy, or be the same as sequence_id if it hasn't been copied.
* `skill_id`: ID of the skill associated with the problem. For the skill builder dataset, different skills for the same data record are in different rows. This means if a student answers a multi skill question, this record is duplicated several times, and each duplication is tagged with one of the multi skills. For the non skill builder dataset, different skills for the same data record are in the same row, separated with comma.
* `skill_name`: Skill name associated with the problem.
* `teacher_id`: Unknown
* `school_id`: ID number for the school
* `hint_count`: Number of hints a student asked for during the duration of the problem.
* `hint_total`: Number of possible hints on this problem.  We tell you the total number of hints so you can compute something like a % of hints used.  Not all problems have all the same number of hints. 
* `overlap_time`: Number of possible hints on this problem.  We tell you the total number of hints so you can compute something like a % of hints used.  Not all problems have all the same number of hints. This field is often computed incorrectly. Many data sets display overlap time the same as the first response time. You could compute overlap time using other fields, like using the state time of two problems. 
* `template_id`: The template ID of the ASSISTment. ASSISTments with the same template ID have similar questions.
* `answer_id`: Only exists for multiple choice or choose all that apply questions. 
  * A number =  the answer the student put in corresponds with one of the answers for that problem
  * 0 or empty = the student put an answer not corresponding with one of the answers for that problem
* `answer_text`: The answer the student entered. Or the value the student selected in a multiple choice or choose all that apply problem.
* `first_action`: 
  * 0 = attempt
  * 1 = hint
  * 2 = scaffolding
  * empty = student clicked on the problem but did nothing else
* `bottom_hint`: 
  * 1 = The student asked for the bottom out hint
  * 0 = The student did not ask for the bottom out hint.
  * If this is blank it means the student did not ask for a hint.  Remember that for scaffolding questions they can not get a hint.
  * The bottom out hint is the last hint for a problem and will generally contain the problem’s answer.
* `opportunity`: The number of opportunities the student has to practice on this skill. For the skill builder dataset, opportunities for different skills of the same data record are in different rows. This means if a student answers a multi skill question, this record is duplicated several times, and each duplication is tagged with one of the multi skills and the corresponding opportunity count.
* `opportunity_original`: The number of opportunities the student has to practice on this skill counting only original problems. For the skill builder dataset, original opportunities for different skills of the same data record are in different rows. This means if a student answers a multi skill question, this record is duplicated several times, and each duplication is tagged with one of the multi skills and the corresponding original opportunity count.

In [4]:
overlap_time = df['overlap_time']

In [5]:
correct = df['correct']

In [6]:
median_overlap = df['overlap_time'].median()

In [7]:
overlap = [0 if x < median_overlap else 1 for x in overlap_time]

In [8]:
np.unique(overlap)

array([0, 1])

In [9]:
from scipy.stats import pointbiserialr
from sklearn.metrics import matthews_corrcoef

In [10]:
matthews_corrcoef(overlap, correct)

-0.23279380539926584

In [11]:
df = df[pd.notnull(df['skill_id'])]
groups = list(df.groupby(['user_id', 'skill_id']))

In [13]:
len(df)

338001

In [12]:
len(groups)

41982

In [14]:
list(groups[9][1]['correct'])

[0, 1, 1, 1]

In [15]:
from WebApp.server.BKT import BKT

In [16]:
model = BKT(np.array(groups[9][1]['correct']).reshape(1, -1))

In [17]:
model.fit()

In [18]:
A, pi, B = model.get_model_params()

In [19]:
# Start probabilities
print(A)

[1.00000000e+000 2.00698872e-112]


In [20]:
# Transition probabilities
print(pi)

[[1.17179653e-15 1.00000000e+00]
 [1.17179653e-15 1.00000000e+00]]


In [21]:
# Emission probabilities
print(B)

[[1.00000000e+000 3.51538960e-015]
 [6.68996241e-113 1.00000000e+000]]


Since the primary diagonal elements are both 1, my guess is that for this particular run, the hidden state 0 (the model's internal state 0) corresponds to observed state 0 (unlearned), and same with state 1. Now looking at the transition probabilities, it says that p(unlearned --> learned) = 1 and p(learned --> learned) = 1. This makes sense.