# Bayesian Knowledge Tracing

In [8]:
import os

os.getcwd()

'g:\\Meine Ablage\\Supervision\\AMLD-workshop\\AMLD2024-Education-Workshop'

In [2]:
# Imports
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random

## Data

We will fit data from the Assistments platform (https://www.commonsense.org/education/website/assistments). These data are from the course 2012-2013. First make sure to download the data from https://raw.githubusercontent.com/CAHLR/pyBKT-examples/master/data/as.csv to the folder **./data/**. 

In [16]:
# Inspect the dataset
DATASET = "data/as.csv"

df = pd.read_csv(DATASET, encoding="latin", low_memory=False)
df.head()

Unnamed: 0,order_id,assignment_id,user_id,assistment_id,problem_id,original,correct,attempt_count,ms_first_response,tutor_mode,...,hint_count,hint_total,overlap_time,template_id,answer_id,answer_text,first_action,bottom_hint,opportunity,opportunity_original
0,33022537,277618,64525,33139,51424,1,1,1,32454,tutor,...,0,3,32454,30799,,26,0,,1,1.0
1,33022709,277618,64525,33150,51435,1,1,1,4922,tutor,...,0,3,4922,30799,,55,0,,2,2.0
2,35450204,220674,70363,33159,51444,1,0,2,25390,tutor,...,0,3,42000,30799,,88,0,,1,1.0
3,35450295,220674,70363,33110,51395,1,1,1,4859,tutor,...,0,3,4859,30059,,41,0,,2,2.0
4,35450311,220674,70363,33196,51481,1,0,14,19813,tutor,...,3,4,124564,30060,,65,0,0.0,3,3.0


Here we focus on the following variables:
- *assistment_id* - The ID of the ASSISTment. An assistment consists of one or more problems.
- *user_id* - The ID of the student doing the problem.
- *problem_id* - The ID of the problem.
- *skill_name* - Skill name associated with the problem (knowledge component).
- *correct*
    - 1 - correct on the first attempt
    - 0 - incorrect on the first attempt or asked for help / hint.
- *attempt_count* - Number of student attempts on this problem.
- *hint_count* - Number of student hints asked by the student on this problem.
- *template_id* - The ID of the template in ASSISTment. Assistments with the same template ID have similar questions.

The remaining variables are explained here: https://sites.google.com/site/assistmentsdata/datasets/2012-13-school-data-with-affect

In [13]:
print(
    "The dataset has {} observations, {} problems, {} skills and {} users".format(
        len(df),
        df["problem_id"].nunique(),
        df["skill_name"].nunique(),
        df["user_id"].nunique(),
    )
)

The dataset has 525534 observations, 26688 problems, 110 skills and 4217 users


Inspect the different skills:

In [17]:
print(df["skill_name"].unique().tolist())

# Drop rows with missing skill_name
df = df.dropna(subset=["skill_name"])

['Box and Whisker', 'Circle Graph', 'Histogram as Table or Graph', 'Number Line', 'Scatter Plot', 'Stem and Leaf Plot', 'Table', 'Venn Diagram', 'Mean', 'Median', 'Mode', 'Range', 'Counting Methods', 'Probability of Two Distinct Events', 'Probability of a Single Event', 'Interior Angles Figures with More than 3 Sides', 'Interior Angles Triangle', 'Congruence', 'Complementary and Supplementary Angles', 'Angles on Parallel Lines Cut by a Transversal', 'Pythagorean Theorem', 'Nets of 3D Figures', 'Unit Conversion Within a System', 'Effect of Changing Dimensions of a Shape Prportionally', nan, 'Area Circle', 'Circumference ', 'Perimeter of a Polygon', 'Reading a Ruler or Scale', 'Calculations with Similar Figures', 'Conversion of Fraction Decimals Percents', 'Equivalent Fractions', 'Ordering Positive Decimals', 'Ordering Fractions', 'Ordering Integers', 'Ordering Real Numbers', 'Rounding', 'Addition Whole Numbers', 'Division Fractions', 'Estimation', 'Fraction Of', 'Least Common Multiple',

Inspect the data for a single student:

In [18]:
df[df["user_id"] == 64525].sort_values("order_id")

Unnamed: 0,order_id,assignment_id,user_id,assistment_id,problem_id,original,correct,attempt_count,ms_first_response,tutor_mode,...,hint_count,hint_total,overlap_time,template_id,answer_id,answer_text,first_action,bottom_hint,opportunity,opportunity_original
288311,21441630,263315,64525,34563,54003,1,0,1,20797,tutor,...,0,0,20859,30677,,-12.2,0,,1,1.0
288312,21441970,263315,64525,34560,53991,1,1,1,13797,tutor,...,0,0,13797,30677,,-6.5,0,,2,2.0
288313,21442097,263315,64525,34580,54071,1,0,1,14172,tutor,...,0,0,14235,30677,,1.6,0,,3,3.0
288314,21442513,263315,64525,34566,54015,1,1,1,48813,tutor,...,0,0,48813,30677,,5.2,0,,4,4.0
288315,21442851,263315,64525,34559,53987,1,1,1,22187,tutor,...,0,0,22187,30677,,1.4,0,,5,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
446259,33273093,277671,64525,37536,59762,1,0,1,19048,tutor,...,0,0,19173,30988,,y=1/5-4,0,,12,6.0
446260,33273154,277671,64525,37536,59763,0,1,1,3978,tutor,...,0,4,3978,30988,,5-Jan,0,,13,
446261,33273172,277671,64525,37536,59764,0,1,1,14960,tutor,...,0,2,14960,30988,,-4,0,,14,
446262,33273227,277671,64525,37536,59765,0,1,1,7363,tutor,...,0,4,7363,30988,,y=1/5x-4,0,,15,


## Model fitting
When calling the *fit* we can specify a list of skill names we want to fit (in this case "Addition and Subtraction Integers", "Multiplication and Division Integers", "Addition and Subtraction Positive Decimals"). The parameters are estimated separately for each skill. If no skills are indicated, the whole dataset is used, which may take a long time. 

- **Parameters**: 
    - num_fits (5) - the number of initialization fits used for the BKT model.
    - parallel (True) - whether to use multi-threading.
    - skills ('.\*') - regular expression used to indicate the skills the BKT model will be run on.
    - forgets (False) - include forgetting in the model.    

- **Inpus**:
The input dataframe should have the following columns: 
    - order_id ('order_id') -  indicates question order.
    - skill_name ('skill_name') - skill name (knowledge component) associated with the question.
    - correct ('correct') - the correct (1) / incorrect (0) label.
    - user_id ('user_id') - name of the CSV column for the ID of the student answering the question. 
    
More details in https://github.com/CAHLR/pyBKT#creating-and-training-models

In [None]:
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)

from pyBKT.models import Model



SKILL_LIST = [
    "Addition and Subtraction Integers",
    "Multiplication and Division Integers",
    "Addition and Subtraction Positive Decimals",
]



BKT = Model(seed=SEED, parallel=True)

BKT.fit(data=df, skills=SKILL_LIST, num_fits=1)

auc_train = BKT.evaluate(data=df, metric="auc")
accuracy_train = BKT.evaluate(data=df, metric="accuracy")



print(f"Training AUC: {auc_train}; Training accuracy: {accuracy_train}")

Training AUC: 0.8841882194587217; Training accuracy: 0.8247150335110807


Inspect the estimated parameters.

**prior** ($P(\text{L}_0)$) - the prior probability of knowledge concept (skill) mastery.
- **learns** ($P(\text{L})$) - the probability of transitioning to the mastery state given non-mastery, over a single learning opportunity.
- **slips** ($P(\text{S})$) - the probability of slipping (making a mistake) when the learner is in the mastered state.
- **guesses** ($P(\text{G})$) - the probability that the student guesses the right answer while not knowing the skill.
- **forgets** ($P(\text{F}))$ - the probability of transitioning to the non-mastery state given mastery (i.e., forgetting something that the student previously learned).

There is a different set of parameters for each skill. Note that the **forgets** parameter is 0. If you run the fit with forgets=True it will also estimate this parameter (see below).

In [47]:
BKT.params()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
skill,param,class,Unnamed: 3_level_1
Median,prior,default,0.55343
Median,learns,default,0.0822
Median,guesses,default,0.32537
Median,slips,default,0.17197
Median,forgets,default,0.0
Mean,prior,default,0.66453
Mean,learns,default,0.15594
Mean,guesses,default,0.05222
Mean,slips,default,0.32271
Mean,forgets,default,0.0


In [56]:
# Alternatively
BKT.coef_

{'Multiplication and Division Integers': {'prior': np.float64(0.7577838698725574),
  'learns': array([0.00242479]),
  'guesses': array([0.51341267]),
  'slips': array([0.06144345]),
  'forgets': array([0.])},
 'Addition and Subtraction Positive Decimals': {'prior': np.float64(0.6364501170581747),
  'learns': array([0.08539341]),
  'guesses': array([0.13081135]),
  'slips': array([0.20636621]),
  'forgets': array([0.])},
 'Addition and Subtraction Integers': {'prior': np.float64(0.5407906675471538),
  'learns': array([0.00610081]),
  'guesses': array([0.4368426]),
  'slips': array([0.07898355]),
  'forgets': array([0.])}}

We can initialize the parameters with the `coef_` attribute.

In [29]:
# Setting prior of the model for a certain skill
BKT.coef_ = {SKILL_LIST[0]: {"prior": 0.6}}

#  Train the model with the pre-initialized parameters
BKT.fit(data=df, skills=SKILL_LIST)
auc_train = BKT.evaluate(data=df, metric="auc")
accuracy_train = BKT.evaluate(data=df, metric="accuracy")

print(f"Training AUC: {auc_train}; Training accuracy: {accuracy_train}")

Training AUC: 0.7703892445446437; Training accuracy: 0.7699721148671788


# Prediction
Once we have trained a model, we can make predictions on new data.

In [23]:
pred = BKT.predict(data=df)

Inspect the predictions:
- *correct_predictions* - between 0 and 1, estimated probability of answering correctly
- *state_predictions* - between 0 and 1, probability of mastering the knowledge component after responding to the item

In [30]:
pred[pred["skill_name"] == SKILL_LIST[0]][
    [
        "user_id",
        "correct",
        "correct_predictions",
        "state_predictions",
        "skill_name",
    ]
].head(20)

Unnamed: 0,user_id,correct,correct_predictions,state_predictions,skill_name
265408,53167,1,0.69884,0.54125,Addition and Subtraction Integers
265409,53167,1,0.78298,0.71503,Addition and Subtraction Integers
265410,53167,1,0.84446,0.842,Addition and Subtraction Integers
265411,53167,0,0.88163,0.91878,Addition and Subtraction Integers
265412,53167,1,0.73497,0.61587,Addition and Subtraction Integers
265413,53167,1,0.8111,0.77311,Addition and Subtraction Integers
265414,53167,1,0.86216,0.87857,Addition and Subtraction Integers
265415,53167,1,0.89135,0.93886,Addition and Subtraction Integers
265416,53167,1,0.90654,0.97022,Addition and Subtraction Integers
265417,64525,1,0.69884,0.54125,Addition and Subtraction Integers


##  Further extensions

### Enable forgetting

We train the model with forgets=True. This model assumes that the student can forget a concept previously learned. We will run these models using a single skills for to keep the training time short. Observe that the probability of forgetting a skill is rather low, as you would expect.

In [32]:
BKT.fit(data=df, skills=SKILL_LIST[0], forgets=True)
BKT.params()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
skill,param,class,Unnamed: 3_level_1
Multiplication and Division Integers,prior,default,0.95161
Multiplication and Division Integers,learns,default,0.0166
Multiplication and Division Integers,guesses,default,0.00103
Multiplication and Division Integers,slips,default,0.09768
Multiplication and Division Integers,forgets,default,0.00233
Addition and Subtraction Positive Decimals,prior,default,0.61776
Addition and Subtraction Positive Decimals,learns,default,0.11423
Addition and Subtraction Positive Decimals,guesses,default,0.10877
Addition and Subtraction Positive Decimals,slips,default,0.18664
Addition and Subtraction Positive Decimals,forgets,default,0.01368


In [None]:
auc_train = BKT.evaluate(data=df, metric="auc")
accuracy_train = BKT.evaluate(data=df, metric="accuracy")

print(f"Training AUC: {auc_train}; Training accuracy: {accuracy_train}")

### Multiguess 
The **multiguess** option estimates different guess and slip parameters across sets of items. We need to include in the input data frame a column that designates a set of similar problems. The guess/slip are the same for each value of *template_id*. We need to indicate this column via the parameter *multigs*.

In [None]:
BKT.fit(data=df, skills=SKILL_LIST[0], multigs="template_id")
BKT.params()

In [None]:
auc_train = BKT.evaluate(data=df, metric="auc")
accuracy_train = BKT.evaluate(data=df, metric="accuracy")

print(f"Training AUC: {auc_train}; Training accuracy: {accuracy_train}")

We can now check the guess and slip parameters for each *template_id*.

In [18]:
params = BKT.params()
plt.figure(figsize=(12, 6))
plt.plot(params.loc[(SKILL_LIST[0], "guesses")], label="Guesses")
plt.plot(params.loc[(SKILL_LIST[0], "slips")], label="Slips")
plt.xlabel("Template ID")
plt.ylabel("Guess/Slip Rate")
plt.title("BKT Parameters per Template ID Class")
plt.legend()

count   816.00000
mean     32.70588
std      39.55317
min       1.00000
25%      10.00000
50%      20.00000
75%      35.25000
max     296.00000
Name: problem_id, dtype: float64
count   816.00000
mean      0.94730
std       0.68707
min       0.00000
25%       1.00000
50%       1.00000
75%       1.00000
max       4.00000
Name: skill_name, dtype: float64
