## Deep Knowledge Tracing

We will try out three different deep learning models.

In [None]:
# Import libraries
import gdown
import os
import pandas as pd
import json
import pickle
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# If running on colab, clone the repo
!git clone https://github.com/benjamingarzon/AMLD2025-Education-Workshop.git
%cd AMLD2025-Education-Workshop

Cloning into 'AMLD2025-Education-Workshop'...
remote: Enumerating objects: 144, done.[K
remote: Counting objects: 100% (144/144), done.[K
remote: Compressing objects: 100% (105/105), done.[K
remote: Total 144 (delta 57), reused 113 (delta 29), pack-reused 0 (from 0)[K
Receiving objects: 100% (144/144), 1.67 MiB | 5.27 MiB/s, done.
Resolving deltas: 100% (57/57), done.
/content/AMLD2025-Education-Workshop


# Data

We will fit data from the Assistments platform (https://www.commonsense.org/education/website/assistments). These data are from the course 2009.

First make sure to download the data from https://drive.google.com/uc?id=1y_1QJ1piwRlM9WXjyR2_4Yk19NubVpjM and save it in the folder **./DeepModels/datasets/ASSIST2009/**. The data preparation script assumes that the data are in a file **./DeepModels/datasets/ASSIST2009/skill_builder_data.csv**.

In [None]:
url = "https://drive.google.com/uc?id=1y_1QJ1piwRlM9WXjyR2_4Yk19NubVpjM"
output = "skill_builder_data.zip"
gdown.download(url, output, quiet=False)
!unzip skill_builder_data.zip 

In [None]:
os.makedirs("DeepModels/datasets/ASSIST2009", exist_ok=True)
!mv skill_builder_data.csv DeepModels/datasets/ASSIST2009/

Inspect the data:
- *user_id* - Student ID.
- *log_id* - Indicates the order in which it was presented.
- *sequence_id* - Knowledge Component (KC). Groups items (exercises/questions) of a similar topic or skill.
- *correct* - Whether the response is correct (1) or incorrect (0).

In [2]:
data_dir = "DeepModels/datasets/ASSIST2009"
data = pd.read_csv(
    os.path.join(data_dir, "skill_builder_data.csv"),
    sep=","
)

print(data.head(10))
print(data.user_id.nunique())
print(data.shape)
print(data.correct.value_counts())

   Unnamed: 0  order_id  assignment_id  user_id  assistment_id  problem_id  \
0           1  33022537         277618    64525          33139       51424   
1           2  33022709         277618    64525          33150       51435   
2           3  35450204         220674    70363          33159       51444   
3           4  35450295         220674    70363          33110       51395   
4           5  35450311         220674    70363          33196       51481   
5           6  35450555         220674    70363          33172       51457   
6           7  35450573         220674    70363          33174       51459   
7           8  35480603         220674    70363          33123       51408   
8           9  33140811         220674    70677          33168       51453   
9          10  33140919         220674    70677          33112       51397   

   original  correct  attempt_count  ms_first_response  ... hint_count  \
0         1        1              1              32454  ...        

# Preprocessing

Since datasets can differ in how they are formatted. In the **data_loaders** folder there are preprocessing scripts. These are called when you run the model (no need to call them separately). We can run *train.py* with the option *--only_preprocess* to run only the preprocessing, to avoid a long wait.

In [9]:
os.chdir("DeepModels/")

In [10]:
!python train.py --model_name=dkt --dataset_name=ASSIST2009 --only_preprocess

Briefly, the preprocessing implements the following steps (see train.py and data_loaders/assist2009.py):

- Sort responses by log_id (time) for each student.
- Remove observations if correct is different from 1 or 0.
- Reindex students and items (consecutive indices starting from 0).
- Split sequences of KCs and responses so that they have uniform length (*seq_len*).
- Split dataset in train and test (e.g., 90%/10%).

The preprocessing creates the following files that are used to train the model:

*u_list.pkl*: List of student ids.
*u2idx.pkl*: Dict with mapping of student ids to indices.

*q_list.pkl*: List of knowledge component (KC) ids.
*q2idx.pkl*: Dict with mapping of KC ids to indices.

*r_seqs.pkl*: List of sequences of responses. Each element is an array that corresponds to the sequence of responses of one student, ordered by timestamp (log_id).
*q_seqs.pkl*: List of sequences of KCs. Each element is an array that corresponds to the sequence of KCs of one student, ordered by timestamp (log_id). It matches r_seqs.

*train_indices.pkl*, *test_indices.pkl*: training/test indices

Inspect these files:

In [11]:
data_dir = "./datasets/ASSIST2009"
pkl_files = [f for f in os.listdir(data_dir) if f.endswith(".pkl")]


data_dict = {}


for file in pkl_files:

    with open(os.path.join(data_dir, file), "rb") as f:
        data_dict[os.path.splitext(file)[0]] = pickle.load(f)


print(data_dict["u_list"][:5])
print(data_dict["q_list"][:5])


print(data_dict["q_seqs"][:5])


print(data_dict["r_seqs"][:5])

print(len(data_dict["train_indices"]))
print(len(data_dict["test_indices"]))

[   14 21825 52613 53167 64525]
['Addition Whole Numbers' 'Angles on Parallel Lines Cut by a Transversal'
 'Area Circle' 'Box and Whisker' 'Calculations with Similar Figures']
[array([ 5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5, 18, 28, 28, 18, 28,
       18, 18]), array([33, 33, 33, 33, 33, 33, 33]), array([20]), array([ 8, 30,  0,  0,  0, 22, 23, 19,  4, 33, 33, 26, 26, 33, 33, 33, 33,
       33, 33,  9,  0,  0, 23, 21, 22,  7,  7,  7,  7,  7,  6,  6,  6,  6,
        6,  6, 33, 33, 33,  4,  2,  2,  2,  2,  2,  0,  0,  0,  0,  0,  0,
        0, 23,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9, 34, 34, 34, 34, 34,
       34, 11, 11, 11, 11, 23, 23, 23, 21, 21, 21, 21, 21, 21, 22, 22, 22,
       22, 22, 22, 22, 22, 22, 30,  8, 19, 27, 33, 33, 33, 33, 33, 18, 17,
       31, 31, 31, 31, 31,  4,  4,  4,  4,  4,  4,  4]), array([ 0, 30,  8, 22, 23,  0,  0,  0,  0,  0,  0,  0, 23,  9,  9,  9,  9,
        9,  9,  9,  9, 34, 11, 23, 23, 23, 21, 21, 21, 21, 21, 21, 22, 22,
       22, 22, 22, 22, 2

*q_seqs*, *r_seqs* are finally converted in datasets which consists of pairs *(q, r)*, where *q* and *r* are 1-d arrays. *q* is a sequence of KC indices and *r* a sequence of responses (0 or 1). The sequences have length *seq_len* and are padded at the end with -1.

# Configuration

The configuration file config.json allows to specify parameters for the training and for the different models.

In [12]:
with open("config.json", "r") as f:
    print(f.read())

{
    "train_config": {
        "batch_size": 256,
        "num_epochs": 200,
        "train_ratio": 0.9,
        "learning_rate": 0.001,
        "optimizer": "adam",
        "seq_len": 100
    },
    "dkt": {
        "emb_size": 50,
        "hidden_size": 50
    },
    "dkvmn": {
        "dim_s": 50,
        "size_m": 20
    },
    "sakt": {
        "n": 100,
        "d": 100,
        "num_attn_heads": 5,
        "dropout": 0.2
    }
}


# Fitting the model

Now you can fit a model. Here are there are three possibilities: Deep Knowledge Tracing (**dkt**, https://arxiv.org/abs/1506.05908), Dynamic Key-Value Memory Networks (**dkvmn**, https://arxiv.org/abs/1611.08108), and Self-Attentive Knowledge Tracing (**sakt**, https://arxiv.org/abs/1907.06837). After running it, results (loss and aucs) for the test set can be found in the folder *ckpts*. You can see the code for these models in the folder **./models**.

In [None]:
# model_name can be dkt, dkvmn or sakt
!python train.py --model_name=dkt --dataset_name=ASSIST2009

Epoch: 1,   AUC: 0.6172876917652882,   Loss Mean: 0.6766604781150818
Epoch: 2,   AUC: 0.6496446325024796,   Loss Mean: 0.6522507071495056
Epoch: 3,   AUC: 0.66781883817627,   Loss Mean: 0.6416162848472595
Epoch: 4,   AUC: 0.6816481995884448,   Loss Mean: 0.6361633539199829
Epoch: 5,   AUC: 0.6897770595008041,   Loss Mean: 0.6319649815559387
Epoch: 6,   AUC: 0.6965738204183785,   Loss Mean: 0.6292781829833984
Epoch: 7,   AUC: 0.7017494141889699,   Loss Mean: 0.6272977590560913
Epoch: 8,   AUC: 0.7049351374899437,   Loss Mean: 0.6282800436019897
Epoch: 9,   AUC: 0.7061627771367227,   Loss Mean: 0.6237517595291138
Epoch: 10,   AUC: 0.7081211772744564,   Loss Mean: 0.622798502445221
Epoch: 11,   AUC: 0.71038235325766,   Loss Mean: 0.6240051984786987
Epoch: 12,   AUC: 0.7117253709345316,   Loss Mean: 0.6226149797439575
Epoch: 13,   AUC: 0.7137116216236746,   Loss Mean: 0.6218594312667847
Epoch: 14,   AUC: 0.7137068941508146,   Loss Mean: 0.6206926703453064
Epoch: 15,   AUC: 0.71508001732430

Since this takes a while to run, you have the precomputed results for three models (dkt, dkvmn, sakt) in the folder *ckpts_precomputed/[model_name]/[dataset_name]*. Mind that you would need to tune the hyperparameters of the models and implement early stopping using a validation dataset for each of the models if you wanted to compare them, but this would require multiple runs.

In [None]:
results_dict = {}
combined_df = None
for model_name in ["dkt", "dkvmn", "sakt"]:
    ckpt_dir = f"ckpts_precomputed/{model_name}/ASSIST2009"

    if not os.path.exists(ckpt_dir):
        continue

    pkl_files = [f for f in os.listdir(ckpt_dir) if f.endswith(".pkl")]

    if len(pkl_files) == 0:
        continue

    results_dict[model_name] = {}

    for file in pkl_files:
        with open(os.path.join(ckpt_dir, file), "rb") as f:
            results_dict[model_name][os.path.splitext(file)[0]] = pickle.load(f)

    # Create DataFrames for plotting
    df = pd.DataFrame(
        {
            "Epoch": range(len(results_dict[model_name]["aucs"])),
            "AUC": results_dict[model_name]["aucs"],
            "Loss": results_dict[model_name]["loss_means"],
            "Model": model_name,
        }
    )

    if combined_df is None:
        combined_df = df
    else:
        combined_df = pd.concat([combined_df, df])

fig, axes = plt.subplots(1, 2, figsize=(20, 6))

# Plot Loss
sns.lineplot(ax=axes[0], data=combined_df, x="Epoch", y="Loss", hue="Model")
axes[0].set_title("Loss")
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Loss")
axes[0].grid(True)

# Plot AUC
sns.lineplot(ax=axes[1], data=combined_df, x="Epoch", y="AUC", hue="Model")
axes[1].set_title("AUC")
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("AUC")
axes[1].grid(True)

plt.show()