## Deep Knowledge Tracing

We will try out three different deep learning models. The code is in the folder DeepModels.

# Data

We will fit data from the Assistments platform (https://www.commonsense.org/education/website/assistments). These data are from the course 2015.

First make sure to download the data from https://drive.google.com/file/d/0B_hO8cnpcIMgUGZzRnh3bHJrSjQ/view?resourcekey=0-dGtan-IMFc3IjQ749-FgQA to the folder **./DeepModels/datasets/ASSIST2015/**. The data preparation script assumes that the data are in a file **./DeepModels/datasets/ASSIST2015/2015_100_skill_builders_main_problems.csv**.

In [9]:
# Imports
import os
import pandas as pd
import json
import pickle
import seaborn as sns
import matplotlib.pyplot as plt

In [10]:
os.path.abspath(os.getcwd())
os.makedirs("DeepModels/datasets/ASSIST2015", exist_ok=True)

Inspect the data:
- *user_id* - Student ID. 
- *log_id* - Indicates the order in which it was presented.
- *sequence_id* - Knowledge Component (KC). Groups items (exercises/questions) of a similar topic or skill.
- *correct* - Whether the response is correct (1) or incorrect (0).

In [None]:
data_dir = "DeepModels/datasets/ASSIST2015"
data = pd.read_csv(
    os.path.join(data_dir, "2015_100_skill_builders_main_problems.csv"), sep=","
)

print(data.head(10))
print(data.user_id.nunique())
print(data.shape)
print(data.correct.value_counts())

# Preprocessing

Since datasets can differ in how they are formatted. In the **data_loaders** folder there are preprocessing scripts. These are called when you run the model (no need to call them separately). We can run *train.py* with the option *--only_preprocess* to run only the preprocessing, to avoid a long wait. 

In [4]:
os.chdir("DeepModels/")

In [5]:
!python train.py --model_name=dkt --dataset_name=ASSIST2015 --only_preprocess 

Briefly, the preprocessing implements the following steps:

- Sort responses by log_id (time) for each student.
- Remove observations if correct is 1 or 0.
- Reindex students and items.
- Split sequences of KCs and responses so that they have uniform length (*seq_len*).
- Split dataset in train and test (e.g., 90%/10%).

The preprocessing creates the following files that are used to train the model:

*u_list.pkl*: List of student ids.
*u2idx.pkl*: Dict with mapping of student ids to indices.

*q_list.pkl*: List of knowledge component (KC) ids.
*q2idx.pkl*: Dict with mapping of KC ids to indices.

*r_seqs.pkl*: List of sequences of responses. Each element is an array that corresponds to the sequence of responses of one student, ordered by timestamp (log_id).
*q_seqs.pkl*: List of sequences of KCs. Each element is an array that corresponds to the sequence of KCs of one student, ordered by timestamp (log_id). It matches r_seqs.

*train_indices.pkl*, *test_indices.pkl*: training/test indices

Inspect these files:

In [None]:
data_dir = "./datasets/ASSIST2015"
pkl_files = [f for f in os.listdir(data_dir) if f.endswith(".pkl")]


data_dict = {}


for file in pkl_files:

    with open(os.path.join(data_dir, file), "rb") as f:
        data_dict[os.path.splitext(file)[0]] = pickle.load(f)


print(data_dict["u_list"][:5])
print(data_dict["q_list"][:5])


print(data_dict["q_seqs"][:5])


print(data_dict["r_seqs"][:5])

print(len(data_dict["train_indices"]))
print(len(data_dict["test_indices"]))

*q_seqs*, *r_seqs* are finally converted in datasets which consists of pairs *(q, r)*, where *q* and *r* are 1-d arrays. *q* is a sequence of KC indices and *r* a sequence of responses (0 or 1). The sequences have length *seq_len* and are padded at the end with -1. 

# Configuration

The configuration file config.json allows to specify parameters for the training and for the different models. 

In [None]:
with open("config.json", "r") as f:
    print(f.read())

# Fitting the model

Now you can run the model. After running it, results (loss and aucs) for the test set can be found in the folder *ckpts*.

In [None]:
# model_name can be dkt, dkvmn or sakt
!python train.py --model_name=dkvmn --dataset_name=ASSIST2015 

Since this takes a while to run, you have the precomputed results for three models (dkt, dkvmn, sakt) in the folder *ckpts_precomputed/<model_name>/<dataset_name>*. 

In [None]:
results_dict = {}
combined_df = None
for model_name in ["dkt", "dkvmn", "sakt"]:
    ckpt_dir = f"ckpts_precomputed/{model_name}/ASSIST2015"
    pkl_files = [f for f in os.listdir(ckpt_dir) if f.endswith(".pkl")]

    if len(pkl_files) == 0:
        continue

    results_dict[model_name] = {}

    for file in pkl_files:
        with open(os.path.join(ckpt_dir, file), "rb") as f:
            results_dict[model_name][os.path.splitext(file)[0]] = pickle.load(f)

    # Create DataFrames for plotting
    df = pd.DataFrame(
        {
            "Epoch": range(len(results_dict[model_name]["aucs"])),
            "AUC": results_dict[model_name]["aucs"],
            "Loss": results_dict[model_name]["loss_means"],
            "Model": model_name,
        }
    )

    if combined_df is None:
        combined_df = df
    else:
        combined_df = pd.concat([combined_df, df])

fig, axes = plt.subplots(1, 2, figsize=(20, 6))

# Plot Loss
sns.lineplot(ax=axes[0], data=combined_df, x="Iteration", y="Loss", hue="Model")
axes[0].set_title("Loss")
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Loss")
axes[0].grid(True)

# Plot AUC
sns.lineplot(ax=axes[1], data=combined_df, x="Iteration", y="AUC", hue="Model")
axes[1].set_title("AUC")
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("AUC")
axes[1].grid(True)

plt.show()