# Deep Knowledge Tracing (DKT) Implemetation

## Introduction
 Deep Knowledge Tracing (DKT) is a deep learning-based methodology for tracking students' knowledge states in educational settings. Unlike traditional Knowledge Tracing, DKT uses Recurrent Neural Networks (RNN) to model student learning processes. This allows prediction of the probability that a student can solve specific problems.

## Environment Setup
AWS Inferentia2 (INF2) is an accelerator specifically designed for deep learning inference. We install the necessary libraries to utilize it.

In [None]:
# Install Neuron SDK and PyTorch NeuronX
!pip install torch-neuronx torchvision neuronx-cc[tensorflow] -U

# Install other required packages 
!pip install numpy pandas scikit-learn matplotlib -U

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting torch-neuronx
  Using cached https://pip.repos.neuron.amazonaws.com/torch-neuronx/torch_neuronx-2.1.2.2.3.2-py3-none-any.whl (2.6 MB)
Collecting torchvision
  Using cached torchvision-0.19.1-cp38-cp38-manylinux1_x86_64.whl.metadata (6.0 kB)
Collecting torch==2.1.* (from torch-neuronx)
  Using cached torch-2.1.2-cp38-cp38-manylinux1_x86_64.whl.metadata (25 kB)
INFO: pip is looking at multiple versions of torch-neuronx to determine which version is compatible with other requirements. This could take a while.
Collecting torch-neuronx
  Using cached https://pip.repos.neuron.amazonaws.com/torch-neuronx/torch_neuronx-2.1.2.2.3.1-py3-none-any.whl (2.6 MB)
  Using cached https://pip.repos.neuron.amazonaws.com/torch-neuronx/torch_neuronx-2.1.2.2.3.0-py3-none-any.whl (2.6 MB)
  Using cached https://pip.repos.neuron.amazonaws.com/torch-neuronx/torch_neuronx-2.1.2.2.2.0-py3-none-any.whl (2.5 MB)
  Using 

## Library Import

Import required libraries.
- numpy: Basic library for numerical operations 
- pandas: Data processing and analysis
- torch: Deep learning model implementation
- torch_neuronx: AWS Inferentia2 optimization
- sklearn: Performance evaluation and data preprocessing

In [28]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch_neuronx
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
%matplotlib inline

## Dataset Introduction
The ASSISTments 2009 dataset is a widely used public dataset in educational data mining with these characteristics:
- Students' mathematics problem-solving records
- Information including problem ID, student ID, correctness
- Data collected from real educational environments

In [None]:
# Set dataset path (change to actual path)
data_path = 'datasets/ASSIST2009/skill_builder_data.csv'

# Load data
data = pd.read_csv(data_path, encoding='ISO-8859-1', low_memory=False)
data.head()

Unnamed: 0,order_id,assignment_id,user_id,assistment_id,problem_id,original,correct,attempt_count,ms_first_response,tutor_mode,...,hint_count,hint_total,overlap_time,template_id,answer_id,answer_text,first_action,bottom_hint,opportunity,opportunity_original
0,33022537,277618,64525,33139,51424,1,1,1,32454,tutor,...,0,3,32454,30799,,26,0,,1,1.0
1,33022709,277618,64525,33150,51435,1,1,1,4922,tutor,...,0,3,4922,30799,,55,0,,2,2.0
2,35450204,220674,70363,33159,51444,1,0,2,25390,tutor,...,0,3,42000,30799,,88,0,,1,1.0
3,35450295,220674,70363,33110,51395,1,1,1,4859,tutor,...,0,3,4859,30059,,41,0,,2,2.0
4,35450311,220674,70363,33196,51481,1,0,14,19813,tutor,...,3,4,124564,30060,,65,0,0.0,3,3.0


### Data Preprocessing

Preprocess the data into the format required for model training.

In [None]:
# Select required columns
data = data[['user_id', 'problem_id', 'correct']]

# Remove missing values
data = data.dropna()

# ID encoding
from sklearn.preprocessing import LabelEncoder

user_encoder = LabelEncoder()
data['user_id'] = user_encoder.fit_transform(data['user_id'])

item_encoder = LabelEncoder()
data['problem_id'] = item_encoder.fit_transform(data['problem_id'])

# Check number of problems and students
num_students = data['user_id'].nunique()
num_questions = data['problem_id'].nunique()

print(f'Number of students: {num_students}')
print(f'Number of problems: {num_questions}')

학생 수: 4217
문제 수: 26688


### Create Sequence Data

Create sequences of problem-solving history for each student.

In [None]:
# Group by student
grouped = data.groupby('user_id')

# Generate sequences
sequences = []

for _, group in grouped:
    seq = list(zip(group['problem_id'].values, group['correct'].values))
    sequences.append(seq)

print(f'Total number of sequences: {len(sequences)}')

총 시퀀스 수: 4217


### Configure Dataset and DataLoader

Define PyTorch Dataset and DataLoader for model training.

In [None]:
class DKTDataset(Dataset):
    def __init__(self, sequences, num_questions, seq_len=100):
        self.sequences = sequences
        self.num_questions = num_questions
        self.seq_len = seq_len
        
    def __len__(self):
        return len(self.sequences)
    
    def __getitem__(self, idx):
        seq = self.sequences[idx]
        
        seq = seq[-self.seq_len:]
        
        x = np.zeros(self.seq_len, dtype=int)
        y = np.zeros(self.seq_len, dtype=int)
        
        for i, (q, r) in enumerate(seq):
            x[i] = q + self.num_questions * r
            y[i] = q
        
        return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)

# Create dataset
dataset = DKTDataset(sequences, num_questions)

# Split into training and test sets
from torch.utils.data import random_split

train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# Create DataLoader
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, drop_last=True)

## DKT Model Architecture
The DKT model consists of the following structure:
1. Embedding Layer: Converts problem IDs to vectors
2. LSTM Layer: Processes sequence data
3. Output Layer: Predicts probability of correct answers

The DKT model is based on LSTM (Long Short-Term Memory) neural networks and takes a student's past problem-solving history as input to predict the probability of correct answers for the next problem. We define a DKT model based on LSTM.

In [8]:
class DKTModel(nn.Module):
    def __init__(self, num_questions, hidden_size):
        super(DKTModel, self).__init__()
        self.num_questions = num_questions
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(num_questions * 2, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_questions)
    
    def forward(self, x):
        embed = self.embedding(x)
        h, _ = self.lstm(embed)
        out = self.fc(h)
        preds = torch.sigmoid(out)
        return preds

## Model Training

Write code to train the model. Considerations during training:
- Batch size selection
- Learning rate adjustment  
- Loss function (BCE Loss)
- Optimizer (Adam)
- Number of epochs

In [None]:
# Set device (CPU for INF2)
device = torch.device('cpu')

# Set hyperparameters
num_epochs = 10
hidden_size = 100

# Initialize model
model = DKTModel(num_questions, hidden_size).to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
train_losses = []

model.train()
for epoch in range(num_epochs):
    total_loss = 0
    for x, y in train_loader:
        x = x.to(device)
        y = y.to(device)
        
        optimizer.zero_grad()
        preds = model(x)
        
        # Create targets
        target_mask = (y != 0)
        target_indices = y[target_mask]
        
        preds = preds[target_mask, target_indices]
        targets = (x[target_mask] >= num_questions).float()
        
        loss = criterion(preds, targets)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    avg_loss = total_loss / len(train_loader)
    train_losses.append(avg_loss)
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}')

# Save trained model
torch.save(model.state_dict(), 'dkt_model.pth')

Epoch [1/10], Loss: 0.6603
Epoch [2/10], Loss: 0.5177
Epoch [3/10], Loss: 0.2917
Epoch [4/10], Loss: 0.1760
Epoch [5/10], Loss: 0.1177
Epoch [6/10], Loss: 0.0840
Epoch [7/10], Loss: 0.0631
Epoch [8/10], Loss: 0.0489
Epoch [9/10], Loss: 0.0390
Epoch [10/10], Loss: 0.0319


## Model Compilation (Neuron Optimization)

Compile the trained model using `torch_neuronx` for inference on INF2.

The reasons for creating example_inputs are:

1. **Define Input Shape for Model Compilation**
   - torch_neuronx.trace() function needs to know the specific shape and data type of input tensors to compile the model into an AWS Inferentia2 optimized form
   - Example inputs inform the model about the structure of data it will actually receive

2. **Compilation Optimization**
   ```python
   example_inputs = torch.randint(0, num_questions * 2, (batch_size, 100), dtype=torch.long)
   ```
   - batch_size: Number of data to process at once (64 here)
   - 100: Sequence length
   - num_questions * 2: Range of possible input values (combination of problem ID and correctness)

3. **Execution Graph Generation**
   - The compiler uses these example inputs to optimize the model's computation graph
   - Analyzes and optimizes computation patterns that will be used in actual inference

4. **Hardware Optimization**
   ```python
   model_neuron = torch_neuronx.trace(model, example_inputs)
   ```
   - Optimizes computations for AWS Inferentia2 hardware
   - Improves memory usage and computation speed

Example:
```python
# For actual batch size of 64 and sequence length of 100
batch_size = 64
seq_length = 100
num_questions = 100  # Total number of problems

# Create example inputs
example_inputs = torch.randint(
    low=0,  # minimum value
    high=num_questions * 2,  # maximum value (num_questions * 2)
    size=(batch_size, seq_length),  # input tensor size
    dtype=torch.long  # data type
)

# Compile model with these example inputs
model_neuron = torch_neuronx.trace(model, example_inputs)
```

While these example inputs are dummy data rather than actual data, they contain all the structural information needed for model compilation and optimization.

In [None]:
# Load model
model.load_state_dict(torch.load('dkt_model.pth'))
model.eval()

# Create example inputs
example_inputs = torch.randint(0, num_questions * 2, (batch_size, 100), dtype=torch.long)

# Compile with Neuron
model_neuron = torch_neuronx.trace(model, example_inputs)

# Save compiled model
model_neuron.save('dkt_model_neuron.pt')

2024-11-25 14:22:00.000511:  18793  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-11-25 14:22:00.000512:  18793  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ec2-user/neuroncc_compile_workdir/f1751cb8-0a67-4cb8-bc6a-c417a2e51eee/model.MODULE_10466109672838001554+d41d8cd9.hlo.pb', '--output', '/tmp/ec2-user/neuroncc_compile_workdir/f1751cb8-0a67-4cb8-bc6a-c417a2e51eee/model.MODULE_10466109672838001554+d41d8cd9.neff', '--verbose=35']
.
Compiler status PASS
2024-11-25 14:22:02.000881:  18863  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-11-25 14:22:02.000882:  18863  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ec2-user/neuroncc_compile_workdir/20158125-f6fd-4581-b4a0-c79305a82d2d/model.MODULE_10076059576887794638+d41d8cd9.hlo.pb', '--output', '/tmp/ec2-user/neuroncc_comp

## Performance Evaluation

Perform inference on INF2 using the compiled model. The DKT model's performance is evaluated using the following metrics:
- AUC (Area Under the ROC Curve)
- Accuracy
- Changes in loss function values

When running the model on AWS Inferentia hardware, the input tensor size must exactly match the size specified during compilation.

In [None]:
# Load compiled model
model_neuron = torch.jit.load('dkt_model_neuron.pt')
model_neuron.eval()

all_preds = []
all_targets = []
total_samples = 0
total_sequences = 0

with torch.no_grad():
    for batch_idx, (x, y) in enumerate(test_loader):
        x = x.to(device)
        y = y.to(device)
        preds = model_neuron(x)
        
        batch_samples = 0
        for i in range(len(y)):
            target_mask = (y[i] != 0)
            target_indices = y[i][target_mask]
            pred = preds[i][target_mask, target_indices]
            
            # Calculate actual number of problems in current sequence
            num_problems = len(target_indices)
            batch_samples += num_problems
            
            all_preds.extend(pred.cpu().numpy())
            all_targets.extend((x[i][target_mask] >= num_questions).cpu().numpy())
        
        total_samples += batch_samples
        total_sequences += len(y)

print(f'\n[Test Statistics]')
print(f'Total sequences: {total_sequences}')
print(f'Total problems: {total_samples}')
print(f'Average problems per sequence: {total_samples/total_sequences:.2f}')
print(f'Total data size: {len(all_preds)}')

# Performance evaluation
auc = roc_auc_score(all_targets, all_preds)
print(f'\nTest Performance:')
print(f'AUC: {auc * 100:.2f}%')


[테스트 통계]
총 시퀀스 수: 832
총 문제 수: 35537
시퀀스당 평균 문제 수: 42.71
전체 데이터 크기: 35537

테스트 성능:
AUC: 99.17%


## Conclusion

In this notebook, we implemented the DKT model using PyTorch and AWS Inferentia2, and performed inference on INF2 using `torch-neuronx`. We provided step-by-step explanations that even beginners can understand, and achieved efficient inference by leveraging INF2's performance capabilities.

## References

- [Deep Knowledge Tracing Paper](https://papers.nips.cc/paper/2015/file/efc3f4b5768f887a677ce7f1dba75504-Paper.pdf)
- [ASSISTments Dataset](https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data)
- [AWS Neuron SDK Documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuronx/tutorials/torch-neuronx.html)