# Predict Student Performance from Game Play

## Is it reasonable to treat session information as sequential data?

The intuition of this notebook is to encode all the rows in a session as sequential data, and then use a Recurrent Neural Networks to predict whether the user for this particular session will answer this question correctly.

If the output is potential, this could tremendously reduce the effort of future engineering, or can become a reliable support for encoding useful features, which can combine with features from statistical analysis to produce a better classifier.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import torch
import torch.nn as nn

In [2]:
# Load the dataset
dtypes = {
    'elapsed_time': np.int32,
    'event_name': 'category', 
    'name': 'category',
    'level': 'category',
    'room_coor_x': np.float32,
    'room_coor_y': np.float32,
    'screen_coor_x': np.float32,
    'screen_coor_y': np.float32,
    'hover_duration': np.float32,
    'text': 'category',
    'fqid': 'category',
    'room_fqid': 'category',
    'text_fqid': 'category',
    'fullscreen': 'category',
    'hq': 'category',
    'music': 'category',
    'level_group': 'category'
}

df = pd.read_csv('data/train.csv', dtype=dtypes)

# Print the first 5 rows
df.head()

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,0,0,cutscene_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4
1,20090312431273200,1,1323,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
2,20090312431273200,2,831,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
3,20090312431273200,3,1147,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
4,20090312431273200,4,1863,person_click,basic,0,,-412.991394,-159.314682,381.0,494.0,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4


In [3]:
df.shape

(26296946, 20)

In specific, taking a `session_id`...

In [4]:
session_1_df = df[df['session_id'] == 20090312431273200]
session_1_df

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,0,0,cutscene_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4
1,20090312431273200,1,1323,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
2,20090312431273200,2,831,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
3,20090312431273200,3,1147,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
4,20090312431273200,4,1863,person_click,basic,0,,-412.991394,-159.314682,381.0,494.0,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
876,20090312431273200,927,1267357,navigate_click,undefined,22,,927.307251,-10.355928,838.0,335.0,,,tomap,tunic.historicalsociety.entry,,0,0,1,13-22
877,20090312431273200,928,1268292,map_hover,basic,22,,,,,,366.0,,tomap,tunic.historicalsociety.entry,,0,0,1,13-22
878,20090312431273200,929,1269474,map_click,undefined,22,,457.523010,22.141338,443.0,316.0,,,tunic.capitol_2,tunic.historicalsociety.entry,,0,0,1,13-22
879,20090312431273200,930,1270708,navigate_click,undefined,22,,224.190323,-60.268669,404.0,337.0,,,chap4_finale_c,tunic.capitol_2.hall,,0,0,1,13-22


...which contains 881 actions recorded. We consider it as a report document of 881 words to process and see whether the prediction made from this document is reliable.

**The strategy for encoding a record into numeric format:**

- Employed columns: `event_name`, `name`, `level`, `room_coor_x`, `room_coor_y`, `screen_coor_x`, `screen_coor_y`, `hover_duration`.
- Set all the null values to 0 since there exists a identification, `event_name`, that shows the reason why these values are zeros.
- Encode all categorical columns (using one-hot encoding).

In [5]:
df.set_index(['session_id', 'index'], inplace=True)

In [6]:
df = df[['event_name', 'name', 'level', 'room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y', 'hover_duration']]
for col in ['room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y', 'hover_duration']:
    # Scaling the coordinates and durations
    df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())
    df[col] = df[col].fillna(0)

We are using a custom `GetDummies` class for one-hot encoding for 2 reasons:

1. `OneHotEncoder` runs excessively slower than `pd.get_dummies` in encoding large data.
2. `pd.get_dummies` transformations may encounter inconsistent amount of columns in transformed data if 2 datasets contain different number of unique categorical values.

In [7]:
import sklearn


class GetDummies(sklearn.base.TransformerMixin):
    """Fast one-hot-encoder that makes use of pandas.get_dummies() safely
    on train/test splits.
    """
    def __init__(self, dtypes=None):
        self.input_columns = None
        self.final_columns = None
        if dtypes is None:
            dtypes = [object, 'category']
        self.dtypes = dtypes

    def fit(self, X, y=None, **kwargs):
        self.input_columns = list(X.select_dtypes(self.dtypes).columns)
        X = pd.get_dummies(X, columns=self.input_columns)
        self.final_columns = X.columns
        return self
        
    def transform(self, X, y=None, **kwargs):
        X = pd.get_dummies(X, columns=self.input_columns)
        X_columns = X.columns
        # if columns in X had values not in the data set used during
        # fit add them and set to 0
        missing = set(self.final_columns) - set(X_columns)
        for c in missing:
            X[c] = 0
        # remove any new columns that may have resulted from values in
        # X that were not in the data set when fit
        return X[self.final_columns]
    
    def get_feature_names(self):
        return tuple(self.final_columns)

In [8]:
get_dummies = GetDummies()
df = get_dummies.fit_transform(df)
df.shape

(26296946, 45)

In [9]:
grouped_data = df.groupby('session_id').apply(lambda x: np.array(x))
grouped_data

session_id
20090312431273200    [[0.485034091309792, 0.519126217135245, 0.1980...
20090312433251036    [[0.4908728283107921, 0.6860461376622912, 0.20...
20090312455206810    [[0.37456928210933976, 0.5028902871479044, 0.0...
20090313091715820    [[0.5599869522339036, 0.48434410658334565, 0.3...
20090313571836404    [[0.5047014159447397, 0.6251614129316955, 0.23...
                                           ...                        
22100215342220508    [[0.5649325929502113, 0.4650824706845987, 0.33...
22100215460321130    [[0.48160799532744986, 0.7014456247891683, 0.2...
22100217104993650    [[0.48196107183558146, 0.657998119962089, 0.19...
22100219442786200    [[0.48564869520463416, 0.5033919618394872, 0.1...
22100221145014656    [[0.42572481545752805, 0.6573140219057517, 0.0...
Length: 23562, dtype: object

In [10]:
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        # Get the numpy array at the given index
        return torch.from_numpy(self.data[idx]).float()

In [11]:
def collate_fn_padd(batch):
    """
    Padds batch of variable length

    Note: it converts things ToTensor manually here since the ToTensor transform
    assume it takes in images rather than arbitrary tensors.
    """
    ## Get sequence lengths
    lengths = [t.shape[0] for t in batch]
    try:
        n_features = batch[0].shape[1]
    except:
        n_features = 1
    max_length = max(lengths)
    if max_length == 0:
        max_length += 1
    batch_size = len(lengths)

    padded_tensor = torch.zeros(batch_size, max_length, n_features, dtype=torch.float32)
    for i, val in enumerate(batch):
        l = lengths[i]
        if n_features == 1:
            padded_tensor[i, :l] = val.reshape(-1, 1)
        else:
            padded_tensor[i, :l] = val
    
    return padded_tensor

In [12]:
# Create an instance of the custom dataset
dataset = MyDataset(grouped_data.values)

# Create a PyTorch DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate_fn_padd)

Now we collect and process the labels...

In [13]:
label_df = pd.read_csv('data/train_labels.csv')
label_df['session'] = label_df.session_id.apply(lambda x: int(x.split('_')[0]) )
label_df['question_idx'] = label_df.session_id.apply(lambda x: int(x.split('_')[-1][1:]) )
label_df.drop("session_id", axis=1, inplace=True)
pivoted_questions = label_df.pivot(columns='question_idx', values='correct', index='session')
pivoted_questions['total_score'] = pivoted_questions.iloc[:, 0:18].sum(axis=1)
pivoted_questions.columns = [f'q_{i}' for i in range(1, 19)] + ['total_score']
pivoted_questions

Unnamed: 0_level_0,q_1,q_2,q_3,q_4,q_5,q_6,q_7,q_8,q_9,q_10,q_11,q_12,q_13,q_14,q_15,q_16,q_17,q_18,total_score
session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
20090312431273200,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,16
20090312433251036,0,1,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,10
20090312455206810,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,17
20090313091715820,0,1,1,1,1,0,1,1,1,0,0,1,0,1,0,1,1,1,12
20090313571836404,1,1,1,1,1,1,1,1,1,1,1,0,1,0,1,1,1,1,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22100215342220508,1,1,1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,16
22100215460321130,0,1,1,1,0,1,1,0,1,0,1,1,0,1,0,1,1,1,12
22100217104993650,1,1,1,1,1,1,1,1,1,0,1,1,1,1,0,0,1,1,15
22100219442786200,0,1,1,1,1,1,1,0,1,0,1,1,0,1,0,1,1,1,13


In [14]:
# Define the LSTM model
class StackedLSTM(nn.Module):
    def __init__(self, n_layers, n_hidden, n_features, n_embeddings):
        super(StackedLSTM, self).__init__()
        self.embedding = nn.Linear(n_features, n_embeddings)
        self.lstm = nn.LSTM(n_embeddings, n_hidden, n_layers, batch_first=True)
        self.linear = nn.Linear(n_hidden, 18)
        
    def forward(self, x):
        # Pass the input through the Embedding layer
        embed_out = self.embedding(x)

        # Pass the input through the LSTM layers
        lstm_out, _ = self.lstm(embed_out)

        # Get only the last output of the LSTM layer
        out = lstm_out[:, -1, :]
        
        # Flatten the LSTM output and pass it through the linear layer
        out = self.linear(out)
        
        # Apply sigmoid activation function to the output
        out = torch.sigmoid(out)
        
        return out

# Create an instance of the model
n_layers = 3  # Number of LSTM layers
n_hidden = 16  # Number of LSTM units
n_embeddings = 16 # Number of dimension in embedding layer
n_features = 45  # Number of features in each sequence

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = StackedLSTM(n_layers, n_hidden, n_features, n_embeddings).to(device)

In [15]:
from tqdm import tqdm

# Define number of output labels (number of questions)
n_out = 18

# Define the batch size
batch_size = 32

# Define the number of epochs
n_epochs = 3

# Data size
n_samples = len(grouped_data)

# Define the loss function and optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Train the model
model.train()
for epoch in range(n_epochs):
    for i, sample in tqdm(enumerate(dataloader)):
        model.zero_grad()
        
        # Get label
        labels = torch.from_numpy(pivoted_questions.iloc[i*batch_size:(i+1)*batch_size, :18].values).float()
        
        sample = sample.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(sample)

        # Compute the loss
        loss = criterion(outputs, labels)
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        sample = sample.to('cpu')
        labels = labels.to('cpu')
        
    # Print the loss after every epoch
    print(f'Epoch {epoch+1}/{n_epochs}, Loss: {loss.item():.4f}')

737it [16:27,  1.34s/it]


Epoch 1/3, Loss: 0.5186


737it [16:22,  1.33s/it]


Epoch 2/3, Loss: 0.5188


737it [16:24,  1.34s/it]

Epoch 3/3, Loss: 0.5191





In [16]:
# Evaluate the model
pred_list = []
true_list = []

model.eval()
for i, sample in tqdm(enumerate(dataloader)):
    model.zero_grad()
        
    # Get label
    labels = torch.from_numpy(pivoted_questions.iloc[i*batch_size:(i+1)*batch_size, :18].values).float()
    
    sample = sample.to(device)
    labels = labels.to(device)

    # Forward pass
    outputs = model(sample)

    sample = sample.to('cpu')
    labels = labels.to('cpu')

    pred_list.append(outputs.data.cpu().numpy())
    true_list.append(labels.data.cpu().numpy())

737it [06:02,  2.03it/s]


In [17]:
test_pred_flattened = np.concatenate(pred_list).ravel()
test_true_flattened = np.concatenate(true_list).ravel()

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print(accuracy_score(test_true_flattened, np.round(test_pred_flattened)))
print(precision_score(test_true_flattened, np.round(test_pred_flattened)))
print(recall_score(test_true_flattened, np.round(test_pred_flattened)))

0.7327311395938847
0.7465808293014176
0.9404686722067959


**Generally, black-box RNN can somehow infer the predictions based on action sequence recorded from the user (with 73.2% accuracy on training set). However, there is something I am still wondering is that the loss did not converge (still at a rate of 0.518x), I hope to get any comments for improvement or spotting whether I have made a mistake in this notebook. Thanks for reading!**

In [19]:
# For test set

# Remove the training set to save RAM
del(df)
del(grouped_data)
del(dataloader)
del(dataset)

In [20]:
test_df = pd.read_csv('data/test.csv', dtype=dtypes)
test_df.set_index(['session_id', 'index'], inplace=True)

test_df = test_df[['event_name', 'name', 'level', 'room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y', 'hover_duration']]
for col in ['room_coor_x', 'room_coor_y', 'screen_coor_x', 'screen_coor_y', 'hover_duration']:
    # Scaling the coordinates and durations
    test_df[col] = (test_df[col] - test_df[col].min()) / (test_df[col].max() - test_df[col].min())
    test_df[col] = test_df[col].fillna(0)
    
test_df = get_dummies.transform(test_df)
grouped_data = test_df.groupby('session_id').apply(lambda x: np.array(x))

dataset = MyDataset(grouped_data.values)
dataloader = DataLoader(dataset, batch_size=3, shuffle=True, collate_fn=collate_fn_padd)

# Make predictions
pred_list = []

model.eval()
for i, sample in tqdm(enumerate(dataloader)):
    model.zero_grad()
    sample = sample.to(device)
    # Forward pass
    outputs = model(sample)
    sample = sample.to('cpu')
    pred_list.append(outputs.data.cpu().numpy())
    
pred_flattened = np.concatenate(pred_list).ravel()
session_ids = test_df.index.get_level_values('session_id').unique().tolist()

from functools import reduce
session_ids = reduce(lambda x, y: x + [f'{y}_q{i}' for i in range(1, 19)], session_ids, [])

test_result = pd.DataFrame({
    'session_id': session_ids,
    'correct': (pred_flattened > 0.6).astype('int')
})
test_result.head()

1it [00:00,  4.53it/s]


Unnamed: 0,session_id,correct
0,20090109393214576_q1,1
1,20090109393214576_q2,1
2,20090109393214576_q3,1
3,20090109393214576_q4,1
4,20090109393214576_q5,0
