# Topic 2: Predict Student Performance from Game Play

**Session 1: Exploratory Data Analysis**

## Context

The goal is to predict student performance during game-based learning in real-time, where the training are game logs.

For each `<session_id>`_`<question no.>`, I am predicting the `correct` column, identifying whether it is believed that the user for this particular session will answer this question correctly only using the previous information for the session.

## Files

- **train.csv** - the training set
- **test.csv** - the test set
- **sample_submission.csv** - a sample submission file in the correct format
- **train_labels.csv** - `correct` value for all 18 questions for each session in the training set

## Columns in training set

- **session_id** - The ID of the session the event took place in
- **index** - The index of the event for the session
- **elapsed_time** - How much time has passed (in ms) between the start of the session and when the event was recorded
- **event_name** - The name of the event type
- **name** - The event name (e.g., identifies whether a notebook_click is opening or closing the notebook)
- **level** - What level of the game the event occurred (0 to 22)
- **page** - The page number of the event (only for notebook-related event)
- **room_coor_x** - The Ox coordinates of the click in reference to the in-game room (only for click events)
- **room_coor_y** - The Oy coordinates of the click in reference to the in-game room (only for click events)
- **screen_coor_x** - The Ox coordinates of the click in reference to the player's screen (only for click events)
- **screen_coor_y** - The Oy coordinates of the click in reference to the player's screen (only for click events)
- **hover_duration** - How long (in ms) the hover happened for (only for hover events)
- **text** - The text the player sees during the event
- **fqid** - The fully qualified ID of the event
- **room_fqid** - The fully qualified ID of the room the event took place in
- **text_fqid** - The fully qualified ID of the text
- **fullscreen** - Whether the player is in fullscreen mode
- **hq** - Whether the player is in high-quality
- **music** - Whether the game music is on or off
- **level_group** - Which group of levels & questions the record belongs to (0-4, 5-12, 13-22)

In [24]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Load the dataset
dtypes = {
    'elapsed_time': np.int32,
    'event_name': 'category', 
    'name': 'category',
    'level': np.uint8,
    'room_coor_x': np.float32,
    'room_coor_y': np.float32,
    'screen_coor_x': np.float32,
    'screen_coor_y': np.float32,
    'hover_duration': np.float32,
    'text': 'category',
    'fqid': 'category',
    'room_fqid': 'category',
    'text_fqid': 'category',
    'fullscreen': 'category',
    'hq': 'category',
    'music': 'category',
    'level_group': 'category'
}

df = pd.read_csv('data/train.csv', dtype=dtypes)

# Print the first 5 rows
df.head()

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,0,0,cutscene_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4
1,20090312431273200,1,1323,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
2,20090312431273200,2,831,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
3,20090312431273200,3,1147,person_click,basic,0,,-413.991394,-159.314682,380.0,494.0,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
4,20090312431273200,4,1863,person_click,basic,0,,-412.991394,-159.314682,381.0,494.0,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4


In [4]:
# Shape of the dataset
df.shape

(26296946, 20)

In [5]:
# Number of sessions
df['session_id'].nunique()

23562

In [6]:
label_df = pd.read_csv('data/train_labels.csv')
label_df.head()

Unnamed: 0,session_id,correct
0,20090312431273200_q1,1
1,20090312433251036_q1,0
2,20090312455206810_q1,1
3,20090313091715820_q1,0
4,20090313571836404_q1,1


In [7]:
label_df.shape

(424116, 2)

In [12]:
label_df['session'] = label_df.session_id.apply(lambda x: int(x.split('_')[0]) )
label_df['question_idx'] = label_df.session_id.apply(lambda x: int(x.split('_')[-1][1:]) )
label_df.drop("session_id", axis=1, inplace=True)

In [13]:
label_df.head()

Unnamed: 0,correct,session,question_idx
0,1,20090312431273200,1
1,0,20090312433251036,1
2,1,20090312455206810,1
3,0,20090313091715820,1
4,1,20090313571836404,1


In [14]:
label_df['question_idx'].value_counts()

1     23562
2     23562
17    23562
16    23562
15    23562
14    23562
13    23562
12    23562
11    23562
10    23562
9     23562
8     23562
7     23562
6     23562
5     23562
4     23562
3     23562
18    23562
Name: question_idx, dtype: int64

- Every session contains 18 questions and the student is told to do all the 18 questions, despite the level it falls into.

In [46]:
label_df = pd.read_csv('data/train_labels.csv')
label_df['session'] = label_df.session_id.apply(lambda x: int(x.split('_')[0]) )
label_df['question_idx'] = label_df.session_id.apply(lambda x: int(x.split('_')[-1][1:]) )
label_df.drop("session_id", axis=1, inplace=True)
pivoted_questions = label_df.pivot(columns='question_idx', values='correct', index='session')
pivoted_questions['total_score'] = pivoted_questions.iloc[:, 0:18].sum(axis=1)
pivoted_questions.columns = [f'q_{i}' for i in range(1, 19)] + ['total_score']
pivoted_questions

Unnamed: 0_level_0,q_1,q_2,q_3,q_4,q_5,q_6,q_7,q_8,q_9,q_10,q_11,q_12,q_13,q_14,q_15,q_16,q_17,q_18,total_score
session,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
20090312431273200,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,16
20090312433251036,0,1,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,10
20090312455206810,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,17
20090313091715820,0,1,1,1,1,0,1,1,1,0,0,1,0,1,0,1,1,1,12
20090313571836404,1,1,1,1,1,1,1,1,1,1,1,0,1,0,1,1,1,1,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22100215342220508,1,1,1,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,16
22100215460321130,0,1,1,1,0,1,1,0,1,0,1,1,0,1,0,1,1,1,12
22100217104993650,1,1,1,1,1,1,1,1,1,0,1,1,1,1,0,0,1,1,15
22100219442786200,0,1,1,1,1,1,1,0,1,0,1,1,0,1,0,1,1,1,13


- Later on, we will use this table to join with training table in order to perform further analysis

In [17]:
# Observe the unique values in some categorical columns
for col in df.select_dtypes('category'):
    nunique = df[col].nunique()
    last_20_unique_list = df[col].unique().tolist()[:20]
    print(f'{col :-<50} ({nunique}) {last_20_unique_list}')

event_name---------------------------------------- (11) ['cutscene_click', 'person_click', 'navigate_click', 'observation_click', 'notification_click', 'object_click', 'object_hover', 'map_hover', 'map_click', 'checkpoint', 'notebook_click']
name---------------------------------------------- (6) ['basic', 'undefined', 'close', 'open', 'prev', 'next']
text---------------------------------------------- (597) ['undefined', 'Whatcha doing over there, Jo?', 'Just talking to Teddy.', 'I gotta run to my meeting!', 'Can I come, Gramps?', 'Sure thing, Jo. Grab your notebook and come upstairs!', 'See you later, Teddy.', "I get to go to Gramps's meeting!", 'Now where did I put my notebook?', '\\u00f0\\u0178\\u02dc\\u00b4', nan, 'I love these photos of me and Teddy!', 'Found it!', 'Gramps is in trouble for losing papers?', "This can't be right!", 'Gramps is a great historian!', "Hmm. Button's still not working.", "Let's get started. The Wisconsin Wonders exhibit opens tomorrow!", 'Who wants to inv

In [18]:
CATEGORICAL = ['event_name', 'name','fqid', 'room_fqid', 'text_fqid']
NUMERICAL = ['elapsed_time','level','page','room_coor_x', 'room_coor_y', 
        'screen_coor_x', 'screen_coor_y', 'hover_duration']

In [19]:
def feature_engineer(dataset_df):
    dfs = []
    for c in CATEGORICAL:
        tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('nunique')
        tmp.name = tmp.name + '_nunique'
        dfs.append(tmp)
    for c in NUMERICAL:
        tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('mean')
        dfs.append(tmp)
    for c in NUMERICAL:
        tmp = dataset_df.groupby(['session_id','level_group'])[c].agg('std')
        tmp.name = tmp.name + '_std'
        dfs.append(tmp)
    dataset_df = pd.concat(dfs,axis=1)
    dataset_df = dataset_df.fillna(-1)
    dataset_df = dataset_df.reset_index()
    dataset_df = dataset_df.set_index('session_id')
    return dataset_df

In [21]:
processed_df = feature_engineer(df)
print("Full prepared dataset shape is {}".format(processed_df.shape))

Full prepared dataset shape is (70686, 22)


In [22]:
processed_df

Unnamed: 0_level_0,level_group,event_name_nunique,name_nunique,fqid_nunique,room_fqid_nunique,text_fqid_nunique,elapsed_time,level,page,room_coor_x,...,screen_coor_y,hover_duration,elapsed_time_std,level_std,page_std,room_coor_x_std,room_coor_y_std,screen_coor_x_std,screen_coor_y_std,hover_duration_std
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
20090312431273200,0-4,10,3,30,7,17,8.579356e+04,1.945455,-1.000000,7.701275,...,383.044861,2389.500000,4.924654e+04,1.230975,-1.000000,399.296038,129.292411,214.871000,104.082743,3227.370757
20090312431273200,13-22,10,3,49,12,35,1.040601e+06,17.402381,-1.000000,-130.347168,...,379.301025,899.925903,1.266661e+05,2.358652,-1.000000,622.061374,230.370874,240.280218,99.067861,1305.088265
20090312431273200,5-12,10,3,39,11,24,3.572052e+05,8.054054,-1.000000,14.306062,...,378.784912,969.333313,8.017568e+04,2.096919,-1.000000,357.227701,137.409476,203.268560,120.255453,1316.408315
20090312433251036,0-4,11,4,22,6,11,9.763342e+04,1.870504,0.000000,-84.045959,...,370.723083,1378.750000,6.737271e+04,1.232616,0.000000,445.980041,156.186242,252.554707,121.062929,2114.876406
20090312433251036,13-22,11,6,73,16,43,2.498852e+06,17.762529,5.100000,-30.762283,...,387.930084,720.384949,7.773825e+05,1.825923,0.863075,529.575656,234.279590,259.288856,133.345693,1990.705518
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22100219442786200,13-22,11,4,49,12,33,9.619192e+05,17.671395,5.230769,-158.599136,...,444.510040,1110.500000,1.516019e+05,2.359474,0.908083,589.562720,273.090325,248.584999,134.772721,1675.299532
22100219442786200,5-12,11,6,41,11,20,3.866058e+05,8.111511,1.833333,-2.569203,...,414.301208,1328.250000,9.665042e+04,2.180934,0.923548,390.345335,147.579436,250.827193,135.693654,1910.823123
22100221145014656,0-4,11,4,27,7,17,2.036104e+05,2.061611,0.333333,-1.339606,...,358.964813,4164.636230,1.085422e+05,1.276526,0.516398,392.539487,159.619091,213.638122,128.499750,6725.520698
22100221145014656,13-22,11,4,54,13,36,4.899580e+06,18.127632,5.181818,-57.838512,...,375.670624,669.000000,3.370855e+05,2.210473,0.906924,566.855306,249.156178,232.192779,122.521568,1115.700943


## Analyzing a single session

In [27]:
# Examine 1 specific session
session_1_df = df[df['session_id'] == 20090312431273200]

In [64]:
session_1_df = session_1_df.sort_values('elapsed_time')
session_1_df.head()

Unnamed: 0,session_id,index,elapsed_time,event_name,name,level,page,room_coor_x,room_coor_y,screen_coor_x,...,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,time_diff
0,20090312431273200,0,0,cutscene_click,basic,0,,-413.991394,-159.314682,380.0,...,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4,0.0
2,20090312431273200,2,831,person_click,basic,0,,-413.991394,-159.314682,380.0,...,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,-492.0
3,20090312431273200,3,1147,person_click,basic,0,,-413.991394,-159.314682,380.0,...,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,316.0
1,20090312431273200,1,1323,person_click,basic,0,,-413.991394,-159.314682,380.0,...,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,1323.0
4,20090312431273200,4,1863,person_click,basic,0,,-412.991394,-159.314682,381.0,...,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4,716.0


Some features that are in thought of being featured:

- **time_diff**: Time difference between 2 consecutive actions, showing the focus/distraction of student. This can be created by differencing the `elapsed_time` between 2 closest action. However, since the increment of elapsed time does not depend on the index order, we may need to reorder it. This is only a support feature for the other ones to be computed.
- **event_count**: How many time does an event occurred? We may assume that there exists every action that are highly correlated to the performance. In other word, student that did more (or less) of a specific action may get a higher score.
- **action_count_on_level**, **action_count_on_level_group**: The number of actions taken on a specific level/level group.
- **time_spent_on_level**, **time_spent_on_level_group**: The amount of time that student spend on a specific level/level group. It may indicate which level/level group is more efficient for learning.
- **clicking_distance**: The distance of the mouse between 2 consecutive clicks. We may assume the case that students clicking out things ambiguously would get lower score.
- **average_hover_duration**: Average of `hover_duration`.
- **full_screen_time**, **full_screen_proportion**: The amount of time/time proportion over the whole session that students open full screen for fully focus on the game.
- **hq_on_time**, **hq_on_proportion**: The amount of time/time proportion over the whole session that students use high quality mode.
- **music_on_time**, **music_on_proportion**: The amount of time/time proportion over the whole session that students use background music. Observing does background music is a distraction or an encouragement for learning.

In [65]:
time_diff = session_1_df['elapsed_time'].diff(1).fillna(0)
time_diff

0         0.0
2       831.0
3       316.0
1       176.0
4       540.0
        ...  
876     966.0
877     935.0
878    1182.0
879    1234.0
880    1971.0
Name: elapsed_time, Length: 881, dtype: float64

In [66]:
event_name = ['cutscene_click', 'person_click', 'navigate_click', 'observation_click', 'notification_click', 'object_click', 'object_hover', 'map_hover', 'map_click', 'checkpoint', 'notebook_click']
event_count = session_1_df.groupby('event_name')['event_name'].count().to_dict()
event_count_dict = {i: event_count.get(i, 0) for i in event_name}
event_count_dict

{'cutscene_click': 100,
 'person_click': 249,
 'navigate_click': 354,
 'observation_click': 8,
 'notification_click': 27,
 'object_click': 59,
 'object_hover': 38,
 'map_hover': 27,
 'map_click': 16,
 'checkpoint': 3,
 'notebook_click': 0}

In [73]:
level_stats = session_1_df.groupby(['level']).size()
level_stats_dict = {f'action_count_on_level_{i}': level_stats[i] for i in range(23)} 
level_stats_dict

{'action_count_on_level_0': 28,
 'action_count_on_level_1': 32,
 'action_count_on_level_2': 39,
 'action_count_on_level_3': 53,
 'action_count_on_level_4': 13,
 'action_count_on_level_5': 18,
 'action_count_on_level_6': 81,
 'action_count_on_level_7': 44,
 'action_count_on_level_8': 36,
 'action_count_on_level_9': 35,
 'action_count_on_level_10': 15,
 'action_count_on_level_11': 57,
 'action_count_on_level_12': 10,
 'action_count_on_level_13': 40,
 'action_count_on_level_14': 15,
 'action_count_on_level_15': 42,
 'action_count_on_level_16': 42,
 'action_count_on_level_17': 34,
 'action_count_on_level_18': 133,
 'action_count_on_level_19': 33,
 'action_count_on_level_20': 30,
 'action_count_on_level_21': 44,
 'action_count_on_level_22': 7}

In [80]:
level_group_stats = session_1_df.groupby(['level_group']).size()
level_group_stats_dict = {f'action_count_on_level_group_{i}': level_group_stats.loc[i] for i in ['0-4', '5-12', '13-22']} 
level_group_stats_dict

{'action_count_on_level_group_0-4': 165,
 'action_count_on_level_group_5-12': 296,
 'action_count_on_level_group_13-22': 420}

In [82]:
time_level_stats = session_1_df.groupby(['level'])['time_diff'].sum()
level_stats_dict = {f'time_spent_on_level_{i}': time_level_stats[i] for i in range(23)} 
level_stats_dict

{'time_spent_on_level_0': 25766.0,
 'time_spent_on_level_1': 39693.0,
 'time_spent_on_level_2': 35254.0,
 'time_spent_on_level_3': 55041.0,
 'time_spent_on_level_4': 39106.0,
 'time_spent_on_level_5': 42874.0,
 'time_spent_on_level_6': 75613.0,
 'time_spent_on_level_7': 34330.0,
 'time_spent_on_level_8': 35991.0,
 'time_spent_on_level_9': 30799.0,
 'time_spent_on_level_10': 16550.0,
 'time_spent_on_level_11': 58645.0,
 'time_spent_on_level_12': 9573.0,
 'time_spent_on_level_13': 370588.0,
 'time_spent_on_level_14': 12603.0,
 'time_spent_on_level_15': 38048.0,
 'time_spent_on_level_16': 33014.0,
 'time_spent_on_level_17': 48465.0,
 'time_spent_on_level_18': 126026.0,
 'time_spent_on_level_19': 27244.0,
 'time_spent_on_level_20': 65438.0,
 'time_spent_on_level_21': 42196.0,
 'time_spent_on_level_22': 9822.0}

In [84]:
level_group_stats = session_1_df.groupby(['level_group'])['time_diff'].sum()
level_group_stats_dict = {f'time_spent_on_level_group_{i}': level_group_stats.loc[i] for i in ['0-4', '5-12', '13-22']} 
level_group_stats_dict

{'time_spent_on_level_group_0-4': 194860.0,
 'time_spent_on_level_group_5-12': 304375.0,
 'time_spent_on_level_group_13-22': 773444.0}

In [89]:
# We are ignoring the x, y
coordinates = session_1_df[['room_coor_x', 'room_coor_y']].copy()
coordinates.dropna(inplace=True)
clicking_x_diff = coordinates['room_coor_x'].diff(1).fillna(0)
clicking_y_diff = coordinates['room_coor_y'].diff(1).fillna(0)
distance_diff = (clicking_x_diff - clicking_y_diff) ** 2
mean_distance_diff = distance_diff.mean()
{'average_clicking_distance': mean_distance_diff}

{'average_clicking_distance': 67199.421875}

In [90]:
{'average_hover_duration': session_1_df['hover_duration'].mean()}

{'average_hover_duration': 1115.2923583984375}

In [92]:
session_total_time = session_1_df['elapsed_time'].max()

full_screen_time = (session_1_df['time_diff'] * session_1_df['fullscreen'].astype('int')).mean()
full_screen_proportion = full_screen_time / session_total_time
hq_on_time = (session_1_df['time_diff'] * session_1_df['hq'].astype('int')).mean()
hq_on_proportion = hq_on_time / session_total_time
music_on_time = (session_1_df['time_diff'] * session_1_df['music'].astype('int')).mean()
music_on_proportion = music_on_time / session_total_time

{
    'full_screen_time': full_screen_time,
    'full_screen_proportion': full_screen_proportion,
    'hq_on_time': hq_on_time,
    'hq_on_proportion': hq_on_proportion,
    'music_on_time': music_on_time,
    'music_on_proportion': music_on_proportion
}

{'full_screen_time': 0.0,
 'full_screen_proportion': 0.0,
 'hq_on_time': 0.0,
 'hq_on_proportion': 0.0,
 'music_on_time': 1444.5845629965947,
 'music_on_proportion': 0.0011350737797956867}

In [None]:
def feature_extraction_over_single_session(session_df):
    session_info_dict = 