# Introduction

- **session_id** - the ID of the session the event took place in
- **index** - the index of the event for the session
- **elapsed_time** - how much time has passed (in milliseconds) between the start of the session and when the event was recorded
- **event_name** - the name of the event type
- **name** - the event name (e.g. identifies whether a notebook_click is opening or closing the notebook)
- **level** - what level of the game the event occurred in (0 to 22)
- **page** - the page number of the event (only for notebook-related events)
- **room_coor_x** - the coordinates of the click in reference to the in-game room (only for click events)
- **room_coor_y** - the coordinates of the click in reference to the in-game room (only for click events)
- **screen_coor_x** - the coordinates of the click in reference to the player’s screen (only for click events)
- **screen_coor_y** - the coordinates of the click in reference to the player’s screen (only for click events)
- **hover_duration** - how long (in milliseconds) the hover happened for (only for hover events)
- **text** - the text the player sees during this event
- **fqid** - the fully qualified ID of the event
- **room_fqid** - the fully qualified ID of the room the event took place in
- **text_fqid** - the fully qualified ID of the
- **fullscreen** - whether the player is in fullscreen mode
- **hq** - whether the game is in high-quality
- **music** - whether the game music is on or off
- **level_group** - which group of levels - and group of questions - this row belongs to (0-4, 5-12, 13-22)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Train

## Data

In [3]:
# Reference: https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/384359
dtypes={
    'elapsed_time': np.int32,
    'event_name': 'category',
    'name': 'category',
    'level': np.uint8,
    'room_coor_x': np.float32,
    'room_coor_y': np.float32,
    'screen_coor_x': np.float32,
    'screen_coor_y': np.float32,
    'hover_duration': np.float32,
    'text': 'category',
    'fqid': 'category',
    'room_fqid': 'category',
    'text_fqid': 'category',
    'fullscreen': 'category',
    'hq': 'category',
    'music': 'category',
    'level_group': 'category'
}

train_data = pd.read_csv('../data/train.csv', dtype=dtypes)

In [4]:
## Reference: https://www.kaggle.com/code/kimtaehun/lightgbm-baseline-with-aggregated-log-data?scriptVersionId=118573291&cellId=15
def summarize_data_info(df: pd.DataFrame) -> pd.DataFrame:
    summary = pd.DataFrame(df.dtypes, columns=['data_type'])
    
    summary['perc_missing'] = df.isnull().sum().values * 100
    summary['perc_missing'] = df.isnull().sum().values / len(df)
    summary['n_unique'] = df.nunique().values
    
    summary['first_value'] = df.loc[0].values
    summary['second_value'] = df.loc[1].values
    summary['third_value'] = df.loc[2].values
    
    df_describe = pd.DataFrame(df.describe(include='all').transpose())
    summary['min'] = df_describe['min'].values
    summary['max'] = df_describe['max'].values
    
    print(f'Data Shape: {df.shape}')
    
    return summary

In [5]:
summary = summarize_data_info(train_data)
summary

Data Shape: (26296946, 20)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int64,0.0,20348,0,1,2,0.0,20473.0
elapsed_time,int32,0.0,5042639,0,1323,831,0.0,1988606704.0
event_name,category,0.0,11,cutscene_click,person_click,person_click,,
name,category,0.0,6,basic,basic,basic,,
level,uint8,0.0,23,0,0,0,0.0,22.0
page,float64,0.978532,7,,,,0.0,6.0
room_coor_x,float32,0.078841,12538215,-413.991394,-413.991394,-413.991394,-1992.354614,1261.773804
room_coor_y,float32,0.078841,9551136,-159.314682,-159.314682,-159.314682,-918.162354,543.616394
screen_coor_x,float32,0.078841,57477,380.0,380.0,380.0,0.0,1919.0


In [6]:
# Reduce Memory Usage
# reference : https://www.kaggle.com/code/arjanso/reducing-dataframe-memory-size-by-65 @ARJANGROEN

def reduce_memory_usage(df):
    
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype.name
        if ((col_type != 'datetime64[ns]') & (col_type != 'category')):
            if (col_type != 'object'):
                c_min = df[col].min()
                c_max = df[col].max()

                if str(col_type)[:3] == 'int':
                    if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                        df[col] = df[col].astype(np.int64)

                else:
                    if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                        df[col] = df[col].astype(np.float16)
                    elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                        df[col] = df[col].astype(np.float32)
                    else:
                        pass
            else:
                df[col] = df[col].astype('category')
    mem_usg = df.memory_usage().sum() / 1024**2 
    print("Memory usage became: ",mem_usg," MB")
    
    return df

In [7]:
train_data = reduce_memory_usage(train_data)

Memory usage of dataframe is 1529.83 MB
Memory usage became:  1053.3384094238281  MB


In [8]:
summary = summarize_data_info(train_data)
summary

Data Shape: (26296946, 20)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,20348,0,1,2,0.0,20473.0
elapsed_time,int32,0.0,5042639,0,1323,831,0.0,1988606704.0
event_name,category,0.0,11,cutscene_click,person_click,person_click,,
name,category,0.0,6,basic,basic,basic,,
level,float16,0.0,23,0.0,0.0,0.0,0.0,22.0
page,float16,0.978532,7,,,,0.0,6.0
room_coor_x,float16,0.078841,29854,-414.0,-414.0,-414.0,-1992.0,1262.0
room_coor_y,float16,0.078841,27847,-159.375,-159.375,-159.375,-918.0,543.5
screen_coor_x,float16,0.078841,6866,380.0,380.0,380.0,0.0,1919.0


### `Text` Field Preprocessing

In [9]:
from typing import Dict

def preprocess_text_str(text_str: str) -> str:
    s = str(text_str).replace("\\", "")
    text_str_ = "undefined" if s.startswith("u0") or (s in ["undefined", "nan"]) else s
    
    text_str__clean = text_str_.split("u0")[0] if "u0" in text_str_ else text_str_ 
    
    return text_str__clean

def create_text_field__clean_dict(data: pd.DataFrame(), text_field: str) -> Dict:
    text_values = list(data[text_field].unique())
    text_values_ = [preprocess_text_str(s) for s in text_values]
    
    text_field__clean_dict = dict(zip(text_values, text_values_))
    
    return text_field__clean_dict

def map_text_field(data: pd.DataFrame, text_field: str, text_field__clean_dict: Dict) -> pd.DataFrame:
    data[text_field] = data[text_field].map(text_values__clean_dict).fillna("undefined")
    
    return data

def recategorize_category_typed_fields(data: pd.DataFrame) -> pd.DataFrame:
    for field_name, dtype in data.dtypes.items():
        if dtype == "category":
            data[field_name] = data[field_name].astype(str).astype("category")
            
    return data

In [10]:
text_values = list(train_data["text"].unique())
text_values[:20]

['undefined',
 'Whatcha doing over there, Jo?',
 'Just talking to Teddy.',
 'I gotta run to my meeting!',
 'Can I come, Gramps?',
 'Sure thing, Jo. Grab your notebook and come upstairs!',
 'See you later, Teddy.',
 "I get to go to Gramps's meeting!",
 'Now where did I put my notebook?',
 '\\u00f0\\u0178\\u02dc\\u00b4',
 nan,
 'I love these photos of me and Teddy!',
 'Found it!',
 'Gramps is in trouble for losing papers?',
 "This can't be right!",
 'Gramps is a great historian!',
 "Hmm. Button's still not working.",
 "Let's get started. The Wisconsin Wonders exhibit opens tomorrow!",
 'Who wants to investigate the shirt artifact?',
 "Not Leopold here. He's been losing papers lately."]

In [11]:
text_values_ = [preprocess_text_str(s) for s in text_values]
text_values_[:20]

['undefined',
 'Whatcha doing over there, Jo?',
 'Just talking to Teddy.',
 'I gotta run to my meeting!',
 'Can I come, Gramps?',
 'Sure thing, Jo. Grab your notebook and come upstairs!',
 'See you later, Teddy.',
 "I get to go to Gramps's meeting!",
 'Now where did I put my notebook?',
 'undefined',
 'undefined',
 'I love these photos of me and Teddy!',
 'Found it!',
 'Gramps is in trouble for losing papers?',
 "This can't be right!",
 'Gramps is a great historian!',
 "Hmm. Button's still not working.",
 "Let's get started. The Wisconsin Wonders exhibit opens tomorrow!",
 'Who wants to investigate the shirt artifact?',
 "Not Leopold here. He's been losing papers lately."]

In [12]:
text_field__clean_dict = dict(zip(text_values, text_values_))
train_data["text"] = train_data["text"].map(text_field__clean_dict).fillna("undefined").astype('category')

In [13]:
summary = summarize_data_info(train_data)
summary

Data Shape: (26296946, 20)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,20348,0,1,2,0.0,20473.0
elapsed_time,int32,0.0,5042639,0,1323,831,0.0,1988606704.0
event_name,category,0.0,11,cutscene_click,person_click,person_click,,
name,category,0.0,6,basic,basic,basic,,
level,float16,0.0,23,0.0,0.0,0.0,0.0,22.0
page,float16,0.978532,7,,,,0.0,6.0
room_coor_x,float16,0.078841,29854,-414.0,-414.0,-414.0,-1992.0,1262.0
room_coor_y,float16,0.078841,27847,-159.375,-159.375,-159.375,-918.0,543.5
screen_coor_x,float16,0.078841,6866,380.0,380.0,380.0,0.0,1919.0


### Train Labels

In [14]:
train_labels = pd.read_csv("../data/train_labels.csv")

In [15]:
summary = summarize_data_info(train_labels)
summary

Data Shape: (424116, 2)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,object,0.0,424116,20090312431273200_q1,20090312433251036_q1,20090312455206810_q1,,
correct,int64,0.0,2,1,0,1,0.0,1.0


In [16]:
train_labels['question_no'] = train_labels['session_id'].apply(lambda x: int(x.split('_')[-1][1:]))
train_labels['session_id'] = train_labels['session_id'].apply(lambda x: int(x.split('_')[0]) )

train_labels["session_id"].nunique()

23562

In [17]:
summary = summarize_data_info(train_labels)
summary

Data Shape: (424116, 3)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312433251036,20090312455206810,2.009031e+16,2.210022e+16
correct,int64,0.0,2,1,0,1,0.0,1.0
question_no,int64,0.0,18,1,1,1,1.0,18.0


#### Validity check >>> Train Labels

In [18]:
train_labels.groupby("session_id")["question_no"].nunique().value_counts()

18    23562
Name: question_no, dtype: int64

In [19]:
question_no__list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
len(train_labels[~train_labels["question_no"].isin(question_no__list)])

0

In [20]:
len(train_labels) == (23562 * 18)

True

#### Validity check >>> Session ids in datasets

In [21]:
train_data__session_id_unique_vals = train_data["session_id"].drop_duplicates().sort_values().reset_index(drop=True)
train_labels__session_id_unique_vals = train_labels["session_id"].drop_duplicates().sort_values().reset_index(drop=True)

pd.testing.assert_series_equal(train_data__session_id_unique_vals, train_labels__session_id_unique_vals)

### Downsampling

In [22]:
session_ids = sorted(train_labels["session_id"].unique())

np.random.seed(42)
np.random.shuffle(session_ids)

session_ids[:5]

[22010107585684490,
 20100413373831344,
 21000409261644490,
 20110314164224844,
 21080621495509370]

In [23]:
N_CHUNKS = 10

np.random.seed(42)
chunk_ids = np.random.randint(N_CHUNKS, size=len(session_ids))

session_chunk_df = pd.DataFrame({"session_id": session_ids, "chunk_id": chunk_ids})
session_chunk_df["chunk_id"].value_counts()

0    2418
9    2407
5    2395
6    2360
1    2358
2    2350
3    2347
7    2320
8    2311
4    2296
Name: chunk_id, dtype: int64

In [24]:
session_chunk_df["chunk_id"].nunique()

10

# Features Per Event

# Event Categories

In [25]:
for i in train_data["event_name"].unique().categories:
    print(i)

checkpoint
cutscene_click
map_click
map_hover
navigate_click
notebook_click
notification_click
object_click
object_hover
observation_click
person_click


In [74]:
from typing import List

def convert_to_numeric_type(data: pd.DataFrame, feature_fields_list: List[str]) -> pd.DataFrame:
    for col_name, dtype in data[feature_fields_list].dtypes.items():
        if str(dtype).startswith("int"):
            data[col_name] = data[col_name].astype("int64")
        elif str(dtype).startswith("float"):
            data[col_name] = data[col_name].astype("float64")
        elif str(dtype) == "category":
            data[col_name] = data[col_name].astype(str).astype("int8")
        else:
            pass
            
    return data

def create_event_features(event_data: pd.DataFrame, feature_fields_list: List) -> pd.DataFrame:
    df_event = convert_to_numeric_type(event_data, feature_fields_list)
    
    stat_list = ["min", "max", "median", "mean", "std", "sum", "count"]
    
    df__event_features = df_event.groupby(["session_id", "level_group"])[feature_fields_list].agg(stat_list).round(2)
    df__event_features.columns = [f"{col_name}__{stat}" for col_name, stat in df__event_features.columns.to_flat_index()]
    
    return df__event_features

def encode_category_field__onehot(event_data: pd.DataFrame, cat_field: str, onehot_dict: Dict[str, str]) -> pd.DataFrame:
    for col_name, value in onehot_dict.items():
        event_data[col_name] = (event_data[cat_field] == value).astype("int8").astype("category")
        
    return event_data

def get_event_data(data: pd.DataFrame, event_name: str) -> pd.DataFrame:
    event_data = data[data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)
    
    return event_data

def create_event_flag(event_data: pd.DataFrame, flag_conds_dict: Dict) -> pd.Series:
    feature_flag = None
    for field_name, flag_cond in flag_conds_dict.items():
        key, value = flag_cond
        
        if key == "is_equal":
            flag = (event_data[field_name] == value).astype(bool)
        
        elif key == "isin_list":
            flag = event_data[field_name].isin(value).astype(bool)
            
        else:
            pass
        
        if feature_flag is None:
            feature_flag = flag
        else:
            feature_flag = feature_flag & flag
            
    feature_flag = feature_flag.astype(int).astype("category")
    
    return feature_flag

def generate_event_features(data: pd.DataFrame, event_name: str, feature_fields_dict: Dict) -> pd.DataFrame:
    event_data = get_event_data(data, event_name)
    feature_fields_dict__event = feature_fields_dict[event_name]
    feature_fields_list = feature_fields_dict__event["numeric"]
    
    if "onehot" in feature_fields_dict__event:
        for flag_name, flag_conds_dict in feature_fields_dict__event["onehot"].items():
            event_data[flag_name] = create_event_flag(event_data, flag_conds_dict)
            feature_fields_list = feature_fields_list + [flag_name]
                
    event_features = create_event_features(event_data, feature_fields_list).reset_index()
    event_features.columns = ["session_id", "level_group", *[f"{event_name}__{feat_name}" for feat_name in event_features.columns[2:]]]
    
    return event_features

In [75]:
feature_fields_dict = {
    "checkpoint": {
        "numeric": [
            "index", "elapsed_time", "fullscreen", "hq", "music"
        ],
    },
    "cutscene_click": {
        "numeric": [
            "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
            "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
        ],
    },
    "map_click": {
        "numeric": [
            "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
            "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
        ],
        "onehot": {
            "is_name__basic": {
                "name" : ("is_equal", "basic")
            },
            "is_name__close": {
                "name" : ("is_equal", "close")
            },
            "is_name__undefined": {
                "name" : ("is_equal", "undefined")
            },
        },
    },
    "map_hover": {
        "numeric": [
            "index", "elapsed_time", "level", "hover_duration", "fullscreen", "hq", "music"
        ],
    },
    "navigate_click": {
        "numeric": [
            "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
            "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
        ],
    },
    "notebook_click": {
        "numeric": [
            "index", "elapsed_time", "level", "page", "room_coor_x", "room_coor_y",
            "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
        ],
        "onehot": {
            "is_name__basic": {
                "name" : ("is_equal", "basic")
            },
            "is_name__open": {
                "name" : ("is_equal", "open")
            },
            "is_name__close": {
                "name" : ("is_equal", "close")
            },
            "is_name__prev": {
                "name" : ("is_equal", "prev")
            },
            "is_name__next": {
                "name" : ("is_equal", "next")
            },
        },
    },
    "notification_click": {
        "numeric": [
            "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
            "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
        ],
    },
    "object_click": {
        "numeric": [
            "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
            "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
        ],
        "onehot": {
            "is_name__basic": {
                "name" : ("is_equal", "basic")
            },
            "is_name__close": {
                "name" : ("is_equal", "close")
            },
        },
    },
    "object_hover": {
        "numeric": [
            "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
            "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
        ],
        "onehot": {
            "is_name__basic": {
                "name" : ("is_equal", "basic")
            },
            "is_name__undefined": {
                "name" : ("is_equal", "undefined")
            },
        },
    },
    "observation_click": {
        "numeric": [
            "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
            "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
        ],
    },
    "person_click": {
        "numeric": [
            "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
            "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
        ],
    },
}

In [76]:
all_event_features = None
for event_name in feature_fields_dict.keys():
    event_features = generate_event_features(train_data, event_name, feature_fields_dict)
    
    if all_event_features is None:
        all_event_features = event_features.copy()
    else:
        all_event_features = all_event_features.merge(event_features, on=["session_id", "level_group"], how="outer")

In [77]:
summary = summarize_data_info(all_event_features)
summary

Data Shape: (70686, 807)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.000000,23562,20090312431273200,20090312431273200,20090312431273200,20090312431273200.0,22100221145014656.0
level_group,category,0.000000,3,0-4,13-22,5-12,,
checkpoint__index__min,float64,0.000014,2366,164.0,931.0,470.0,0.0,20473.0
checkpoint__index__max,float64,0.000014,2432,164.0,931.0,470.0,0.0,20473.0
checkpoint__index__median,float64,0.000014,2541,164.0,931.0,470.0,0.0,20473.0
...,...,...,...,...,...,...,...,...
person_click__music__median,float64,0.000000,2,1.0,1.0,1.0,0.0,1.0
person_click__music__mean,float64,0.000000,2,1.0,1.0,1.0,0.0,1.0
person_click__music__std,float64,0.000000,1,0.0,0.0,0.0,0.0,0.0
person_click__music__sum,int64,0.000000,243,22,123,104,0.0,492.0


In [78]:
all_event_features

Unnamed: 0,session_id,level_group,checkpoint__index__min,checkpoint__index__max,checkpoint__index__median,checkpoint__index__mean,checkpoint__index__std,checkpoint__index__sum,checkpoint__index__count,checkpoint__elapsed_time__min,...,person_click__hq__std,person_click__hq__sum,person_click__hq__count,person_click__music__min,person_click__music__max,person_click__music__median,person_click__music__mean,person_click__music__std,person_click__music__sum,person_click__music__count
0,20090312431273200,0-4,164.0,164.0,164.0,164.0,,164,1,194860.0,...,0.0,0,22,1,1,1.0,1.0,0.0,22,22
1,20090312431273200,13-22,931.0,931.0,931.0,931.0,,931,1,1272679.0,...,0.0,0,123,1,1,1.0,1.0,0.0,123,123
2,20090312431273200,5-12,470.0,470.0,470.0,470.0,,470,1,499235.0,...,0.0,0,104,1,1,1.0,1.0,0.0,104,104
3,20090312433251036,0-4,138.0,138.0,138.0,138.0,,138,1,233752.0,...,0.0,0,18,0,0,0.0,0.0,0.0,0,18
4,20090312433251036,13-22,1875.0,1875.0,1875.0,1875.0,,1875,1,3815334.0,...,0.0,0,145,0,0,0.0,0.0,0.0,0,145
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70681,22100219442786200,13-22,920.0,920.0,920.0,920.0,,920,1,1218877.0,...,0.0,0,101,1,1,1.0,1.0,0.0,101,101
70682,22100219442786200,5-12,453.0,453.0,453.0,453.0,,453,1,561672.0,...,0.0,0,95,1,1,1.0,1.0,0.0,95,95
70683,22100221145014656,0-4,210.0,210.0,210.0,210.0,,210,1,435055.0,...,0.0,0,27,1,1,1.0,1.0,0.0,27,27
70684,22100221145014656,13-22,1604.0,1604.0,1604.0,1604.0,,1604,1,5487952.0,...,0.0,0,139,1,1,1.0,1.0,0.0,139,139


In [79]:
all_event_features.to_csv("../data/all_features.csv", index=False)