# Introduction

- **session_id** - the ID of the session the event took place in
- **index** - the index of the event for the session
- **elapsed_time** - how much time has passed (in milliseconds) between the start of the session and when the event was recorded
- **event_name** - the name of the event type
- **name** - the event name (e.g. identifies whether a notebook_click is opening or closing the notebook)
- **level** - what level of the game the event occurred in (0 to 22)
- **page** - the page number of the event (only for notebook-related events)
- **room_coor_x** - the coordinates of the click in reference to the in-game room (only for click events)
- **room_coor_y** - the coordinates of the click in reference to the in-game room (only for click events)
- **screen_coor_x** - the coordinates of the click in reference to the player’s screen (only for click events)
- **screen_coor_y** - the coordinates of the click in reference to the player’s screen (only for click events)
- **hover_duration** - how long (in milliseconds) the hover happened for (only for hover events)
- **text** - the text the player sees during this event
- **fqid** - the fully qualified ID of the event
- **room_fqid** - the fully qualified ID of the room the event took place in
- **text_fqid** - the fully qualified ID of the
- **fullscreen** - whether the player is in fullscreen mode
- **hq** - whether the game is in high-quality
- **music** - whether the game music is on or off
- **level_group** - which group of levels - and group of questions - this row belongs to (0-4, 5-12, 13-22)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Train

## Data

In [3]:
# Reference: https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/384359
dtypes={
    'elapsed_time': np.int32,
    'event_name': 'category',
    'name': 'category',
    'level': np.uint8,
    'room_coor_x': np.float32,
    'room_coor_y': np.float32,
    'screen_coor_x': np.float32,
    'screen_coor_y': np.float32,
    'hover_duration': np.float32,
    'text': 'category',
    'fqid': 'category',
    'room_fqid': 'category',
    'text_fqid': 'category',
    'fullscreen': 'category',
    'hq': 'category',
    'music': 'category',
    'level_group': 'category'
}

train_data = pd.read_csv('../data/train.csv', dtype=dtypes)

In [4]:
## Reference: https://www.kaggle.com/code/kimtaehun/lightgbm-baseline-with-aggregated-log-data?scriptVersionId=118573291&cellId=15
def summarize_data_info(df: pd.DataFrame) -> pd.DataFrame:
    summary = pd.DataFrame(df.dtypes, columns=['data_type'])
    
    summary['perc_missing'] = df.isnull().sum().values * 100
    summary['perc_missing'] = df.isnull().sum().values / len(df)
    summary['n_unique'] = df.nunique().values
    
    summary['first_value'] = df.loc[0].values
    summary['second_value'] = df.loc[1].values
    summary['third_value'] = df.loc[2].values
    
    df_describe = pd.DataFrame(df.describe(include='all').transpose())
    summary['min'] = df_describe['min'].values
    summary['max'] = df_describe['max'].values
    
    print(f'Data Shape: {df.shape}')
    
    return summary

In [5]:
summary = summarize_data_info(train_data)
summary

Data Shape: (26296946, 20)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int64,0.0,20348,0,1,2,0.0,20473.0
elapsed_time,int32,0.0,5042639,0,1323,831,0.0,1988606704.0
event_name,category,0.0,11,cutscene_click,person_click,person_click,,
name,category,0.0,6,basic,basic,basic,,
level,uint8,0.0,23,0,0,0,0.0,22.0
page,float64,0.978532,7,,,,0.0,6.0
room_coor_x,float32,0.078841,12538215,-413.991394,-413.991394,-413.991394,-1992.354614,1261.773804
room_coor_y,float32,0.078841,9551136,-159.314682,-159.314682,-159.314682,-918.162354,543.616394
screen_coor_x,float32,0.078841,57477,380.0,380.0,380.0,0.0,1919.0


In [6]:
# Reduce Memory Usage
# reference : https://www.kaggle.com/code/arjanso/reducing-dataframe-memory-size-by-65 @ARJANGROEN

def reduce_memory_usage(df):
    
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype.name
        if ((col_type != 'datetime64[ns]') & (col_type != 'category')):
            if (col_type != 'object'):
                c_min = df[col].min()
                c_max = df[col].max()

                if str(col_type)[:3] == 'int':
                    if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                        df[col] = df[col].astype(np.int64)

                else:
                    if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                        df[col] = df[col].astype(np.float16)
                    elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                        df[col] = df[col].astype(np.float32)
                    else:
                        pass
            else:
                df[col] = df[col].astype('category')
    mem_usg = df.memory_usage().sum() / 1024**2 
    print("Memory usage became: ",mem_usg," MB")
    
    return df

In [7]:
train_data = reduce_memory_usage(train_data)

Memory usage of dataframe is 1529.83 MB
Memory usage became:  1053.3384094238281  MB


In [8]:
summary = summarize_data_info(train_data)
summary

Data Shape: (26296946, 20)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,20348,0,1,2,0.0,20473.0
elapsed_time,int32,0.0,5042639,0,1323,831,0.0,1988606704.0
event_name,category,0.0,11,cutscene_click,person_click,person_click,,
name,category,0.0,6,basic,basic,basic,,
level,float16,0.0,23,0.0,0.0,0.0,0.0,22.0
page,float16,0.978532,7,,,,0.0,6.0
room_coor_x,float16,0.078841,29854,-414.0,-414.0,-414.0,-1992.0,1262.0
room_coor_y,float16,0.078841,27847,-159.375,-159.375,-159.375,-918.0,543.5
screen_coor_x,float16,0.078841,6866,380.0,380.0,380.0,0.0,1919.0


### `Text` Field Preprocessing

In [9]:
from typing import Dict

def preprocess_text_str(text_str: str) -> str:
    s = str(text_str).replace("\\", "")
    text_str_ = "undefined" if s.startswith("u0") or (s in ["undefined", "nan"]) else s
    
    text_str__clean = text_str_.split("u0")[0] if "u0" in text_str_ else text_str_ 
    
    return text_str__clean

def create_text_field__clean_dict(data: pd.DataFrame(), text_field: str) -> Dict:
    text_values = list(data[text_field].unique())
    text_values_ = [preprocess_text_str(s) for s in text_values]
    
    text_field__clean_dict = dict(zip(text_values, text_values_))
    
    return text_field__clean_dict

def map_text_field(data: pd.DataFrame, text_field: str, text_field__clean_dict: Dict) -> pd.DataFrame:
    data[text_field] = data[text_field].map(text_values__clean_dict).fillna("undefined")
    
    return data

def recategorize_category_typed_fields(data: pd.DataFrame) -> pd.DataFrame:
    for field_name, dtype in data.dtypes.items():
        if dtype == "category":
            data[field_name] = data[field_name].astype(str).astype("category")
            
    return data

In [10]:
text_values = list(train_data["text"].unique())
text_values[:20]

['undefined',
 'Whatcha doing over there, Jo?',
 'Just talking to Teddy.',
 'I gotta run to my meeting!',
 'Can I come, Gramps?',
 'Sure thing, Jo. Grab your notebook and come upstairs!',
 'See you later, Teddy.',
 "I get to go to Gramps's meeting!",
 'Now where did I put my notebook?',
 '\\u00f0\\u0178\\u02dc\\u00b4',
 nan,
 'I love these photos of me and Teddy!',
 'Found it!',
 'Gramps is in trouble for losing papers?',
 "This can't be right!",
 'Gramps is a great historian!',
 "Hmm. Button's still not working.",
 "Let's get started. The Wisconsin Wonders exhibit opens tomorrow!",
 'Who wants to investigate the shirt artifact?',
 "Not Leopold here. He's been losing papers lately."]

In [11]:
text_values_ = [preprocess_text_str(s) for s in text_values]
text_values_[:20]

['undefined',
 'Whatcha doing over there, Jo?',
 'Just talking to Teddy.',
 'I gotta run to my meeting!',
 'Can I come, Gramps?',
 'Sure thing, Jo. Grab your notebook and come upstairs!',
 'See you later, Teddy.',
 "I get to go to Gramps's meeting!",
 'Now where did I put my notebook?',
 'undefined',
 'undefined',
 'I love these photos of me and Teddy!',
 'Found it!',
 'Gramps is in trouble for losing papers?',
 "This can't be right!",
 'Gramps is a great historian!',
 "Hmm. Button's still not working.",
 "Let's get started. The Wisconsin Wonders exhibit opens tomorrow!",
 'Who wants to investigate the shirt artifact?',
 "Not Leopold here. He's been losing papers lately."]

In [12]:
text_field__clean_dict = dict(zip(text_values, text_values_))
train_data["text"] = train_data["text"].map(text_field__clean_dict).fillna("undefined").astype('category')

In [13]:
summary = summarize_data_info(train_data)
summary

Data Shape: (26296946, 20)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,20348,0,1,2,0.0,20473.0
elapsed_time,int32,0.0,5042639,0,1323,831,0.0,1988606704.0
event_name,category,0.0,11,cutscene_click,person_click,person_click,,
name,category,0.0,6,basic,basic,basic,,
level,float16,0.0,23,0.0,0.0,0.0,0.0,22.0
page,float16,0.978532,7,,,,0.0,6.0
room_coor_x,float16,0.078841,29854,-414.0,-414.0,-414.0,-1992.0,1262.0
room_coor_y,float16,0.078841,27847,-159.375,-159.375,-159.375,-918.0,543.5
screen_coor_x,float16,0.078841,6866,380.0,380.0,380.0,0.0,1919.0


### Train Labels

In [14]:
train_labels = pd.read_csv("../data/train_labels.csv")

In [15]:
summary = summarize_data_info(train_labels)
summary

Data Shape: (424116, 2)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,object,0.0,424116,20090312431273200_q1,20090312433251036_q1,20090312455206810_q1,,
correct,int64,0.0,2,1,0,1,0.0,1.0


In [16]:
train_labels['question_no'] = train_labels['session_id'].apply(lambda x: int(x.split('_')[-1][1:]))
train_labels['session_id'] = train_labels['session_id'].apply(lambda x: int(x.split('_')[0]) )

train_labels["session_id"].nunique()

23562

In [17]:
summary = summarize_data_info(train_labels)
summary

Data Shape: (424116, 3)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312433251036,20090312455206810,2.009031e+16,2.210022e+16
correct,int64,0.0,2,1,0,1,0.0,1.0
question_no,int64,0.0,18,1,1,1,1.0,18.0


#### Validity check >>> Train Labels

In [18]:
train_labels.groupby("session_id")["question_no"].nunique().value_counts()

18    23562
Name: question_no, dtype: int64

In [19]:
question_no__list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
len(train_labels[~train_labels["question_no"].isin(question_no__list)])

0

In [20]:
len(train_labels) == (23562 * 18)

True

#### Validity check >>> Session ids in datasets

In [21]:
train_data__session_id_unique_vals = train_data["session_id"].drop_duplicates().sort_values().reset_index(drop=True)
train_labels__session_id_unique_vals = train_labels["session_id"].drop_duplicates().sort_values().reset_index(drop=True)

pd.testing.assert_series_equal(train_data__session_id_unique_vals, train_labels__session_id_unique_vals)

### Downsampling

In [22]:
session_ids = sorted(train_labels["session_id"].unique())

np.random.seed(42)
np.random.shuffle(session_ids)

session_ids[:5]

[22010107585684490,
 20100413373831344,
 21000409261644490,
 20110314164224844,
 21080621495509370]

In [23]:
N_CHUNKS = 10

np.random.seed(42)
chunk_ids = np.random.randint(N_CHUNKS, size=len(session_ids))

session_chunk_df = pd.DataFrame({"session_id": session_ids, "chunk_id": chunk_ids})
session_chunk_df["chunk_id"].value_counts()

0    2418
9    2407
5    2395
6    2360
1    2358
2    2350
3    2347
7    2320
8    2311
4    2296
Name: chunk_id, dtype: int64

In [24]:
session_chunk_df["chunk_id"].nunique()

10

# Numeric Features Per Event

### Event Categories

In [25]:
for i in train_data["event_name"].unique().categories:
    print(i)

checkpoint
cutscene_click
map_click
map_hover
navigate_click
notebook_click
notification_click
object_click
object_hover
observation_click
person_click


### `event_name` == `"checkpoint"`

In [26]:
event_name = "checkpoint"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,164,194860,basic,4.0,,,,,,,undefined,chap1_finale_c,tunic.capitol_0.hall,,0,0,1,0-4
1,20090312431273200,470,499235,basic,12.0,,,,,,,undefined,chap2_finale_c,tunic.capitol_1.hall,,0,0,1,5-12
2,20090312431273200,931,1272679,basic,22.0,,,,,,,undefined,chap4_finale_c,tunic.capitol_2.hall,,0,0,1,13-22
3,20090312433251036,138,233752,basic,4.0,,,,,,,undefined,chap1_finale_c,tunic.capitol_0.hall,,0,0,0,0-4
4,20090312433251036,544,817609,basic,12.0,,,,,,,undefined,chap2_finale_c,tunic.capitol_1.hall,,0,0,0,5-12


#### `level_group` == `"0-4"`

In [27]:
level_group = "0-4"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,164,194860,basic,4.0,,,,,,,undefined,chap1_finale_c,tunic.capitol_0.hall,,0,0,1
1,20090312433251036,138,233752,basic,4.0,,,,,,,undefined,chap1_finale_c,tunic.capitol_0.hall,,0,0,0
2,20090312455206810,148,363226,basic,4.0,,,,,,,undefined,chap1_finale_c,tunic.capitol_0.hall,,1,1,1
3,20090313091715820,175,192793,basic,4.0,,,,,,,undefined,chap1_finale_c,tunic.capitol_0.hall,,1,1,1
4,20090313571836404,111,195851,basic,4.0,,,,,,,undefined,chap1_finale_c,tunic.capitol_0.hall,,0,0,1


In [28]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (23713, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23561,20090312431273200,20090312433251036,20090312455206810,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,687,164,138,148,0.0,5135.0
elapsed_time,int32,0.0,23048,194860,233752,363226,589.0,1986921747.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,1,4.0,4.0,4.0,4.0,4.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


#### `level_group` == `"5-12"`

In [29]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,470,499235,basic,12.0,,,,,,,undefined,chap2_finale_c,tunic.capitol_1.hall,,0,0,1
1,20090312433251036,544,817609,basic,12.0,,,,,,,undefined,chap2_finale_c,tunic.capitol_1.hall,,0,0,0
2,20090312455206810,402,632860,basic,12.0,,,,,,,undefined,chap2_finale_c,tunic.capitol_1.hall,,1,1,1
3,20090313091715820,510,749302,basic,12.0,,,,,,,undefined,chap2_finale_c,tunic.capitol_1.hall,,1,1,1
4,20090313571836404,360,527617,basic,12.0,,,,,,,undefined,chap2_finale_c,tunic.capitol_1.hall,,0,0,1


In [30]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (23682, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312433251036,20090312455206810,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,1122,470,544,402,26.0,18196.0
elapsed_time,int32,0.0,23472,499235,817609,632860,893.0,1987182816.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,1,12.0,12.0,12.0,12.0,12.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


#### `level_group` == `"13-22"`

In [31]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,931,1272679,basic,22.0,,,,,,,undefined,chap4_finale_c,tunic.capitol_2.hall,,0,0,1
1,20090312433251036,1875,3815334,basic,22.0,,,,,,,undefined,chap4_finale_c,tunic.capitol_2.hall,,0,0,0
2,20090312455206810,826,1189050,basic,22.0,,,,,,,undefined,chap4_finale_c,tunic.capitol_2.hall,,1,1,1
3,20090313091715820,1039,1621368,basic,22.0,,,,,,,undefined,chap4_finale_c,tunic.capitol_2.hall,,1,1,1
4,20090313571836404,783,1174676,basic,22.0,,,,,,,undefined,chap4_finale_c,tunic.capitol_2.hall,,0,0,1


In [32]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (23633, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312433251036,20090312455206810,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,1869,931,1875,826,0.0,20473.0
elapsed_time,int32,0.0,23548,1272679,3815334,1189050,300.0,1468325947.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,1,22.0,22.0,22.0,22.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


#### Numeric Fields

In [33]:
checkpoint__numeric_fields = ["index", "elapsed_time", "fullscreen", "hq", "music"]

### `event_name` == `"cutscene_click"`

In [34]:
event_name = "cutscene_click"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,0,0,basic,0.0,,-414.0,-159.375,380.0,494.0,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4
1,20090312431273200,41,45062,basic,1.0,,93.8125,-60.34375,338.0,368.0,,Let's get started. The Wisconsin Wonders exhib...,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1,0-4
2,20090312431273200,42,46046,basic,1.0,,134.0,-85.6875,390.0,386.0,,Who wants to investigate the shirt artifact?,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1,0-4
3,20090312431273200,43,47362,basic,1.0,,125.9375,-83.375,390.0,385.0,,Not Leopold here. He's been losing papers lately.,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1,0-4
4,20090312431273200,44,48112,basic,1.0,,123.6875,-80.0625,389.0,383.0,,Hey!,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1,0-4


#### `level_group` == `"0-4"`

In [35]:
level_group = "0-4"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,0,0,basic,0.0,,-414.0,-159.375,380.0,494.0,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1
1,20090312431273200,41,45062,basic,1.0,,93.8125,-60.34375,338.0,368.0,,Let's get started. The Wisconsin Wonders exhib...,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1
2,20090312431273200,42,46046,basic,1.0,,134.0,-85.6875,390.0,386.0,,Who wants to investigate the shirt artifact?,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1
3,20090312431273200,43,47362,basic,1.0,,125.9375,-83.375,390.0,385.0,,Not Leopold here. He's been losing papers lately.,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1
4,20090312431273200,44,48112,basic,1.0,,123.6875,-80.0625,389.0,383.0,,Hey!,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1


In [36]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (787584, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,2641,0,41,42,0.0,5107.0
elapsed_time,int32,0.0,230243,0,45062,46046,0.0,1986889118.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,5,0.0,1.0,1.0,0.0,4.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,17676,-414.0,93.8125,134.0,-794.0,941.0
room_coor_y,float16,0.0,15108,-159.375,-60.34375,-85.6875,-532.0,543.5
screen_coor_x,float16,0.0,3375,380.0,338.0,390.0,0.0,1876.0
screen_coor_y,float16,0.0,2255,494.0,368.0,386.0,0.0,1175.0


#### `level_group` == `"5-12"`

In [37]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,188,234969,basic,5.0,,-668.0,-187.625,89.0,509.0,,Oh no!,what_happened,tunic.historicalsociety.closet_dirty,tunic.historicalsociety.closet_dirty.what_happ...,0,0,1
1,20090312431273200,189,235717,basic,5.0,,-623.0,-157.125,308.0,470.0,,undefined,what_happened,tunic.historicalsociety.closet_dirty,tunic.historicalsociety.closet_dirty.what_happ...,0,0,1
2,20090312431273200,190,236350,basic,5.0,,-713.5,-156.125,260.0,467.0,,What happened here?!,what_happened,tunic.historicalsociety.closet_dirty,tunic.historicalsociety.closet_dirty.what_happ...,0,0,1
3,20090312431273200,191,237000,basic,5.0,,-729.0,-158.75,258.0,469.0,,I don't know!,what_happened,tunic.historicalsociety.closet_dirty,tunic.historicalsociety.closet_dirty.what_happ...,0,0,1
4,20090312431273200,192,237734,basic,5.0,,-712.5,-152.0,279.0,462.0,,I got here and the whole place was a mess!,what_happened,tunic.historicalsociety.closet_dirty,tunic.historicalsociety.closet_dirty.what_happ...,0,0,1


In [38]:
summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (292379, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,2310,188,189,190,0.0,18198.0
elapsed_time,int32,0.0,224618,234969,235717,236350,52.0,1986974520.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,4,5.0,5.0,5.0,5.0,12.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,13656,-668.0,-623.0,-713.5,-993.0,765.5
room_coor_y,float16,0.0,10862,-187.625,-157.125,-156.125,-459.0,341.5
screen_coor_x,float16,0.0,2211,89.0,308.0,260.0,0.0,1773.0
screen_coor_y,float16,0.0,1627,509.0,470.0,467.0,5.0,1265.0


#### `level_group` == `"13-22"`

In [39]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,523,844329,basic,13.0,,-50.5625,-43.375,509.0,360.0,,Jo!,ch3start,tunic.historicalsociety.basement,tunic.historicalsociety.basement.ch3start,0,0,1
1,20090312431273200,524,845479,basic,13.0,,119.9375,-124.8125,635.0,420.0,,Check out the next artifact!,ch3start,tunic.historicalsociety.basement,tunic.historicalsociety.basement.ch3start,0,0,1
2,20090312431273200,525,848147,basic,13.0,,147.5,-135.25,495.0,420.0,,What is it?,ch3start,tunic.historicalsociety.basement,tunic.historicalsociety.basement.ch3start,0,0,1
3,20090312431273200,526,848745,basic,13.0,,164.625,-135.375,494.0,420.0,,"I think it's a flag! Pretty interesting, huh?",ch3start,tunic.historicalsociety.basement,tunic.historicalsociety.basement.ch3start,0,0,1
4,20090312431273200,527,849311,basic,13.0,,169.125,-136.75,493.0,421.0,,"It's really cool, Gramps. But I'm worried abou...",ch3start,tunic.historicalsociety.basement,tunic.historicalsociety.basement.ch3start,0,0,1


In [40]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (1623072, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3296,523,524,525,0.0,19407.0
elapsed_time,int32,0.0,1141780,844329,845479,848147,100.0,1988105886.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,5,13.0,13.0,13.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,19645,-50.5625,119.9375,147.5,-1419.0,949.0
room_coor_y,float16,0.0,18398,-43.375,-124.8125,-135.25,-530.0,543.5
screen_coor_x,float16,0.0,3954,509.0,635.0,495.0,0.0,1886.0
screen_coor_y,float16,0.0,2717,360.0,420.0,420.0,0.0,1414.0


#### Numeric Fields

In [41]:
cutscene_click__numeric_fields = ["index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
                                  "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"]

#### Text Fields

In [42]:
cutscene_click__text_fields = ["text", "fqid", "room_fqid", "text_fqid"]

### `event_name` == `"map_click"`

In [43]:
event_name = "map_click"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,129,135990,undefined,3.0,,168.0,-142.25,263.0,417.0,,undefined,tunic.kohlcenter,tunic.historicalsociety.entry,,0,0,1,0-4
1,20090312431273200,162,162438,undefined,4.0,,-538.0,6.0,462.0,324.0,,undefined,tunic.capitol_0,tunic.kohlcenter.halloffame,,0,0,1,0-4
2,20090312431273200,183,228133,undefined,5.0,,456.75,167.125,559.0,198.0,,undefined,tunic.historicalsociety,tunic.capitol_0.hall,,0,0,1,5-12
3,20090312431273200,242,280148,close,6.0,,1111.0,419.5,843.0,72.0,,undefined,,tunic.historicalsociety.entry,,0,0,1,5-12
4,20090312431273200,285,324396,undefined,7.0,,418.5,-201.0,420.0,453.0,,undefined,tunic.humanecology,tunic.historicalsociety.entry,,0,0,1,5-12


#### `level_group` == `"0-4"`

In [44]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (1623072, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3296,523,524,525,0.0,19407.0
elapsed_time,int32,0.0,1141780,844329,845479,848147,100.0,1988105886.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,5,13.0,13.0,13.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,19645,-50.5625,119.9375,147.5,-1419.0,949.0
room_coor_y,float16,0.0,18398,-43.375,-124.8125,-135.25,-530.0,543.5
screen_coor_x,float16,0.0,3954,509.0,635.0,495.0,0.0,1886.0
screen_coor_y,float16,0.0,2717,360.0,420.0,420.0,0.0,1414.0


In [45]:
summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (1623072, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3296,523,524,525,0.0,19407.0
elapsed_time,int32,0.0,1141780,844329,845479,848147,100.0,1988105886.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,5,13.0,13.0,13.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,19645,-50.5625,119.9375,147.5,-1419.0,949.0
room_coor_y,float16,0.0,18398,-43.375,-124.8125,-135.25,-530.0,543.5
screen_coor_x,float16,0.0,3954,509.0,635.0,495.0,0.0,1886.0
screen_coor_y,float16,0.0,2717,360.0,420.0,420.0,0.0,1414.0


#### `level_group` == `"5-12"`

In [46]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,183,228133,undefined,5.0,,456.75,167.125,559.0,198.0,,undefined,tunic.historicalsociety,tunic.capitol_0.hall,,0,0,1
1,20090312431273200,242,280148,close,6.0,,1111.0,419.5,843.0,72.0,,undefined,,tunic.historicalsociety.entry,,0,0,1
2,20090312431273200,285,324396,undefined,7.0,,418.5,-201.0,420.0,453.0,,undefined,tunic.humanecology,tunic.historicalsociety.entry,,0,0,1
3,20090312431273200,329,357345,undefined,8.0,,432.5,60.0,587.0,270.0,,undefined,tunic.drycleaner,tunic.humanecology.frontdesk,,0,0,1
4,20090312431273200,362,390068,undefined,9.0,,-30.953125,39.0,298.0,291.0,,undefined,tunic.library,tunic.drycleaner.frontdesk,,0,0,1


In [47]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (205314, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,2783,183,242,285,0.0,18093.0
elapsed_time,int32,0.0,187424,228133,280148,324396,113.0,1987178723.0
name,category,0.0,3,undefined,close,undefined,,
level,float16,0.0,7,5.0,6.0,7.0,5.0,12.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,15553,456.75,1111.0,418.5,-995.0,1171.0
room_coor_y,float16,0.0,12845,167.125,419.5,-201.0,-531.5,536.5
screen_coor_x,float16,0.0,1976,559.0,843.0,420.0,0.0,1894.0
screen_coor_y,float16,0.0,1571,198.0,72.0,453.0,0.0,1062.0


#### `level_group` == `"13-22"`

In [48]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,521,841512,undefined,13.0,,431.75,216.0,541.0,166.0,,undefined,tunic.historicalsociety,tunic.capitol_1.hall,,0,0,1
1,20090312431273200,721,1044018,undefined,18.0,,967.0,-341.0,753.0,539.0,,undefined,tunic.wildlife,tunic.historicalsociety.entry,,0,0,1
2,20090312431273200,825,1131795,undefined,19.0,,417.5,-340.25,749.0,217.0,,undefined,tunic.flaghouse,tunic.wildlife.center,,0,0,1
3,20090312431273200,857,1159661,undefined,20.0,,58.96875,24.390625,325.0,299.0,,undefined,tunic.library,tunic.flaghouse.entry,,0,0,1
4,20090312431273200,897,1233809,undefined,21.0,,-275.25,207.125,534.0,171.0,,undefined,tunic.historicalsociety,tunic.library.frontdesk,,0,0,1


In [49]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (257860, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3605,521,721,825,0.0,20470.0
elapsed_time,int32,0.0,246812,841512,1044018,1131795,248.0,1988601973.0
name,category,0.0,3,undefined,undefined,undefined,,
level,float16,0.0,10,13.0,18.0,19.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,12325,431.75,967.0,417.5,-998.0,1171.0
room_coor_y,float16,0.0,15479,216.0,-341.0,-340.25,-918.0,536.5
screen_coor_x,float16,0.0,2294,541.0,753.0,749.0,0.0,1872.0
screen_coor_y,float16,0.0,1935,166.0,539.0,217.0,0.0,1311.0


#### Numeric Fields

In [112]:
map_click__numeric_fields = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

map_click__one_hot_fields = [
    "name"
]

#### Text Fields

In [51]:
map_click__text_fields = ["fqid", "room_fqid"]

### `event_name` == `"map_hover"`

In [52]:
event_name = "map_hover"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,127,135124,basic,3.0,,,,,,234.0,undefined,tunic.historicalsociety,tunic.historicalsociety.entry,,0,0,1,0-4
1,20090312431273200,128,135256,basic,3.0,,,,,,17.0,undefined,tunic.kohlcenter,tunic.historicalsociety.entry,,0,0,1,0-4
2,20090312431273200,160,161405,basic,4.0,,,,,,250.0,undefined,toentry,tunic.kohlcenter.halloffame,,0,0,1,0-4
3,20090312431273200,161,161822,basic,4.0,,,,,,17.0,undefined,tunic.kohlcenter,tunic.kohlcenter.halloffame,,0,0,1,0-4
4,20090312431273200,182,226643,basic,5.0,,,,,,750.0,undefined,toentry,tunic.capitol_0.hall,,0,0,1,5-12


#### `level_group` == `"0-4"`

In [53]:
level_group = "0-4"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,127,135124,basic,3.0,,,,,,234.0,undefined,tunic.historicalsociety,tunic.historicalsociety.entry,,0,0,1
1,20090312431273200,128,135256,basic,3.0,,,,,,17.0,undefined,tunic.kohlcenter,tunic.historicalsociety.entry,,0,0,1
2,20090312431273200,160,161405,basic,4.0,,,,,,250.0,undefined,toentry,tunic.kohlcenter.halloffame,,0,0,1
3,20090312431273200,161,161822,basic,4.0,,,,,,17.0,undefined,tunic.kohlcenter,tunic.kohlcenter.halloffame,,0,0,1
4,20090312433251036,101,161157,basic,3.0,,,,,,517.0,undefined,tomap,tunic.historicalsociety.entry,,0,0,0


In [54]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (45130, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,19330,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,1001,127,128,160,0.0,3701.0
elapsed_time,int32,0.0,42053,135124,135256,161405,16.0,1986894774.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,2,3.0,3.0,4.0,3.0,4.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


#### `level_group` == `"5-12"`

In [55]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,182,226643,basic,5.0,,,,,,750.0,undefined,toentry,tunic.capitol_0.hall,,0,0,1
1,20090312431273200,240,278898,basic,6.0,,,,,,582.0,undefined,tomap,tunic.historicalsociety.entry,,0,0,1
2,20090312431273200,241,280113,basic,6.0,,,,,,350.0,undefined,tunic.capitol_1,tunic.historicalsociety.entry,,0,0,1
3,20090312431273200,284,323996,basic,7.0,,,,,,16.0,undefined,tunic.capitol_1,tunic.historicalsociety.entry,,0,0,1
4,20090312431273200,361,389751,basic,9.0,,,,,,133.0,undefined,tunic.capitol_1,tunic.drycleaner.frontdesk,,0,0,1


In [56]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (323170, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,21688,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,2324,182,240,241,0.0,14937.0
elapsed_time,int32,0.0,284686,226643,278898,280113,123.0,1987141625.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,7,5.0,6.0,6.0,5.0,12.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


#### `level_group` == `"13-22"`

In [57]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,516,839629,basic,13.0,,,,,,67.0,undefined,tunic.drycleaner,tunic.capitol_1.hall,,0,0,1
1,20090312431273200,517,840662,basic,13.0,,,,,,983.0,undefined,tunic.drycleaner,tunic.capitol_1.hall,,0,0,1
2,20090312431273200,518,840780,basic,13.0,,,,,,100.0,undefined,tunic.historicalsociety,tunic.capitol_1.hall,,0,0,1
3,20090312431273200,519,840830,basic,13.0,,,,,,35.0,undefined,tunic.capitol_1,tunic.capitol_1.hall,,0,0,1
4,20090312431273200,520,841212,basic,13.0,,,,,,367.0,undefined,tunic.library,tunic.capitol_1.hall,,0,0,1


In [58]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (576859, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,21688,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3926,516,517,518,0.0,20469.0
elapsed_time,int32,0.0,526354,839629,840662,840780,195.0,1988601273.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,10,13.0,13.0,13.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


#### Numeric Fields

In [59]:
map_hover__numeric_fields = [
    "index", "elapsed_time", "level", "hover_duration", "fullscreen", "hq", "music"
]

#### Text Fields

In [60]:
map_hover__text_fields = ["fqid", "room_fqid"]

### `event_name` == `"navigate_click"`

In [61]:
event_name = "navigate_click"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,10,9133,undefined,0.0,,501.0,-160.75,605.0,445.0,,undefined,teddy,tunic.historicalsociety.closet,,0,0,1,0-4
1,20090312431273200,12,12030,undefined,0.0,,510.0,-106.375,614.0,386.0,,undefined,photo,tunic.historicalsociety.closet,,0,0,1,0-4
2,20090312431273200,14,14814,undefined,0.0,,274.0,-196.75,406.0,486.0,,undefined,,tunic.historicalsociety.closet,,0,0,1,0-4
3,20090312431273200,15,15498,undefined,0.0,,185.75,-205.75,363.0,492.0,,undefined,,tunic.historicalsociety.closet,,0,0,1,0-4
4,20090312431273200,16,16046,undefined,0.0,,0.583496,-225.75,234.0,510.0,,undefined,,tunic.historicalsociety.closet,,0,0,1,0-4


#### `level_group` == `"0-4"`

In [62]:
level_group = "0-4"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,10,9133,undefined,0.0,,501.0,-160.75,605.0,445.0,,undefined,teddy,tunic.historicalsociety.closet,,0,0,1
1,20090312431273200,12,12030,undefined,0.0,,510.0,-106.375,614.0,386.0,,undefined,photo,tunic.historicalsociety.closet,,0,0,1
2,20090312431273200,14,14814,undefined,0.0,,274.0,-196.75,406.0,486.0,,undefined,,tunic.historicalsociety.closet,,0,0,1
3,20090312431273200,15,15498,undefined,0.0,,185.75,-205.75,363.0,492.0,,undefined,,tunic.historicalsociety.closet,,0,0,1
4,20090312431273200,16,16046,undefined,0.0,,0.583496,-225.75,234.0,510.0,,undefined,,tunic.historicalsociety.closet,,0,0,1


In [63]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (1807806, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3162,10,12,14,0.0,5134.0
elapsed_time,int32,0.0,440016,9133,12030,14814,0.0,1986899034.0
name,category,0.0,1,undefined,undefined,undefined,,
level,float16,0.0,5,0.0,0.0,0.0,0.0,4.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,22024,501.0,510.0,274.0,-1218.0,1172.0
room_coor_y,float16,0.0,19530,-160.75,-106.375,-196.75,-591.5,536.5
screen_coor_x,float16,0.0,5211,605.0,614.0,406.0,0.0,1914.0
screen_coor_y,float16,0.0,2846,445.0,386.0,486.0,0.0,1411.0


#### `level_group` == `"5-12"`

In [64]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,175,221485,undefined,5.0,,280.5,-19.703125,653.0,321.0,,undefined,boss,tunic.capitol_0.hall,,0,0,1
1,20090312431273200,178,223735,undefined,5.0,,331.75,-220.625,688.0,454.0,,undefined,,tunic.capitol_0.hall,,0,0,1
2,20090312431273200,179,224235,undefined,5.0,,404.25,-226.5,716.0,454.0,,undefined,,tunic.capitol_0.hall,,0,0,1
3,20090312431273200,180,224802,undefined,5.0,,612.5,-230.625,804.0,454.0,,undefined,,tunic.capitol_0.hall,,0,0,1
4,20090312431273200,181,225803,undefined,5.0,,755.0,-213.125,824.0,441.0,,undefined,toentry,tunic.capitol_0.hall,,0,0,1


In [65]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (3192522, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,11135,175,178,179,0.0,18195.0
elapsed_time,int32,0.0,1191636,221485,223735,224235,0.0,1987181016.0
name,category,0.0,1,undefined,undefined,undefined,,
level,float16,0.0,8,5.0,5.0,5.0,5.0,12.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,24307,280.5,331.75,404.25,-1215.0,1173.0
room_coor_y,float16,0.0,21229,-19.703125,-220.625,-226.5,-592.5,532.0
screen_coor_x,float16,0.0,5432,653.0,688.0,716.0,0.0,1917.0
screen_coor_y,float16,0.0,3311,321.0,454.0,454.0,0.0,1440.0


#### `level_group` == `"13-22"`

In [66]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,512,836732,undefined,13.0,,290.25,-204.5,651.0,445.0,,undefined,,tunic.capitol_1.hall,,0,0,1
1,20090312431273200,513,837245,undefined,13.0,,353.75,-210.375,672.0,445.0,,undefined,,tunic.capitol_1.hall,,0,0,1
2,20090312431273200,514,837779,undefined,13.0,,587.5,-280.75,780.0,489.0,,undefined,,tunic.capitol_1.hall,,0,0,1
3,20090312431273200,515,838446,undefined,13.0,,751.5,-102.125,823.0,365.0,,undefined,toentry,tunic.capitol_1.hall,,0,0,1
4,20090312431273200,522,842396,undefined,13.0,,593.0,170.375,523.0,223.0,,undefined,tobasement,tunic.historicalsociety.entry,,0,0,1


In [67]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (6326105, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,7354,512,513,514,0.0,20472.0
elapsed_time,int32,0.0,2910215,836732,837245,837779,0.0,1988606704.0
name,category,0.0,1,undefined,undefined,undefined,,
level,float16,0.0,10,13.0,13.0,13.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,26195,290.25,353.75,587.5,-1992.0,1259.0
room_coor_y,float16,0.0,24383,-204.5,-210.375,-280.75,-918.0,536.5
screen_coor_x,float16,0.0,6379,651.0,672.0,780.0,0.0,1919.0
screen_coor_y,float16,0.0,3588,445.0,445.0,489.0,0.0,1439.0


#### Numeric Fields

In [68]:
navigate_click__numeric_fields = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

#### Text Fields

In [69]:
navigate_click__text_fields = ["fqid", "room_fqid"]

### `event_name` == `"notebook_click"`

In [70]:
event_name = "notebook_click"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312433251036,60,60743,open,2.0,0.0,-1112.0,-518.5,30.0,639.0,,undefined,,tunic.historicalsociety.entry,,0,0,0,0-4
1,20090312433251036,61,61761,close,2.0,0.0,73.25,428.25,789.0,58.0,,undefined,,tunic.historicalsociety.entry,,0,0,0,0-4
2,20090312433251036,209,351064,open,6.0,1.0,-490.75,-429.75,61.0,629.0,,undefined,,tunic.historicalsociety.basement,,0,0,0,5-12
3,20090312433251036,210,354779,basic,6.0,1.0,-97.625,-304.25,343.0,539.0,,undefined,,tunic.historicalsociety.basement,,0,0,0,5-12
4,20090312433251036,211,357947,close,6.0,1.0,556.0,342.5,812.0,75.0,,undefined,,tunic.historicalsociety.basement,,0,0,0,5-12


#### `level_group` == `"0-4"`

In [71]:
level_group = "0-4"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312433251036,60,60743,open,2.0,0.0,-1112.0,-518.5,30.0,639.0,,undefined,,tunic.historicalsociety.entry,,0,0,0
1,20090312433251036,61,61761,close,2.0,0.0,73.25,428.25,789.0,58.0,,undefined,,tunic.historicalsociety.entry,,0,0,0
2,20090313091715820,123,80710,open,3.0,0.0,-564.0,-431.75,33.0,731.0,,undefined,,tunic.historicalsociety.collection,,1,1,1
3,20090313091715820,124,82222,close,3.0,0.0,594.0,403.25,969.0,56.0,,undefined,,tunic.historicalsociety.collection,,1,1,1
4,20090313571836404,47,72818,open,2.0,0.0,-481.0,-486.25,38.0,616.0,,undefined,,tunic.historicalsociety.entry,,0,0,1


In [72]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (81733, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,13943,20090312433251036,20090312433251036,20090313091715820,2.009031243325104e+16,2.2100221145014656e+16
index,int16,0.0,1262,60,61,123,0.0,2858.0
elapsed_time,int32,0.0,72223,60743,61761,80710,9183.0,1747004800.0
name,category,0.0,5,open,close,open,,
level,float16,0.0,4,2.0,2.0,3.0,1.0,4.0
page,float16,0.0,2,0.0,0.0,0.0,0.0,1.0
room_coor_x,float16,0.0,8913,-1112.0,73.25,-564.0,-1182.0,1140.0
room_coor_y,float16,0.0,5529,-518.5,428.25,-431.75,-560.5,517.5
screen_coor_x,float16,0.0,1854,30.0,789.0,33.0,0.0,1919.0
screen_coor_y,float16,0.0,1451,639.0,58.0,731.0,0.0,1409.0


#### `level_group` == `"5-12"`

In [73]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312433251036,209,351064,open,6.0,1.0,-490.75,-429.75,61.0,629.0,,undefined,,tunic.historicalsociety.basement,,0,0,0
1,20090312433251036,210,354779,basic,6.0,1.0,-97.625,-304.25,343.0,539.0,,undefined,,tunic.historicalsociety.basement,,0,0,0
2,20090312433251036,211,357947,close,6.0,1.0,556.0,342.5,812.0,75.0,,undefined,,tunic.historicalsociety.basement,,0,0,0
3,20090312433251036,413,651200,open,11.0,3.0,-465.25,-493.75,48.0,628.0,,undefined,,tunic.historicalsociety.entry,,0,0,0
4,20090312433251036,414,654048,close,11.0,3.0,795.0,390.5,822.0,85.0,,undefined,,tunic.historicalsociety.entry,,0,0,0


In [74]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (182143, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,16228,20090312433251036,20090312433251036,20090312433251036,2.009031243325104e+16,2.2100221145014656e+16
index,int16,0.0,4516,209,210,211,0.0,17446.0
elapsed_time,int32,0.0,168120,351064,354779,357947,110180.0,1987139057.0
name,category,0.0,5,open,basic,close,,
level,float16,0.0,8,6.0,6.0,6.0,5.0,12.0
page,float16,0.0,4,1.0,1.0,1.0,0.0,3.0
room_coor_x,float16,0.0,10959,-490.75,-97.625,556.0,-1208.0,1171.0
room_coor_y,float16,0.0,6169,-429.75,-304.25,342.5,-583.0,534.5
screen_coor_x,float16,0.0,2312,61.0,343.0,812.0,0.0,1897.0
screen_coor_y,float16,0.0,1833,629.0,539.0,75.0,0.0,1418.0


#### `level_group` == `"13-22"`

In [75]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312433251036,638,1239444,open,15.0,4.0,-1537.0,-356.5,31.0,629.0,,undefined,,tunic.historicalsociety.cage,,0,0,0
1,20090312433251036,639,1244575,basic,15.0,4.0,-1378.0,-126.9375,232.0,427.0,,undefined,,tunic.historicalsociety.cage,,0,0,0
2,20090312433251036,640,1246154,close,15.0,4.0,-689.5,257.75,839.0,88.0,,undefined,,tunic.historicalsociety.cage,,0,0,0
3,20090312433251036,736,1417626,open,17.0,4.0,-180.75,-469.5,48.0,618.0,,undefined,,tunic.historicalsociety.entry,,0,0,0
4,20090312433251036,737,1445314,close,17.0,4.0,1063.0,417.75,812.0,73.0,,undefined,,tunic.historicalsociety.entry,,0,0,0


In [76]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (300668, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,18712,20090312433251036,20090312433251036,20090312433251036,2.009031243325104e+16,2.2100221145014656e+16
index,int16,0.0,3388,638,639,640,0.0,20384.0
elapsed_time,int32,0.0,285877,1239444,1244575,1246154,325220.0,1988605536.0
name,category,0.0,5,open,basic,close,,
level,float16,0.0,10,15.0,15.0,15.0,13.0,22.0
page,float16,0.0,7,4.0,4.0,4.0,0.0,6.0
room_coor_x,float16,0.0,13373,-1537.0,-1378.0,-689.5,-1991.0,1258.0
room_coor_y,float16,0.0,10571,-356.5,-126.9375,257.75,-915.5,535.0
screen_coor_x,float16,0.0,2794,31.0,232.0,839.0,0.0,1904.0
screen_coor_y,float16,0.0,2337,629.0,427.0,88.0,0.0,1419.0


#### Numeric Fields

In [77]:
notebook_click__numeric_fields = [
    "index", "elapsed_time", "level", "page", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

notebook_click__onehot_fields = ["name"]

#### Text Fields

In [78]:
notebook_click__text_fields = ["room_fqid"]

### `event_name` == `"notification_click"`

In [79]:
event_name = "notification_click"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,26,24348,basic,0.0,,-472.25,-117.9375,554.0,394.0,,Found it!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.notebook,0,0,1,0-4
1,20090312431273200,29,32229,basic,1.0,,-182.5,-1.90625,767.0,305.0,,Gramps is in trouble for losing papers?,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1,0-4
2,20090312431273200,30,33063,basic,1.0,,-182.5,-55.875,767.0,359.0,,This can't be right!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1,0-4
3,20090312431273200,31,34245,basic,1.0,,-182.5,-55.875,767.0,359.0,,Gramps is a great historian!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1,0-4
4,20090312431273200,85,89809,basic,2.0,,-86.875,-96.8125,355.0,397.0,,This looks like a clue!,,tunic.historicalsociety.collection,tunic.historicalsociety.collection.tunic.slip,0,0,1,0-4


#### `level_group` == `"0-4"`

In [80]:
level_group = "0-4"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,26,24348,basic,0.0,,-472.25,-117.9375,554.0,394.0,,Found it!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.notebook,0,0,1
1,20090312431273200,29,32229,basic,1.0,,-182.5,-1.90625,767.0,305.0,,Gramps is in trouble for losing papers?,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1
2,20090312431273200,30,33063,basic,1.0,,-182.5,-55.875,767.0,359.0,,This can't be right!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1
3,20090312431273200,31,34245,basic,1.0,,-182.5,-55.875,767.0,359.0,,Gramps is a great historian!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1
4,20090312431273200,85,89809,basic,2.0,,-86.875,-96.8125,355.0,397.0,,This looks like a clue!,,tunic.historicalsociety.collection,tunic.historicalsociety.collection.tunic.slip,0,0,1


In [81]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (183243, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,1864,26,29,30,0.0,5104.0
elapsed_time,int32,0.0,129241,24348,32229,33063,229.0,1986886387.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,4,0.0,1.0,1.0,0.0,3.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,9947,-472.25,-182.5,-182.5,-1022.5,873.0
room_coor_y,float16,0.0,9803,-117.9375,-1.90625,-55.875,-475.25,470.25
screen_coor_x,float16,0.0,1877,554.0,767.0,767.0,0.0,1764.0
screen_coor_y,float16,0.0,1731,394.0,305.0,359.0,0.0,1420.0


#### `level_group` == `"5-12"`

In [82]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,313,346295,basic,7.0,,133.25,-104.0,491.0,434.0,,This place was around in 1916! I can start there!,,tunic.humanecology.frontdesk,tunic.humanecology.frontdesk.businesscards.car...,0,0,1
1,20090312431273200,349,381368,basic,8.0,,256.25,12.0,687.0,318.0,,It's a match!,,tunic.drycleaner.frontdesk,tunic.drycleaner.frontdesk.logbook.page.bingo,0,0,1
2,20090312431273200,350,381935,basic,8.0,,256.25,12.0,687.0,318.0,,Theodora Youmans must be the owner!,,tunic.drycleaner.frontdesk,tunic.drycleaner.frontdesk.logbook.page.bingo,0,0,1
3,20090312431273200,383,412184,basic,9.0,,-33.15625,-122.0,414.0,452.0,,Youmans was a suffragist!,,tunic.library.microfiche,tunic.library.microfiche.reader.paper2.bingo,0,0,1
4,20090312431273200,384,412767,basic,9.0,,-33.15625,-122.0,414.0,452.0,,She helped get votes for women!,,tunic.library.microfiche,tunic.library.microfiche.reader.paper2.bingo,0,0,1


In [83]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (222801, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,2173,313,349,350,0.0,17073.0
elapsed_time,int32,0.0,201230,346295,381368,381935,82.0,1987171306.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,6,7.0,8.0,8.0,7.0,12.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,11176,133.25,256.25,256.25,-590.5,575.5
room_coor_y,float16,0.0,5931,-104.0,12.0,12.0,-433.0,410.5
screen_coor_x,float16,0.0,2116,491.0,687.0,687.0,0.0,1731.0
screen_coor_y,float16,0.0,1683,434.0,318.0,318.0,0.0,1408.0


#### `level_group` == `"13-22"`

In [84]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,598,909809,basic,15.0,,-269.25,-360.5,398.0,551.0,,Those are the same glasses!,,tunic.historicalsociety.entry,tunic.historicalsociety.entry.directory.closeu...,0,0,1
1,20090312431273200,599,910274,basic,15.0,,-269.25,-360.5,398.0,551.0,,The archivist must've taken Teddy!,,tunic.historicalsociety.entry,tunic.historicalsociety.entry.directory.closeu...,0,0,1
2,20090312431273200,601,912192,basic,15.0,,478.0,426.0,857.0,68.0,,Those are the same glasses!,,tunic.historicalsociety.entry,tunic.historicalsociety.entry.directory.closeu...,0,0,1
3,20090312431273200,602,912976,basic,15.0,,481.25,460.25,859.0,47.0,,The archivist must've taken Teddy!,,tunic.historicalsociety.entry,tunic.historicalsociety.entry.directory.closeu...,0,0,1
4,20090312431273200,787,1104231,basic,18.0,,1011.5,-410.75,698.0,352.0,,That hoofprint doesn't match the flag!,,tunic.wildlife.center,tunic.wildlife.center.tracks.hub.deer,0,0,1


In [85]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (242957, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3319,598,599,601,0.0,20458.0
elapsed_time,int32,0.0,232948,909809,910274,912192,182.0,1988525224.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,10,15.0,15.0,15.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,10014,-269.25,-269.25,478.0,-914.0,1262.0
room_coor_y,float16,0.0,9569,-360.5,-360.5,426.0,-812.0,536.5
screen_coor_x,float16,0.0,2248,398.0,398.0,857.0,0.0,1906.0
screen_coor_y,float16,0.0,1764,551.0,551.0,68.0,0.0,1419.0


#### Numeric Fields

In [86]:
notification_click__numeric_fields = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

#### Text Fields

In [87]:
notification_click__text_fields = ["text", "room_fqid", "text_fqid"]

### `event_name` == `"object_click"`

In [88]:
event_name = "object_click"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,27,25766,close,0.0,,-206.5,199.125,822.0,76.0,,undefined,notebook,tunic.historicalsociety.closet,,0,0,1,0-4
1,20090312431273200,32,36433,close,1.0,,-113.5,241.125,836.0,62.0,,undefined,retirement_letter,tunic.historicalsociety.closet,,0,0,1,0-4
2,20090312431273200,50,57277,basic,1.0,,856.5,69.75,839.0,291.0,,undefined,report,tunic.historicalsociety.entry,,0,0,1,0-4
3,20090312431273200,51,58244,close,1.0,,848.0,402.0,834.0,87.0,,undefined,report,tunic.historicalsociety.entry,,0,0,1,0-4
4,20090312431273200,68,73927,close,2.0,,439.0,416.0,833.0,74.0,,undefined,directory,tunic.historicalsociety.entry,,0,0,1,0-4


#### `level_group` == `"0-4"`

In [89]:
level_group = "0-4"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,27,25766,close,0.0,,-206.5,199.125,822.0,76.0,,undefined,notebook,tunic.historicalsociety.closet,,0,0,1
1,20090312431273200,32,36433,close,1.0,,-113.5,241.125,836.0,62.0,,undefined,retirement_letter,tunic.historicalsociety.closet,,0,0,1
2,20090312431273200,50,57277,basic,1.0,,856.5,69.75,839.0,291.0,,undefined,report,tunic.historicalsociety.entry,,0,0,1
3,20090312431273200,51,58244,close,1.0,,848.0,402.0,834.0,87.0,,undefined,report,tunic.historicalsociety.entry,,0,0,1
4,20090312431273200,68,73927,close,2.0,,439.0,416.0,833.0,74.0,,undefined,directory,tunic.historicalsociety.entry,,0,0,1


In [90]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (364862, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,2214,27,32,50,0.0,5105.0
elapsed_time,int32,0.0,232496,25766,36433,57277,10.0,1986888039.0
name,category,0.0,2,close,close,basic,,
level,float16,0.0,5,0.0,1.0,1.0,0.0,4.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,11720,-206.5,-113.5,856.5,-1020.5,923.0
room_coor_y,float16,0.0,10404,199.125,241.125,69.75,-523.5,543.5
screen_coor_x,float16,0.0,2914,822.0,836.0,839.0,0.0,1901.0
screen_coor_y,float16,0.0,2640,76.0,62.0,291.0,0.0,1372.0


#### `level_group` == `"5-12"`

In [91]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,305,339994,basic,7.0,,396.5,-11.0,755.0,341.0,,undefined,businesscards.card_0.next,tunic.humanecology.frontdesk,,0,0,1
1,20090312431273200,307,342227,basic,7.0,,400.25,0.0,758.0,330.0,,undefined,businesscards.card_1.next,tunic.humanecology.frontdesk,,0,0,1
2,20090312431273200,309,343111,basic,7.0,,225.25,-19.0,583.0,349.0,,undefined,businesscards,tunic.humanecology.frontdesk,,0,0,1
3,20090312431273200,312,345413,basic,7.0,,135.25,-105.0,493.0,435.0,,undefined,businesscards.card_bingo.bingo,tunic.humanecology.frontdesk,,0,0,1
4,20090312431273200,314,346661,basic,7.0,,133.25,-104.0,491.0,434.0,,undefined,businesscards,tunic.humanecology.frontdesk,,0,0,1


In [92]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (1120858, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,5604,305,307,309,0.0,17245.0
elapsed_time,int32,0.0,752387,339994,342227,343111,0.0,1987171611.0
name,category,0.0,2,basic,basic,basic,,
level,float16,0.0,8,7.0,7.0,7.0,5.0,12.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,14151,396.5,400.25,225.25,-906.0,859.5
room_coor_y,float16,0.0,12965,-11.0,0.0,-19.0,-520.0,535.0
screen_coor_x,float16,0.0,3624,755.0,758.0,583.0,0.0,1906.0
screen_coor_y,float16,0.0,3186,341.0,330.0,349.0,0.0,1417.0


#### `level_group` == `"13-22"`

In [93]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,597,909078,basic,15.0,,-269.25,-360.5,398.0,551.0,,undefined,directory.closeup.archivist,tunic.historicalsociety.entry,,0,0,1
1,20090312431273200,600,910909,basic,15.0,,-171.625,-326.25,458.0,530.0,,undefined,directory.closeup.archivist,tunic.historicalsociety.entry,,0,0,1
2,20090312431273200,604,914363,close,15.0,,414.5,473.25,818.0,39.0,,undefined,directory,tunic.historicalsociety.entry,,0,0,1
3,20090312431273200,786,1103032,basic,18.0,,1019.5,-408.0,704.0,350.0,,undefined,tracks.hub.deer,tunic.wildlife.center,,0,0,1
4,20090312431273200,789,1104813,basic,18.0,,1010.0,-410.75,697.0,352.0,,undefined,tracks,tunic.wildlife.center,,0,0,1


In [94]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (712491, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,4162,597,600,604,0.0,20462.0
elapsed_time,int32,0.0,634652,909078,910909,914363,0.0,1988526677.0
name,category,0.0,2,basic,basic,close,,
level,float16,0.0,10,15.0,15.0,15.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,13714,-269.25,-171.625,414.5,-954.0,1249.0
room_coor_y,float16,0.0,14879,-360.5,-326.25,473.25,-811.0,536.5
screen_coor_x,float16,0.0,3555,398.0,458.0,818.0,0.0,1919.0
screen_coor_y,float16,0.0,3160,551.0,530.0,39.0,0.0,1374.0


#### Numeric Fields

In [95]:
object_click__numeric_fields = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

object_click__onehot_fields = ["name"]

#### Text Fields

In [96]:
object_click__text_fields = ["fqid", "room_fqid"]

### `event_name` == `"object_hover"`

In [97]:
event_name = "object_hover"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,49,52328,basic,1.0,,,,,,7899.0,undefined,groupconvo,tunic.historicalsociety.entry,,0,0,1,0-4
1,20090312431273200,82,87242,basic,2.0,,,,,,400.0,undefined,tunic,tunic.historicalsociety.collection,,0,0,1,0-4
2,20090312431273200,87,92242,undefined,2.0,,,,,,3949.0,undefined,tunic.hub.slip,tunic.historicalsociety.collection,,0,0,1,0-4
3,20090312431273200,148,153655,undefined,3.0,,,,,,6350.0,undefined,plaque.face.date,tunic.kohlcenter.halloffame,,0,0,1,0-4
4,20090312431273200,303,338929,undefined,7.0,,,,,,68.0,undefined,businesscards.card_0.next,tunic.humanecology.frontdesk,,0,0,1,5-12


#### `level_group` == `"0-4"`

In [98]:
level_group = "0-4"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,49,52328,basic,1.0,,,,,,7899.0,undefined,groupconvo,tunic.historicalsociety.entry,,0,0,1
1,20090312431273200,82,87242,basic,2.0,,,,,,400.0,undefined,tunic,tunic.historicalsociety.collection,,0,0,1
2,20090312431273200,87,92242,undefined,2.0,,,,,,3949.0,undefined,tunic.hub.slip,tunic.historicalsociety.collection,,0,0,1
3,20090312431273200,148,153655,undefined,3.0,,,,,,6350.0,undefined,plaque.face.date,tunic.kohlcenter.halloffame,,0,0,1
4,20090312433251036,73,106194,undefined,2.0,,,,,,701.0,undefined,tunic.hub.slip,tunic.historicalsociety.collection,,0,0,0


In [99]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (107127, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,21689,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,1464,49,82,87,0.0,3680.0
elapsed_time,int32,0.0,91471,52328,87242,92242,543.0,1986887172.0
name,category,0.0,2,basic,basic,undefined,,
level,float16,0.0,5,1.0,2.0,2.0,0.0,4.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


#### `level_group` == `"5-12"`

In [100]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,303,338929,undefined,7.0,,,,,,68.0,undefined,businesscards.card_0.next,tunic.humanecology.frontdesk,,0,0,1
1,20090312431273200,304,339045,undefined,7.0,,,,,,50.0,undefined,businesscards.card_0.next,tunic.humanecology.frontdesk,,0,0,1
2,20090312431273200,306,341328,undefined,7.0,,,,,,1950.0,undefined,businesscards.card_0.next,tunic.humanecology.frontdesk,,0,0,1
3,20090312431273200,308,342946,undefined,7.0,,,,,,867.0,undefined,businesscards.card_1.next,tunic.humanecology.frontdesk,,0,0,1
4,20090312431273200,310,343961,undefined,7.0,,,,,,17.0,undefined,businesscards.card_bingo.next,tunic.humanecology.frontdesk,,0,0,1


In [101]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (524264, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,21690,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,2472,303,304,306,0.0,17074.0
elapsed_time,int32,0.0,423095,338929,339045,341328,29.0,1987170838.0
name,category,0.0,2,undefined,undefined,undefined,,
level,float16,0.0,8,7.0,7.0,7.0,5.0,12.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


#### `level_group` == `"13-22"`

In [102]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,595,904875,basic,15.0,,,,,,917.0,undefined,directory,tunic.historicalsociety.entry,,0,0,1
1,20090312431273200,596,907873,undefined,15.0,,,,,,716.0,undefined,directory.closeup.archivist,tunic.historicalsociety.entry,,0,0,1
2,20090312431273200,603,913192,undefined,15.0,,,,,,4750.0,undefined,directory.closeup.archivist,tunic.historicalsociety.entry,,0,0,1
3,20090312431273200,784,1099731,basic,18.0,,,,,,784.0,undefined,tracks,tunic.wildlife.center,,0,0,1
4,20090312431273200,785,1099799,undefined,18.0,,,,,,17.0,undefined,tracks.hub.deer,tunic.wildlife.center,,0,0,1


In [103]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (425694, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,21689,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3483,595,596,603,0.0,20461.0
elapsed_time,int32,0.0,396526,904875,907873,913192,104.0,1988525877.0
name,category,0.0,2,basic,undefined,undefined,,
level,float16,0.0,10,15.0,15.0,15.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


#### Numeric Fields

In [104]:
object_hover__numeric_fields = [
    "index", "elapsed_time", "level", "hover_duration", "fullscreen", "hq", "music"
]

object_hover__onehot_fields = ["name"]

#### Text Fields

In [105]:
object_hover__text_fields = ["fqid", "room_fqid"]

### `event_name` == `"observation_click"`

In [106]:
event_name = "observation_click"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,13,13030,basic,0.0,,487.0,-98.5625,614.0,386.0,,I love these photos of me and Teddy!,photo,tunic.historicalsociety.closet,tunic.historicalsociety.closet.photo,0,0,1,0-4
1,20090312431273200,37,41297,basic,1.0,,-400.25,-117.5,179.0,405.0,,Hmm. Button's still not working.,janitor,tunic.historicalsociety.basement,tunic.historicalsociety.basement.janitor,0,0,1,0-4
2,20090312431273200,108,109825,basic,3.0,,14.359375,-156.25,444.0,485.0,,Better check back later.,outtolunch,tunic.historicalsociety.stacks,tunic.historicalsociety.stacks.outtolunch,0,0,1,0-4
3,20090312431273200,112,117142,basic,3.0,,-7.492188,-61.71875,480.0,365.0,,Hmm. Button's still not working.,janitor,tunic.historicalsociety.basement,tunic.historicalsociety.basement.janitor,0,0,1,0-4
4,20090312431273200,256,300382,basic,6.0,,75.625,-32.0,419.0,362.0,,I bet the archivist could use this!,magnify,tunic.historicalsociety.frontdesk,tunic.historicalsociety.frontdesk.magnify,0,0,1,5-12


#### `level_group` == `"0-4"`

In [107]:
level_group = "0-4"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,13,13030,basic,0.0,,487.0,-98.5625,614.0,386.0,,I love these photos of me and Teddy!,photo,tunic.historicalsociety.closet,tunic.historicalsociety.closet.photo,0,0,1
1,20090312431273200,37,41297,basic,1.0,,-400.25,-117.5,179.0,405.0,,Hmm. Button's still not working.,janitor,tunic.historicalsociety.basement,tunic.historicalsociety.basement.janitor,0,0,1
2,20090312431273200,108,109825,basic,3.0,,14.359375,-156.25,444.0,485.0,,Better check back later.,outtolunch,tunic.historicalsociety.stacks,tunic.historicalsociety.stacks.outtolunch,0,0,1
3,20090312431273200,112,117142,basic,3.0,,-7.492188,-61.71875,480.0,365.0,,Hmm. Button's still not working.,janitor,tunic.historicalsociety.basement,tunic.historicalsociety.basement.janitor,0,0,1
4,20090312433251036,29,36447,basic,1.0,,-990.0,2.251953,134.0,327.0,,I should see what Grampa is up to!,block_tocollection,tunic.historicalsociety.entry,tunic.historicalsociety.entry.block_tocollection,0,0,0


In [108]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (40850, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,15847,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,996,13,37,108,1.0,3598.0
elapsed_time,int32,0.0,36606,13030,41297,109825,1799.0,1746734366.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,4,0.0,1.0,3.0,0.0,3.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,11288,487.0,-400.25,14.359375,-1199.0,1173.0
room_coor_y,float16,0.0,8410,-98.5625,-117.5,-156.25,-502.25,450.5
screen_coor_x,float16,0.0,1181,614.0,179.0,444.0,0.0,1726.0
screen_coor_y,float16,0.0,885,386.0,405.0,485.0,6.0,1084.0


#### `level_group` == `"5-12"`

In [109]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,256,300382,basic,6.0,,75.625,-32.0,419.0,362.0,,I bet the archivist could use this!,magnify,tunic.historicalsociety.frontdesk,tunic.historicalsociety.frontdesk.magnify,0,0,1
1,20090312433251036,157,281616,basic,5.0,,86.75,-151.625,517.0,483.0,,Better check back later.,outtolunch,tunic.historicalsociety.stacks,tunic.historicalsociety.stacks.outtolunch,0,0,0
2,20090312433251036,165,288949,basic,5.0,,133.875,-67.3125,590.0,369.0,,Hmm. Button's still not working.,janitor,tunic.historicalsociety.basement,tunic.historicalsociety.basement.janitor,0,0,0
3,20090312433251036,222,378996,basic,6.0,,132.625,67.0,480.0,263.0,,I bet the archivist could use this!,magnify,tunic.historicalsociety.frontdesk,tunic.historicalsociety.frontdesk.magnify,0,0,0
4,20090312455206810,217,445207,basic,6.0,,-78.125,-7.332031,369.0,460.0,,I bet the archivist could use this!,magnify,tunic.historicalsociety.frontdesk,tunic.historicalsociety.frontdesk.magnify,1,1,1


In [110]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (64988, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312433251036,20090312433251036,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,1528,256,157,165,1.0,16167.0
elapsed_time,int32,0.0,62388,300382,281616,288949,88.0,1987020699.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,8,6.0,5.0,5.0,5.0,12.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,12193,75.625,86.75,133.875,-972.5,736.0
room_coor_y,float16,0.0,7060,-32.0,-151.625,-67.3125,-467.25,301.0
screen_coor_x,float16,0.0,1422,419.0,517.0,590.0,0.0,1695.0
screen_coor_y,float16,0.0,1085,362.0,483.0,369.0,23.0,1250.0


#### `level_group` == `"13-22"`

In [111]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,608,920474,basic,15.0,,-49.4375,-23.0,374.0,353.0,,Yes! It's the key for Teddy's cage!,key,tunic.historicalsociety.frontdesk,tunic.historicalsociety.frontdesk.key,0,0,1
1,20090312431273200,725,1050634,basic,18.0,,-583.5,-775.5,215.0,577.0,,People sure drink a lot of coffee around here.,coffee,tunic.wildlife.center,tunic.wildlife.center.coffee,0,0,1
2,20090312431273200,773,1091181,basic,18.0,,439.0,-423.5,547.0,349.0,,"It's OK, girl! Look, I found you a cricket!",remove_cup,tunic.wildlife.center,tunic.wildlife.center.remove_cup,0,0,1
3,20090312433251036,620,1217273,basic,14.0,,86.0,-107.8125,735.0,416.0,,It's locked!,lockeddoor,tunic.historicalsociety.cage,tunic.historicalsociety.cage.lockeddoor,0,0,0
4,20090312433251036,667,1275220,basic,15.0,,-0.777344,-182.0,423.0,512.0,,Yes! It's the key for Teddy's cage!,key,tunic.historicalsociety.frontdesk,tunic.historicalsociety.frontdesk.key,0,0,0


In [150]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (2875847, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,4351,538,539,560,0.0,20388.0
elapsed_time,int32,0.0,1829252,858662,860233,877826,100.0,1988597956.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,10,13.0,13.0,14.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,21764,43.90625,-188.125,-642.0,-1335.0,1059.0
room_coor_y,float16,0.0,19629,-137.875,-56.375,-235.25,-839.5,527.5
screen_coor_x,float16,0.0,4051,406.0,236.0,621.0,0.0,1875.0
screen_coor_y,float16,0.0,3107,415.0,356.0,522.0,0.0,1388.0


#### Numeric Fields

In [151]:
object_click__numeric_fields = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

#### Text Fields

In [152]:
object_click__text_fields = ["text", "fqid", "room_fqid", "text_fqid"]

### `event_name` == `"person_click"`

In [144]:
event_name = "person_click"
train_data__en_filtered = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

train_data__en_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,1,1323,basic,0.0,,-414.0,-159.375,380.0,494.0,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
1,20090312431273200,2,831,basic,0.0,,-414.0,-159.375,380.0,494.0,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
2,20090312431273200,3,1147,basic,0.0,,-414.0,-159.375,380.0,494.0,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
3,20090312431273200,4,1863,basic,0.0,,-413.0,-159.375,381.0,494.0,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
4,20090312431273200,5,3423,basic,0.0,,-413.0,-157.375,381.0,492.0,,"Sure thing, Jo. Grab your notebook and come up...",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4


#### `level_group` == `"0-4"`

In [145]:
level_group = "0-4"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,1,1323,basic,0.0,,-414.0,-159.375,380.0,494.0,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1
1,20090312431273200,2,831,basic,0.0,,-414.0,-159.375,380.0,494.0,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1
2,20090312431273200,3,1147,basic,0.0,,-414.0,-159.375,380.0,494.0,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1
3,20090312431273200,4,1863,basic,0.0,,-413.0,-159.375,381.0,494.0,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1
4,20090312431273200,5,3423,basic,0.0,,-413.0,-157.375,381.0,492.0,,"Sure thing, Jo. Grab your notebook and come up...",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1


In [146]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (484889, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,2423,1,2,3,0.0,5064.0
elapsed_time,int32,0.0,195832,1323,831,1147,0.0,1986867521.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,4,0.0,0.0,0.0,0.0,3.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,15315,-414.0,-414.0,-414.0,-794.0,861.5
room_coor_y,float16,0.0,13607,-159.375,-159.375,-159.375,-523.5,527.0
screen_coor_x,float16,0.0,2616,380.0,380.0,380.0,0.0,1852.0
screen_coor_y,float16,0.0,1928,494.0,494.0,494.0,1.0,1177.0


#### `level_group` == `"5-12"`

In [147]:
level_group = "5-12"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,176,222334,basic,5.0,,273.0,-19.234375,649.0,321.0,,"What are you still doing here, Jolie?",boss,tunic.capitol_0.hall,tunic.capitol_0.hall.boss.talktogramps,0,0,1
1,20090312431273200,177,223251,basic,5.0,,240.875,-228.25,628.0,459.0,,Go find your grampa and get to work!,boss,tunic.capitol_0.hall,tunic.capitol_0.hall.boss.talktogramps,0,0,1
2,20090312431273200,194,239167,basic,6.0,,-615.5,28.296875,312.0,269.0,,Can you help me tidy up?,gramps,tunic.historicalsociety.closet_dirty,tunic.historicalsociety.closet_dirty.gramps.he...,0,0,1
3,20090312431273200,222,262132,basic,6.0,,-716.5,-212.125,192.0,509.0,,Who could've done this?,gramps,tunic.historicalsociety.closet_dirty,tunic.historicalsociety.closet_dirty.gramps.news,0,0,1
4,20090312431273200,223,262516,basic,6.0,,-725.0,-205.125,196.0,502.0,,It must've been Wells.,gramps,tunic.historicalsociety.closet_dirty,tunic.historicalsociety.closet_dirty.gramps.news,0,0,1


In [148]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (2692117, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3597,176,177,194,0.0,15768.0
elapsed_time,int32,0.0,1055703,222334,223251,239167,0.0,1987158004.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,8,5.0,5.0,6.0,5.0,12.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,21206,273.0,240.875,-615.5,-923.5,651.0
room_coor_y,float16,0.0,16541,-19.234375,-228.25,28.296875,-528.5,456.75
screen_coor_x,float16,0.0,4009,649.0,628.0,312.0,0.0,1853.0
screen_coor_y,float16,0.0,3092,321.0,459.0,269.0,0.0,1270.0


#### `level_group` == `"13-22"`

In [149]:
level_group = "13-22"
train_data__en_lg_filtered = train_data__en_filtered[train_data__en_filtered["level_group"] == level_group].drop(labels=["level_group"], axis=1).reset_index(drop=True)

train_data__en_lg_filtered.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music
0,20090312431273200,538,858662,basic,13.0,,43.90625,-137.875,406.0,415.0,,We'll find Teddy.,gramps,tunic.historicalsociety.basement,tunic.historicalsociety.basement.gramps.whatdo,0,0,1
1,20090312431273200,539,860233,basic,13.0,,-188.125,-56.375,236.0,356.0,,We just have to keep our eyes open!,gramps,tunic.historicalsociety.basement,tunic.historicalsociety.basement.gramps.whatdo,0,0,1
2,20090312431273200,560,877826,basic,14.0,,-642.0,-235.25,621.0,522.0,,I wonder whose glasses these are.,glasses,tunic.historicalsociety.cage,tunic.historicalsociety.cage.glasses.beforeteddy,0,0,1
3,20090312431273200,565,881960,basic,14.0,,87.1875,-112.125,722.0,418.0,,Teddy!!!,teddy,tunic.historicalsociety.cage,tunic.historicalsociety.cage.teddy.trapped,0,0,1
4,20090312431273200,566,882426,basic,14.0,,103.375,-111.625,722.0,418.0,,Hang on. I'll get you out of there!,teddy,tunic.historicalsociety.cage,tunic.historicalsociety.cage.teddy.trapped,0,0,1


In [150]:
train_data__en_lg_filtered = recategorize_category_typed_fields(train_data__en_lg_filtered)

summary = summarize_data_info(train_data__en_lg_filtered)
summary

Data Shape: (2875847, 18)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,4351,538,539,560,0.0,20388.0
elapsed_time,int32,0.0,1829252,858662,860233,877826,100.0,1988597956.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,10,13.0,13.0,14.0,13.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,21764,43.90625,-188.125,-642.0,-1335.0,1059.0
room_coor_y,float16,0.0,19629,-137.875,-56.375,-235.25,-839.5,527.5
screen_coor_x,float16,0.0,4051,406.0,236.0,621.0,0.0,1875.0
screen_coor_y,float16,0.0,3107,415.0,356.0,522.0,0.0,1388.0


#### Numeric Fields

In [151]:
object_click__numeric_fields = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

#### Text Fields

In [152]:
object_click__text_fields = ["text", "fqid", "room_fqid", "text_fqid"]