# Introduction

- **session_id** - the ID of the session the event took place in
- **index** - the index of the event for the session
- **elapsed_time** - how much time has passed (in milliseconds) between the start of the session and when the event was recorded
- **event_name** - the name of the event type
- **name** - the event name (e.g. identifies whether a notebook_click is opening or closing the notebook)
- **level** - what level of the game the event occurred in (0 to 22)
- **page** - the page number of the event (only for notebook-related events)
- **room_coor_x** - the coordinates of the click in reference to the in-game room (only for click events)
- **room_coor_y** - the coordinates of the click in reference to the in-game room (only for click events)
- **screen_coor_x** - the coordinates of the click in reference to the player’s screen (only for click events)
- **screen_coor_y** - the coordinates of the click in reference to the player’s screen (only for click events)
- **hover_duration** - how long (in milliseconds) the hover happened for (only for hover events)
- **text** - the text the player sees during this event
- **fqid** - the fully qualified ID of the event
- **room_fqid** - the fully qualified ID of the room the event took place in
- **text_fqid** - the fully qualified ID of the
- **fullscreen** - whether the player is in fullscreen mode
- **hq** - whether the game is in high-quality
- **music** - whether the game music is on or off
- **level_group** - which group of levels - and group of questions - this row belongs to (0-4, 5-12, 13-22)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Train

## Data

In [3]:
# Reference: https://www.kaggle.com/competitions/predict-student-performance-from-game-play/discussion/384359
dtypes={
    'elapsed_time': np.int32,
    'event_name': 'category',
    'name': 'category',
    'level': np.uint8,
    'room_coor_x': np.float32,
    'room_coor_y': np.float32,
    'screen_coor_x': np.float32,
    'screen_coor_y': np.float32,
    'hover_duration': np.float32,
    'text': 'category',
    'fqid': 'category',
    'room_fqid': 'category',
    'text_fqid': 'category',
    'fullscreen': 'category',
    'hq': 'category',
    'music': 'category',
    'level_group': 'category'
}

train_data = pd.read_csv('../data/train.csv', dtype=dtypes)

In [4]:
## Reference: https://www.kaggle.com/code/kimtaehun/lightgbm-baseline-with-aggregated-log-data?scriptVersionId=118573291&cellId=15
def summarize_data_info(df: pd.DataFrame) -> pd.DataFrame:
    summary = pd.DataFrame(df.dtypes, columns=['data_type'])
    
    summary['perc_missing'] = df.isnull().sum().values * 100
    summary['perc_missing'] = df.isnull().sum().values / len(df)
    summary['n_unique'] = df.nunique().values
    
    summary['first_value'] = df.loc[0].values
    summary['second_value'] = df.loc[1].values
    summary['third_value'] = df.loc[2].values
    
    df_describe = pd.DataFrame(df.describe(include='all').transpose())
    summary['min'] = df_describe['min'].values
    summary['max'] = df_describe['max'].values
    
    print(f'Data Shape: {df.shape}')
    
    return summary

In [5]:
summary = summarize_data_info(train_data)
summary

Data Shape: (26296946, 20)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int64,0.0,20348,0,1,2,0.0,20473.0
elapsed_time,int32,0.0,5042639,0,1323,831,0.0,1988606704.0
event_name,category,0.0,11,cutscene_click,person_click,person_click,,
name,category,0.0,6,basic,basic,basic,,
level,uint8,0.0,23,0,0,0,0.0,22.0
page,float64,0.978532,7,,,,0.0,6.0
room_coor_x,float32,0.078841,12538215,-413.991394,-413.991394,-413.991394,-1992.354614,1261.773804
room_coor_y,float32,0.078841,9551136,-159.314682,-159.314682,-159.314682,-918.162354,543.616394
screen_coor_x,float32,0.078841,57477,380.0,380.0,380.0,0.0,1919.0


In [6]:
# Reduce Memory Usage
# reference : https://www.kaggle.com/code/arjanso/reducing-dataframe-memory-size-by-65 @ARJANGROEN

def reduce_memory_usage(df):
    
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype.name
        if ((col_type != 'datetime64[ns]') & (col_type != 'category')):
            if (col_type != 'object'):
                c_min = df[col].min()
                c_max = df[col].max()

                if str(col_type)[:3] == 'int':
                    if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                        df[col] = df[col].astype(np.int64)

                else:
                    if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                        df[col] = df[col].astype(np.float16)
                    elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                        df[col] = df[col].astype(np.float32)
                    else:
                        pass
            else:
                df[col] = df[col].astype('category')
    mem_usg = df.memory_usage().sum() / 1024**2 
    print("Memory usage became: ",mem_usg," MB")
    
    return df

In [7]:
train_data = reduce_memory_usage(train_data)

Memory usage of dataframe is 1529.83 MB
Memory usage became:  1053.3384094238281  MB


In [8]:
summary = summarize_data_info(train_data)
summary

Data Shape: (26296946, 20)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,20348,0,1,2,0.0,20473.0
elapsed_time,int32,0.0,5042639,0,1323,831,0.0,1988606704.0
event_name,category,0.0,11,cutscene_click,person_click,person_click,,
name,category,0.0,6,basic,basic,basic,,
level,float16,0.0,23,0.0,0.0,0.0,0.0,22.0
page,float16,0.978532,7,,,,0.0,6.0
room_coor_x,float16,0.078841,29854,-414.0,-414.0,-414.0,-1992.0,1262.0
room_coor_y,float16,0.078841,27847,-159.375,-159.375,-159.375,-918.0,543.5
screen_coor_x,float16,0.078841,6866,380.0,380.0,380.0,0.0,1919.0


### `Text` Field Preprocessing

In [9]:
from typing import Dict

def preprocess_text_str(text_str: str) -> str:
    s = str(text_str).replace("\\", "")
    text_str_ = "undefined" if s.startswith("u0") or (s in ["undefined", "nan"]) else s
    
    text_str__clean = text_str_.split("u0")[0] if "u0" in text_str_ else text_str_ 
    
    return text_str__clean

def create_text_field__clean_dict(data: pd.DataFrame(), text_field: str) -> Dict:
    text_values = list(data[text_field].unique())
    text_values_ = [preprocess_text_str(s) for s in text_values]
    
    text_field__clean_dict = dict(zip(text_values, text_values_))
    
    return text_field__clean_dict

def map_text_field(data: pd.DataFrame, text_field: str, text_field__clean_dict: Dict) -> pd.DataFrame:
    data[text_field] = data[text_field].map(text_values__clean_dict).fillna("undefined")
    
    return data

def recategorize_category_typed_fields(data: pd.DataFrame) -> pd.DataFrame:
    for field_name, dtype in data.dtypes.items():
        if dtype == "category":
            data[field_name] = data[field_name].astype(str).astype("category")
            
    return data

In [10]:
text_values = list(train_data["text"].unique())
text_values[:20]

['undefined',
 'Whatcha doing over there, Jo?',
 'Just talking to Teddy.',
 'I gotta run to my meeting!',
 'Can I come, Gramps?',
 'Sure thing, Jo. Grab your notebook and come upstairs!',
 'See you later, Teddy.',
 "I get to go to Gramps's meeting!",
 'Now where did I put my notebook?',
 '\\u00f0\\u0178\\u02dc\\u00b4',
 nan,
 'I love these photos of me and Teddy!',
 'Found it!',
 'Gramps is in trouble for losing papers?',
 "This can't be right!",
 'Gramps is a great historian!',
 "Hmm. Button's still not working.",
 "Let's get started. The Wisconsin Wonders exhibit opens tomorrow!",
 'Who wants to investigate the shirt artifact?',
 "Not Leopold here. He's been losing papers lately."]

In [11]:
text_values_ = [preprocess_text_str(s) for s in text_values]
text_values_[:20]

['undefined',
 'Whatcha doing over there, Jo?',
 'Just talking to Teddy.',
 'I gotta run to my meeting!',
 'Can I come, Gramps?',
 'Sure thing, Jo. Grab your notebook and come upstairs!',
 'See you later, Teddy.',
 "I get to go to Gramps's meeting!",
 'Now where did I put my notebook?',
 'undefined',
 'undefined',
 'I love these photos of me and Teddy!',
 'Found it!',
 'Gramps is in trouble for losing papers?',
 "This can't be right!",
 'Gramps is a great historian!',
 "Hmm. Button's still not working.",
 "Let's get started. The Wisconsin Wonders exhibit opens tomorrow!",
 'Who wants to investigate the shirt artifact?',
 "Not Leopold here. He's been losing papers lately."]

In [12]:
text_field__clean_dict = dict(zip(text_values, text_values_))
train_data["text"] = train_data["text"].map(text_field__clean_dict).fillna("undefined").astype('category')

In [13]:
summary = summarize_data_info(train_data)
summary

Data Shape: (26296946, 20)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,20348,0,1,2,0.0,20473.0
elapsed_time,int32,0.0,5042639,0,1323,831,0.0,1988606704.0
event_name,category,0.0,11,cutscene_click,person_click,person_click,,
name,category,0.0,6,basic,basic,basic,,
level,float16,0.0,23,0.0,0.0,0.0,0.0,22.0
page,float16,0.978532,7,,,,0.0,6.0
room_coor_x,float16,0.078841,29854,-414.0,-414.0,-414.0,-1992.0,1262.0
room_coor_y,float16,0.078841,27847,-159.375,-159.375,-159.375,-918.0,543.5
screen_coor_x,float16,0.078841,6866,380.0,380.0,380.0,0.0,1919.0


### Train Labels

In [14]:
train_labels = pd.read_csv("../data/train_labels.csv")

In [15]:
summary = summarize_data_info(train_labels)
summary

Data Shape: (424116, 2)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,object,0.0,424116,20090312431273200_q1,20090312433251036_q1,20090312455206810_q1,,
correct,int64,0.0,2,1,0,1,0.0,1.0


In [16]:
train_labels['question_no'] = train_labels['session_id'].apply(lambda x: int(x.split('_')[-1][1:]))
train_labels['session_id'] = train_labels['session_id'].apply(lambda x: int(x.split('_')[0]) )

train_labels["session_id"].nunique()

23562

In [17]:
summary = summarize_data_info(train_labels)
summary

Data Shape: (424116, 3)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312433251036,20090312455206810,2.009031e+16,2.210022e+16
correct,int64,0.0,2,1,0,1,0.0,1.0
question_no,int64,0.0,18,1,1,1,1.0,18.0


#### Validity check >>> Train Labels

In [18]:
train_labels.groupby("session_id")["question_no"].nunique().value_counts()

18    23562
Name: question_no, dtype: int64

In [19]:
question_no__list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
len(train_labels[~train_labels["question_no"].isin(question_no__list)])

0

In [20]:
len(train_labels) == (23562 * 18)

True

#### Validity check >>> Session ids in datasets

In [21]:
train_data__session_id_unique_vals = train_data["session_id"].drop_duplicates().sort_values().reset_index(drop=True)
train_labels__session_id_unique_vals = train_labels["session_id"].drop_duplicates().sort_values().reset_index(drop=True)

pd.testing.assert_series_equal(train_data__session_id_unique_vals, train_labels__session_id_unique_vals)

### Downsampling

In [22]:
session_ids = sorted(train_labels["session_id"].unique())

np.random.seed(42)
np.random.shuffle(session_ids)

session_ids[:5]

[22010107585684490,
 20100413373831344,
 21000409261644490,
 20110314164224844,
 21080621495509370]

In [23]:
N_CHUNKS = 10

np.random.seed(42)
chunk_ids = np.random.randint(N_CHUNKS, size=len(session_ids))

session_chunk_df = pd.DataFrame({"session_id": session_ids, "chunk_id": chunk_ids})
session_chunk_df["chunk_id"].value_counts()

0    2418
9    2407
5    2395
6    2360
1    2358
2    2350
3    2347
7    2320
8    2311
4    2296
Name: chunk_id, dtype: int64

In [24]:
session_chunk_df["chunk_id"].nunique()

10

# Features Per Event

# Event Categories

In [25]:
for i in train_data["event_name"].unique().categories:
    print(i)

checkpoint
cutscene_click
map_click
map_hover
navigate_click
notebook_click
notification_click
object_click
object_hover
observation_click
person_click


In [26]:
from typing import List

def convert_to_numeric_type(data: pd.DataFrame, feature_fields_list: List[str]) -> pd.DataFrame:
    for col_name, dtype in data[feature_fields_list].dtypes.items():
        if str(dtype).startswith("int"):
            data[col_name] = data[col_name].astype("int64")
        elif str(dtype).startswith("float"):
            data[col_name] = data[col_name].astype("float64")
        elif str(dtype) == "category":
            data[col_name] = data[col_name].astype(str).astype("int8")
        else:
            pass
            
    return data

def create_event_features(event_data: pd.DataFrame, feature_fields_list: List) -> pd.DataFrame:
    df_event = convert_to_numeric_type(event_data, feature_fields_list)
    
    stat_list = ["min", "max", "median", "mean", "std", "sum", "count"]
    
    df__event_features = df_event.groupby(["session_id", "level_group"])[feature_fields_list].agg(stat_list).round(2)
    df__event_features.columns = [f"{col_name}__{stat}" for col_name, stat in df__event_features.columns.to_flat_index()]
    
    return df__event_features

def encode_category_field__onehot(event_data: pd.DataFrame, cat_field: str, onehot_dict: Dict[str, str]) -> pd.DataFrame:
    for col_name, value in onehot_dict.items():
        event_data[col_name] = (event_data[cat_field] == value).astype("int8").astype("category")
        
    return event_data

### `event_name` == `"checkpoint"`

In [27]:
event_name = "checkpoint"
checkpoint__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

checkpoint__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,164,194860,basic,4.0,,,,,,,undefined,chap1_finale_c,tunic.capitol_0.hall,,0,0,1,0-4
1,20090312431273200,470,499235,basic,12.0,,,,,,,undefined,chap2_finale_c,tunic.capitol_1.hall,,0,0,1,5-12
2,20090312431273200,931,1272679,basic,22.0,,,,,,,undefined,chap4_finale_c,tunic.capitol_2.hall,,0,0,1,13-22
3,20090312433251036,138,233752,basic,4.0,,,,,,,undefined,chap1_finale_c,tunic.capitol_0.hall,,0,0,0,0-4
4,20090312433251036,544,817609,basic,12.0,,,,,,,undefined,chap2_finale_c,tunic.capitol_1.hall,,0,0,0,5-12


In [28]:
checkpoint__train_data = recategorize_category_typed_fields(checkpoint__train_data)

summary = summarize_data_info(checkpoint__train_data)
summary

Data Shape: (71028, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,2435,164,470,931,0.0,20473.0
elapsed_time,int32,0.0,69855,194860,499235,1272679,300.0,1987182816.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,3,4.0,12.0,22.0,4.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


In [29]:
checkpoint__feature_fields_list = ["index", "elapsed_time", "fullscreen", "hq", "music"]

In [30]:
numeric_features__checkpoint = create_event_features(checkpoint__train_data, checkpoint__feature_fields_list).reset_index()
numeric_features__checkpoint.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,hq__std,hq__sum,hq__count,music__min,music__max,music__median,music__mean,music__std,music__sum,music__count
0,20090312431273200,0-4,164.0,164.0,164.0,164.0,,164,1,194860.0,...,,0,1,1.0,1.0,1.0,1.0,,1,1
1,20090312431273200,13-22,931.0,931.0,931.0,931.0,,931,1,1272679.0,...,,0,1,1.0,1.0,1.0,1.0,,1,1
2,20090312431273200,5-12,470.0,470.0,470.0,470.0,,470,1,499235.0,...,,0,1,1.0,1.0,1.0,1.0,,1,1
3,20090312433251036,0-4,138.0,138.0,138.0,138.0,,138,1,233752.0,...,,0,1,0.0,0.0,0.0,0.0,,0,1
4,20090312433251036,13-22,1875.0,1875.0,1875.0,1875.0,,1875,1,3815334.0,...,,0,1,0.0,0.0,0.0,0.0,,0,1


In [31]:
summary = summarize_data_info(numeric_features__checkpoint)
summary

Data Shape: (70686, 37)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,float64,1.4e-05,2366,164.0,931.0,470.0,0.0,20473.0
index__max,float64,1.4e-05,2432,164.0,931.0,470.0,0.0,20473.0
index__median,float64,1.4e-05,2541,164.0,931.0,470.0,0.0,20473.0
index__mean,float64,1.4e-05,2545,164.0,931.0,470.0,0.0,20473.0
index__std,float64,0.995261,290,,,,39.6,1837.77
index__sum,int64,0.0,2493,164,931,470,0.0,20473.0
index__count,int64,0.0,4,1,1,1,0.0,3.0
elapsed_time__min,float64,1.4e-05,69512,194860.0,1272679.0,499235.0,300.0,1468325947.0


### `event_name` == `"cutscene_click"`

In [32]:
event_name = "cutscene_click"
cutscene_click__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

cutscene_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,0,0,basic,0.0,,-414.0,-159.375,380.0,494.0,,undefined,intro,tunic.historicalsociety.closet,tunic.historicalsociety.closet.intro,0,0,1,0-4
1,20090312431273200,41,45062,basic,1.0,,93.8125,-60.34375,338.0,368.0,,Let's get started. The Wisconsin Wonders exhib...,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1,0-4
2,20090312431273200,42,46046,basic,1.0,,134.0,-85.6875,390.0,386.0,,Who wants to investigate the shirt artifact?,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1,0-4
3,20090312431273200,43,47362,basic,1.0,,125.9375,-83.375,390.0,385.0,,Not Leopold here. He's been losing papers lately.,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1,0-4
4,20090312431273200,44,48112,basic,1.0,,123.6875,-80.0625,389.0,383.0,,Hey!,groupconvo,tunic.historicalsociety.entry,tunic.historicalsociety.entry.groupconvo,0,0,1,0-4


In [33]:
cutscene_click__train_data = recategorize_category_typed_fields(cutscene_click__train_data)

summary = summarize_data_info(cutscene_click__train_data)
summary

Data Shape: (2703035, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3583,0,41,42,0.0,19407.0
elapsed_time,int32,0.0,1528317,0,45062,46046,0.0,1988105886.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,14,0.0,1.0,1.0,0.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,21727,-414.0,93.8125,134.0,-1419.0,949.0
room_coor_y,float16,0.0,19919,-159.375,-60.34375,-85.6875,-532.0,543.5
screen_coor_x,float16,0.0,4428,380.0,338.0,390.0,0.0,1886.0
screen_coor_y,float16,0.0,3017,494.0,368.0,386.0,0.0,1414.0


In [34]:
cutscene_click__feature_fields_list = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

In [35]:
numeric_features__cutscene_click = create_event_features(cutscene_click__train_data, cutscene_click__feature_fields_list).reset_index()
numeric_features__cutscene_click.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,hq__std,hq__sum,hq__count,music__min,music__max,music__median,music__mean,music__std,music__sum,music__count
0,20090312431273200,0-4,0,151,56.5,63.75,29.89,1785,28,0,...,0.0,0,28,1,1,1.0,1.0,0.0,28,28
1,20090312431273200,13-22,523,699,637.5,618.05,58.78,37083,60,844329,...,0.0,0,60,1,1,1.0,1.0,0.0,60,60
2,20090312431273200,5-12,188,214,204.5,200.58,9.82,2407,12,234969,...,0.0,0,12,1,1,1.0,1.0,0.0,12,12
3,20090312433251036,0-4,0,126,48.5,49.81,25.68,1793,36,0,...,0.0,0,36,0,0,0.0,0.0,0.0,0,36
4,20090312433251036,13-22,586,1419,698.0,769.8,265.25,50037,65,1183009,...,0.0,0,65,0,0,0.0,0.0,0.0,0,65


In [36]:
pd.set_option("display.max_rows", 100)

In [37]:
summary = summarize_data_info(numeric_features__cutscene_click)
summary

Data Shape: (70686, 72)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,int64,0.0,1363,0,523,188,0.0,18345.0
index__max,int64,0.0,1959,151,699,214,21.0,19407.0
index__median,float64,0.0,2992,56.5,637.5,204.5,6.0,18446.5
index__mean,float64,0.0,33067,63.75,618.05,200.58,13.27,18540.38
index__std,float64,0.0,11998,29.89,58.78,9.82,4.74,2644.45
index__sum,int64,0.0,23987,1785,37083,2407,146.0,1409069.0
index__count,int64,0.0,145,28,60,12,11.0,293.0
elapsed_time__min,int64,0.0,46490,0,844329,234969,0.0,1194312858.0


### `event_name` == `"map_click"`

In [38]:
event_name = "map_click"
map_click__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

map_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,129,135990,undefined,3.0,,168.0,-142.25,263.0,417.0,,undefined,tunic.kohlcenter,tunic.historicalsociety.entry,,0,0,1,0-4
1,20090312431273200,162,162438,undefined,4.0,,-538.0,6.0,462.0,324.0,,undefined,tunic.capitol_0,tunic.kohlcenter.halloffame,,0,0,1,0-4
2,20090312431273200,183,228133,undefined,5.0,,456.75,167.125,559.0,198.0,,undefined,tunic.historicalsociety,tunic.capitol_0.hall,,0,0,1,5-12
3,20090312431273200,242,280148,close,6.0,,1111.0,419.5,843.0,72.0,,undefined,,tunic.historicalsociety.entry,,0,0,1,5-12
4,20090312431273200,285,324396,undefined,7.0,,418.5,-201.0,420.0,453.0,,undefined,tunic.humanecology,tunic.historicalsociety.entry,,0,0,1,5-12


In [39]:
map_click__train_data = recategorize_category_typed_fields(map_click__train_data)

summary = summarize_data_info(map_click__train_data)
summary

Data Shape: (517242, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,4287,129,162,183,0.0,20470.0
elapsed_time,int32,0.0,471822,135990,162438,228133,113.0,1988601973.0
name,category,0.0,3,undefined,undefined,undefined,,
level,float16,0.0,19,3.0,4.0,5.0,3.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,17550,168.0,-538.0,456.75,-998.0,1173.0
room_coor_y,float16,0.0,17467,-142.25,6.0,167.125,-918.0,536.5
screen_coor_x,float16,0.0,2721,263.0,462.0,559.0,0.0,1894.0
screen_coor_y,float16,0.0,2386,417.0,324.0,198.0,0.0,1311.0


#### OneHot Encoding

- **map_click one_hot_fields**: `name`

In [40]:
onehot_cat_field_str = "name"
map_click__train_data[onehot_cat_field_str].value_counts()

undefined    442532
basic         46087
close         28623
Name: name, dtype: int64

In [41]:
onehot_dict = {
    "is_name__basic": "basic",
    "is_name__close": "close",
    "is_name__undefined": "undefined",
}

In [42]:
map_click__train_data = encode_category_field__onehot(map_click__train_data, onehot_cat_field_str, onehot_dict)
map_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,...,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,is_name__basic,is_name__close,is_name__undefined
0,20090312431273200,129,135990,undefined,3.0,,168.0,-142.25,263.0,417.0,...,tunic.kohlcenter,tunic.historicalsociety.entry,,0,0,1,0-4,0,0,1
1,20090312431273200,162,162438,undefined,4.0,,-538.0,6.0,462.0,324.0,...,tunic.capitol_0,tunic.kohlcenter.halloffame,,0,0,1,0-4,0,0,1
2,20090312431273200,183,228133,undefined,5.0,,456.75,167.125,559.0,198.0,...,tunic.historicalsociety,tunic.capitol_0.hall,,0,0,1,5-12,0,0,1
3,20090312431273200,242,280148,close,6.0,,1111.0,419.5,843.0,72.0,...,,tunic.historicalsociety.entry,,0,0,1,5-12,0,1,0
4,20090312431273200,285,324396,undefined,7.0,,418.5,-201.0,420.0,453.0,...,tunic.humanecology,tunic.historicalsociety.entry,,0,0,1,5-12,0,0,1


In [43]:
summary = summarize_data_info(map_click__train_data)
summary

Data Shape: (517242, 22)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,4287,129,162,183,0.0,20470.0
elapsed_time,int32,0.0,471822,135990,162438,228133,113.0,1988601973.0
name,category,0.0,3,undefined,undefined,undefined,,
level,float16,0.0,19,3.0,4.0,5.0,3.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,17550,168.0,-538.0,456.75,-998.0,1173.0
room_coor_y,float16,0.0,17467,-142.25,6.0,167.125,-918.0,536.5
screen_coor_x,float16,0.0,2721,263.0,462.0,559.0,0.0,1894.0
screen_coor_y,float16,0.0,2386,417.0,324.0,198.0,0.0,1311.0


In [44]:
map_click__numeric_fields = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music", 
]

map_click__onehot_encoded_fields = ["is_name__basic", "is_name__close", "is_name__undefined"]

map_click__feature_fields_list = [*map_click__numeric_fields, *map_click__onehot_encoded_fields]

numeric_features__map_click = create_event_features(map_click__train_data, map_click__feature_fields_list).reset_index()
numeric_features__map_click.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,is_name__close__std,is_name__close__sum,is_name__close__count,is_name__undefined__min,is_name__undefined__max,is_name__undefined__median,is_name__undefined__mean,is_name__undefined__std,is_name__undefined__sum,is_name__undefined__count
0,20090312431273200,0-4,129,162,145.5,145.5,23.33,291,2,135990,...,0.0,0,2,1,1,1.0,1.0,0.0,2,2
1,20090312431273200,13-22,521,929,841.0,791.67,150.72,4750,6,841512,...,0.0,0,6,1,1,1.0,1.0,0.0,6,6
2,20090312431273200,5-12,183,467,345.5,336.75,96.09,2694,8,228133,...,0.35,1,8,0,1,1.0,0.88,0.35,7,8
3,20090312433251036,0-4,104,136,135.0,125.0,18.19,375,3,162990,...,0.0,0,3,0,1,1.0,0.67,0.58,2,3
4,20090312433251036,13-22,584,1873,1066.0,1172.2,331.72,52749,45,1179708,...,0.25,3,45,0,1,1.0,0.76,0.43,34,45


In [45]:
pd.set_option("display.max_rows", 100)

In [46]:
summary = summarize_data_info(numeric_features__map_click)
summary

Data Shape: (70686, 93)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,int64,0.0,1352,129,521,183,0.0,18336.0
index__max,int64,0.0,2426,162,929,467,27.0,20470.0
index__median,float64,0.0,3495,145.5,841.0,345.5,1.0,20334.0
index__mean,float64,0.0,25272,145.5,791.67,336.75,15.0,19991.31
index__std,float64,2.8e-05,18799,23.33,150.72,96.09,9.06,4774.66
index__sum,int64,0.0,15666,291,4750,2694,30.0,6780473.0
index__count,int64,0.0,93,2,6,8,1.0,616.0
elapsed_time__min,int64,0.0,67978,135990,841512,228133,113.0,1188866840.0


### `event_name` == `"map_hover"`

In [47]:
event_name = "map_hover"
map_hover__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

map_hover__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,127,135124,basic,3.0,,,,,,234.0,undefined,tunic.historicalsociety,tunic.historicalsociety.entry,,0,0,1,0-4
1,20090312431273200,128,135256,basic,3.0,,,,,,17.0,undefined,tunic.kohlcenter,tunic.historicalsociety.entry,,0,0,1,0-4
2,20090312431273200,160,161405,basic,4.0,,,,,,250.0,undefined,toentry,tunic.kohlcenter.halloffame,,0,0,1,0-4
3,20090312431273200,161,161822,basic,4.0,,,,,,17.0,undefined,tunic.kohlcenter,tunic.kohlcenter.halloffame,,0,0,1,0-4
4,20090312431273200,182,226643,basic,5.0,,,,,,750.0,undefined,toentry,tunic.capitol_0.hall,,0,0,1,5-12


In [48]:
map_hover__train_data = recategorize_category_typed_fields(map_hover__train_data)

summary = summarize_data_info(map_hover__train_data)
summary

Data Shape: (945159, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,21688,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3957,127,128,160,0.0,20469.0
elapsed_time,int32,0.0,825536,135124,135256,161405,16.0,1988601273.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,19,3.0,3.0,4.0,3.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


In [49]:
map_hover__feature_fields_list = [ "index", "elapsed_time", "level", "hover_duration", "fullscreen", "hq", "music"]

numeric_features__map_hover = create_event_features(map_hover__train_data, map_hover__feature_fields_list).reset_index()
numeric_features__map_hover.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,hq__std,hq__sum,hq__count,music__min,music__max,music__median,music__mean,music__std,music__sum,music__count
0,20090312431273200,0-4,127.0,161.0,144.0,144.0,19.06,576,4,135124.0,...,0.0,0,4,1.0,1.0,1.0,1.0,0.0,4,4
1,20090312431273200,13-22,516.0,928.0,770.5,711.29,158.11,9958,14,839629.0,...,0.0,0,14,1.0,1.0,1.0,1.0,0.0,14,14
2,20090312431273200,5-12,182.0,466.0,361.0,340.33,106.22,3063,9,226643.0,...,0.0,0,9,1.0,1.0,1.0,1.0,0.0,9,9
3,20090312433251036,0-4,101.0,103.0,102.0,102.0,1.0,306,3,161157.0,...,0.0,0,3,0.0,0.0,0.0,0.0,0.0,0,3
4,20090312433251036,13-22,583.0,1872.0,1085.5,1160.44,316.2,215842,186,1178909.0,...,0.0,0,186,0.0,0.0,0.0,0.0,0.0,0,186


In [50]:
summary = summarize_data_info(numeric_features__map_hover)
summary

Data Shape: (65064, 51)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,21688,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,float64,0.036241,1355,127.0,516.0,182.0,0.0,18333.0
index__max,float64,0.036241,2302,161.0,928.0,466.0,0.0,20469.0
index__median,float64,0.036241,3452,144.0,770.5,361.0,0.0,20330.0
index__mean,float64,0.036241,31752,144.0,711.29,340.33,0.0,19800.16
index__std,float64,0.123432,18777,19.06,158.11,106.22,0.71,7534.93
index__sum,int64,0.0,24581,576,9958,3063,0.0,1213421.0
index__count,int64,0.0,202,4,14,9,0.0,381.0
elapsed_time__min,float64,0.036241,60827,135124.0,839629.0,226643.0,16.0,1188865706.0


### `event_name` == `"navigate_click"`

In [51]:
event_name = "navigate_click"
navigate_click__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

navigate_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,10,9133,undefined,0.0,,501.0,-160.75,605.0,445.0,,undefined,teddy,tunic.historicalsociety.closet,,0,0,1,0-4
1,20090312431273200,12,12030,undefined,0.0,,510.0,-106.375,614.0,386.0,,undefined,photo,tunic.historicalsociety.closet,,0,0,1,0-4
2,20090312431273200,14,14814,undefined,0.0,,274.0,-196.75,406.0,486.0,,undefined,,tunic.historicalsociety.closet,,0,0,1,0-4
3,20090312431273200,15,15498,undefined,0.0,,185.75,-205.75,363.0,492.0,,undefined,,tunic.historicalsociety.closet,,0,0,1,0-4
4,20090312431273200,16,16046,undefined,0.0,,0.583496,-225.75,234.0,510.0,,undefined,,tunic.historicalsociety.closet,,0,0,1,0-4


In [52]:
navigate_click__train_data = recategorize_category_typed_fields(navigate_click__train_data)

summary = summarize_data_info(navigate_click__train_data)
summary

Data Shape: (11326433, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,13859,10,12,14,0.0,20472.0
elapsed_time,int32,0.0,3657724,9133,12030,14814,0.0,1988606704.0
name,category,0.0,1,undefined,undefined,undefined,,
level,float16,0.0,23,0.0,0.0,0.0,0.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,28184,501.0,510.0,274.0,-1992.0,1259.0
room_coor_y,float16,0.0,25911,-160.75,-106.375,-196.75,-918.0,536.5
screen_coor_x,float16,0.0,6722,605.0,614.0,406.0,0.0,1919.0
screen_coor_y,float16,0.0,3862,445.0,386.0,486.0,0.0,1440.0


In [53]:
navigate_click__feature_fields_list = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

In [54]:
numeric_features__navigate_click = create_event_features(navigate_click__train_data, navigate_click__feature_fields_list).reset_index()
numeric_features__navigate_click.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,hq__std,hq__sum,hq__count,music__min,music__max,music__median,music__mean,music__std,music__sum,music__count
0,20090312431273200,0-4,10,163,106.0,91.74,48.2,7431,81,9133,...,0.0,0,81,1,1,1.0,1.0,0.0,81,81
1,20090312431273200,13-22,512,930,675.5,698.16,121.01,118687,170,836732,...,0.0,0,170,1,1,1.0,1.0,0.0,170,170
2,20090312431273200,5-12,175,469,286.0,302.64,89.81,31172,103,221485,...,0.0,0,103,1,1,1.0,1.0,0.0,103,103
3,20090312433251036,0-4,13,137,90.0,74.59,42.19,3655,49,5149,...,0.0,0,49,0,0,0.0,0.0,0.0,0,49
4,20090312433251036,13-22,579,1874,1196.0,1197.36,369.63,762719,637,1176483,...,0.0,0,637,0,0,0.0,0.0,0.0,0,637


In [55]:
summary = summarize_data_info(numeric_features__navigate_click)
summary

Data Shape: (70686, 72)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,int64,0.0,1319,10,512,175,0.0,18325.0
index__max,int64,0.0,2411,163,930,469,83.0,20472.0
index__median,float64,0.0,3293,106.0,675.5,286.0,33.5,19287.5
index__mean,float64,0.0,42654,91.74,698.16,302.64,46.69,19354.37
index__std,float64,0.0,20163,48.2,121.01,89.81,26.74,4216.07
index__sum,int64,0.0,52823,7431,118687,31172,1296.0,79007976.0
index__count,int64,0.0,1017,81,170,103,25.0,7871.0
elapsed_time__min,int64,0.0,62676,9133,836732,221485,0.0,1188861174.0


### `event_name` == `"notebook_click"`

In [56]:
event_name = "notebook_click"
notebook_click__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

notebook_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312433251036,60,60743,open,2.0,0.0,-1112.0,-518.5,30.0,639.0,,undefined,,tunic.historicalsociety.entry,,0,0,0,0-4
1,20090312433251036,61,61761,close,2.0,0.0,73.25,428.25,789.0,58.0,,undefined,,tunic.historicalsociety.entry,,0,0,0,0-4
2,20090312433251036,209,351064,open,6.0,1.0,-490.75,-429.75,61.0,629.0,,undefined,,tunic.historicalsociety.basement,,0,0,0,5-12
3,20090312433251036,210,354779,basic,6.0,1.0,-97.625,-304.25,343.0,539.0,,undefined,,tunic.historicalsociety.basement,,0,0,0,5-12
4,20090312433251036,211,357947,close,6.0,1.0,556.0,342.5,812.0,75.0,,undefined,,tunic.historicalsociety.basement,,0,0,0,5-12


In [57]:
notebook_click__train_data = recategorize_category_typed_fields(notebook_click__train_data)

summary = summarize_data_info(notebook_click__train_data)
summary

Data Shape: (564544, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,20887,20090312433251036,20090312433251036,20090312433251036,2.009031243325104e+16,2.2100221145014656e+16
index,int16,0.0,6006,60,61,209,0.0,20384.0
elapsed_time,int32,0.0,515782,60743,61761,351064,9183.0,1988605536.0
name,category,0.0,5,open,close,open,,
level,float16,0.0,22,2.0,2.0,6.0,1.0,22.0
page,float16,0.0,7,0.0,0.0,1.0,0.0,6.0
room_coor_x,float16,0.0,15130,-1112.0,73.25,-490.75,-1991.0,1258.0
room_coor_y,float16,0.0,11968,-518.5,428.25,-429.75,-915.5,535.0
screen_coor_x,float16,0.0,3246,30.0,789.0,61.0,0.0,1919.0
screen_coor_y,float16,0.0,2825,639.0,58.0,629.0,0.0,1419.0


#### OneHot Encoding

- **notebook_click one_hot_fields**: `name`

In [58]:
onehot_cat_field_str = "name"
notebook_click__train_data[onehot_cat_field_str].value_counts()

open     235139
close    235132
basic     63416
prev      19250
next      11607
Name: name, dtype: int64

In [59]:
onehot_dict = {
    "is_name__basic": "basic",
    "is_name__open": "open",
    "is_name__close": "close",
    "is_name__prev": "prev",
    "is_name__next": "next",
}

In [60]:
notebook_click__train_data = encode_category_field__onehot(notebook_click__train_data, onehot_cat_field_str, onehot_dict)
notebook_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,...,text_fqid,fullscreen,hq,music,level_group,is_name__basic,is_name__open,is_name__close,is_name__prev,is_name__next
0,20090312433251036,60,60743,open,2.0,0.0,-1112.0,-518.5,30.0,639.0,...,,0,0,0,0-4,0,1,0,0,0
1,20090312433251036,61,61761,close,2.0,0.0,73.25,428.25,789.0,58.0,...,,0,0,0,0-4,0,0,1,0,0
2,20090312433251036,209,351064,open,6.0,1.0,-490.75,-429.75,61.0,629.0,...,,0,0,0,5-12,0,1,0,0,0
3,20090312433251036,210,354779,basic,6.0,1.0,-97.625,-304.25,343.0,539.0,...,,0,0,0,5-12,1,0,0,0,0
4,20090312433251036,211,357947,close,6.0,1.0,556.0,342.5,812.0,75.0,...,,0,0,0,5-12,0,0,1,0,0


In [61]:
summary = summarize_data_info(notebook_click__train_data)
summary

Data Shape: (564544, 24)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,20887,20090312433251036,20090312433251036,20090312433251036,2.009031243325104e+16,2.2100221145014656e+16
index,int16,0.0,6006,60,61,209,0.0,20384.0
elapsed_time,int32,0.0,515782,60743,61761,351064,9183.0,1988605536.0
name,category,0.0,5,open,close,open,,
level,float16,0.0,22,2.0,2.0,6.0,1.0,22.0
page,float16,0.0,7,0.0,0.0,1.0,0.0,6.0
room_coor_x,float16,0.0,15130,-1112.0,73.25,-490.75,-1991.0,1258.0
room_coor_y,float16,0.0,11968,-518.5,428.25,-429.75,-915.5,535.0
screen_coor_x,float16,0.0,3246,30.0,789.0,61.0,0.0,1919.0
screen_coor_y,float16,0.0,2825,639.0,58.0,629.0,0.0,1419.0


In [62]:
notebook_click__numeric_fields = [
    "index", "elapsed_time", "level", "page", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

notebook_click__onehot_encoded_fields = ["is_name__basic", "is_name__open", "is_name__close", "is_name__prev", "is_name__next"]

notebook_click__feature_fields_list = [*notebook_click__numeric_fields, *notebook_click__onehot_encoded_fields]

numeric_features__notebook_click = create_event_features(notebook_click__train_data, notebook_click__feature_fields_list).reset_index()
numeric_features__notebook_click.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,is_name__prev__std,is_name__prev__sum,is_name__prev__count,is_name__next__min,is_name__next__max,is_name__next__median,is_name__next__mean,is_name__next__std,is_name__next__sum,is_name__next__count
0,20090312433251036,0-4,60.0,61.0,60.5,60.5,0.71,121,2,60743.0,...,0.0,0,2,0.0,0.0,0.0,0.0,0.0,0,2
1,20090312433251036,13-22,638.0,1866.0,1449.0,1408.28,377.69,70414,50,1239444.0,...,0.14,1,50,0.0,1.0,0.0,0.02,0.14,1,50
2,20090312433251036,5-12,209.0,538.0,413.0,361.71,150.68,2532,7,351064.0,...,0.0,0,7,0.0,0.0,0.0,0.0,0.0,0,7
3,20090312455206810,0-4,,,,,,0,0,,...,,0,0,,,,,,0,0
4,20090312455206810,13-22,521.0,820.0,733.5,694.58,104.43,18059,26,796699.0,...,0.0,0,26,0.0,0.0,0.0,0.0,0.0,0,26


In [63]:
pd.set_option("display.max_rows", 200)

In [64]:
summary = summarize_data_info(numeric_features__notebook_click)
summary

Data Shape: (62661, 114)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,20887,20090312433251036,20090312433251036,20090312433251036,2.009031243325104e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,float64,0.219882,1634,60.0,638.0,209.0,0.0,18394.0
index__max,float64,0.219882,2320,61.0,1866.0,538.0,11.0,20384.0
index__median,float64,0.219882,3524,60.5,1449.0,413.0,8.0,20058.0
index__mean,float64,0.219882,23019,60.5,1408.28,361.71,10.5,19941.41
index__std,float64,0.219882,16850,0.71,377.69,150.68,0.71,3634.01
index__sum,int64,0.0,17665,121,70414,2532,0.0,29832207.0
index__count,int64,0.0,115,2,50,7,0.0,2588.0
elapsed_time__min,float64,0.219882,48059,60743.0,1239444.0,351064.0,9183.0,1987137011.0


### `event_name` == `"notification_click"`

In [65]:
event_name = "notification_click"
notification_click__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

notification_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,26,24348,basic,0.0,,-472.25,-117.9375,554.0,394.0,,Found it!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.notebook,0,0,1,0-4
1,20090312431273200,29,32229,basic,1.0,,-182.5,-1.90625,767.0,305.0,,Gramps is in trouble for losing papers?,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1,0-4
2,20090312431273200,30,33063,basic,1.0,,-182.5,-55.875,767.0,359.0,,This can't be right!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1,0-4
3,20090312431273200,31,34245,basic,1.0,,-182.5,-55.875,767.0,359.0,,Gramps is a great historian!,,tunic.historicalsociety.closet,tunic.historicalsociety.closet.retirement_lett...,0,0,1,0-4
4,20090312431273200,85,89809,basic,2.0,,-86.875,-96.8125,355.0,397.0,,This looks like a clue!,,tunic.historicalsociety.collection,tunic.historicalsociety.collection.tunic.slip,0,0,1,0-4


In [66]:
notification_click__train_data = recategorize_category_typed_fields(notification_click__train_data)

summary = summarize_data_info(notification_click__train_data)
summary

Data Shape: (649001, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3457,26,29,30,0.0,20458.0
elapsed_time,int32,0.0,552331,24348,32229,33063,82.0,1988525224.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,20,0.0,1.0,1.0,0.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,15642,-472.25,-182.5,-182.5,-1022.5,1262.0
room_coor_y,float16,0.0,13756,-117.9375,-1.90625,-55.875,-812.0,536.5
screen_coor_x,float16,0.0,2871,554.0,767.0,767.0,0.0,1906.0
screen_coor_y,float16,0.0,2303,394.0,305.0,359.0,0.0,1420.0


In [67]:
notification_click__feature_fields_list = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

numeric_features__notification_click = create_event_features(notification_click__train_data, notification_click__feature_fields_list).reset_index()
numeric_features__notification_click.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,hq__std,hq__sum,hq__count,music__min,music__max,music__median,music__mean,music__std,music__sum,music__count
0,20090312431273200,0-4,26,146,58.0,72.25,51.52,578,8,24348,...,0.0,0,8,1,1,1.0,1.0,0.0,8,8
1,20090312431273200,13-22,598,922,813.0,766.6,149.22,7666,10,909809,...,0.0,0,10,1,1,1.0,1.0,0.0,10,10
2,20090312431273200,5-12,313,458,384.0,393.56,53.21,3542,9,346295,...,0.0,0,9,1,1,1.0,1.0,0.0,9,9
3,20090312433251036,0-4,21,122,76.0,83.0,41.6,415,5,12113,...,0.0,0,5,0,0,0.0,0.0,0.0,0,5
4,20090312433251036,13-22,753,1837,1121.5,1323.07,388.31,18523,14,1481128,...,0.0,0,14,0,0,0.0,0.0,0.0,0,14


In [68]:
summary = summarize_data_info(numeric_features__notification_click)
summary

Data Shape: (70686, 72)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,int64,0.0,1717,26,598,313,0.0,19694.0
index__max,int64,0.0,2375,146,922,458,64.0,20458.0
index__median,float64,0.0,3889,58.0,813.0,384.0,12.0,20451.0
index__mean,float64,0.0,23966,72.25,766.6,393.56,36.83,20302.33
index__std,float64,0.0,20032,51.52,149.22,53.21,20.8,2825.41
index__sum,int64,0.0,15120,578,7666,3542,221.0,365906.0
index__count,int64,0.0,53,8,10,9,5.0,96.0
elapsed_time__min,int64,0.0,65185,24348,909809,346295,82.0,1195638449.0


### `event_name` == `"object_click"`

In [69]:
event_name = "object_click"
object_click__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

object_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,27,25766,close,0.0,,-206.5,199.125,822.0,76.0,,undefined,notebook,tunic.historicalsociety.closet,,0,0,1,0-4
1,20090312431273200,32,36433,close,1.0,,-113.5,241.125,836.0,62.0,,undefined,retirement_letter,tunic.historicalsociety.closet,,0,0,1,0-4
2,20090312431273200,50,57277,basic,1.0,,856.5,69.75,839.0,291.0,,undefined,report,tunic.historicalsociety.entry,,0,0,1,0-4
3,20090312431273200,51,58244,close,1.0,,848.0,402.0,834.0,87.0,,undefined,report,tunic.historicalsociety.entry,,0,0,1,0-4
4,20090312431273200,68,73927,close,2.0,,439.0,416.0,833.0,74.0,,undefined,directory,tunic.historicalsociety.entry,,0,0,1,0-4


In [70]:
object_click__train_data = recategorize_category_typed_fields(object_click__train_data)

summary = summarize_data_info(object_click__train_data)
summary

Data Shape: (2198211, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,6981,27,32,50,0.0,20462.0
elapsed_time,int32,0.0,1494730,25766,36433,57277,0.0,1988526677.0
name,category,0.0,2,close,close,basic,,
level,float16,0.0,23,0.0,1.0,1.0,0.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,17563,-206.5,-113.5,856.5,-1020.5,1249.0
room_coor_y,float16,0.0,17191,199.125,241.125,69.75,-811.0,543.5
screen_coor_x,float16,0.0,4285,822.0,836.0,839.0,0.0,1919.0
screen_coor_y,float16,0.0,4224,76.0,62.0,291.0,0.0,1417.0


#### OneHot Encoding

- **object_click one_hot_fields**: `name`

In [71]:
onehot_cat_field_str = "name"
object_click__train_data[onehot_cat_field_str].value_counts()

basic    1785270
close     412941
Name: name, dtype: int64

In [72]:
onehot_dict = {
    "is_name__basic": "basic",
    "is_name__close": "close",
}

In [73]:
object_click__train_data = encode_category_field__onehot(object_click__train_data, onehot_cat_field_str, onehot_dict)
object_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,...,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,is_name__basic,is_name__close
0,20090312431273200,27,25766,close,0.0,,-206.5,199.125,822.0,76.0,...,undefined,notebook,tunic.historicalsociety.closet,,0,0,1,0-4,0,1
1,20090312431273200,32,36433,close,1.0,,-113.5,241.125,836.0,62.0,...,undefined,retirement_letter,tunic.historicalsociety.closet,,0,0,1,0-4,0,1
2,20090312431273200,50,57277,basic,1.0,,856.5,69.75,839.0,291.0,...,undefined,report,tunic.historicalsociety.entry,,0,0,1,0-4,1,0
3,20090312431273200,51,58244,close,1.0,,848.0,402.0,834.0,87.0,...,undefined,report,tunic.historicalsociety.entry,,0,0,1,0-4,0,1
4,20090312431273200,68,73927,close,2.0,,439.0,416.0,833.0,74.0,...,undefined,directory,tunic.historicalsociety.entry,,0,0,1,0-4,0,1


In [74]:
summary = summarize_data_info(object_click__train_data)
summary

Data Shape: (2198211, 21)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,6981,27,32,50,0.0,20462.0
elapsed_time,int32,0.0,1494730,25766,36433,57277,0.0,1988526677.0
name,category,0.0,2,close,close,basic,,
level,float16,0.0,23,0.0,1.0,1.0,0.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,17563,-206.5,-113.5,856.5,-1020.5,1249.0
room_coor_y,float16,0.0,17191,199.125,241.125,69.75,-811.0,543.5
screen_coor_x,float16,0.0,4285,822.0,836.0,839.0,0.0,1919.0
screen_coor_y,float16,0.0,4224,76.0,62.0,291.0,0.0,1417.0


In [75]:
object_click__numeric_fields = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

object_click__onehot_encoded_fields = ["is_name__basic", "is_name__close"]

object_click__feature_fields_list = [*object_click__numeric_fields, *object_click__onehot_encoded_fields]

numeric_features__object_click = create_event_features(object_click__train_data, object_click__feature_fields_list).reset_index()
numeric_features__object_click.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,is_name__basic__std,is_name__basic__sum,is_name__basic__count,is_name__close__min,is_name__close__max,is_name__close__median,is_name__close__mean,is_name__close__std,is_name__close__sum,is_name__close__count
0,20090312431273200,0-4,27,149,83.0,83.91,45.02,923,11,25766,...,0.52,5,11,0,1,1.0,0.55,0.52,6,11
1,20090312431273200,13-22,597,924,857.0,820.75,102.73,16415,20,909078,...,0.44,15,20,0,1,0.0,0.25,0.44,5,20
2,20090312431273200,5-12,305,460,366.0,373.89,53.21,10469,28,339994,...,0.39,23,28,0,1,0.0,0.18,0.39,5,28
3,20090312433251036,0-4,22,124,112.0,94.2,34.01,1413,15,13148,...,0.46,11,15,0,1,0.0,0.27,0.46,4,15
4,20090312433251036,13-22,744,1861,1309.0,1406.34,324.27,116726,83,1452629,...,0.41,66,83,0,1,0.0,0.2,0.41,17,83


In [76]:
pd.set_option("display.max_rows", 200)

In [77]:
summary = summarize_data_info(numeric_features__object_click)
summary

Data Shape: (70686, 86)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,int64,0.0,1607,27,597,305,0.0,19686.0
index__max,int64,0.0,2382,149,924,460,66.0,20462.0
index__median,float64,0.0,3800,83.0,857.0,366.0,14.0,20281.0
index__mean,float64,0.0,43099,83.91,820.75,373.89,40.0,20152.27
index__std,float64,0.0,17662,45.02,102.73,53.21,18.33,2437.95
index__sum,int64,0.0,33099,923,16415,10469,240.0,37472305.0
index__count,int64,0.0,356,11,20,28,6.0,2709.0
elapsed_time__min,int64,0.0,65478,25766,909078,339994,0.0,1195530470.0


In [78]:
object_click__numeric_fields = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

### `event_name` == `"object_hover"`

In [79]:
event_name = "object_hover"
object_hover__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

object_hover__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,49,52328,basic,1.0,,,,,,7899.0,undefined,groupconvo,tunic.historicalsociety.entry,,0,0,1,0-4
1,20090312431273200,82,87242,basic,2.0,,,,,,400.0,undefined,tunic,tunic.historicalsociety.collection,,0,0,1,0-4
2,20090312431273200,87,92242,undefined,2.0,,,,,,3949.0,undefined,tunic.hub.slip,tunic.historicalsociety.collection,,0,0,1,0-4
3,20090312431273200,148,153655,undefined,3.0,,,,,,6350.0,undefined,plaque.face.date,tunic.kohlcenter.halloffame,,0,0,1,0-4
4,20090312431273200,303,338929,undefined,7.0,,,,,,68.0,undefined,businesscards.card_0.next,tunic.humanecology.frontdesk,,0,0,1,5-12


In [80]:
object_hover__train_data = recategorize_category_typed_fields(object_hover__train_data)

summary = summarize_data_info(object_hover__train_data)
summary

Data Shape: (1057085, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,21690,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3598,49,82,87,0.0,20461.0
elapsed_time,int32,0.0,879279,52328,87242,92242,29.0,1988525877.0
name,category,0.0,2,basic,basic,undefined,,
level,float16,0.0,23,1.0,2.0,2.0,0.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


#### OneHot Encoding

- **object_hover one_hot_fields**: `name`

In [81]:
onehot_cat_field_str = "name"
object_hover__train_data[onehot_cat_field_str].value_counts()

undefined    936820
basic        120265
Name: name, dtype: int64

In [82]:
onehot_dict = {
    "is_name__basic": "basic",
    "is_name__undefined": "undefined",
}

In [83]:
object_hover__train_data = encode_category_field__onehot(object_hover__train_data, onehot_cat_field_str, onehot_dict)
object_hover__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,...,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group,is_name__basic,is_name__undefined
0,20090312431273200,49,52328,basic,1.0,,,,,,...,undefined,groupconvo,tunic.historicalsociety.entry,,0,0,1,0-4,1,0
1,20090312431273200,82,87242,basic,2.0,,,,,,...,undefined,tunic,tunic.historicalsociety.collection,,0,0,1,0-4,1,0
2,20090312431273200,87,92242,undefined,2.0,,,,,,...,undefined,tunic.hub.slip,tunic.historicalsociety.collection,,0,0,1,0-4,0,1
3,20090312431273200,148,153655,undefined,3.0,,,,,,...,undefined,plaque.face.date,tunic.kohlcenter.halloffame,,0,0,1,0-4,0,1
4,20090312431273200,303,338929,undefined,7.0,,,,,,...,undefined,businesscards.card_0.next,tunic.humanecology.frontdesk,,0,0,1,5-12,0,1


In [84]:
summary = summarize_data_info(object_hover__train_data)
summary

Data Shape: (1057085, 21)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,21690,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,3598,49,82,87,0.0,20461.0
elapsed_time,int32,0.0,879279,52328,87242,92242,29.0,1988525877.0
name,category,0.0,2,basic,basic,undefined,,
level,float16,0.0,23,1.0,2.0,2.0,0.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,1.0,0,,,,,
room_coor_y,float16,1.0,0,,,,,
screen_coor_x,float16,1.0,0,,,,,
screen_coor_y,float16,1.0,0,,,,,


In [85]:
object_hover__numeric_fields = [
    "index", "elapsed_time", "level", "hover_duration", "fullscreen", "hq", "music"
]

object_hover__onehot_encoded_fields = ["is_name__basic", "is_name__undefined"]

object_hover__feature_fields_list = [*object_hover__numeric_fields, *object_hover__onehot_encoded_fields]

numeric_features__object_hover = create_event_features(object_hover__train_data, object_hover__feature_fields_list).reset_index()
numeric_features__object_hover.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,is_name__basic__std,is_name__basic__sum,is_name__basic__count,is_name__undefined__min,is_name__undefined__max,is_name__undefined__median,is_name__undefined__mean,is_name__undefined__std,is_name__undefined__sum,is_name__undefined__count
0,20090312431273200,0-4,49.0,148.0,84.5,91.5,41.27,366,4,52328.0,...,0.58,2,4,0.0,1.0,0.5,0.5,0.58,2,4
1,20090312431273200,13-22,595.0,923.0,875.0,807.08,130.17,10492,13,904875.0,...,0.38,2,13,0.0,1.0,1.0,0.85,0.38,11,13
2,20090312431273200,5-12,303.0,459.0,377.0,374.67,60.53,7868,21,338929.0,...,0.3,2,21,0.0,1.0,1.0,0.9,0.3,19,21
3,20090312433251036,0-4,73.0,123.0,115.0,101.2,24.13,506,5,106194.0,...,0.0,0,5,1.0,1.0,1.0,1.0,0.0,5,5
4,20090312433251036,13-22,747.0,1860.0,1309.0,1352.53,345.64,89267,66,1455145.0,...,0.24,4,66,0.0,1.0,1.0,0.94,0.24,62,66


In [86]:
pd.set_option("display.max_rows", 200)

In [87]:
summary = summarize_data_info(numeric_features__object_hover)
summary

Data Shape: (65070, 65)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,21690,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,float64,3.1e-05,1563,49.0,595.0,303.0,0.0,19685.0
index__max,float64,3.1e-05,2289,148.0,923.0,459.0,36.0,20461.0
index__median,float64,3.1e-05,3712,84.5,875.0,377.0,14.0,20442.0
index__mean,float64,3.1e-05,35556,91.5,807.08,374.67,33.33,20275.27
index__std,float64,0.004718,17894,41.27,130.17,60.53,2.63,3080.01
index__sum,int64,0.0,24048,366,10492,7868,0.0,485563.0
index__count,int64,0.0,109,4,13,21,0.0,164.0
elapsed_time__min,float64,3.1e-05,63029,52328.0,904875.0,338929.0,29.0,1195528472.0


### `event_name` == `"observation_click"`

In [88]:
event_name = "observation_click"
observation_click__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

observation_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,13,13030,basic,0.0,,487.0,-98.5625,614.0,386.0,,I love these photos of me and Teddy!,photo,tunic.historicalsociety.closet,tunic.historicalsociety.closet.photo,0,0,1,0-4
1,20090312431273200,37,41297,basic,1.0,,-400.25,-117.5,179.0,405.0,,Hmm. Button's still not working.,janitor,tunic.historicalsociety.basement,tunic.historicalsociety.basement.janitor,0,0,1,0-4
2,20090312431273200,108,109825,basic,3.0,,14.359375,-156.25,444.0,485.0,,Better check back later.,outtolunch,tunic.historicalsociety.stacks,tunic.historicalsociety.stacks.outtolunch,0,0,1,0-4
3,20090312431273200,112,117142,basic,3.0,,-7.492188,-61.71875,480.0,365.0,,Hmm. Button's still not working.,janitor,tunic.historicalsociety.basement,tunic.historicalsociety.basement.janitor,0,0,1,0-4
4,20090312431273200,256,300382,basic,6.0,,75.625,-32.0,419.0,362.0,,I bet the archivist could use this!,magnify,tunic.historicalsociety.frontdesk,tunic.historicalsociety.frontdesk.magnify,0,0,1,5-12


In [89]:
observation_click__train_data = recategorize_category_typed_fields(observation_click__train_data)

summary = summarize_data_info(observation_click__train_data)
summary

Data Shape: (212355, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,2501,13,37,108,1.0,19657.0
elapsed_time,int32,0.0,201238,13030,41297,109825,88.0,1988385266.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,22,0.0,1.0,3.0,0.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,18460,487.0,-400.25,14.359375,-1972.0,1173.0
room_coor_y,float16,0.0,12759,-98.5625,-117.5,-156.25,-885.0,450.5
screen_coor_x,float16,0.0,2065,614.0,179.0,444.0,0.0,1726.0
screen_coor_y,float16,0.0,1648,386.0,405.0,485.0,6.0,1250.0


In [90]:
observation_click__feature_fields_list = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

In [91]:
numeric_features__observation_click = create_event_features(observation_click__train_data, observation_click__feature_fields_list).reset_index()
numeric_features__observation_click.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,hq__std,hq__sum,hq__count,music__min,music__max,music__median,music__mean,music__std,music__sum,music__count
0,20090312431273200,0-4,13.0,112.0,72.5,67.5,50.07,270,4,13030.0,...,0.0,0,4,1.0,1.0,1.0,1.0,0.0,4,4
1,20090312431273200,13-22,608.0,773.0,725.0,702.0,84.87,2106,3,920474.0,...,0.0,0,3,1.0,1.0,1.0,1.0,0.0,3,3
2,20090312431273200,5-12,256.0,256.0,256.0,256.0,,256,1,300382.0,...,,0,1,1.0,1.0,1.0,1.0,,1,1
3,20090312433251036,0-4,29.0,31.0,30.0,30.0,1.41,60,2,36447.0,...,0.0,0,2,0.0,0.0,0.0,0.0,0.0,0,2
4,20090312433251036,13-22,620.0,1474.0,981.0,1038.2,410.68,5191,5,1217273.0,...,0.0,0,5,0.0,0.0,0.0,0.0,0.0,0,5


In [92]:
summary = summarize_data_info(numeric_features__observation_click)
summary

Data Shape: (70686, 72)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,float64,0.109145,1458,13.0,608.0,256.0,1.0,18378.0
index__max,float64,0.109145,2094,112.0,773.0,256.0,3.0,19657.0
index__median,float64,0.109145,2978,72.5,725.0,256.0,3.0,19593.0
index__mean,float64,0.109145,12759,67.5,702.0,256.0,3.0,19125.4
index__std,float64,0.2726,16515,50.07,84.87,,0.71,3101.72
index__sum,int64,0.0,7307,270,2106,256,0.0,338375.0
index__count,int64,0.0,32,4,3,1,0.0,53.0
elapsed_time__min,float64,0.109145,60720,13030.0,920474.0,300382.0,88.0,1746734366.0


### `event_name` == `"person_click"`

In [93]:
event_name = "person_click"
person_click__train_data = train_data[train_data["event_name"] == event_name].drop(labels=["event_name"], axis=1).reset_index(drop=True)

person_click__train_data.head()

Unnamed: 0,session_id,index,elapsed_time,name,level,page,room_coor_x,room_coor_y,screen_coor_x,screen_coor_y,hover_duration,text,fqid,room_fqid,text_fqid,fullscreen,hq,music,level_group
0,20090312431273200,1,1323,basic,0.0,,-414.0,-159.375,380.0,494.0,,"Whatcha doing over there, Jo?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
1,20090312431273200,2,831,basic,0.0,,-414.0,-159.375,380.0,494.0,,Just talking to Teddy.,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
2,20090312431273200,3,1147,basic,0.0,,-414.0,-159.375,380.0,494.0,,I gotta run to my meeting!,gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
3,20090312431273200,4,1863,basic,0.0,,-413.0,-159.375,381.0,494.0,,"Can I come, Gramps?",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4
4,20090312431273200,5,3423,basic,0.0,,-413.0,-157.375,381.0,492.0,,"Sure thing, Jo. Grab your notebook and come up...",gramps,tunic.historicalsociety.closet,tunic.historicalsociety.closet.gramps.intro_0_...,0,0,1,0-4


In [94]:
person_click__train_data = recategorize_category_typed_fields(person_click__train_data)

summary = summarize_data_info(person_click__train_data)
summary

Data Shape: (6052853, 19)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
index,int16,0.0,4878,1,2,3,0.0,20388.0
elapsed_time,int32,0.0,2686162,1323,831,1147,0.0,1988597956.0
name,category,0.0,1,basic,basic,basic,,
level,float16,0.0,22,0.0,0.0,0.0,0.0,22.0
page,float16,1.0,0,,,,,
room_coor_x,float16,0.0,24453,-414.0,-414.0,-414.0,-1335.0,1059.0
room_coor_y,float16,0.0,21702,-159.375,-159.375,-159.375,-839.5,527.5
screen_coor_x,float16,0.0,4630,380.0,380.0,380.0,0.0,1875.0
screen_coor_y,float16,0.0,3513,494.0,494.0,494.0,0.0,1388.0


In [95]:
person_click__feature_fields_list = [
    "index", "elapsed_time", "level", "room_coor_x", "room_coor_y",
    "screen_coor_x", "screen_coor_y", "fullscreen", "hq", "music"
]

In [96]:
numeric_features__person_click = create_event_features(person_click__train_data, person_click__feature_fields_list).reset_index()
numeric_features__person_click.head()

Unnamed: 0,session_id,level_group,index__min,index__max,index__median,index__mean,index__std,index__sum,index__count,elapsed_time__min,...,hq__std,hq__sum,hq__count,music__min,music__max,music__median,music__mean,music__std,music__sum,music__count
0,20090312431273200,0-4,1,120,62.0,52.09,44.93,1146,22,831,...,0.0,0,22,1,1,1.0,1.0,0.0,22,22
1,20090312431273200,13-22,538,909,770.0,771.89,90.65,94942,123,858662,...,0.0,0,123,1,1,1.0,1.0,0.0,123,123
2,20090312431273200,5-12,176,440,311.0,322.3,75.29,33519,104,222334,...,0.0,0,104,1,1,1.0,1.0,0.0,104,104
3,20090312433251036,0-4,1,88,45.5,44.94,40.31,809,18,218,...,0.0,0,18,0,0,0.0,0.0,0.0,0,18
4,20090312433251036,13-22,616,1823,1521.0,1435.01,307.43,208076,145,1212275,...,0.0,0,145,0,0,0.0,0.0,0.0,0,145


In [97]:
summary = summarize_data_info(numeric_features__person_click)
summary

Data Shape: (70686, 72)


Unnamed: 0,data_type,perc_missing,n_unique,first_value,second_value,third_value,min,max
session_id,int64,0.0,23562,20090312431273200,20090312431273200,20090312431273200,2.00903124312732e+16,2.2100221145014656e+16
level_group,category,0.0,3,0-4,13-22,5-12,,
index__min,int64,0.0,1363,1,538,176,0.0,18373.0
index__max,int64,0.0,2383,120,909,440,9.0,20388.0
index__median,float64,0.0,3492,62.0,770.0,311.0,5.0,19660.5
index__mean,float64,0.0,39654,52.09,771.89,322.3,5.0,19706.88
index__std,float64,0.0,16315,44.93,90.65,75.29,2.74,3066.1
index__sum,int64,0.0,41111,1146,94942,33519,45.0,3665479.0
index__count,int64,0.0,243,22,123,104,9.0,492.0
elapsed_time__min,int64,0.0,53858,831,858662,222334,0.0,1194396086.0
