## 目的

本コンペティションは、学生にエッセイを書かせ、その詳細な動作(入力や消去、移動など)からエッセイの採点結果を予想するテーブルコンペである。  
  
このnotebookでは、前処理から学習、提出までの流れをまとめる。  
なお、参考にしたnotebookは以下の通り。  
[https://www.kaggle.com/code/alexryzhkov/lgbm-and-nn-on-sentences/notebook](https://www.kaggle.com/code/alexryzhkov/lgbm-and-nn-on-sentences/notebook)

## 1. LightAutoMLのインストール
事前に[LightAutoML 038 dependecies](https://www.kaggle.com/code/alexryzhkov/lightautoml-038-dependecies)をAdd Dataしておく。

In [1]:
!pip install --no-index -U --find-links=/kaggle/input/lightautoml-038-dependecies lightautoml==0.3.8
!pip install --no-index -U --find-links=/kaggle/input/lightautoml-038-dependecies pandas==2.0.3

Looking in links: /kaggle/input/lightautoml-038-dependecies
Processing /kaggle/input/lightautoml-038-dependecies/lightautoml-0.3.8-py3-none-any.whl
Processing /kaggle/input/lightautoml-038-dependecies/AutoWoE-1.3.2-py3-none-any.whl (from lightautoml==0.3.8)
Processing /kaggle/input/lightautoml-038-dependecies/cmaes-0.10.0-py3-none-any.whl (from lightautoml==0.3.8)
Processing /kaggle/input/lightautoml-038-dependecies/joblib-1.2.0-py3-none-any.whl (from lightautoml==0.3.8)
Processing /kaggle/input/lightautoml-038-dependecies/json2html-1.3.0.tar.gz (from lightautoml==0.3.8)
  Preparing metadata (setup.py) ... [?25ldone
[?25hProcessing /kaggle/input/lightautoml-038-dependecies/lightgbm-3.2.1-py3-none-manylinux1_x86_64.whl (from lightautoml==0.3.8)
Processing /kaggle/input/lightautoml-038-dependecies/pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (from lightautoml==0.3.8)
Processing /kaggle/input/lightautoml-038-dependecies/poetry_core-1.8.1-py3-none-any.whl (from

## 2. Import

In [2]:
%matplotlib inline
import gc
import os
import itertools
import pickle
import re
import time
from random import choice, choices
from functools import reduce
from tqdm import tqdm
from itertools import cycle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from functools import reduce
from itertools import cycle
from scipy import stats
from scipy.stats import skew, kurtosis
from sklearn import metrics, model_selection, preprocessing, linear_model, ensemble, decomposition, tree
import lightgbm as lgb
import copy

## 3. データの読み込み
初期に用意される入力データは、以下の4つが用意されている。
- train_logs.csv : キーロガーの記録(学習データ)
- train_scores.csv : エッセイの採点結果(学習データ)
- test_logs.csv : キーロガーの記録(テストデータ)
- sample_submission.csv : 提出用csvファイル

In [3]:
INPUT_DIR = '../input/linking-writing-processes-to-writing-quality'
train_logs = pd.read_csv(f'{INPUT_DIR}/train_logs.csv')
train_scores = pd.read_csv(f'{INPUT_DIR}/train_scores.csv')
test_logs = pd.read_csv(f'{INPUT_DIR}/test_logs.csv')
ss_df = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')

読み込んだデータの概要を確認する。  
各データの形状および先頭5行をみると、train_logsとtest_logsには「生徒のid」や「アクションの開始・終了の時間」、「アクションの種類(InputやRemoveなど)」などの情報が格納されている。  
一方で、train_scoresには「生徒のid」と「エッセイの採点結果」が格納されている。  
提出の形式は、「生徒のid」および「エッセイの採点結果」を提出するようだ。

In [4]:
print("train_logs shape : ", train_logs.shape)
print("train_scores shape : ", train_scores.shape)
print("test_logs shape : ", test_logs.shape)
print("sample_submission shape : ", ss_df.shape)

train_logs shape :  (8405898, 11)
train_scores shape :  (2471, 2)
test_logs shape :  (6, 11)
sample_submission shape :  (3, 2)


In [5]:
display(train_logs.head())
display(train_scores.head())

Unnamed: 0,id,event_id,down_time,up_time,action_time,activity,down_event,up_event,text_change,cursor_position,word_count
0,001519c8,1,4526,4557,31,Nonproduction,Leftclick,Leftclick,NoChange,0,0
1,001519c8,2,4558,4962,404,Nonproduction,Leftclick,Leftclick,NoChange,0,0
2,001519c8,3,106571,106571,0,Nonproduction,Shift,Shift,NoChange,0,0
3,001519c8,4,106686,106777,91,Input,q,q,q,1,1
4,001519c8,5,107196,107323,127,Input,q,q,q,2,1


Unnamed: 0,id,score
0,001519c8,3.5
1,0022f953,3.5
2,0042269b,6.0
3,0059420b,2.0
4,0075873a,4.0


In [6]:
display(test_logs.head())

Unnamed: 0,id,event_id,down_time,up_time,action_time,activity,down_event,up_event,text_change,cursor_position,word_count
0,0000aaaa,1,338433,338518,85,Input,Space,Space,,0,0
1,0000aaaa,2,760073,760160,87,Input,Space,Space,,1,0
2,2222bbbb,1,711956,712023,67,Input,q,q,q,0,1
3,2222bbbb,2,290502,290548,46,Input,q,q,q,1,1
4,4444cccc,1,635547,635641,94,Input,Space,Space,,0,0


続いて、作成したエッセイの情報を読み込む。  
この情報はtrain_logsのキーロガー情報を結合し、各生徒が作成したエッセイを復元したデータである。実際に入力した文字は"q"に置き換えられているが、エッセイの長さなどの重要な情報が取得可能となる。  

<参考>  
[https://www.kaggle.com/code/hiarsl/feature-engineering-sentence-paragraph-features](https://www.kaggle.com/code/hiarsl/feature-engineering-sentence-paragraph-features)

In [7]:
train_essays = pd.read_csv('../input/writing-quality-challenge-constructed-essays/train_essays_02.csv')
train_essays.index = train_essays["Unnamed: 0"]
train_essays.index.name = None
train_essays.drop(columns=["Unnamed: 0"], inplace=True)
train_essays.head()

Unnamed: 0,essay
001519c8,qqqqqqqqq qq qqqqq qq qqqq qqqq. qqqqqq qqq q...
0022f953,"qqqq qq qqqqqqqqqqq ? qq qq qqq qqq qqq, qqqqq..."
0042269b,qqqqqqqqqqq qq qqqqq qqqqqqqqq qq qqqqqqqqqqq ...
0059420b,qq qqqqqqq qqqqqq qqqqqqqqqqqqq qqqq q qqqq qq...
0075873a,"qqqqqqqqqqq qq qqq qqqqq qq qqqqqqqqqq, qqq qq..."


また、テストデータについてもエッセイを復元する。  
テストデータのエッセイは、以下の関数を使用して新たにエッセイを復元する。
- processingInputs関数 : 各キーロガー記録からテキストの小部分を復元する関数(getEssays関数内で呼び出される。)
- getEssays関数 : エッセイを復元する関数

In [8]:
def getEssays(df):
    """
    エッセイの復元関数
    [input]
     df(pd.DataFrame) : キーロガー情報のデータフレーム
    [output]
     essayFrame(pd.DataFrame) : 復元したエッセイのデータフレーム
    """
    textInputDf = df[['id', 'activity', 'cursor_position', 'text_change']]
    textInputDf = textInputDf[textInputDf.activity != 'Nonproduction']
    valCountsArr = textInputDf['id'].value_counts(sort=False).values
    lastIndex = 0
    essaySeries = pd.Series()
    for index, valCount in enumerate(valCountsArr):
        currTextInput = textInputDf[['activity', 'cursor_position', 'text_change']].iloc[lastIndex : lastIndex + valCount]
        lastIndex += valCount
        essayText = ""
        for Input in currTextInput.values:
            if Input[0] == 'Replace':
                replaceTxt = Input[2].split(' => ')
                essayText = essayText[:Input[1] - len(replaceTxt[1])] + replaceTxt[1] +\
                essayText[Input[1] - len(replaceTxt[1]) + len(replaceTxt[0]):]
                continue
            if Input[0] == 'Paste':
                essayText = essayText[:Input[1] - len(Input[2])] + Input[2] + essayText[Input[1] - len(Input[2]):]
                continue
            if Input[0] == 'Remove/Cut':
                essayText = essayText[:Input[1]] + essayText[Input[1] + len(Input[2]):]
                continue
            if "M" in Input[0]:
                croppedTxt = Input[0][10:]
                splitTxt = croppedTxt.split(' To ')
                valueArr = [item.split(', ') for item in splitTxt]
                moveData = (int(valueArr[0][0][1:]), 
                            int(valueArr[0][1][:-1]), 
                            int(valueArr[1][0][1:]), 
                            int(valueArr[1][1][:-1]))
                if moveData[0] != moveData[2]:
                    if moveData[0] < moveData[2]:
                        essayText = essayText[:moveData[0]] + essayText[moveData[1]:moveData[3]] +\
                        essayText[moveData[0]:moveData[1]] + essayText[moveData[3]:]
                    else:
                        essayText = essayText[:moveData[2]] + essayText[moveData[0]:moveData[1]] +\
                        essayText[moveData[2]:moveData[0]] + essayText[moveData[1]:]
                continue
            essayText = essayText[:Input[1] - len(Input[2])] + Input[2] + essayText[Input[1] - len(Input[2]):]
        essaySeries[index] = essayText
    essaySeries.index =  textInputDf['id'].unique()
    return pd.DataFrame(essaySeries, columns=['essay'])

In [9]:
# Features for test dataset
test_essays = getEssays(test_logs)
test_essays.head()

Unnamed: 0,essay
0000aaaa,
2222bbbb,qq
4444cccc,q


## 4.特徴量の作成

本節では、学習データを作成する準備として、各生徒のキーロガー情報から特徴量を作成していく。  
まず、最初の準備としては復元したエッセイ情報からエッセイの特徴に関する情報（文字数や平均文字数などの情報）を作成する。  
ここでは四分位数の第一四分位数と第三四分位数を求める関数と、復元したテキスト情報から、エッセイの情報を抜き出す関数を定義する。  
- q1関数 : 第一四分位数(25パーセンタイル)を返却する関数
- q3関数 : 第三四分位数(75パーセンタイル)を返却する関数
- split_essays_into_sentences関数 : 復元したエッセイを各文ごとに分割し、その文章の長さや単語数を返却する関数
- compute_sentence_aggregations関数 : 各文ごとに分割したエッセイ情報から文字数や平均文字数などの情報を返却する関数

In [10]:
# 第一四分位数と第三四分位数を返却する関数
def q1(x):
    return x.quantile(0.25)
def q3(x):
    return x.quantile(0.75)

In [11]:
AGGREGATIONS = ['count', 'mean', 'std', 'min', 'max', 'first', 'last', 'sem', q1, 'median', q3, 'skew', pd.DataFrame.kurt, 'sum']

def split_essays_into_sentences(df):
    """
    エッセイ情報の各文別データフレーム作成関数
    [input]
     df(pd.DataFrame) : 復元したエッセイのデータフレーム
    [output]
     essay_df(pd.DataFrame) : 復元したエッセイのデータフレーム
     
    復元したエッセイ情報をもとに文の最後にあるカンマ(.)をキーとしてエッセイを分割する。
    分割した各文の情報およびそれぞれの文章の長さや単語数の情報を追加して返却する。
    """
    essay_df = df
    essay_df['id'] = essay_df.index
    essay_df['sent'] = essay_df['essay'].apply(lambda x: re.split('\\.|\\?|\\!',x))
    essay_df = essay_df.explode('sent')
    essay_df['sent'] = essay_df['sent'].apply(lambda x: x.replace('\n','').strip())
    # Number of characters in sentences
    essay_df['sent_len'] = essay_df['sent'].apply(lambda x: len(x))
    # Number of words in sentences
    essay_df['sent_word_count'] = essay_df['sent'].apply(lambda x: len(x.split(' ')))
    essay_df = essay_df[essay_df.sent_len!=0].reset_index(drop=True)
    return essay_df

def compute_sentence_aggregations(df):
    """
    エッセイ情報の統計情報取得関数
    [input]
     df(pd.DataFrame) : 復元したエッセイのデータフレーム
    [output]
     sent_agg_df(pd.DataFrame) : 復元したエッセイのデータフレーム
     
    カンマ(.)をキーとして分割したエッセイ情報のデータフレームから、生徒ごとの統計情報(平均や分散など)を
    取得し返却する。
    """
    sent_agg_df = pd.concat(
        [df[['id','sent_len']].groupby(['id']).agg(AGGREGATIONS), df[['id','sent_word_count']].groupby(['id']).agg(AGGREGATIONS)], axis=1
    )
    sent_agg_df.columns = ['_'.join(x) for x in sent_agg_df.columns]
    sent_agg_df['id'] = sent_agg_df.index
    sent_agg_df = sent_agg_df.reset_index(drop=True)
    sent_agg_df.drop(columns=["sent_word_count_count"], inplace=True)
    sent_agg_df = sent_agg_df.rename(columns={"sent_len_count":"sent_count"})
    return sent_agg_df

In [12]:
# Word features for train dataset
train_sent_df = split_essays_into_sentences(train_essays)
train_sent_agg_df = compute_sentence_aggregations(train_sent_df)
test_sent_df = split_essays_into_sentences(test_essays)
test_sent_agg_df = compute_sentence_aggregations(test_sent_df)

この関数を実行することで、各エッセイにおける各文の統計量が特徴量として取得できる。

In [13]:
display(train_sent_agg_df.head())
display(test_sent_agg_df.head())

Unnamed: 0,sent_count,sent_len_mean,sent_len_std,sent_len_min,sent_len_max,sent_len_first,sent_len_last,sent_len_sem,sent_len_q1,sent_len_median,...,sent_word_count_first,sent_word_count_last,sent_word_count_sem,sent_word_count_q1,sent_word_count_median,sent_word_count_q3,sent_word_count_skew,sent_word_count_kurt,sent_word_count_sum,id
0,14,106.142857,41.12805,31,196,31,89,10.991934,75.5,119.5,...,6,16,1.736577,12.25,21.0,22.0,-0.506007,-0.526754,256,001519c8
1,15,107.666667,64.713287,19,226,19,143,16.708899,56.5,92.0,...,3,30,3.269872,12.0,20.0,31.0,0.391857,-0.935036,325,0022f953
2,19,133.842105,33.480115,73,189,139,161,7.680865,108.0,139.0,...,21,26,1.207599,17.5,21.0,26.5,-0.24256,-1.171619,408,0042269b
3,13,86.846154,33.195999,39,144,99,80,9.206914,62.0,80.0,...,17,14,1.800997,11.0,15.0,18.0,0.656055,-0.538051,208,0059420b
4,16,86.8125,44.09417,22,182,75,22,11.023543,60.0,74.0,...,11,3,2.166927,11.0,12.5,18.25,1.148513,0.888421,255,0075873a


Unnamed: 0,sent_count,sent_len_mean,sent_len_std,sent_len_min,sent_len_max,sent_len_first,sent_len_last,sent_len_sem,sent_len_q1,sent_len_median,...,sent_word_count_first,sent_word_count_last,sent_word_count_sem,sent_word_count_q1,sent_word_count_median,sent_word_count_q3,sent_word_count_skew,sent_word_count_kurt,sent_word_count_sum,id
0,1,2.0,,2,2,2,2,,2.0,2.0,...,1,1,,1.0,1.0,1.0,,,1,2222bbbb
1,1,1.0,,1,1,1,1,,1.0,1.0,...,1,1,,1.0,1.0,1.0,,,1,4444cccc


次に、段落ごとのエッセイの特徴量を抜き出す。  
改行(\n)をキーとして、エッセイを分割し特徴量(文章の平均や分散など)を抜き出す関数を定義する。  
- split_essays_into_paragraph関数 : エッセイを段落ごとに分割し、その文章の長さや単語数を返却する関数
- compute_sentence_aggregations関数 : 段落ごとに分割したエッセイ情報から文字数や平均文字数などの情報を返却する関数

In [14]:
def split_essays_into_paragraphs(df):
    essay_df = df
    essay_df['id'] = essay_df.index
    essay_df['paragraph'] = essay_df['essay'].apply(lambda x: x.split('\n'))
    essay_df = essay_df.explode('paragraph')
    # Number of characters in paragraphs
    essay_df['paragraph_len'] = essay_df['paragraph'].apply(lambda x: len(x)) 
    # Number of words in paragraphs
    essay_df['paragraph_word_count'] = essay_df['paragraph'].apply(lambda x: len(x.split(' ')))
    essay_df = essay_df[essay_df.paragraph_len!=0].reset_index(drop=True)
    return essay_df

def compute_paragraph_aggregations(df):
    paragraph_agg_df = pd.concat(
        [df[['id','paragraph_len']].groupby(['id']).agg(AGGREGATIONS), df[['id','paragraph_word_count']].groupby(['id']).agg(AGGREGATIONS)], axis=1
    ) 
    paragraph_agg_df.columns = ['_'.join(x) for x in paragraph_agg_df.columns]
    paragraph_agg_df['id'] = paragraph_agg_df.index
    paragraph_agg_df = paragraph_agg_df.reset_index(drop=True)
    paragraph_agg_df.drop(columns=["paragraph_word_count_count"], inplace=True)
    paragraph_agg_df = paragraph_agg_df.rename(columns={"paragraph_len_count":"paragraph_count"})
    return paragraph_agg_df

In [15]:
# Paragraph features for train dataset
train_paragraph_df = split_essays_into_paragraphs(train_essays)
train_paragraph_agg_df = compute_paragraph_aggregations(train_paragraph_df)
test_paragraph_df = split_essays_into_paragraphs(test_essays)
test_paragraph_agg_df = compute_paragraph_aggregations(test_paragraph_df)

この関数を実行することで、各エッセイにおける各段落の統計量が特徴量として取得できる。

In [16]:
display(train_paragraph_agg_df.head())
display(test_paragraph_agg_df.head())

Unnamed: 0,paragraph_count,paragraph_len_mean,paragraph_len_std,paragraph_len_min,paragraph_len_max,paragraph_len_first,paragraph_len_last,paragraph_len_sem,paragraph_len_q1,paragraph_len_median,...,paragraph_word_count_first,paragraph_word_count_last,paragraph_word_count_sem,paragraph_word_count_q1,paragraph_word_count_median,paragraph_word_count_q3,paragraph_word_count_skew,paragraph_word_count_kurt,paragraph_word_count_sum,id
0,3,508.0,134.208793,390,654,390,480,77.485483,435.0,480.0,...,71,86,11.976829,78.5,86.0,99.0,0.770543,,269,001519c8
1,6,278.166667,98.554384,176,462,240,284,40.234659,228.75,261.0,...,53,60,8.316316,47.75,56.5,62.25,1.299614,2.342703,355,0022f953
2,6,429.5,101.087586,296,568,491,296,41.268834,356.75,444.5,...,79,45,6.926599,55.5,73.5,78.75,-0.502908,-1.536764,410,0042269b
3,3,384.0,56.471232,347,449,347,356,32.603681,351.5,356.0,...,62,65,5.897269,63.5,65.0,73.0,1.565482,,208,0059420b
4,5,283.4,232.336609,23,627,351,23,103.90409,124.0,292.0,...,61,3,18.706683,26.0,52.0,61.0,0.68676,0.722916,256,0075873a


Unnamed: 0,paragraph_count,paragraph_len_mean,paragraph_len_std,paragraph_len_min,paragraph_len_max,paragraph_len_first,paragraph_len_last,paragraph_len_sem,paragraph_len_q1,paragraph_len_median,...,paragraph_word_count_first,paragraph_word_count_last,paragraph_word_count_sem,paragraph_word_count_q1,paragraph_word_count_median,paragraph_word_count_q3,paragraph_word_count_skew,paragraph_word_count_kurt,paragraph_word_count_sum,id
0,1,2.0,,2,2,2,2,,2.0,2.0,...,3,3,,3.0,3.0,3.0,,,3,0000aaaa
1,1,2.0,,2,2,2,2,,2.0,2.0,...,1,1,,1.0,1.0,1.0,,,1,2222bbbb
2,1,2.0,,2,2,2,2,,2.0,2.0,...,2,2,,2.0,2.0,2.0,,,2,4444cccc


3番目の特徴量作成として、キーロガー情報から各種統計情報を抜き出すためのPreprocessorクラスを実行する。作成する特徴量は以下の項目を作成する。
- ラグ特徴量 : 同一IDのキーロガー情報をshift関数で任意にずらし、執筆時間の差や単語数の差を取得する。
- InputやRemove/Cutなどのactivity情報の数(activity_counts関数)
- それぞれの生徒がエッセイを書き上げるまでに記録されたキーロガー情報のイベント数(event_counts関数)
- 書き込みや消去などのテキスト変更回数(text_change_counts関数)
- 句読点の回数(match_punctuations関数)
- 入力文字の統計量(get_input_words関数)

In [17]:
# The following code comes almost Abdullah's notebook: https://www.kaggle.com/code/abdullahmeda/enter-ing-the-timeseries-space-sec-3-new-aggs
# Abdullah's code is based on work shared in previous notebooks (e.g., https://www.kaggle.com/code/hengzheng/link-writing-simple-lgbm-baseline)

from collections import defaultdict

class Preprocessor:
    
    def __init__(self, seed):
        self.seed = seed
        
        self.activities = ['Input', 'Remove/Cut', 'Nonproduction', 'Replace', 'Paste']
        self.events = ['q', 'Space', 'Backspace', 'Shift', 'ArrowRight', 'Leftclick', 'ArrowLeft', '.', ',', 
              'ArrowDown', 'ArrowUp', 'Enter', 'CapsLock', "'", 'Delete', 'Unidentified']
        self.text_changes = ['q', ' ', 'NoChange', '.', ',', '\n', "'", '"', '-', '?', ';', '=', '/', '\\', ':']
        self.punctuations = ['"', '.', ',', "'", '-', ';', ':', '?', '!', '<', '>', '/',
                        '@', '#', '$', '%', '^', '&', '*', '(', ')', '_', '+']
        self.gaps = [1, 2, 3, 5, 10, 20, 50, 100]
        
        self.idf = defaultdict(float)
    
    def activity_counts(self, df):
        tmp_df = df.groupby('id').agg({'activity': list}).reset_index()
        ret = list()
        for li in tqdm(tmp_df['activity'].values):
            items = list(Counter(li).items())
            di = dict()
            for k in self.activities:
                di[k] = 0
            for item in items:
                k, v = item[0], item[1]
                if k in di:
                    di[k] = v
            ret.append(di)
        ret = pd.DataFrame(ret)
        cols = [f'activity_{i}_count' for i in range(len(ret.columns))]
        ret.columns = cols

        cnts = ret.sum(1)

        for col in cols:
            if col in self.idf.keys():
                idf = self.idf[col]
            else:
                idf = df.shape[0] / (ret[col].sum() + 1)
                idf = np.log(idf)
                self.idf[col] = idf

            ret[col] = 1 + np.log(ret[col] / cnts)
            ret[col] *= idf

        return ret

    def event_counts(self, df, colname):
        tmp_df = df.groupby('id').agg({colname: list}).reset_index()
        ret = list()
        for li in tqdm(tmp_df[colname].values):
            items = list(Counter(li).items())
            di = dict()
            for k in self.events:
                di[k] = 0
            for item in items:
                k, v = item[0], item[1]
                if k in di:
                    di[k] = v
            ret.append(di)
        ret = pd.DataFrame(ret)
        cols = [f'{colname}_{i}_count' for i in range(len(ret.columns))]
        ret.columns = cols

        cnts = ret.sum(1)

        for col in cols:
            if col in self.idf.keys():
                idf = self.idf[col]
            else:
                idf = df.shape[0] / (ret[col].sum() + 1)
                idf = np.log(idf)
                self.idf[col] = idf
            
            ret[col] = 1 + np.log(ret[col] / cnts)
            ret[col] *= idf

        return ret

    def text_change_counts(self, df):
        tmp_df = df.groupby('id').agg({'text_change': list}).reset_index()
        ret = list()
        for li in tqdm(tmp_df['text_change'].values):
            items = list(Counter(li).items())
            di = dict()
            for k in self.text_changes:
                di[k] = 0
            for item in items:
                k, v = item[0], item[1]
                if k in di:
                    di[k] = v
            ret.append(di)
        ret = pd.DataFrame(ret)
        cols = [f'text_change_{i}_count' for i in range(len(ret.columns))]
        ret.columns = cols

        cnts = ret.sum(1)

        for col in cols:
            if col in self.idf.keys():
                idf = self.idf[col]
            else:
                idf = df.shape[0] / (ret[col].sum() + 1)
                idf = np.log(idf)
                self.idf[col] = idf
            
            ret[col] = 1 + np.log(ret[col] / cnts)
            ret[col] *= idf
            
        return ret

    def match_punctuations(self, df):
        tmp_df = df.groupby('id').agg({'down_event': list}).reset_index()
        ret = list()
        for li in tqdm(tmp_df['down_event'].values):
            cnt = 0
            items = list(Counter(li).items())
            for item in items:
                k, v = item[0], item[1]
                if k in self.punctuations:
                    cnt += v
            ret.append(cnt)
        ret = pd.DataFrame({'punct_cnt': ret})
        return ret

    def get_input_words(self, df):
        tmp_df = df[(~df['text_change'].str.contains('=>'))&(df['text_change'] != 'NoChange')].reset_index(drop=True)
        tmp_df = tmp_df.groupby('id').agg({'text_change': list}).reset_index()
        tmp_df['text_change'] = tmp_df['text_change'].apply(lambda x: ''.join(x))
        tmp_df['text_change'] = tmp_df['text_change'].apply(lambda x: re.findall(r'q+', x))
        tmp_df['input_word_count'] = tmp_df['text_change'].apply(len)
        tmp_df['input_word_length_mean'] = tmp_df['text_change'].apply(lambda x: np.mean([len(i) for i in x] if len(x) > 0 else 0))
        tmp_df['input_word_length_max'] = tmp_df['text_change'].apply(lambda x: np.max([len(i) for i in x] if len(x) > 0 else 0))
        tmp_df['input_word_length_std'] = tmp_df['text_change'].apply(lambda x: np.std([len(i) for i in x] if len(x) > 0 else 0))
        tmp_df.drop(['text_change'], axis=1, inplace=True)
        return tmp_df
    
    def make_feats(self, df):
        
        feats = pd.DataFrame({'id': df['id'].unique().tolist()})
        
        print("Engineering time data")
        for gap in self.gaps:
            df[f'up_time_shift{gap}'] = df.groupby('id')['up_time'].shift(gap)
            df[f'action_time_gap{gap}'] = df['down_time'] - df[f'up_time_shift{gap}']
        df.drop(columns=[f'up_time_shift{gap}' for gap in self.gaps], inplace=True)

        print("Engineering cursor position data")
        for gap in self.gaps:
            df[f'cursor_position_shift{gap}'] = df.groupby('id')['cursor_position'].shift(gap)
            df[f'cursor_position_change{gap}'] = df['cursor_position'] - df[f'cursor_position_shift{gap}']
            df[f'cursor_position_abs_change{gap}'] = np.abs(df[f'cursor_position_change{gap}'])
        df.drop(columns=[f'cursor_position_shift{gap}' for gap in self.gaps], inplace=True)

        print("Engineering word count data")
        for gap in self.gaps:
            df[f'word_count_shift{gap}'] = df.groupby('id')['word_count'].shift(gap)
            df[f'word_count_change{gap}'] = df['word_count'] - df[f'word_count_shift{gap}']
            df[f'word_count_abs_change{gap}'] = np.abs(df[f'word_count_change{gap}'])
        df.drop(columns=[f'word_count_shift{gap}' for gap in self.gaps], inplace=True)
        
        print("Engineering statistical summaries for features")
        feats_stat = [
            ('event_id', ['max']),
            ('up_time', ['max']),
            ('action_time', ['max', 'min', 'mean', 'std', 'quantile', 'sem', 'sum', 'skew', pd.DataFrame.kurt]),
            ('activity', ['nunique']),
            ('down_event', ['nunique']),
            ('up_event', ['nunique']),
            ('text_change', ['nunique']),
            ('cursor_position', ['nunique', 'max', 'quantile', 'sem', 'mean']),
            ('word_count', ['nunique', 'max', 'quantile', 'sem', 'mean'])]
        for gap in self.gaps:
            feats_stat.extend([
                (f'action_time_gap{gap}', ['max', 'min', 'mean', 'std', 'quantile', 'sem', 'sum', 'skew', pd.DataFrame.kurt]),
                (f'cursor_position_change{gap}', ['max', 'mean', 'std', 'quantile', 'sem', 'sum', 'skew', pd.DataFrame.kurt]),
                (f'word_count_change{gap}', ['max', 'mean', 'std', 'quantile', 'sem', 'sum', 'skew', pd.DataFrame.kurt])
            ])
        
        pbar = tqdm(feats_stat)
        for item in pbar:
            colname, methods = item[0], item[1]
            for method in methods:
                pbar.set_postfix()
                if isinstance(method, str):
                    method_name = method
                else:
                    method_name = method.__name__
                pbar.set_postfix(column=colname, method=method_name)
                tmp_df = df.groupby(['id']).agg({colname: method}).reset_index().rename(columns={colname: f'{colname}_{method_name}'})
                feats = feats.merge(tmp_df, on='id', how='left')

        print("Engineering activity counts data")
        tmp_df = self.activity_counts(df)
        feats = pd.concat([feats, tmp_df], axis=1)
        
        print("Engineering event counts data")
        tmp_df = self.event_counts(df, 'down_event')
        feats = pd.concat([feats, tmp_df], axis=1)
        tmp_df = self.event_counts(df, 'up_event')
        feats = pd.concat([feats, tmp_df], axis=1)
        
        print("Engineering text change counts data")
        tmp_df = self.text_change_counts(df)
        feats = pd.concat([feats, tmp_df], axis=1)
        
        print("Engineering punctuation counts data")
        tmp_df = self.match_punctuations(df)
        feats = pd.concat([feats, tmp_df], axis=1)

        print("Engineering input words data")
        tmp_df = self.get_input_words(df)
        feats = pd.merge(feats, tmp_df, on='id', how='left')

        print("Engineering ratios data")
        feats['word_time_ratio'] = feats['word_count_max'] / feats['up_time_max']
        feats['word_event_ratio'] = feats['word_count_max'] / feats['event_id_max']
        feats['event_time_ratio'] = feats['event_id_max']  / feats['up_time_max']
        feats['idle_time_ratio'] = feats['action_time_gap1_sum'] / feats['up_time_max']

        return feats

In [18]:
preprocessor = Preprocessor(seed=42)
train_feats = preprocessor.make_feats(train_logs)
test_feats = preprocessor.make_feats(test_logs)
nan_cols = train_feats.columns[train_feats.isna().any()].tolist()
train_feats = train_feats.drop(columns=nan_cols)
test_feats = test_feats.drop(columns=nan_cols)

Engineering time data
Engineering cursor position data
Engineering word count data
Engineering statistical summaries for features


100%|██████████| 33/33 [03:03<00:00,  5.55s/it, column=word_count_change100, method=kurt]         


Engineering activity counts data


100%|██████████| 2471/2471 [00:00<00:00, 6506.83it/s]
  result = getattr(ufunc, method)(*inputs, **kwargs)


Engineering event counts data


100%|██████████| 2471/2471 [00:00<00:00, 6180.00it/s]
  result = getattr(ufunc, method)(*inputs, **kwargs)
100%|██████████| 2471/2471 [00:00<00:00, 6105.97it/s]
  result = getattr(ufunc, method)(*inputs, **kwargs)


Engineering text change counts data


100%|██████████| 2471/2471 [00:00<00:00, 6072.22it/s]
  result = getattr(ufunc, method)(*inputs, **kwargs)


Engineering punctuation counts data


100%|██████████| 2471/2471 [00:00<00:00, 6173.79it/s]


Engineering input words data


  feats['word_time_ratio'] = feats['word_count_max'] / feats['up_time_max']
  feats['word_event_ratio'] = feats['word_count_max'] / feats['event_id_max']
  feats['event_time_ratio'] = feats['event_id_max']  / feats['up_time_max']
  feats['idle_time_ratio'] = feats['action_time_gap1_sum'] / feats['up_time_max']


Engineering ratios data
Engineering time data
Engineering cursor position data
Engineering word count data
Engineering statistical summaries for features


100%|██████████| 33/33 [00:02<00:00, 14.08it/s, column=word_count_change100, method=kurt]         


Engineering activity counts data


100%|██████████| 3/3 [00:00<00:00, 6963.43it/s]
  result = getattr(ufunc, method)(*inputs, **kwargs)


Engineering event counts data


100%|██████████| 3/3 [00:00<00:00, 3387.06it/s]
  result = getattr(ufunc, method)(*inputs, **kwargs)
100%|██████████| 3/3 [00:00<00:00, 20262.34it/s]
  result = getattr(ufunc, method)(*inputs, **kwargs)


Engineering text change counts data


100%|██████████| 3/3 [00:00<00:00, 12958.71it/s]
  result = getattr(ufunc, method)(*inputs, **kwargs)


Engineering punctuation counts data


100%|██████████| 3/3 [00:00<00:00, 19815.61it/s]
  feats['word_time_ratio'] = feats['word_count_max'] / feats['up_time_max']
  feats['word_event_ratio'] = feats['word_count_max'] / feats['event_id_max']
  feats['event_time_ratio'] = feats['event_id_max']  / feats['up_time_max']
  feats['idle_time_ratio'] = feats['action_time_gap1_sum'] / feats['up_time_max']


Engineering input words data
Engineering ratios data


上記の関数を実行すると、キーロガー情報（train_logsやtest_logs）をもとに特徴量を抜き出す。詳細は[https://www.kaggle.com/code/abdullahmeda/enter-ing-the-timeseries-space-sec-3-new-aggs](https://www.kaggle.com/code/abdullahmeda/enter-ing-the-timeseries-space-sec-3-new-aggs)参照

In [19]:
display(train_feats.head())
display(test_feats.head())

Unnamed: 0,id,event_id_max,up_time_max,action_time_max,action_time_min,action_time_mean,action_time_std,action_time_quantile,action_time_sem,action_time_sum,...,text_change_14_count,punct_cnt,input_word_count,input_word_length_mean,input_word_length_max,input_word_length_std,word_time_ratio,word_event_ratio,event_time_ratio,idle_time_ratio
0,001519c8,2557,1801969,2259,0,116.246774,91.797374,112.0,1.815369,297243,...,-inf,37,366,5.325137,20,3.487804,0.000142,0.100117,0.001419,0.832534
1,0022f953,2454,1788969,1758,0,112.221271,55.431189,115.0,1.118966,275391,...,-inf,53,385,4.41039,33,3.199496,0.000181,0.131622,0.001372,0.828944
2,0042269b,4136,1771669,3005,0,101.837766,82.383766,94.0,1.281007,421201,...,-inf,47,627,5.446571,25,3.474895,0.000228,0.097679,0.002335,0.759751
3,0059420b,1556,1404469,806,0,121.848329,113.768226,110.0,2.884139,189596,...,-inf,18,251,4.609562,19,2.949601,0.000147,0.132391,0.001108,0.835531
4,0075873a,2531,1662472,701,0,123.943896,62.082013,129.0,1.234013,313702,...,-inf,66,412,4.76699,18,2.986064,0.000152,0.099565,0.001522,0.764103


Unnamed: 0,id,event_id_max,up_time_max,action_time_max,action_time_min,action_time_mean,action_time_std,action_time_quantile,action_time_sem,action_time_sum,...,text_change_14_count,punct_cnt,input_word_count,input_word_length_mean,input_word_length_max,input_word_length_std,word_time_ratio,word_event_ratio,event_time_ratio,idle_time_ratio
0,0000aaaa,2,760160,87,85,86.0,1.414214,86.0,1.0,172,...,-inf,0,0,0.0,0,0.0,0.0,0.0,3e-06,0.554561
1,2222bbbb,2,712023,67,46,56.5,14.849242,56.5,10.5,113,...,-inf,0,1,2.0,2,0.0,1e-06,0.5,3e-06,-0.592005
2,4444cccc,2,635641,94,56,75.0,26.870058,75.0,19.0,150,...,-inf,0,1,1.0,1,0.0,2e-06,0.5,3e-06,-0.708962


In [20]:
# Code for additional aggregations comes from here: https://www.kaggle.com/code/abdullahmeda/enter-ing-the-timeseries-space-sec-3-new-aggs

train_agg_fe_df = train_logs.groupby("id")[['down_time', 'up_time', 'action_time', 'cursor_position', 'word_count']].agg(
    ['mean', 'std', 'min', 'max', 'last', 'first', 'sem', 'median', 'sum'])
train_agg_fe_df.columns = ['_'.join(x) for x in train_agg_fe_df.columns]
train_agg_fe_df = train_agg_fe_df.add_prefix("tmp_")
train_agg_fe_df.reset_index(inplace=True)

test_agg_fe_df = test_logs.groupby("id")[['down_time', 'up_time', 'action_time', 'cursor_position', 'word_count']].agg(
    ['mean', 'std', 'min', 'max', 'last', 'first', 'sem', 'median', 'sum'])
test_agg_fe_df.columns = ['_'.join(x) for x in test_agg_fe_df.columns]
test_agg_fe_df = test_agg_fe_df.add_prefix("tmp_")
test_agg_fe_df.reset_index(inplace=True)

train_feats = train_feats.merge(train_agg_fe_df, on='id', how='left')
test_feats = test_feats.merge(test_agg_fe_df, on='id', how='left')


In [21]:
# Adding the additional features to the original feature set

train_feats = train_feats.merge(train_sent_agg_df, on='id', how='left')
train_feats = train_feats.merge(train_paragraph_agg_df, on='id', how='left')
test_feats = test_feats.merge(test_sent_agg_df, on='id', how='left')
test_feats = test_feats.merge(test_paragraph_agg_df, on='id', how='left')