# 1. Introduction
At the end of this competition, I was 9th place on public LB and expected to win this competition.
However, I could not prevent shake-down, and dropped down to 37th on private LB.
Even though private score was not impressive one, I would like to share my solution for thanks to kernel authers and OSS developpers.

My approach consists of two types of models.
The one is neural networks (self-attention models), and the other is Light GBM.
I mainly worked on improving self-attention models.
For Light GBM, I integreted public kernels and modified only some parts.
Finally, I ensembled both types of models.
# 2. Import packages
All packages are imported here.

In [1]:
import re
import time
import json
import random
import warnings
from pathlib import Path
from functools import partial, reduce
from collections import Counter

import multiprocessing
from multiprocessing import Process

from tqdm import tqdm_notebook as tqdm
import lightgbm as lgb
import numpy as np
import pandas as pd
import scipy as sp
from scipy import optimize
from sklearn.metrics import mean_squared_error
from numba import jit
from IPython.display import display

import chainer
from chainer import functions as F
from chainer import links as L
from chainer import cuda, serializers, reporter
from chainer.dataset import convert
from chainer.datasets.dict_dataset import DictDataset
from chainer.training.extensions import Evaluator

# 3. Setup
File paths are setup here.
I used to switch directories for local debug.
Please set `debug_lgbm = True` for dry-run (trial of training with small data).

In [2]:
warnings.filterwarnings('ignore')

# dir_dataset = Path('../../dataset')
dir_dataset = Path('/kaggle/input/data-science-bowl-2019')

# dir_model = Path('_model')
dir_model = Path('/kaggle/input/dsb2019-37th-models')

debug_lgbm = False

# ４. Self-Attention Model
Self-Attention Models are the core components of my solution.
These models were trained offline, and uploaded for prediction.
## 4.1 Utility function
Utility functions are implemented here.
Evaluate function is forked from Aman Arora ( @aroraaman )'s implementation.
* Quadratic Kappa Metric explained in 5 simple steps<br>https://www.kaggle.com/aroraaman/quadratic-kappa-metric-explained-in-5-simple-steps

In [3]:
def calc_weighted_kappa(truth, pred, categories=(0, 1, 2, 3)):

    truth = pd.Categorical(truth, categories=categories)
    pred = pd.Categorical(pred, categories=categories)

    confusion_matrix = pd.crosstab(truth, pred, dropna=False)

    O = confusion_matrix.values

    true_counts = truth.value_counts()
    pred_counts = pred.value_counts()

    E = np.outer(true_counts, pred_counts)

    n = len(categories)
    w = np.zeros((n, n))

    for i in range(len(w)):
        for j in range(len(w)):
            w[i][j] = float(((i - j) ** 2) / (n - 1) ** 2)

    E = E / E.sum()
    O = O / O.sum()

    num = np.sum(w * O)
    den = np.sum(w * E)

    weighted_kappa = (1 - (num / den))

    return weighted_kappa

Optimized Rounder is useful module to tune thresholds. I forked it from Naveen Asaithambi ( @naveenasaithambi )'s implementation.
* OptimizedRounder() - Improved<br>https://www.kaggle.com/naveenasaithambi/optimizedrounder-improved

In [4]:
class OptimizedRounder(object):
    """
    An optimizer for rounding thresholds
    to maximize Quadratic Weighted Kappa (QWK) score
    """

    def __init__(self, thresholds):
        self.coef_ = thresholds

    def _kappa_loss(self, coef, X, y):
        """
        Get loss according to
        using current coefficients

        :param coef: A list of coefficients that will be used for rounding
        :param X: The raw predictions
        :param y: The ground truth labels
        """
        X_p = pd.cut(X, [-np.inf] + list(np.sort(coef)) + [np.inf], labels=[0, 1, 2, 3])

        return -calc_weighted_kappa(y, X_p)

    def fit(self, X, y):
        """
        Optimize rounding thresholds

        :param X: The raw predictions
        :param y: The ground truth labels
        """
        loss_partial = partial(self._kappa_loss, X=X, y=y)
        initial_coef = [0.5, 1.5, 2.5]
        try:
            new_coef = optimize.minimize(loss_partial, initial_coef, method='nelder-mead')['x']
        except ValueError:
            new_coef = initial_coef
        self.coef_.update(new_coef)

    def predict(self, X, coef):
        """
        Make predictions with specified thresholds

        :param X: The raw predictions
        :param coef: A list of coefficients that will be used for rounding
        """
        return pd.cut(X, [-np.inf] + list(np.sort(coef)) + [np.inf], labels=[0, 1, 2, 3])

    def coefficients(self):
        """
        Return the optimized coefficients
        """
        return self.coef_.coefficients()


class SharedThresholds:

    def __init__(self):
        self.thresholds = None

    def update(self, new_thresholds):
        self.thresholds = new_thresholds

    def coefficients(self):
        return self.thresholds

    def save(self, path):
        np.savetxt(str(path), np.array(self.thresholds))

    def load(self, path):
        self.thresholds = np.loadtxt(str(path))

## 4.2 Model definition
The network of self-attention models is implemented here.
This architecture is so-called Transfomer.
I empoyed Encoder part of it.

Subset of Transfomer is forked from below.
* Transformer - Attention Is All You Need<br>https://github.com/soskek/attention_is_all_you_need<br>Copyright (c) 2017, Sosuke Kobayashi

In [5]:
class ConvolutionSentence(L.Convolution2D):
    """ Position-wise Linear Layer for Sentence Block
    Position-wise linear layer for array of shape
    (batchsize, dimension, sentence_length)
    can be implemented a convolution layer.
    """

    def __init__(self, in_channels, out_channels,
                 ksize=1, stride=1, pad=0, nobias=False,
                 initialW=None, initial_bias=None):
        super(ConvolutionSentence, self).__init__(
            in_channels, out_channels,
            ksize, stride, pad, nobias,
            initialW, initial_bias)

    def __call__(self, x):
        """Applies the linear layer.
        Args:
            x (~chainer.Variable): Batch of input vector block. Its shape is
                (batchsize, in_channels, sentence_length).
        Returns:
            ~chainer.Variable: Output of the linear layer. Its shape is
                (batchsize, out_channels, sentence_length).
        """
        x = F.expand_dims(x, axis=3)
        y = super(ConvolutionSentence, self).__call__(x)
        y = F.squeeze(y, axis=3)
        return y


def seq_func(func, x, reconstruct_shape=True):
    """ Change implicitly function's target to ndim=3
    Apply a given function for array of ndim 3,
    shape (batchsize, dimension, sentence_length),
    instead for array of ndim 2.
    """

    batch, units, length = x.shape
    e = F.transpose(x, (0, 2, 1)).reshape(batch * length, units)
    e = func(e)
    if not reconstruct_shape:
        return e
    out_units = e.shape[1]
    e = F.transpose(e.reshape((batch, length, out_units)), (0, 2, 1))
    assert(e.shape == (batch, out_units, length))
    return e


linear_init = chainer.initializers.LeCunUniform()


class MultiHeadAttention(chainer.Chain):
    """ Multi Head Attention Layer for Sentence Blocks
    For batch computation efficiency, dot product to calculate query-key
    scores is performed all heads together.
    """

    def __init__(self, n_units, h=8, dropout=0.1, self_attention=True):
        super(MultiHeadAttention, self).__init__()
        with self.init_scope():
            if self_attention:
                self.W_QKV = ConvolutionSentence(
                    n_units, n_units * 3, nobias=True,
                    initialW=linear_init)
            else:
                self.W_Q = ConvolutionSentence(
                    n_units, n_units, nobias=True,
                    initialW=linear_init)
                self.W_KV = ConvolutionSentence(
                    n_units, n_units * 2, nobias=True,
                    initialW=linear_init)
            self.finishing_linear_layer = ConvolutionSentence(
                n_units, n_units, nobias=True,
                initialW=linear_init)
        self.h = h
        self.scale_score = 1. / (n_units // h) ** 0.5
        self.dropout = dropout
        self.is_self_attention = self_attention

    def __call__(self, x, z=None, mask=None):
        xp = self.xp
        h = self.h

        # temporary mask
        mask = np.zeros((8, x.shape[2], x.shape[2]), dtype=np.bool)

        if self.is_self_attention:
            Q, K, V = F.split_axis(self.W_QKV(x), 3, axis=1)
        else:
            Q = self.W_Q(x)
            K, V = F.split_axis(self.W_KV(z), 2, axis=1)
        batch, n_units, n_querys = Q.shape
        _, _, n_keys = K.shape

        # Calculate Attention Scores with Mask for Zero-padded Areas
        # Perform Multi-head Attention using pseudo batching
        # all together at once for efficiency

        batch_Q = F.concat(F.split_axis(Q, h, axis=1), axis=0)
        batch_K = F.concat(F.split_axis(K, h, axis=1), axis=0)
        batch_V = F.concat(F.split_axis(V, h, axis=1), axis=0)
        assert(batch_Q.shape == (batch * h, n_units // h, n_querys))
        assert(batch_K.shape == (batch * h, n_units // h, n_keys))
        assert(batch_V.shape == (batch * h, n_units // h, n_keys))

        # mask = xp.concatenate([mask] * h, axis=0)
        batch_A = F.matmul(batch_Q, batch_K, transa=True) * self.scale_score
        # batch_A = F.where(mask, batch_A, xp.full(batch_A.shape, -xp.inf, 'f'))
        # batch_A = F.softmax(batch_A, axis=2)
        # batch_A = F.where(
        #     xp.isnan(batch_A.data), xp.zeros(batch_A.shape, 'f'), batch_A)
        # assert(batch_A.shape == (batch * h, n_querys, n_keys))

        # Calculate Weighted Sum
        batch_A, batch_V = F.broadcast(
            batch_A[:, None], batch_V[:, :, None])
        batch_C = F.sum(batch_A * batch_V, axis=3)
        assert(batch_C.shape == (batch * h, n_units // h, n_querys))
        C = F.concat(F.split_axis(batch_C, h, axis=0), axis=1)
        assert(C.shape == (batch, n_units, n_querys))
        C = self.finishing_linear_layer(C)
        return C


class FeedForwardLayer(chainer.Chain):
    def __init__(self, n_units: int, ff_inner: int, ff_slope: float):
        super(FeedForwardLayer, self).__init__()
        n_inner_units = n_units * ff_inner
        self.slope = ff_slope
        with self.init_scope():
            self.W_1 = ConvolutionSentence(n_units, n_inner_units, initialW=linear_init)
            self.W_2 = ConvolutionSentence(n_inner_units, n_units, initialW=linear_init)
            self.act = F.leaky_relu

    def __call__(self, e):
        e = self.W_1(e)
        e = self.act(e, slope=self.slope)
        e = self.W_2(e)
        return e


class LayerNormalizationSentence(L.LayerNormalization):
    """ Position-wise Linear Layer for Sentence Block
    Position-wise layer-normalization layer for array of shape
    (batchsize, dimension, sentence_length).
    """

    def __init__(self, *args, **kwargs):
        super(LayerNormalizationSentence, self).__init__(*args, **kwargs)

    def __call__(self, x):
        y = seq_func(super(LayerNormalizationSentence, self).__call__, x)
        return y


class EncoderLayer(chainer.Chain):
    def __init__(self, n_units, ff_inner: int, ff_slope: float, h: int, dropout1: float, dropout2: float):
        super(EncoderLayer, self).__init__()
        with self.init_scope():
            self.self_attention = MultiHeadAttention(n_units, h)
            self.feed_forward = FeedForwardLayer(n_units, ff_inner, ff_slope)
            self.ln_1 = LayerNormalizationSentence(n_units, eps=1e-6)
            self.ln_2 = LayerNormalizationSentence(n_units, eps=1e-6)
        self.dropout1 = dropout1
        self.dropout2 = dropout2

    def __call__(self, e, xx_mask):
        sub = self.self_attention(e, e, xx_mask)
        e = e + F.dropout(sub, self.dropout1)
        e = self.ln_1(e)

        sub = self.feed_forward(e)
        e = e + F.dropout(sub, self.dropout2)
        e = self.ln_2(e)
        return e

The entire model receive query features and history features.
Encoder layer extract high level features from history and they are concatenated with query features.
Subsequently, output values are calculated throught FC layers.

In [6]:
class DSB2019Net(chainer.Chain):

    def __init__(self, dim_input: int, dim_enc: int, dim_fc: int,
                 ff_inner: int, ff_slope: float, head: int,
                 dropout1: float, dropout2: float, dropout3: float, **kwargs):

        super(DSB2019Net, self).__init__()

        self.dropout3 = dropout3

        with self.init_scope():

            self.cur_fc1 = L.Linear(128)
            self.cur_fc2 = L.Linear(128)

            self.hist_conv1 = ConvolutionSentence(dim_input, int(dim_enc))
            self.hist_enc1 = EncoderLayer(int(dim_enc), ff_inner, ff_slope, head, dropout1, dropout2)

            self.fc1 = L.Linear(dim_fc)
            self.fc2 = L.Linear(1)

    def __call__(self, query, history, targets):

        out = self.predict(query, history)
        loss = F.mean_absolute_error(out, targets)
        reporter.report({'loss': loss}, self)
        return loss

    def predict(self, query, history, **kwargs):
        """
            query: [batch_size, feature]
            history: [batch_size, time_step, feature]
        """

        h_cur = F.leaky_relu(self.cur_fc1(query))
        h_cur = self.cur_fc2(h_cur)

        h_hist = F.swapaxes(history, 1, 2)

        h_hist = self.hist_conv1(h_hist)
        h_hist = self.hist_enc1(h_hist, xx_mask=None)

        h_hist_ave = F.average(h_hist, axis=2)
        h_hist_max = F.max(h_hist, axis=2)

        h = F.concat([h_cur, h_hist_ave, h_hist_max], axis=1)

        h = F.dropout(F.leaky_relu(self.fc1(h)), ratio=self.dropout3)
        out = self.fc2(h)

        return out

Evaluator is used for generating a submission.
This is used for validation during training too.

In [7]:
class ThresholdEvaluator(Evaluator):

    def __init__(self, iterator, target, name, thresholds, converter=convert.concat_examples, device=None,
                 is_validate=False, is_submit=False, installation_id=None, submission_name='submission'):

        super(ThresholdEvaluator, self).__init__(iterator, target, converter=converter, device=device)

        self.is_validate = is_validate
        self.is_submit = is_submit
        self.name = name

        self.rounder = OptimizedRounder(thresholds)
        self.installation_id = installation_id
        self.submission_name = submission_name

    def evaluate(self):
        iterator = self._iterators['main']
        eval_func = self._targets['main']

        iterator.reset()
        it = iterator

        y_total = []
        t_total = []

        for batch in it:
            in_arrays = self.converter(batch, self.device)
            with chainer.no_backprop_mode(), chainer.using_config('train', False):
                y = eval_func.predict(**in_arrays)

            y_data = cuda.to_cpu(y.data)
            y_total.append(y_data)
            t_total.append(cuda.to_cpu(in_arrays['targets']))

        y_truth = np.concatenate(t_total).flatten()
        y_pred_value = np.concatenate(y_total).flatten()

        if self.is_validate:
            self.rounder.fit(y_pred_value, y_truth)

        y_pred_label = self.rounder.predict(y_pred_value, self.rounder.coefficients())

        if self.is_submit:

            print('\nsave submission !')

            submit = pd.DataFrame()
            submit['installation_id'] = self.installation_id
            submit['accuracy_group'] = y_pred_label
            submit.sort_values('installation_id', inplace=True)
            submit.to_csv(f'{self.submission_name}.csv', index=False)

        if self.is_validate:

            valid_score = calc_weighted_kappa(y_truth, y_pred_label)
            observation = {}
            with reporter.report_scope(observation):
                reporter.report({'qw_kappa': valid_score}, self._targets['main'])
            return observation

        return {}

## 4.3 Feature Extraction
Since history data is given to the model directly, feature extraction is very simple.

As a history feature, the number of event code, title, types are counted for each game_session.
Duration of geme_session is also calculated and concatenated with history features.

As a query feature, the number of `correct` and `incorrect` are counted respectively. 

In [8]:
session_merge = partial(pd.merge, on='game_session')

def extract_features(data, data_labels, event_codes, titles, types, num_history_step: int):

    data['timestamp'] = pd.to_datetime(data['timestamp'])
    data['correct'] = data['event_data'].map(lambda x: '"correct":true' in x)
    data['incorrect'] = data['event_data'].map(lambda x: '"correct":false' in x)

    data_gp = data.groupby(['installation_id', 'game_session'])
    data_time = data_gp['timestamp'].agg(min).reset_index()

    data_time_max = data_gp['timestamp'].agg(max).reset_index()[['game_session', 'timestamp']]
    data_time_max.columns = ['game_session', 'timestamp_end']

    data_level = data_gp['f_level'].agg(np.max).reset_index()[['game_session', 'f_level']]
    data_level = data_level.fillna(0.0)

    data_count = data_gp[['correct', 'incorrect']].agg(sum).reset_index()[['game_session', 'correct', 'incorrect']]

    data_code = pd.crosstab(data['game_session'], data['event_code']).astype(np.float32)
    data_title = pd.crosstab(data['game_session'], data['title']).astype(np.float32)
    data_type = pd.crosstab(data['game_session'], data['type']).astype(np.float32)

    data_title_str = data.drop_duplicates('game_session', keep='last').copy()[['game_session', 'title']]

    data_feature = reduce(
        session_merge,
        [data_time, data_code, data_title, data_type, data_time_max, data_title_str, data_count, data_level]
    )
    data_feature.index = data_feature['game_session']
    data_feature_gp = data_feature.groupby('installation_id')

    list_history = list()
    list_current = list()

    num_unique_geme_session = len(set(data_feature['game_session']))
    num_unique_id_and_game_session = len(set(zip(data_feature['installation_id'], data_feature['game_session'])))
    assert num_unique_geme_session == num_unique_id_and_game_session

    assessments = [
        'Mushroom Sorter (Assessment)',
        'Bird Measurer (Assessment)',
        'Cauldron Filler (Assessment)',
        'Cart Balancer (Assessment)',
        'Chest Sorter (Assessment)'
    ]

    for _, row in tqdm(data_labels.iterrows(), total=len(data_labels), miniters=100):

        same_id = data_feature_gp.get_group(row['installation_id'])

        target_timestamp = same_id.loc[row['game_session'], 'timestamp']

        same_id_before = same_id.loc[same_id['timestamp'] < target_timestamp].copy()
        same_id_before.sort_values('timestamp', inplace=True)

        same_id_before['duration'] = (same_id_before['timestamp_end'] - same_id_before['timestamp']).dt.total_seconds()
        same_id_before['duration'] = np.log1p(same_id_before['duration'])

        h_feature = same_id_before.iloc[-num_history_step:][event_codes + titles + types + ['duration']]
        h_feature = np.log1p(h_feature.values)

        c_feature = (same_id.loc[row['game_session']][assessments].values != 0).astype(np.int32)

        query_title = row['title']
        success_exp = np.sum(same_id_before.query('title==@query_title')['correct'])
        failure_exp = np.sum(same_id_before.query('title==@query_title')['incorrect'])

        c_feature = np.append(c_feature, np.log1p(success_exp))
        c_feature = np.append(c_feature, np.log1p(failure_exp))
        c_feature = np.append(c_feature, (success_exp + 1) / (success_exp + failure_exp + 2) - 0.5)
        c_feature = np.append(c_feature, (target_timestamp.hour - 12.0) / 10.0)

        if len(h_feature) < num_history_step:
            h_feature = np.pad(h_feature, ((num_history_step - len(h_feature), 0), (0, 0)),
                               mode='constant', constant_values=0)

        list_history.append(h_feature)
        list_current.append(c_feature)

    history = np.asarray(list_history)
    current = np.asarray(list_current)

    return history, current

## 4.4 Main function
Data Loading and simple preprocessing are implemented here.

In [9]:
def single_fold(dir_dataset: Path, dir_model: Path, num_history_step: int,
                batch_size: int, seed: int, device: int, **kwargs):

    random.seed(seed)
    np.random.seed(seed)

    tic = time.time()

    sub = pd.read_csv(dir_dataset / 'sample_submission.csv')

    test_installation_id = list(set(sub.installation_id))

    print('test installation id: {}'.format(test_installation_id[:10]))

    test = pd.read_csv(dir_dataset / 'test.csv')
    test = test[test.installation_id.isin(test_installation_id)]
    print('test shape: {}'.format(test.shape))

    test.sort_values(['installation_id', 'timestamp'], inplace=True)
    test_labels = test.drop_duplicates('installation_id', keep='last').copy()
    test_labels.reset_index(drop=True, inplace=True)
    test_labels['accuracy_group'] = -1  # dummy label

    event_codes = pd.read_csv(dir_model / f'event_codes.csv')['event_code'].tolist()
    titles = pd.read_csv(dir_model / 'media_sequence.csv')['title'].tolist()
    types = ['Activity', 'Assessment', 'Clip', 'Game']
    re_level = re.compile(r'.*"level":([0-9]+).*')

    test['event_code'] = pd.Categorical(test['event_code'], categories=event_codes)
    test['title'] = pd.Categorical(test['title'], categories=titles)
    test['type'] = pd.Categorical(test['type'], categories=types)
    test['f_level'] = test['event_data'].map(
        lambda x: int(re.sub(re_level, '\\1', x)) + 1 if '"level"' in x else np.nan)

    print(' test shape: {}'.format(test.shape))

    model = DSB2019Net(len(event_codes + titles + types) + 1, **kwargs)

    serializers.load_npz(dir_model / f'model_seed{seed}.npz', model)
    model.to_cpu()

    test_history, test_current = extract_features(test, test_labels, event_codes, titles, types, num_history_step)

    test_dataset = DictDataset(history=test_history.astype(np.float32),
                               query=test_current.astype(np.float32),
                               targets=np.asarray(test_labels[['accuracy_group']], dtype=np.float32))

    test_iter = chainer.iterators.SerialIterator(test_dataset, batch_size, repeat=False, shuffle=False)

    thresholds = SharedThresholds()
    thresholds.load(dir_model / f'thresholds_seed{seed}.txt')

    dir_model.mkdir(exist_ok=True)

    ThresholdEvaluator(test_iter, model, 'test', thresholds, device=device, is_submit=True,
                       installation_id=test_labels['installation_id'],
                       submission_name=f'submission_seed{seed}').evaluate()

    elapsed_time = time.time() - tic
    print('elapsed time: {:.1f} [min]'.format(elapsed_time / 60.0))


def main_nn(dir_dataset: Path, dir_model: Path, device: int, seeds_nn: str):

    list_seed = [int(s) for s in seeds_nn.split(',')]

    for seed in list_seed:

        with open(str(dir_model / f'parameters_seed{seed}.json'), 'r') as f:
            hyper_params = json.load(f)

        single_fold(dir_dataset=dir_dataset,
                    dir_model=dir_model,
                    seed=seed,
                    device=device,
                    **hyper_params)

## 4.5 Execution
Pretrained models are loaded and predictions are executed.
10 pre-submisions are saved on local strage.

In [10]:
seeds_nn = '3048,3049,3050,3051,3052,3053,3054,3055,3056,3057'

main_nn(
    dir_dataset=dir_dataset,
    dir_model=dir_model,
    device=-1,
    seeds_nn=seeds_nn
)

test installation id: ['f2a1b17d', '08671ec7', 'be0381e4', '326575f9', '125a3d09', 'bf287639', '9e7e6cd8', 'ae10e514', 'e566fc58', '5cdb3a18']
test shape: (1156414, 11)
 test shape: (1156414, 12)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))



save submission !
elapsed time: 1.5 [min]
test installation id: ['f2a1b17d', '08671ec7', 'be0381e4', '326575f9', '125a3d09', 'bf287639', '9e7e6cd8', 'ae10e514', 'e566fc58', '5cdb3a18']
test shape: (1156414, 11)
 test shape: (1156414, 12)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))



save submission !
elapsed time: 1.4 [min]
test installation id: ['f2a1b17d', '08671ec7', 'be0381e4', '326575f9', '125a3d09', 'bf287639', '9e7e6cd8', 'ae10e514', 'e566fc58', '5cdb3a18']
test shape: (1156414, 11)
 test shape: (1156414, 12)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))



save submission !
elapsed time: 1.5 [min]
test installation id: ['f2a1b17d', '08671ec7', 'be0381e4', '326575f9', '125a3d09', 'bf287639', '9e7e6cd8', 'ae10e514', 'e566fc58', '5cdb3a18']
test shape: (1156414, 11)
 test shape: (1156414, 12)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))



save submission !
elapsed time: 1.4 [min]
test installation id: ['f2a1b17d', '08671ec7', 'be0381e4', '326575f9', '125a3d09', 'bf287639', '9e7e6cd8', 'ae10e514', 'e566fc58', '5cdb3a18']
test shape: (1156414, 11)
 test shape: (1156414, 12)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))



save submission !
elapsed time: 1.3 [min]
test installation id: ['f2a1b17d', '08671ec7', 'be0381e4', '326575f9', '125a3d09', 'bf287639', '9e7e6cd8', 'ae10e514', 'e566fc58', '5cdb3a18']
test shape: (1156414, 11)
 test shape: (1156414, 12)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))



save submission !
elapsed time: 1.5 [min]
test installation id: ['f2a1b17d', '08671ec7', 'be0381e4', '326575f9', '125a3d09', 'bf287639', '9e7e6cd8', 'ae10e514', 'e566fc58', '5cdb3a18']
test shape: (1156414, 11)
 test shape: (1156414, 12)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))



save submission !
elapsed time: 1.4 [min]
test installation id: ['f2a1b17d', '08671ec7', 'be0381e4', '326575f9', '125a3d09', 'bf287639', '9e7e6cd8', 'ae10e514', 'e566fc58', '5cdb3a18']
test shape: (1156414, 11)
 test shape: (1156414, 12)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))



save submission !
elapsed time: 1.5 [min]
test installation id: ['f2a1b17d', '08671ec7', 'be0381e4', '326575f9', '125a3d09', 'bf287639', '9e7e6cd8', 'ae10e514', 'e566fc58', '5cdb3a18']
test shape: (1156414, 11)
 test shape: (1156414, 12)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))



save submission !
elapsed time: 1.5 [min]
test installation id: ['f2a1b17d', '08671ec7', 'be0381e4', '326575f9', '125a3d09', 'bf287639', '9e7e6cd8', 'ae10e514', 'e566fc58', '5cdb3a18']
test shape: (1156414, 11)
 test shape: (1156414, 12)


HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))



save submission !
elapsed time: 1.5 [min]


# 5. Light GBM
I learned many things from public kernel.
Especially, below three kernels were informative for me.
I integreted these kernels and adjusted a little.

Andrew Lukyanenko ( @artgor ) created regression based kernel.
It tought us a good way to handle QWK.
* Quick and dirty regression<br>https://www.kaggle.com/artgor/quick-and-dirty-regression

Bhavika ( @bhavikapanara ) share additional features which can boost public LB.
* 2019 DSB - With more features (QWK-0.549)<br>https://www.kaggle.com/bhavikapanara/2019-dsb-with-more-features-qwk-0-549

Memento Mori ( @nxrprime ) integrated above ideas and create nice script based kernel (I personally prefer script format rather than notebook).
* simple<br>https://www.kaggle.com/nxrprime/simple

## 5.1 Functions and Classes
All functions and classes for Light GBM are implemented here.

In [11]:
def read_data(dir_dataset: Path, debug: bool):

    start = time.time()
    print('Start read data')

    print('Reading train data....')
    if debug:
        print('debug mode !')
        train = pd.read_csv(dir_dataset / 'train.csv', nrows=10000)
        train_labels = pd.read_csv(dir_dataset / 'train_labels.csv', nrows=100)
    else:
        print('full data mode !')
        train = pd.read_csv(dir_dataset / 'train.csv')
        train_labels = pd.read_csv(dir_dataset / 'train_labels.csv')
    print('Training.csv file have {} rows and {} columns'.format(train.shape[0], train.shape[1]))
    print('Train_labels.csv file have {} rows and {} columns'.format(train_labels.shape[0], train_labels.shape[1]))

    print('Reading test data....')
    test = pd.read_csv(dir_dataset / 'test.csv')
    print('Test.csv file have {} rows and {} columns'.format(test.shape[0], test.shape[1]))

    print('Reading specs.csv file....')
    specs = pd.read_csv(dir_dataset / 'specs.csv')
    print('Specs.csv file have {} rows and {} columns'.format(specs.shape[0], specs.shape[1]))

    print('Reading sample_submission.csv file....')
    sample_submission = pd.read_csv(dir_dataset / 'sample_submission.csv')
    print('Sample_submission.csv file have {} rows and {} columns'.format(
        sample_submission.shape[0], sample_submission.shape[1]))

    print("read data done, time - ", time.time() - start)
    return train, test, train_labels, specs, sample_submission


def encode_title(train, test, train_labels):
    start = time.time()

    print("Start encoding data")

    str_concat = lambda x, y: str(x) + '_' + str(y)
    sorted_list = lambda set_obj: sorted(list(set_obj))

    # encode title
    train['title_event_code'] = sorted_list(map(str_concat, train['title'], train['event_code']))
    test['title_event_code'] = sorted_list(map(str_concat, test['title'], test['event_code']))
    all_title_event_code = sorted_list(set(train["title_event_code"].unique()).union(test["title_event_code"].unique()))

    train['type_world'] = sorted_list(map(str_concat, train['type'], train['world']))
    test['type_world'] = sorted_list(map(str_concat, test['type'], test['world']))
    all_type_world = sorted_list(set(train["type_world"].unique()).union(test["type_world"].unique()))

    # make a list with all the unique 'titles' from the train and test set
    list_of_user_activities = sorted_list(set(train['title'].unique()).union(set(test['title'].unique())))

    # make a list with all the unique 'event_code' from the train and test set
    list_of_event_code = sorted_list(set(train['event_code'].unique()).union(set(test['event_code'].unique())))
    list_of_event_id = sorted_list(set(train['event_id'].unique()).union(set(test['event_id'].unique())))

    # make a list with all the unique worlds from the train and test set
    list_of_worlds = sorted_list(set(train['world'].unique()).union(set(test['world'].unique())))

    # create a dictionary numerating the titles
    activities_map = dict(zip(list_of_user_activities, np.arange(len(list_of_user_activities))))
    activities_labels = dict(zip(np.arange(len(list_of_user_activities)), list_of_user_activities))
    activities_world = dict(zip(list_of_worlds, np.arange(len(list_of_worlds))))
    assess_titles = sorted_list(set(train[train['type'] == 'Assessment']['title'].value_counts().index).union(
        set(test[test['type'] == 'Assessment']['title'].value_counts().index)))

    # replace the text titles with the number titles from the dict
    train['title'] = train['title'].map(activities_map)
    test['title'] = test['title'].map(activities_map)
    train['world'] = train['world'].map(activities_world)
    test['world'] = test['world'].map(activities_world)
    train_labels['title'] = train_labels['title'].map(activities_map)
    win_code = dict(zip(activities_map.values(), (4100 * np.ones(len(activities_map))).astype('int')))
    # then, it set one element, the 'Bird Measurer (Assessment)' as 4110, 10 more than the rest
    win_code[activities_map['Bird Measurer (Assessment)']] = 4110
    # convert text into datetime
    train['timestamp'] = pd.to_datetime(train['timestamp'])
    test['timestamp'] = pd.to_datetime(test['timestamp'])
    print("End encoding data, time - ", time.time() - start)

    event_data = {}
    event_data["train_labels"] = train_labels
    event_data["win_code"] = win_code
    event_data["list_of_user_activities"] = list_of_user_activities
    event_data["list_of_event_code"] = list_of_event_code
    event_data["activities_labels"] = activities_labels
    event_data["assess_titles"] = assess_titles
    event_data["list_of_event_id"] = list_of_event_id
    event_data["all_title_event_code"] = all_title_event_code
    event_data["activities_map"] = activities_map
    event_data["all_type_world"] = all_type_world

    return train, test, event_data


def get_all_features(feature_dict, ac_data):
    if len(ac_data['durations']) > 0:
        feature_dict['installation_duration_mean'] = np.mean(ac_data['durations'])
        feature_dict['installation_duration_sum'] = np.sum(ac_data['durations'])
    else:
        feature_dict['installation_duration_mean'] = 0
        feature_dict['installation_duration_sum'] = 0

    return feature_dict


def get_data(user_sample, event_data, test_set):
    """
    The user_sample is a DataFrame from train or test where the only one
    installation_id is filtered
    And the test_set parameter is related with the labels processing, that is only requered
    if test_set=False
    """
    # Constants and parameters declaration
    last_assesment = {}

    last_activity = 0

    user_activities_count = {'Clip': 0, 'Activity': 0, 'Assessment': 0, 'Game': 0}

    assess_4020_acc_dict = {'Cauldron Filler (Assessment)_4020_accuracy': 0,
                            'Mushroom Sorter (Assessment)_4020_accuracy': 0,
                            'Bird Measurer (Assessment)_4020_accuracy': 0,
                            'Chest Sorter (Assessment)_4020_accuracy': 0}

    game_time_dict = {'Clip_gametime': 0, 'Game_gametime': 0,
                      'Activity_gametime': 0, 'Assessment_gametime': 0}

    accuracy_groups = {0: 0, 1: 0, 2: 0, 3: 0}
    all_assessments = []
    accumulated_accuracy_group = 0
    accumulated_accuracy = 0
    accumulated_correct_attempts = 0
    accumulated_uncorrect_attempts = 0
    accumulated_actions = 0

    # Newly added features
    accumulated_game_miss = 0
    Cauldron_Filler_4025 = 0
    mean_game_round = 0
    mean_game_duration = 0
    mean_game_level = 0
    Assessment_mean_event_count = 0
    Game_mean_event_count = 0
    Activity_mean_event_count = 0
    chest_assessment_uncorrect_sum = 0

    counter = 0
    durations = []
    durations_game = []
    durations_activity = []
    last_accuracy_title = {'acc_' + title: -1 for title in event_data["assess_titles"]}
    last_game_time_title = {'lgt_' + title: 0 for title in event_data["assess_titles"]}
    ac_game_time_title = {'agt_' + title: 0 for title in event_data["assess_titles"]}
    ac_true_attempts_title = {'ata_' + title: 0 for title in event_data["assess_titles"]}
    ac_false_attempts_title = {'afa_' + title: 0 for title in event_data["assess_titles"]}
    event_code_count: dict[str, int] = {ev: 0 for ev in event_data["list_of_event_code"]}
    event_code_proc_count = {str(ev) + "_proc" : 0. for ev in event_data["list_of_event_code"]}
    event_id_count: dict[str, int] = {eve: 0 for eve in event_data["list_of_event_id"]}
    title_count: dict[str, int] = {eve: 0 for eve in event_data["activities_labels"].values()}
    title_event_code_count: dict[str, int] = {t_eve: 0 for t_eve in event_data["all_title_event_code"]}
    type_world_count: dict[str, int] = {w_eve: 0 for w_eve in event_data["all_type_world"]}
    session_count = 0

    # iterates through each session of one installation_id
    for i, session in user_sample.groupby('game_session', sort=False):
        # i = game_session_id
        # session is a DataFrame that contain only one game_session
        # get some sessions information
        session_type = session['type'].iloc[0]
        session_title = session['title'].iloc[0]
        session_title_text = event_data["activities_labels"][session_title]

        if session_type == "Activity":
            Activity_mean_event_count = (Activity_mean_event_count + session['event_count'].iloc[-1]) / 2.0

        if session_type == "Game":
            Game_mean_event_count = (Game_mean_event_count + session['event_count'].iloc[-1]) / 2.0

            game_s = session[session.event_code == 2030]
            misses_cnt = cnt_miss(game_s)
            accumulated_game_miss += misses_cnt

            try:
                game_round = json.loads(session['event_data'].iloc[-1])["round"]
                mean_game_round = (mean_game_round + game_round) / 2.0
            except:
                pass

            try:
                game_duration = json.loads(session['event_data'].iloc[-1])["duration"]
                mean_game_duration = (mean_game_duration + game_duration) / 2.0
            except:
                pass

            try:
                game_level = json.loads(session['event_data'].iloc[-1])["level"]
                mean_game_level = (mean_game_level + game_level) / 2.0
            except:
                pass

        # for each assessment, and only this kind off session, the features below are processed
        # and a register are generated
        if (session_type == 'Assessment') & (test_set or len(session) > 1):
            # search for event_code 4100, that represents the assessments trial
            all_attempts = session.query(f'event_code == {event_data["win_code"][session_title]}')
            # then, check the numbers of wins and the number of losses
            true_attempts = all_attempts['event_data'].str.contains('true').sum()
            false_attempts = all_attempts['event_data'].str.contains('false').sum()
            # copy a dict to use as feature template, it's initialized with some itens:
            # {'Clip':0, 'Activity': 0, 'Assessment': 0, 'Game':0}
            features = user_activities_count.copy()
            features.update(last_accuracy_title.copy())
            features.update(event_code_count.copy())
            features.update(title_count.copy())
            features.update(game_time_dict.copy())
            features.update(event_id_count.copy())
            features.update(title_event_code_count.copy())
            features.update(assess_4020_acc_dict.copy())
            features.update(type_world_count.copy())
            features.update(last_game_time_title.copy())
            features.update(ac_game_time_title.copy())
            features.update(ac_true_attempts_title.copy())
            features.update(ac_false_attempts_title.copy())

            features.update(event_code_proc_count.copy())
            features['installation_session_count'] = session_count
            features['accumulated_game_miss'] = accumulated_game_miss
            features['mean_game_round'] = mean_game_round
            features['mean_game_duration'] = mean_game_duration
            features['mean_game_level'] = mean_game_level
            features['Assessment_mean_event_count'] = Assessment_mean_event_count
            features['Game_mean_event_count'] = Game_mean_event_count
            features['Activity_mean_event_count'] = Activity_mean_event_count
            features['chest_assessment_uncorrect_sum'] = chest_assessment_uncorrect_sum

            variety_features = [('var_event_code', event_code_count),
                                ('var_event_id', event_id_count),
                                ('var_title', title_count),
                                ('var_title_event_code', title_event_code_count),
                                ('var_type_world', type_world_count)]

            for name, dict_counts in variety_features:
                arr = np.array(list(dict_counts.values()))
                features[name] = np.count_nonzero(arr)

            # get installation_id for aggregated features
            features['installation_id'] = session['installation_id'].iloc[-1]
            # add title as feature, remembering that title represents the name of the game
            features['session_title'] = session['title'].iloc[0]
            # the 4 lines below add the feature of the history of the trials of this player
            # this is based on the all time attempts so far, at the moment of this assessment
            features['accumulated_correct_attempts'] = accumulated_correct_attempts
            features['accumulated_uncorrect_attempts'] = accumulated_uncorrect_attempts
            accumulated_correct_attempts += true_attempts
            accumulated_uncorrect_attempts += false_attempts

            # ----------------------------------------------
            ac_true_attempts_title['ata_' + session_title_text] += true_attempts
            ac_false_attempts_title['afa_' + session_title_text] += false_attempts

            last_game_time_title['lgt_' + session_title_text] = session['game_time'].iloc[-1]
            ac_game_time_title['agt_' + session_title_text] += session['game_time'].iloc[-1]
            # ----------------------------------------------

            # the time spent in the app so far
            if durations == []:
                features['duration_mean'] = 0
                features['duration_std'] = 0
                features['last_duration'] = 0
                features['duration_max'] = 0
            else:
                features['duration_mean'] = np.mean(durations)
                features['duration_std'] = np.std(durations)
                features['last_duration'] = durations[-1]
                features['duration_max'] = np.max(durations)
            durations.append((session.iloc[-1, 2] - session.iloc[0, 2]).seconds)

            if durations_game == []:
                features['duration_game_mean'] = 0
                features['duration_game_std'] = 0
                features['game_last_duration'] = 0
                features['game_max_duration'] = 0
            else:
                features['duration_game_mean'] = np.mean(durations_game)
                features['duration_game_std'] = np.std(durations_game)
                features['game_last_duration'] = durations_game[-1]
                features['game_max_duration'] = np.max(durations_game)

            if durations_activity == []:
                features['duration_activity_mean'] = 0
                features['duration_activity_std'] = 0
                features['game_activity_duration'] = 0
                features['game_activity_max'] = 0
            else:
                features['duration_activity_mean'] = np.mean(durations_activity)
                features['duration_activity_std'] = np.std(durations_activity)
                features['game_activity_duration'] = durations_activity[-1]
                features['game_activity_max'] = np.max(durations_activity)

            # the accuracy is the all time wins divided by the all time attempts
            features['accumulated_accuracy'] = accumulated_accuracy / counter if counter > 0 else 0
            # --------------------------
            features['Cauldron_Filler_4025'] = Cauldron_Filler_4025 / counter if counter > 0 else 0

            Assess_4025 = session[(session.event_code == 4025) & (session.title == 'Cauldron Filler (Assessment)')]
            true_attempts_ = Assess_4025['event_data'].str.contains('true').sum()
            false_attempts_ = Assess_4025['event_data'].str.contains('false').sum()

            if (true_attempts_ + false_attempts_) != 0:
                cau_assess_accuracy_ = true_attempts_ / (true_attempts_ + false_attempts_)
            else:
                cau_assess_accuracy_ = 0
            Cauldron_Filler_4025 += cau_assess_accuracy_

            chest_assessment_uncorrect_sum += len(session[session.event_id == "df4fe8b6"])

            Assessment_mean_event_count = (Assessment_mean_event_count + session['event_count'].iloc[-1]) / 2.0
            # ----------------------------
            accuracy = true_attempts / (true_attempts + false_attempts) if (true_attempts + false_attempts) != 0 else 0
            accumulated_accuracy += accuracy
            last_accuracy_title['acc_' + session_title_text] = accuracy
            # a feature of the current accuracy categorized
            # it is a counter of how many times this player was in each accuracy group
            if accuracy == 0:
                features['accuracy_group'] = 0
            elif accuracy == 1:
                features['accuracy_group'] = 3
            elif accuracy == 0.5:
                features['accuracy_group'] = 2
            else:
                features['accuracy_group'] = 1
            features.update(accuracy_groups)
            accuracy_groups[features['accuracy_group']] += 1
            # mean of the all accuracy groups of this player
            features['accumulated_accuracy_group'] = accumulated_accuracy_group / counter if counter > 0 else 0
            accumulated_accuracy_group += features['accuracy_group']
            # how many actions the player has done so far, it is initialized as 0 and updated some lines below
            features['accumulated_actions'] = accumulated_actions

            # there are some conditions to allow this features to be inserted in the datasets
            # if it's a test set, all sessions belong to the final dataset
            # it it's a train, needs to be passed throught this clausule
            # : session.query(f'event_code == {win_code[session_title]}')
            # that means, must exist an event_code 4100 or 4110
            if test_set:
                last_assesment = features.copy()

            if true_attempts + false_attempts > 0:
                all_assessments.append(features)

            counter += 1

        if session_type == 'Game':
            durations_game.append((session.iloc[-1, 2] - session.iloc[0, 2]).seconds)

        if session_type == 'Activity':
            durations_activity.append((session.iloc[-1, 2] - session.iloc[0, 2]).seconds)

        session_count += 1

        # this piece counts how many actions was made in each event_code so far
        def update_counters(counter: dict, col: str):
            num_of_session_count = Counter(session[col])
            for k in num_of_session_count.keys():
                x = k
                if col == 'title':
                    x = event_data["activities_labels"][k]
                counter[x] += num_of_session_count[k]
            return counter

        def update_proc(count: dict):
            res = {}
            for k, val in count.items():
                res[str(k) + "_proc"] = (float(val) * 100.0) / accumulated_actions
            return res

        event_code_count = update_counters(event_code_count, "event_code")

        event_id_count = update_counters(event_id_count, "event_id")
        title_count = update_counters(title_count, 'title')
        title_event_code_count = update_counters(title_event_code_count, 'title_event_code')
        type_world_count = update_counters(type_world_count, 'type_world')

        assess_4020_acc_dict = get_4020_acc(session, assess_4020_acc_dict, event_data)
        game_time_dict[session_type + '_gametime'] = (game_time_dict[session_type + '_gametime'] + (
                    session['game_time'].iloc[-1] / 1000.0)) / 2.0

        # counts how many actions the player has done so far, used in the feature of the same name
        accumulated_actions += len(session)
        event_code_proc_count = update_proc(event_code_count)

        if last_activity != session_type:
            user_activities_count[session_type] += 1
            last_activitiy = session_type

            # if it't the test_set, only the last assessment must be predicted, the previous are scraped
    if test_set:
        return last_assesment, all_assessments
    # in the train_set, all assessments goes to the dataset
    return all_assessments


def cnt_miss(df):
    cnt = 0
    for e in range(len(df)):
        x = df['event_data'].iloc[e]
        y = json.loads(x)['misses']
        cnt += y
    return cnt


def get_4020_acc(df, counter_dict, event_data):
    for e in ['Cauldron Filler (Assessment)', 'Bird Measurer (Assessment)',
              'Mushroom Sorter (Assessment)', 'Chest Sorter (Assessment)']:
        Assess_4020 = df[(df.event_code == 4020) & (df.title == event_data["activities_map"][e])]
        true_attempts_ = Assess_4020['event_data'].str.contains('true').sum()
        false_attempts_ = Assess_4020['event_data'].str.contains('false').sum()

        if (true_attempts_ + false_attempts_) != 0:
            measure_assess_accuracy_ = true_attempts_ / (true_attempts_ + false_attempts_)
        else:
            measure_assess_accuracy_ = 0
        counter_dict[e + "_4020_accuracy"] += (counter_dict[e + "_4020_accuracy"] + measure_assess_accuracy_) / 2.0

    return counter_dict


def get_users_data(users_list, return_dict,  event_data, test_set):

    if test_set:
        for user in users_list:
            return_dict.append(get_data(user, event_data, test_set))
    else:
        answer = []
        for user in users_list:
            answer += get_data(user, event_data, test_set)
        return_dict += answer


def get_data_parallel(users_list, event_data, test_set):

    manager = multiprocessing.Manager()
    return_dict = manager.list()
    threads_number = event_data["process_numbers"]
    data_len = len(users_list)
    processes = []
    cur_start = 0
    cur_stop = 0
    for index in range(threads_number):
        cur_stop += (data_len-1) // threads_number

        if index != (threads_number - 1):
            p = Process(target=get_users_data, args=(users_list[cur_start:cur_stop], return_dict, event_data, test_set))
        else:
            p = Process(target=get_users_data, args=(users_list[cur_start:], return_dict, event_data, test_set))

        processes.append(p)
        cur_start = cur_stop

    for proc in processes:
        proc.start()

    for proc in processes:
        proc.join()

    return list(return_dict)


def get_train_and_test(train, test, event_data):

    start = time.time()
    print("Start get_train_and_test")

    compiled_train = []
    compiled_test = []

    user_train_list = []
    user_test_list = []

    stride_size = event_data["strides"]
    for i, (ins_id, user_sample) in enumerate(tqdm(train.groupby('installation_id', sort=False), miniters=100)):
        user_train_list.append(user_sample)
        if (i + 1) % stride_size == 0:
            compiled_train += get_data_parallel(user_train_list, event_data, False)
            del user_train_list
            user_train_list = []

    if len(user_train_list) > 0:
        compiled_train += get_data_parallel(user_train_list, event_data, False)
        del user_train_list

    for i, (ins_id, user_sample) in enumerate(tqdm(test.groupby('installation_id', sort=False), miniters=100)):
        user_test_list.append(user_sample)
        if (i + 1) % stride_size == 0:
            compiled_test += get_data_parallel(user_test_list, event_data, True)
            del user_test_list
            user_test_list = []

    if len(user_test_list) > 0:
        compiled_test += get_data_parallel(user_test_list, event_data, True)
        del user_test_list

    reduce_train = pd.DataFrame(compiled_train)

    reduce_test = [x[0] for x in compiled_test]

    reduce_train_from_test = []
    for i in [x[1] for x in compiled_test]:
        reduce_train_from_test += i

    reduce_test = pd.DataFrame(reduce_test)
    reduce_train_from_test = pd.DataFrame(reduce_train_from_test)
    print("End get_train_and_test, time - ", time.time() - start)
    return reduce_train, reduce_test, reduce_train_from_test


def get_train_and_test_single_proc(train, test, event_data):

    compiled_train = []
    compiled_test = []
    compiled_test_his = []
    for ins_id, user_sample in tqdm(train.groupby('installation_id', sort=False), miniters=100):
        compiled_train += get_data(user_sample, event_data, False)
    for ins_id, user_sample in tqdm(test.groupby('installation_id', sort=False), miniters=100):
        test_data = get_data(user_sample, event_data, True)
        compiled_test.append(test_data[0])
        compiled_test_his += test_data[1]

    reduce_train = pd.DataFrame(compiled_train)
    reduce_test = pd.DataFrame(compiled_test)
    reduce_test_his = pd.DataFrame(compiled_test_his)

    return reduce_train, reduce_test, reduce_test_his


def predict(sample_submission, y_pred, file_name='submission.csv'):
    sample_submission['accuracy_group'] = y_pred
    sample_submission['accuracy_group'] = sample_submission['accuracy_group'].astype(int)
    sample_submission.to_csv(file_name, index=False)
    print(sample_submission['accuracy_group'].value_counts(normalize=True))


def get_random_assessment(reduce_train):
    used_idx = []
    for iid in tqdm(set(reduce_train['installation_id']), miniters=200):
        list_ = list(reduce_train[reduce_train['installation_id'] == iid].index)
        cur = random.choices(list_, k=1)[0]
        used_idx.append(cur)
    reduce_train_t = reduce_train.loc[used_idx]
    return reduce_train_t, used_idx

It was obvious that scaling by ajust_factor was risky.
Whan I chose two submissions at the end of the competition, I chose the one which used ajust_factor and the other one which didn't use ajust_factor.

In [12]:
# function to exclude columns from the train and test set if the mean is different,
# also adjust test column by a factor to simulate the same distribution
def exclude(reduce_train, reduce_test, features):
    to_exclude = []
    ajusted_test = reduce_test.copy()
    for feature in features:
        if feature not in ['accuracy_group', 'installation_id', 'session_title']:
            data = reduce_train[feature]
            train_mean = data.mean()
            data = ajusted_test[feature]
            test_mean = data.mean()
            try:
                ajust_factor = train_mean / test_mean
                if ajust_factor > 10 or ajust_factor < 0.1:  # or error > 0.01:
                    to_exclude.append(feature)
                    print(feature)
                else:
                    ajusted_test[feature] *= ajust_factor
            except:
                to_exclude.append(feature)
                print(feature)
    return to_exclude, ajusted_test


def remove_correlated_features(reduce_train, features):
    counter = 0
    to_remove = []
    for feat_a in features:
        for feat_b in features:
            if feat_a != feat_b and feat_a not in to_remove and feat_b not in to_remove:
                c = np.corrcoef(reduce_train[feat_a], reduce_train[feat_b])[0][1]
                if c > 0.995:
                    counter += 1
                    to_remove.append(feat_b)
                    print('{}: FEAT_A: {} FEAT_B: {} - Correlation: {}'.format(counter, feat_a, feat_b, c))
    return to_remove

This qwk function was shared by CPMP ( @cpmpml ).
* Fast QWK Computation<br>https://www.kaggle.com/c/data-science-bowl-2019/discussion/114133#latest-660168

In [13]:
@jit
def qwk(a1, a2):
    max_rat = 3
    a1 = np.asarray(a1, dtype=int)
    a2 = np.asarray(a2, dtype=int)

    hist1 = np.zeros((max_rat + 1,))
    hist2 = np.zeros((max_rat + 1,))

    o = 0
    for k in range(a1.shape[0]):
        i, j = a1[k], a2[k]
        hist1[i] += 1
        hist2[j] += 1
        o += (i - j) * (i - j)

    e = 0
    for i in range(max_rat + 1):
        for j in range(max_rat + 1):
            e += hist1[i] * hist2[j] * (i - j) * (i - j)

    e = e / a1.shape[0]

    return 1 - o / e


class MyModel:

    def __init__(self, train_all, features, list_seed):
        self.bin_models = []
        self.models = []
        self.rounders = []
        self.features = features
        self.list_seed = list_seed

        params_origin = {
            'num_boost_round': 1000,
            'boosting_type': 'gbdt',  # 'dart', 'dart', 'gbdt'
            'metric': "None",
            'objective': 'regression',  # regression',quantile fair huber poisson
            'n_jobs': -1,
            'num_leaves': 32,
            'learning_rate': 0.08,
            'max_depth': 14,
            'lambda_l1': 2.0,
            'lambda_l2': 1.0,
            'bagging_fraction': 0.90,
            'bagging_freq': 1,
            'feature_fraction': 0.90,
            'early_stopping_rounds': 300,
            'verbose': 0,
        }

        oof_rmse_scores = []
        oof_cohen_scores = []

        target = 'accuracy_group'

        for model_number, seed in enumerate(self.list_seed):

            random.seed(seed)
            np.random.seed(seed)

            print(f'model_number: {model_number}')

            installation_id_all = sorted(list(set(train_all['installation_id'])))
            random.shuffle(installation_id_all)

            num_train = int(len(installation_id_all) * 0.8)
            train_installation_id = installation_id_all[:num_train]
            valid_installation_id = installation_id_all[num_train:]

            train = train_all.query('installation_id in @train_installation_id')
            valid = train_all.query('installation_id in @valid_installation_id')

            print('train installation id: {}'.format(train_installation_id[:10]))
            print('valid installation id: {}'.format(valid_installation_id[:10]))

            x_train, x_valid = train[features], valid[features]
            y_train, y_valid = train[target], valid[target]
            x_train.drop('installation_id', inplace=True, axis=1)

            x_valid, idx_val = get_random_assessment(x_valid)
            x_valid.drop('installation_id', inplace=True, axis=1)
            y_valid = y_valid.loc[idx_val]

            train_set = lgb.Dataset(x_train, y_train, categorical_feature=['session_title'])
            val_set = lgb.Dataset(x_valid, y_valid, categorical_feature=['session_title'])

            params = dict(params_origin)
            params['seed'] = seed

            model = lgb.train(params, train_set,
                              valid_sets=[train_set, val_set], verbose_eval=10,
                              feval=eval_qwk_lgb_metric)

            self.models.append(model)
            reg_pred = model.predict(x_valid)

            optR = OptimizedRounder_LGBM()
            optR.fit(reg_pred, y_valid)
            coef = optR.coefficients()
            self.rounders.append(optR)

            oof_rmse_score = np.sqrt(mean_squared_error(y_valid, reg_pred))
            oof_cohen_score = qwk(y_valid, optR.predict(reg_pred, coef))

            print('RMSE:', oof_rmse_score)
            print(' QWK:', oof_cohen_score)

            oof_rmse_scores.append(oof_rmse_score)
            oof_cohen_scores.append(oof_cohen_score)

        print('mean RMSE: ', sum(oof_rmse_scores) / len(oof_rmse_scores))
        print('mean  QWK: ', sum(oof_cohen_scores) / len(oof_cohen_scores))

    def predict(self, test):

        print('number of models: {}'.format(len(self.models)))

        current_features = [x for x in self.features if x not in ['installation_id']]

        list_y_pred = list()

        for i in range(len(self.list_seed)):

            model = self.models[i]
            coef = self.rounders[i].coefficients()

            y_pred_i = self.rounders[i].predict(model.predict(test[current_features]), coef)
            list_y_pred.append(np.squeeze(y_pred_i))

        y_preds = np.stack(list_y_pred, axis=1)

        assert len(y_preds) == len(test)
        assert y_preds.shape[1] == len(self.list_seed)

        print(y_preds[:10])

        return y_preds


class OptimizedRounder_LGBM(object):
    """
    An optimizer for rounding thresholds
    to maximize Quadratic Weighted Kappa (QWK) score
    # https://www.kaggle.com/naveenasaithambi/optimizedrounder-improved
    """

    def __init__(self):
        self.coef_ = 0

    def _kappa_loss(self, coef, X, y):
        """
        Get loss according to
        using current coefficients

        :param coef: A list of coefficients that will be used for rounding
        :param X: The raw predictions
        :param y: The ground truth labels
        """
        X_p = pd.cut(X, [-np.inf] + list(np.sort(coef)) + [np.inf], labels=[0, 1, 2, 3])

        return -qwk(y, X_p)

    def fit(self, X, y):
        """
        Optimize rounding thresholds

        :param X: The raw predictions
        :param y: The ground truth labels
        """
        loss_partial = partial(self._kappa_loss, X=X, y=y)
        initial_coef = [1.10, 1.72, 2.25]
        self.coef_ = sp.optimize.minimize(loss_partial, initial_coef,
                                          method='nelder-mead', options={'maxiter': 5000})

    def predict(self, X, coef):
        """
        Make predictions with specified thresholds

        :param X: The raw predictions
        :param coef: A list of coefficients that will be used for rounding
        """
        return pd.cut(X, [-np.inf] + list(np.sort(coef)) + [np.inf], labels=[0, 1, 2, 3])

    def coefficients(self):
        """
        Return the optimized coefficients
        """
        return self.coef_['x']


def eval_qwk_lgb_metric(y_pred, true):
    y_true = true.label

    dist = Counter(y_true)
    for k in dist:
        dist[k] /= len(y_true)

    acum = 0
    bound = {}
    for i in range(3):
        acum += dist[i]
        bound[i] = np.percentile(y_pred, acum * 100)

    def classify(x):
        if x <= bound[0]:
            return 0
        elif x <= bound[1]:
            return 1
        elif x <= bound[2]:
            return 2
        else:
            return 3

    y_pred = np.array(list(map(classify, y_pred)))

    return 'cappa', qwk(y_true, y_pred), True


def main_lgbm(dir_dataset: Path, debug: Path, seeds_lgbm: str):

    list_seed = [int(s) for s in seeds_lgbm.split(',')]

    in_kaggle = False
    random.seed(42)
    np.random.seed(42)
    start_program = time.time()

    event_data = {}
    if in_kaggle:
        event_data["strides"] = 300
        event_data["process_numbers"] = 4
    else:
        event_data["strides"] = 300
        event_data["process_numbers"] = 3

    # read data
    train, test, train_labels, specs, sample_submission = read_data(dir_dataset, debug)
    # get useful dict with mapping encode
    train, test, event_data_update = encode_title(train, test, train_labels)
    event_data.update(event_data_update)

    # reduce_train, reduce_test, reduce_train_from_test = get_train_and_test_single_proc(train, test, event_data)
    reduce_train, reduce_test, reduce_train_from_test = get_train_and_test(train, test, event_data)
    dels = [train, test]
    del dels

    sample_submission = pd.read_csv(dir_dataset / 'sample_submission.csv')

    reduce_train.sort_values("installation_id", axis=0, ascending=True, inplace=True, na_position='last')
    reduce_test.sort_values("installation_id", axis=0, ascending=True, inplace=True, na_position='last')

    reduce_train = pd.concat([reduce_train, reduce_train_from_test], ignore_index=True)

    old_features = list(reduce_train.columns[0:99]) + list(reduce_train.columns[886:])
    el_features = ['accuracy_group', 'accuracy', 'installation_id']
    old_features = [col for col in old_features if col not in el_features]
    event_id_features = list(reduce_train.columns[99:483])
    title_event_code_cross = list(reduce_train.columns[483:886])
    features = old_features + event_id_features + title_event_code_cross

    to_remove = remove_correlated_features(reduce_train, features)
    features = [col for col in features if col not in to_remove]
    print('Training with {} features'.format(len(features)))

    features.append('installation_id')

    # to avoid below error
    # lightgbm.basic.LightGBMError: Do not support special JSON characters in feature name.
    features = [str(s).replace(',', '__') for s in features]
    reduce_train.columns = [str(s).replace(',', '__') for s in reduce_train.columns]
    reduce_test.columns = [str(s).replace(',', '__') for s in reduce_test.columns]

    to_exclude, ajusted_test = exclude(reduce_train, reduce_test, features)
    features = [col for col in features if col not in to_exclude]

    my_model = MyModel(reduce_train, features, list_seed=list_seed)
    train_pred = my_model.predict(reduce_train)
    test_pred = my_model.predict(ajusted_test)

    y = reduce_train['accuracy_group'].values

    for i, seed in enumerate(list_seed):

        train_pred_round = train_pred[:, i]
        test_pred_round = test_pred[:, i]

        print('train cappa rounding: {:.4f}'.format(qwk(y, train_pred_round)))

        predict(sample_submission, test_pred_round, f'submission_seed{seed}.csv')

    print("Program full time:", time.time() - start_program)

## 5.2 Execution
19 LGBM models are trained here.
19 pre-submissions are saved on local strage.

In [14]:
seeds_lgbm = '1048,1049,1050,1051,1052,1053,1054,1055,1056,1057' \
             ',1058,1059,1060,1061,1062,1063,1064,1065,1066'

main_lgbm(
    dir_dataset=dir_dataset,
    debug=debug_lgbm,
    seeds_lgbm=seeds_lgbm
)

Start read data
Reading train data....
full data mode !
Training.csv file have 11341042 rows and 11 columns
Train_labels.csv file have 17690 rows and 7 columns
Reading test data....
Test.csv file have 1156414 rows and 11 columns
Reading specs.csv file....
Specs.csv file have 386 rows and 3 columns
Reading sample_submission.csv file....
Sample_submission.csv file have 1000 rows and 2 columns
read data done, time -  79.28344202041626
Start encoding data
End encoding data, time -  76.40388250350952
Start get_train_and_test


HBox(children=(FloatProgress(value=0.0, max=17000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))


End get_train_and_test, time -  2273.1146750450134
1: FEAT_A: Clip FEAT_B: 27253bdc - Correlation: 1.0
2: FEAT_A: 2000 FEAT_B: installation_session_count - Correlation: 0.9999999999999999
3: FEAT_A: 2020 FEAT_B: 2030 - Correlation: 0.996191280717455
4: FEAT_A: 2040 FEAT_B: 2050 - Correlation: 0.9965872887962527
5: FEAT_A: 2040 FEAT_B: 2b9272f4 - Correlation: 0.9965033653798532
6: FEAT_A: 2040 FEAT_B: 37c53127 - Correlation: 0.9965872887962527
7: FEAT_A: 2040 FEAT_B: 5a848010 - Correlation: 0.9976173802635967
8: FEAT_A: 2040 FEAT_B: 73757a5e - Correlation: 0.9967092436106245
9: FEAT_A: 2040 FEAT_B: dcaede90 - Correlation: 1.0
10: FEAT_A: 3010 FEAT_B: 3110 - Correlation: 0.9999291513813332
11: FEAT_A: 3020 FEAT_B: 3120 - Correlation: 0.9998781338229461
12: FEAT_A: 3021 FEAT_B: 3121 - Correlation: 0.9999096941873034
13: FEAT_A: 4031 FEAT_B: 1996c610 - Correlation: 0.9999999999999999
14: FEAT_A: 4050 FEAT_B: a1192f43 - Correlation: 0.9999999999999999
15: FEAT_A: 4220 FEAT_B: 1340b8d7 - Co

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.607886	valid_1's cappa: 0.499267
[20]	training's cappa: 0.622464	valid_1's cappa: 0.497783
[30]	training's cappa: 0.641854	valid_1's cappa: 0.514104
[40]	training's cappa: 0.660084	valid_1's cappa: 0.509653
[50]	training's cappa: 0.675255	valid_1's cappa: 0.500751
[60]	training's cappa: 0.686374	valid_1's cappa: 0.502976
[70]	training's cappa: 0.697039	valid_1's cappa: 0.495558
[80]	training's cappa: 0.707456	valid_1's cappa: 0.50446
[90]	training's cappa: 0.717708	valid_1's cappa: 0.511429
[100]	training's cappa: 0.72581	valid_1's cappa: 0.511429
[110]	training's cappa: 0.736434	valid_1's cappa: 0.512767
[120]	training's cappa: 0.742965	valid_1's cappa: 0.513509
[130]	training's cappa: 0.750736	valid_1's cappa: 0.515587
[140]	training's cappa: 0.759086	valid_1's cappa: 0.509653
[150]	training's cappa: 0.766445	valid_1's cappa: 0.511136
[160]	training's cappa: 0.772686	valid_1's cappa: 0.508911
[170

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.599196	valid_1's cappa: 0.543962
[20]	training's cappa: 0.622809	valid_1's cappa: 0.552768
[30]	training's cappa: 0.642575	valid_1's cappa: 0.571791
[40]	training's cappa: 0.660617	valid_1's cappa: 0.565303
[50]	training's cappa: 0.674664	valid_1's cappa: 0.568908
[60]	training's cappa: 0.686362	valid_1's cappa: 0.580442
[70]	training's cappa: 0.698142	valid_1's cappa: 0.581163
[80]	training's cappa: 0.708424	valid_1's cappa: 0.579721
[90]	training's cappa: 0.719677	valid_1's cappa: 0.579
[100]	training's cappa: 0.729757	valid_1's cappa: 0.585488
[110]	training's cappa: 0.737246	valid_1's cappa: 0.586209
[120]	training's cappa: 0.745665	valid_1's cappa: 0.581323
[130]	training's cappa: 0.755988	valid_1's cappa: 0.57339
[140]	training's cappa: 0.764488	valid_1's cappa: 0.572512
[150]	training's cappa: 0.77056	valid_1's cappa: 0.573233
[160]	training's cappa: 0.775782	valid_1's cappa: 0.57107
[170]	tr

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.601802	valid_1's cappa: 0.520766
[20]	training's cappa: 0.62228	valid_1's cappa: 0.537126
[30]	training's cappa: 0.644874	valid_1's cappa: 0.54483
[40]	training's cappa: 0.6604	valid_1's cappa: 0.549441
[50]	training's cappa: 0.676113	valid_1's cappa: 0.54483
[60]	training's cappa: 0.687456	valid_1's cappa: 0.54409
[70]	training's cappa: 0.701427	valid_1's cappa: 0.54409
[80]	training's cappa: 0.711909	valid_1's cappa: 0.556671
[90]	training's cappa: 0.722146	valid_1's cappa: 0.54927
[100]	training's cappa: 0.729516	valid_1's cappa: 0.54927
[110]	training's cappa: 0.740631	valid_1's cappa: 0.553711
[120]	training's cappa: 0.748556	valid_1's cappa: 0.54927
[130]	training's cappa: 0.756581	valid_1's cappa: 0.551491
[140]	training's cappa: 0.76391	valid_1's cappa: 0.552231
[150]	training's cappa: 0.77128	valid_1's cappa: 0.55001
[160]	training's cappa: 0.778077	valid_1's cappa: 0.550751
[170]	training'

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.604058	valid_1's cappa: 0.481633
[20]	training's cappa: 0.623867	valid_1's cappa: 0.490179
[30]	training's cappa: 0.644629	valid_1's cappa: 0.4856
[40]	training's cappa: 0.661825	valid_1's cappa: 0.487889
[50]	training's cappa: 0.673649	valid_1's cappa: 0.478731
[60]	training's cappa: 0.686866	valid_1's cappa: 0.478731
[70]	training's cappa: 0.698713	valid_1's cappa: 0.481932
[80]	training's cappa: 0.708341	valid_1's cappa: 0.487126
[90]	training's cappa: 0.719555	valid_1's cappa: 0.493232
[100]	training's cappa: 0.727756	valid_1's cappa: 0.487889
[110]	training's cappa: 0.737028	valid_1's cappa: 0.481784
[120]	training's cappa: 0.746458	valid_1's cappa: 0.481784
[130]	training's cappa: 0.754026	valid_1's cappa: 0.476115
[140]	training's cappa: 0.762426	valid_1's cappa: 0.479025
[150]	training's cappa: 0.768448	valid_1's cappa: 0.475824
[160]	training's cappa: 0.776333	valid_1's cappa: 0.484073
[170

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.610184	valid_1's cappa: 0.498189
[20]	training's cappa: 0.628753	valid_1's cappa: 0.496497
[30]	training's cappa: 0.646261	valid_1's cappa: 0.504902
[40]	training's cappa: 0.661821	valid_1's cappa: 0.507194
[50]	training's cappa: 0.677125	valid_1's cappa: 0.506166
[60]	training's cappa: 0.690076	valid_1's cappa: 0.509654
[70]	training's cappa: 0.701856	valid_1's cappa: 0.518999
[80]	training's cappa: 0.712282	valid_1's cappa: 0.520947
[90]	training's cappa: 0.721622	valid_1's cappa: 0.524003
[100]	training's cappa: 0.732532	valid_1's cappa: 0.516111
[110]	training's cappa: 0.739738	valid_1's cappa: 0.525531
[120]	training's cappa: 0.748393	valid_1's cappa: 0.524003
[130]	training's cappa: 0.756726	valid_1's cappa: 0.517126
[140]	training's cappa: 0.764456	valid_1's cappa: 0.515598
[150]	training's cappa: 0.771863	valid_1's cappa: 0.517126
[160]	training's cappa: 0.778988	valid_1's cappa: 0.513306
[1

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.60631	valid_1's cappa: 0.525191
[20]	training's cappa: 0.629471	valid_1's cappa: 0.536394
[30]	training's cappa: 0.648934	valid_1's cappa: 0.531611
[40]	training's cappa: 0.662809	valid_1's cappa: 0.535311
[50]	training's cappa: 0.677315	valid_1's cappa: 0.540491
[60]	training's cappa: 0.689175	valid_1's cappa: 0.540491
[70]	training's cappa: 0.700034	valid_1's cappa: 0.540491
[80]	training's cappa: 0.710292	valid_1's cappa: 0.539011
[90]	training's cappa: 0.721312	valid_1's cappa: 0.54493
[100]	training's cappa: 0.730969	valid_1's cappa: 0.542711
[110]	training's cappa: 0.738181	valid_1's cappa: 0.535311
[120]	training's cappa: 0.745514	valid_1's cappa: 0.535311
[130]	training's cappa: 0.752647	valid_1's cappa: 0.534571
[140]	training's cappa: 0.760861	valid_1's cappa: 0.536051
[150]	training's cappa: 0.768635	valid_1's cappa: 0.534571
[160]	training's cappa: 0.776128	valid_1's cappa: 0.535311
[170

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.608821	valid_1's cappa: 0.516354
[20]	training's cappa: 0.625922	valid_1's cappa: 0.516818
[30]	training's cappa: 0.645449	valid_1's cappa: 0.515402
[40]	training's cappa: 0.660457	valid_1's cappa: 0.526504
[50]	training's cappa: 0.676296	valid_1's cappa: 0.531007
[60]	training's cappa: 0.687268	valid_1's cappa: 0.543013
[70]	training's cappa: 0.697392	valid_1's cappa: 0.53776
[80]	training's cappa: 0.708117	valid_1's cappa: 0.53701
[90]	training's cappa: 0.718881	valid_1's cappa: 0.533258
[100]	training's cappa: 0.727765	valid_1's cappa: 0.533258
[110]	training's cappa: 0.737769	valid_1's cappa: 0.533258
[120]	training's cappa: 0.746413	valid_1's cappa: 0.530416
[130]	training's cappa: 0.753696	valid_1's cappa: 0.529506
[140]	training's cappa: 0.761139	valid_1's cappa: 0.53701
[150]	training's cappa: 0.767542	valid_1's cappa: 0.534759
[160]	training's cappa: 0.774625	valid_1's cappa: 0.540762
[170]

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.601208	valid_1's cappa: 0.487488
[20]	training's cappa: 0.623914	valid_1's cappa: 0.494363
[30]	training's cappa: 0.645634	valid_1's cappa: 0.507688
[40]	training's cappa: 0.661325	valid_1's cappa: 0.512998
[50]	training's cappa: 0.674894	valid_1's cappa: 0.509887
[60]	training's cappa: 0.687463	valid_1's cappa: 0.506191
[70]	training's cappa: 0.698791	valid_1's cappa: 0.503973
[80]	training's cappa: 0.708878	valid_1's cappa: 0.508409
[90]	training's cappa: 0.718245	valid_1's cappa: 0.508409
[100]	training's cappa: 0.729412	valid_1's cappa: 0.510626
[110]	training's cappa: 0.738338	valid_1's cappa: 0.510626
[120]	training's cappa: 0.745944	valid_1's cappa: 0.511366
[130]	training's cappa: 0.754189	valid_1's cappa: 0.512844
[140]	training's cappa: 0.761955	valid_1's cappa: 0.507669
[150]	training's cappa: 0.76936	valid_1's cappa: 0.505452
[160]	training's cappa: 0.775885	valid_1's cappa: 0.503234
[17

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.60246	valid_1's cappa: 0.485057
[20]	training's cappa: 0.623778	valid_1's cappa: 0.505538
[30]	training's cappa: 0.642243	valid_1's cappa: 0.509829
[40]	training's cappa: 0.660597	valid_1's cappa: 0.522257
[50]	training's cappa: 0.673292	valid_1's cappa: 0.520778
[60]	training's cappa: 0.686351	valid_1's cappa: 0.521518
[70]	training's cappa: 0.696781	valid_1's cappa: 0.515601
[80]	training's cappa: 0.708465	valid_1's cappa: 0.525215
[90]	training's cappa: 0.720958	valid_1's cappa: 0.531132
[100]	training's cappa: 0.72872	valid_1's cappa: 0.536308
[110]	training's cappa: 0.737049	valid_1's cappa: 0.534829
[120]	training's cappa: 0.745459	valid_1's cappa: 0.529652
[130]	training's cappa: 0.753827	valid_1's cappa: 0.53409
[140]	training's cappa: 0.760943	valid_1's cappa: 0.53335
[150]	training's cappa: 0.768584	valid_1's cappa: 0.541485
[160]	training's cappa: 0.776144	valid_1's cappa: 0.537787
[170]	

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.598361	valid_1's cappa: 0.521731
[20]	training's cappa: 0.619741	valid_1's cappa: 0.522203
[30]	training's cappa: 0.639096	valid_1's cappa: 0.535834
[40]	training's cappa: 0.655131	valid_1's cappa: 0.543225
[50]	training's cappa: 0.670922	valid_1's cappa: 0.550015
[60]	training's cappa: 0.684973	valid_1's cappa: 0.550968
[70]	training's cappa: 0.694231	valid_1's cappa: 0.559866
[80]	training's cappa: 0.7053	valid_1's cappa: 0.556173
[90]	training's cappa: 0.716071	valid_1's cappa: 0.552095
[100]	training's cappa: 0.724696	valid_1's cappa: 0.55301
[110]	training's cappa: 0.732997	valid_1's cappa: 0.548963
[120]	training's cappa: 0.74231	valid_1's cappa: 0.552481
[130]	training's cappa: 0.748991	valid_1's cappa: 0.54805
[140]	training's cappa: 0.758425	valid_1's cappa: 0.547311
[150]	training's cappa: 0.764985	valid_1's cappa: 0.546573
[160]	training's cappa: 0.772274	valid_1's cappa: 0.544357
[170]	t

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.606401	valid_1's cappa: 0.533713
[20]	training's cappa: 0.627506	valid_1's cappa: 0.579515
[30]	training's cappa: 0.643254	valid_1's cappa: 0.570505
[40]	training's cappa: 0.659851	valid_1's cappa: 0.573508
[50]	training's cappa: 0.674245	valid_1's cappa: 0.578764
[60]	training's cappa: 0.688436	valid_1's cappa: 0.578764
[70]	training's cappa: 0.699433	valid_1's cappa: 0.571256
[80]	training's cappa: 0.708692	valid_1's cappa: 0.570505
[90]	training's cappa: 0.718354	valid_1's cappa: 0.575761
[100]	training's cappa: 0.728827	valid_1's cappa: 0.582519
[110]	training's cappa: 0.737762	valid_1's cappa: 0.584771
[120]	training's cappa: 0.746132	valid_1's cappa: 0.579515
[130]	training's cappa: 0.754663	valid_1's cappa: 0.572758
[140]	training's cappa: 0.760849	valid_1's cappa: 0.574259
[150]	training's cappa: 0.76756	valid_1's cappa: 0.578014
[160]	training's cappa: 0.775323	valid_1's cappa: 0.577263
[17

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.601664	valid_1's cappa: 0.541097
[20]	training's cappa: 0.621417	valid_1's cappa: 0.553718
[30]	training's cappa: 0.640304	valid_1's cappa: 0.567786
[40]	training's cappa: 0.654634	valid_1's cappa: 0.561538
[50]	training's cappa: 0.667175	valid_1's cappa: 0.560044
[60]	training's cappa: 0.682741	valid_1's cappa: 0.564273
[70]	training's cappa: 0.695304	valid_1's cappa: 0.574983
[80]	training's cappa: 0.704851	valid_1's cappa: 0.572743
[90]	training's cappa: 0.715233	valid_1's cappa: 0.569008
[100]	training's cappa: 0.725866	valid_1's cappa: 0.573489
[110]	training's cappa: 0.736701	valid_1's cappa: 0.574236
[120]	training's cappa: 0.744422	valid_1's cappa: 0.574236
[130]	training's cappa: 0.752265	valid_1's cappa: 0.571249
[140]	training's cappa: 0.759178	valid_1's cappa: 0.571249
[150]	training's cappa: 0.767708	valid_1's cappa: 0.560044
[160]	training's cappa: 0.775066	valid_1's cappa: 0.5632
[170

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.612449	valid_1's cappa: 0.513707
[20]	training's cappa: 0.629498	valid_1's cappa: 0.51647
[30]	training's cappa: 0.646203	valid_1's cappa: 0.526216
[40]	training's cappa: 0.661608	valid_1's cappa: 0.522467
[50]	training's cappa: 0.675988	valid_1's cappa: 0.532213
[60]	training's cappa: 0.689664	valid_1's cappa: 0.530307
[70]	training's cappa: 0.700966	valid_1's cappa: 0.527715
[80]	training's cappa: 0.710304	valid_1's cappa: 0.525466
[90]	training's cappa: 0.717538	valid_1's cappa: 0.523967
[100]	training's cappa: 0.727324	valid_1's cappa: 0.520218
[110]	training's cappa: 0.737331	valid_1's cappa: 0.518719
[120]	training's cappa: 0.745093	valid_1's cappa: 0.522467
[130]	training's cappa: 0.750411	valid_1's cappa: 0.520218
[140]	training's cappa: 0.758882	valid_1's cappa: 0.517969
[150]	training's cappa: 0.76688	valid_1's cappa: 0.522467
[160]	training's cappa: 0.775311	valid_1's cappa: 0.523217
[170

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.606459	valid_1's cappa: 0.479669
[20]	training's cappa: 0.621507	valid_1's cappa: 0.479205
[30]	training's cappa: 0.641353	valid_1's cappa: 0.480437
[40]	training's cappa: 0.658327	valid_1's cappa: 0.488892
[50]	training's cappa: 0.671932	valid_1's cappa: 0.496578
[60]	training's cappa: 0.686742	valid_1's cappa: 0.498883
[70]	training's cappa: 0.697444	valid_1's cappa: 0.497508
[80]	training's cappa: 0.707817	valid_1's cappa: 0.501958
[90]	training's cappa: 0.718478	valid_1's cappa: 0.505966
[100]	training's cappa: 0.728687	valid_1's cappa: 0.501353
[110]	training's cappa: 0.73779	valid_1's cappa: 0.505032
[120]	training's cappa: 0.746195	valid_1's cappa: 0.500421
[130]	training's cappa: 0.753986	valid_1's cappa: 0.498883
[140]	training's cappa: 0.762268	valid_1's cappa: 0.501189
[150]	training's cappa: 0.770592	valid_1's cappa: 0.504264
[160]	training's cappa: 0.777521	valid_1's cappa: 0.508106
[17

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.602038	valid_1's cappa: 0.544453
[20]	training's cappa: 0.621606	valid_1's cappa: 0.557643
[30]	training's cappa: 0.640847	valid_1's cappa: 0.560688
[40]	training's cappa: 0.657766	valid_1's cappa: 0.575154
[50]	training's cappa: 0.671446	valid_1's cappa: 0.569825
[60]	training's cappa: 0.684549	valid_1's cappa: 0.57287
[70]	training's cappa: 0.696005	valid_1's cappa: 0.569825
[80]	training's cappa: 0.707623	valid_1's cappa: 0.571347
[90]	training's cappa: 0.71757	valid_1's cappa: 0.571347
[100]	training's cappa: 0.726946	valid_1's cappa: 0.573052
[110]	training's cappa: 0.735059	valid_1's cappa: 0.57287
[120]	training's cappa: 0.74515	valid_1's cappa: 0.581245
[130]	training's cappa: 0.753783	valid_1's cappa: 0.577438
[140]	training's cappa: 0.761842	valid_1's cappa: 0.572109
[150]	training's cappa: 0.768324	valid_1's cappa: 0.575916
[160]	training's cappa: 0.775703	valid_1's cappa: 0.571347
[170]	

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.613242	valid_1's cappa: 0.455473
[20]	training's cappa: 0.631195	valid_1's cappa: 0.456989
[30]	training's cappa: 0.64743	valid_1's cappa: 0.460781
[40]	training's cappa: 0.661987	valid_1's cappa: 0.464573
[50]	training's cappa: 0.676856	valid_1's cappa: 0.472916
[60]	training's cappa: 0.688135	valid_1's cappa: 0.471399
[70]	training's cappa: 0.701209	valid_1's cappa: 0.469882
[80]	training's cappa: 0.711473	valid_1's cappa: 0.470641
[90]	training's cappa: 0.720098	valid_1's cappa: 0.469124
[100]	training's cappa: 0.730636	valid_1's cappa: 0.475191
[110]	training's cappa: 0.739222	valid_1's cappa: 0.477466
[120]	training's cappa: 0.746442	valid_1's cappa: 0.469882
[130]	training's cappa: 0.753661	valid_1's cappa: 0.472157
[140]	training's cappa: 0.759398	valid_1's cappa: 0.473674
[150]	training's cappa: 0.766228	valid_1's cappa: 0.472157
[160]	training's cappa: 0.77376	valid_1's cappa: 0.469124
[170

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.603216	valid_1's cappa: 0.500632
[20]	training's cappa: 0.627011	valid_1's cappa: 0.515234
[30]	training's cappa: 0.644989	valid_1's cappa: 0.517424
[40]	training's cappa: 0.660364	valid_1's cappa: 0.528375
[50]	training's cappa: 0.677163	valid_1's cappa: 0.529565
[60]	training's cappa: 0.689553	valid_1's cappa: 0.521954
[70]	training's cappa: 0.700086	valid_1's cappa: 0.524724
[80]	training's cappa: 0.711899	valid_1's cappa: 0.524599
[90]	training's cappa: 0.721905	valid_1's cappa: 0.516841
[100]	training's cappa: 0.732725	valid_1's cappa: 0.519614
[110]	training's cappa: 0.739802	valid_1's cappa: 0.515964
[120]	training's cappa: 0.748385	valid_1's cappa: 0.518884
[130]	training's cappa: 0.756398	valid_1's cappa: 0.519614
[140]	training's cappa: 0.763516	valid_1's cappa: 0.522534
[150]	training's cappa: 0.768966	valid_1's cappa: 0.51694
[160]	training's cappa: 0.777061	valid_1's cappa: 0.521804
[17

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.600809	valid_1's cappa: 0.505474
[20]	training's cappa: 0.624277	valid_1's cappa: 0.509697
[30]	training's cappa: 0.642958	valid_1's cappa: 0.514985
[40]	training's cappa: 0.660766	valid_1's cappa: 0.514985
[50]	training's cappa: 0.675243	valid_1's cappa: 0.521264
[60]	training's cappa: 0.687142	valid_1's cappa: 0.522017
[70]	training's cappa: 0.695999	valid_1's cappa: 0.524791
[80]	training's cappa: 0.70733	valid_1's cappa: 0.530825
[90]	training's cappa: 0.717722	valid_1's cappa: 0.528154
[100]	training's cappa: 0.726606	valid_1's cappa: 0.530071
[110]	training's cappa: 0.735966	valid_1's cappa: 0.533842
[120]	training's cappa: 0.743899	valid_1's cappa: 0.537614
[130]	training's cappa: 0.751911	valid_1's cappa: 0.538368
[140]	training's cappa: 0.759843	valid_1's cappa: 0.535351
[150]	training's cappa: 0.767101	valid_1's cappa: 0.535351
[160]	training's cappa: 0.774399	valid_1's cappa: 0.53686
[170

HBox(children=(FloatProgress(value=0.0, max=835.0), HTML(value='')))


Training until validation scores don't improve for 300 rounds
[10]	training's cappa: 0.605209	valid_1's cappa: 0.516257
[20]	training's cappa: 0.626248	valid_1's cappa: 0.526841
[30]	training's cappa: 0.64551	valid_1's cappa: 0.524598
[40]	training's cappa: 0.66168	valid_1's cappa: 0.527577
[50]	training's cappa: 0.675354	valid_1's cappa: 0.524558
[60]	training's cappa: 0.686326	valid_1's cappa: 0.530002
[70]	training's cappa: 0.697297	valid_1's cappa: 0.524558
[80]	training's cappa: 0.708864	valid_1's cappa: 0.531405
[90]	training's cappa: 0.718643	valid_1's cappa: 0.535123
[100]	training's cappa: 0.727905	valid_1's cappa: 0.536633
[110]	training's cappa: 0.736133	valid_1's cappa: 0.532105
[120]	training's cappa: 0.743686	valid_1's cappa: 0.528331
[130]	training's cappa: 0.751785	valid_1's cappa: 0.526822
[140]	training's cappa: 0.758434	valid_1's cappa: 0.53135
[150]	training's cappa: 0.766225	valid_1's cappa: 0.529841
[160]	training's cappa: 0.773897	valid_1's cappa: 0.535123
[170]

# 6. Ensemble
All pre-submissions are ensembled here.
I implemented weighted voting.

In [15]:
def ensemble(submissions_all: pd.DataFrame, weights):

    assert submissions_all.shape[1] == len(weights)

    vote_table_all = np.zeros((len(submissions_all), 4))

    for i in range(len(weights)):

        vote_table = np.zeros((len(submissions_all), 4))
        vote_table[np.arange(len(submissions_all)), submissions_all.iloc[:, i]] = 1.0

        vote_table_all += vote_table * weights[i]

    display(pd.DataFrame(vote_table_all[:20]))

    return np.argmax(vote_table_all, axis=1)

Final submission (submission.csv) is generated here.
Since, self-attention models achieved higher score on public LB, I gave higher weight with them.
The each weight of self-attention model is 1.0, and each weight of Light GBM model is 0.25.
Considering number of models, each sample get 14.75 weight in total (1.0 x 10 + 0.25 x 19).

In [16]:
list_seed_lgbm = [int(s) for s in seeds_lgbm.split(',')]
list_seed_nn = [int(s) for s in seeds_nn.split(',')]

seed_str = seeds_lgbm + ',' + seeds_nn
list_seed = [int(s) for s in seed_str.split(',')]

submission1 = pd.read_csv(f'submission_seed{list_seed[0]}.csv')['installation_id']

submission_all = pd.concat([pd.read_csv(f'submission_seed{seed}.csv')[['accuracy_group']] for seed in list_seed],
                           axis=1)

ensemble_weights = np.append(np.full(len(list_seed_lgbm), 0.25),
                             np.ones(len(list_seed_nn)))

ensemble_pred = ensemble(submission_all, ensemble_weights)

submission = pd.DataFrame()
submission['installation_id'] = submission1
submission['accuracy_group'] = ensemble_pred
submission.to_csv('submission.csv', index=False)

print('prediction counts')
print(submission['accuracy_group'].value_counts())

Unnamed: 0,0,1,2,3
0,0.0,2.0,5.75,7.0
1,0.0,0.0,0.25,14.5
2,0.0,0.0,1.0,13.75
3,0.0,0.25,2.75,11.75
4,0.0,1.75,4.0,9.0
5,0.0,0.0,0.0,14.75
6,0.0,1.0,8.75,5.0
7,0.0,0.75,10.0,4.0
8,14.75,0.0,0.0,0.0
9,4.0,10.75,0.0,0.0


prediction counts
3    479
2    200
0    161
1    160
Name: accuracy_group, dtype: int64


# 7. Conclusion
In this competition, I employed self-attention models and Light GBM.
Emsembling with them improved public LB score more.
It was interesting that self-attention models could be used for tabular data competition.

Even though my approach would have potential to win this competition, I lost my way at somewhere.
I was misled by public LB and it caused shake-down.
I want to learn from this failure and try to win next competition.