<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Student-Model" data-toc-modified-id="Student-Model-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Student Model</a></span><ul class="toc-item"><li><span><a href="#Data-processing" data-toc-modified-id="Data-processing-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data processing</a></span></li><li><span><a href="#Model" data-toc-modified-id="Model-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model</a></span><ul class="toc-item"><li><span><a href="#Define-Model" data-toc-modified-id="Define-Model-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Define Model</a></span></li><li><span><a href="#Train" data-toc-modified-id="Train-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Train</a></span></li></ul></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Evaluation</a></span><ul class="toc-item"><li><span><a href="#ROC-AUC" data-toc-modified-id="ROC-AUC-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>ROC AUC</a></span></li><li><span><a href="#Compression-rate" data-toc-modified-id="Compression-rate-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Compression rate</a></span></li></ul></li></ul></li></ul></div>

# Student Model


Нужно обучть небольшую модель на [soft таргетах](https://drive.google.com/file/d/1tBbPOUT-Ow9f3zTDApykGXYwt-KslYle/view?usp=sharing)  модели учителя, которая не сильно уступала бы в качестве учителю.

In [1]:
import os
import pandas as pd
import numpy as np
import wandb

import torch.nn.functional as F

from category_encoders import CatBoostEncoder

from sklearn.metrics import log_loss, roc_auc_score
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split

from deepctr_torch.inputs import SparseFeat, DenseFeat, get_feature_names
from deepctr_torch.models.dcn import DCN

from collections import defaultdict

import matplotlib.pyplot as plt

  from ._conv import register_converters as _register_converters


In [2]:
DATA_PATH = '../../data/criteo'

TRAIN_DATA = os.path.join(DATA_PATH, 'train.csv')

## Data processing

Данные на Train/Validation/Test нужно разбить как 80/10/10

The data loading part was copied from the teacher model

In [3]:
dense_features_indices = [i for i in range(1, 14)]
sparse_features_indices = [i for i in range(14, 40)]

dense_features = ['c{}'.format(i) for i in dense_features_indices]
sparse_features = ['c{}'.format(i) for i in sparse_features_indices]

len(dense_features_indices), len(sparse_features_indices)

(13, 26)

In [4]:
data = pd.read_csv(TRAIN_DATA, index_col='id')
data.rename(columns=dict([(col, col[1:] if col[0] == '_' else col) for col in data.columns]), inplace=True)

data[sparse_features] = data[sparse_features].fillna('-1', )
data[dense_features] = data[dense_features].fillna(0, )

hard_target = ['c0']

  mask |= (ar1 == a)


In [5]:
soft_targets = pd.read_csv('soft_targets_full.csv', index_col='id', squeeze=True)

data['c0_soft'] = soft_targets

soft_target = ['c0_soft']
targets = hard_target + soft_target

  mask |= (ar1 == a)


In [6]:
data.head()

Unnamed: 0_level_0,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,...,c31,c32,c33,c34,c35,c36,c37,c38,c39,c0_soft
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12,1,0.0,-1,0.0,0.0,1465.0,0.0,17.0,0.0,4.0,...,e5f8f18f,-1,-1,f3ddd519,-1,32c7478e,b34f3128,-1,-1,0.532062
26,1,0.0,1,20.0,16.0,1548.0,93.0,42.0,32.0,912.0,...,1f868fdd,21ddcdc9,a458ea53,7eee76d1,-1,32c7478e,9af06ad9,9d93af03,cdfe5ab7,0.483268
39,0,8.0,0,15.0,20.0,115.0,24.0,8.0,23.0,24.0,...,1304f63b,21ddcdc9,b1252a9d,07b2853e,-1,32c7478e,94bde4f2,010f6491,09b76f8d,0.126496
41,1,88.0,319,0.0,4.0,5.0,4.0,89.0,40.0,88.0,...,bbf70d82,-1,-1,16e2e3b3,-1,32c7478e,d859b4dd,-1,-1,0.750299
85,0,0.0,53,0.0,10.0,6550.0,98.0,34.0,11.0,349.0,...,fa0643ee,21ddcdc9,b1252a9d,0094bc78,-1,32c7478e,29ece3ed,001f3601,402185f3,0.784883


In [7]:
len(data)

3664931

Before processing the categorial features, let's split the dataset

In [8]:
mms = MinMaxScaler(feature_range=(0, 1))
data[dense_features] = mms.fit_transform(data[dense_features])
train, test = train_test_split(data, test_size=0.2, shuffle=False)
validation, test = train_test_split(test, test_size=0.5, shuffle=False)

print(len(train), len(validation), len(test))

2931944 366493 366494


### Processing categorial features

In [9]:
sparse_features_dims = {feat: len(data[feat].unique()) for feat in sparse_features}

In [10]:
sparse_features_dims

{'c14': 1445,
 'c15': 556,
 'c16': 1130758,
 'c17': 360209,
 'c18': 304,
 'c19': 21,
 'c20': 11845,
 'c21': 631,
 'c22': 3,
 'c23': 49223,
 'c24': 5194,
 'c25': 985420,
 'c26': 3157,
 'c27': 26,
 'c28': 11588,
 'c29': 715441,
 'c30': 10,
 'c31': 4681,
 'c32': 2029,
 'c33': 4,
 'c34': 870796,
 'c35': 17,
 'c36': 15,
 'c37': 87605,
 'c38': 84,
 'c39': 58187}

#### High cardinality features

We have about 3.6M samples in the dataset. For categorial features with high cardinality there are only a few training examples per each category. It is unlikely that we can learn meaningful high-dimensional embeddings for such categories. Also, such embeddings will consume a lot of memory, even if we apply hashing trick and reduce the number of distinct embedding vectors to 50,000 as it was done for the teacher model. Let's take another approach and encode them with some cool supervised techniques (that's why we split the data beforehand).

In [11]:
high_cardinality_features = [feat for feat, cnt in sparse_features_dims.items() if cnt >= 40000]

In [12]:
high_cardinality_features

['c16', 'c17', 'c23', 'c25', 'c29', 'c34', 'c37', 'c39']

In [13]:
enc = CatBoostEncoder(cols=high_cardinality_features, verbose=1)
enc.fit(train[high_cardinality_features], train[hard_target])

CatBoostEncoder(a=1,
                cols=['c16', 'c17', 'c23', 'c25', 'c29', 'c34', 'c37', 'c39'],
                drop_invariant=False, handle_missing='value',
                handle_unknown='value', random_state=None, return_df=True,
                sigma=None, verbose=1)

In [14]:
for df in [train, validation, test]:
    df.loc[:, high_cardinality_features] = enc.transform(df.loc[:, high_cardinality_features])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [15]:
dense_features = dense_features + high_cardinality_features
sparse_features = [feat for feat in sparse_features if feat not in high_cardinality_features]

#### Low cardinality features

Now let's deal with low (kind of) cardinality features. For them we will use label encoding.

In [16]:
for feat in sparse_features:
    le = LabelEncoder()
    le.fit(data.loc[:, feat])

    for df in [train, validation, test]:
        df.loc[:, feat] = le.transform(df.loc[:, feat])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [17]:
train.head()

Unnamed: 0_level_0,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,...,c31,c32,c33,c34,c35,c36,c37,c38,c39,c0_soft
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12,1,0.0,9.1e-05,0.0,0.0,0.000558,0.0,0.001717,0.0,0.000207,...,4211,0,0,0.365174,0,2,0.260759,0,0.267092,0.532062
26,1,0.0,0.000181,0.000305,0.018244,0.00059,0.000798,0.004242,0.006335,0.047188,...,588,248,2,0.254942,0,2,0.496607,52,0.506555,0.483268
39,0,0.001385,0.000136,0.000229,0.022805,4.4e-05,0.000206,0.000808,0.004554,0.001242,...,354,248,3,0.254942,0,2,0.156868,2,0.313736,0.126496
41,1,0.015238,0.014591,0.0,0.004561,2e-06,3.4e-05,0.008989,0.007919,0.004553,...,3393,0,0,0.211829,0,2,0.230221,0,0.267092,0.750299
85,0,0.0,0.002537,0.0,0.011403,0.002497,0.000841,0.003434,0.002178,0.018058,...,4577,248,3,0.254942,0,2,0.383206,1,0.311325,0.784883


Finally, prepare input for the model (mostly copied from the teacher model)

In [18]:
fixlen_feature_columns = [SparseFeat(feat, 
                                     vocabulary_size=vocab_size, 
                                     embedding_dim=min(int(6 * (vocab_size) ** (0.25)), 100), 
                                     use_hash=False, dtype='string') 
                          for feat, vocab_size in sparse_features_dims.items()] + \
                        [DenseFeat(feat, 1,) for feat in dense_features]


linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns, )

In [19]:
def gen_model_input(df):
    return {name: (pd.core.series.Series(df[name]) if name in sparse_features else np.array(df[name])) for name in feature_names}


train_model_input = gen_model_input(train)
validation_model_input = gen_model_input(validation)
test_model_input = gen_model_input(test)

In [20]:
train_model_targets = train[targets]
validation_model_targets = validation[targets]
test_model_targets = test[targets]

### Model

Можно также использовать Pruning и/или Quantinization.

### Define Model

As a first solution let's use the same model as the teacher (DCN) but with smaller number of parameters.

To do so, we should redefine the loss function to use both soft and hard targets.

In [21]:
def distillation_loss(y_pred, y_true, weight=0.99):
    print(y_pred)
    print(y_true)
    loss_hard = F.binary_cross_entropy(y_pred, y_true[:, 0], reduction='sum')
    loss_soft = F.binary_cross_entropy(y_pred, y_true[:, 1], reduction='sum')
    return weight * loss_soft + (1 - weight) * loss_hard

In [22]:
model = DCN(linear_feature_columns, dnn_feature_columns, cross_num=2,
            dnn_hidden_units=(32, 32), l2_reg_linear=0, l2_reg_embedding=0,
            l2_reg_cross=0, l2_reg_dnn=0, init_std=0.0001, seed=1024, 
            dnn_use_bn=True, dnn_activation='relu', task='binary')

model.compile("adam", distillation_loss)

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 286176400 bytes. Error code 12 (Cannot allocate memory)


### Train

In [31]:
model.fit(train_model_input, train_model_targets.values, epochs=5, verbose=1)

NameError: name 'model' is not defined

## Evaluation

Наша основная задача получить модель, которая 
* в терминах ROC AUC не намного хуже модели учителя, и в то же время 
* сильно меньше по размеру

### ROC AUC

Сравним ROC AUC модели ученика с показателем для учителя.

ROC AUC учителя: 0.802

### Compression rate

Пусть 
* $a$ - \# of the parameters in the original model $M$
* $a^{*}$ - \# of the parameters in compressed model $M^{*}$

тогда compression rate is $$\alpha(M,M^{*}) = \frac{a}{a^{*}}$$

Можно также посчитать comression rate просто как отношение фактических размеров моделей.

Размер модели учителя - 168MB
