<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Student-Model" data-toc-modified-id="Student-Model-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Student Model</a></span><ul class="toc-item"><li><span><a href="#Data-processing" data-toc-modified-id="Data-processing-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data processing</a></span></li><li><span><a href="#Model" data-toc-modified-id="Model-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model</a></span><ul class="toc-item"><li><span><a href="#Define-Model" data-toc-modified-id="Define-Model-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Define Model</a></span></li><li><span><a href="#Train" data-toc-modified-id="Train-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Train</a></span></li></ul></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Evaluation</a></span><ul class="toc-item"><li><span><a href="#ROC-AUC" data-toc-modified-id="ROC-AUC-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>ROC AUC</a></span></li><li><span><a href="#Compression-rate" data-toc-modified-id="Compression-rate-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Compression rate</a></span></li></ul></li></ul></li></ul></div>

# Student Model


Нужно обучть небольшую модель на [soft таргетах](https://drive.google.com/file/d/1tBbPOUT-Ow9f3zTDApykGXYwt-KslYle/view?usp=sharing)  модели учителя, которая не сильно уступала бы в качестве учителю.

In [1]:
import os
import pandas as pd
import numpy as np

import torch
import torch.nn.functional as F

from category_encoders import CatBoostEncoder, TargetEncoder

from sklearn.metrics import log_loss, roc_auc_score
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from deepctr_torch.inputs import SparseFeat, DenseFeat, get_feature_names, build_input_features
from deepctr_torch.models.dcn import DCN

from collections import defaultdict

from tqdm import tqdm
import matplotlib.pyplot as plt

In [2]:
DATA_PATH = '../../data/criteo'

TRAIN_DATA = os.path.join(DATA_PATH, 'train.csv')

## Data processing

Данные на Train/Validation/Test нужно разбить как 80/10/10

The data loading part was copied from the teacher model

In [3]:
dense_features_indices = [i for i in range(1, 14)]
sparse_features_indices = [i for i in range(14, 40)]

dense_features = ['c{}'.format(i) for i in dense_features_indices]
sparse_features = ['c{}'.format(i) for i in sparse_features_indices]

len(dense_features_indices), len(sparse_features_indices)

(13, 26)

In [4]:
data = pd.read_csv(TRAIN_DATA, index_col='id')
data.rename(columns=dict([(col, col[1:] if col[0] == '_' else col) for col in data.columns]), inplace=True)

data[sparse_features] = data[sparse_features].fillna('-1', )
data[dense_features] = data[dense_features].fillna(0, )

hard_target = ['c0']

  mask |= (ar1 == a)


In [5]:
soft_targets = pd.read_csv('soft_targets_full.csv', index_col='id', squeeze=True)

data['c0_soft'] = soft_targets

soft_target = ['c0_soft']
targets = hard_target + soft_target

In [6]:
data.head()

Unnamed: 0_level_0,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,...,c31,c32,c33,c34,c35,c36,c37,c38,c39,c0_soft
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12,1,0.0,-1,0.0,0.0,1465.0,0.0,17.0,0.0,4.0,...,e5f8f18f,-1,-1,f3ddd519,-1,32c7478e,b34f3128,-1,-1,0.532062
26,1,0.0,1,20.0,16.0,1548.0,93.0,42.0,32.0,912.0,...,1f868fdd,21ddcdc9,a458ea53,7eee76d1,-1,32c7478e,9af06ad9,9d93af03,cdfe5ab7,0.483268
39,0,8.0,0,15.0,20.0,115.0,24.0,8.0,23.0,24.0,...,1304f63b,21ddcdc9,b1252a9d,07b2853e,-1,32c7478e,94bde4f2,010f6491,09b76f8d,0.126496
41,1,88.0,319,0.0,4.0,5.0,4.0,89.0,40.0,88.0,...,bbf70d82,-1,-1,16e2e3b3,-1,32c7478e,d859b4dd,-1,-1,0.750299
85,0,0.0,53,0.0,10.0,6550.0,98.0,34.0,11.0,349.0,...,fa0643ee,21ddcdc9,b1252a9d,0094bc78,-1,32c7478e,29ece3ed,001f3601,402185f3,0.784883


In [7]:
len(data)

3664931

Before processing the categorial features, let's split the dataset

In [8]:
mms = MinMaxScaler(feature_range=(0, 1))
data[dense_features] = mms.fit_transform(data[dense_features])
train, test = train_test_split(data, test_size=0.2, shuffle=False)
validation, test = train_test_split(test, test_size=0.5, shuffle=False)

print(len(train), len(validation), len(test))

2931944 366493 366494


### Processing categorial features

In [9]:
sparse_features_dims = {feat: len(data[feat].unique()) for feat in sparse_features}

In [10]:
sparse_features_dims

{'c14': 1445,
 'c15': 556,
 'c16': 1130758,
 'c17': 360209,
 'c18': 304,
 'c19': 21,
 'c20': 11845,
 'c21': 631,
 'c22': 3,
 'c23': 49223,
 'c24': 5194,
 'c25': 985420,
 'c26': 3157,
 'c27': 26,
 'c28': 11588,
 'c29': 715441,
 'c30': 10,
 'c31': 4681,
 'c32': 2029,
 'c33': 4,
 'c34': 870796,
 'c35': 17,
 'c36': 15,
 'c37': 87605,
 'c38': 84,
 'c39': 58187}

#### High cardinality features

We have about 3.6M samples in the dataset. For categorial features with high cardinality there are only a few training examples per each category. It is unlikely that we can learn meaningful high-dimensional embeddings for such categories. Also, such embeddings will consume a lot of memory, even if we apply hashing trick and reduce the number of distinct embedding vectors to 50,000 as it was done for the teacher model. Let's take another approach and encode them with some cool supervised techniques (that's why we split the data beforehand).

In [11]:
high_cardinality_features = [feat for feat, cnt in sparse_features_dims.items() if cnt >= 40000]

In [12]:
high_cardinality_features

['c16', 'c17', 'c23', 'c25', 'c29', 'c34', 'c37', 'c39']

In [13]:
enc = CatBoostEncoder(cols=high_cardinality_features, verbose=1)
enc.fit(train[high_cardinality_features], train[hard_target])

catboost_features = [f'{feat}_catboost' for feat in high_cardinality_features]

for df in [train, validation, test]:
    for feat in catboost_features:
        df[feat] = 0
    df.loc[:, catboost_features] = enc.transform(df.loc[:, high_cardinality_features]).values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [14]:
enc = TargetEncoder(cols=high_cardinality_features, verbose=1)
enc.fit(train[high_cardinality_features], train[hard_target])

target_features = [f'{feat}_target' for feat in high_cardinality_features]

for df in [train, validation, test]:
    for feat in target_features:
        df[feat] = 0
    df.loc[:, target_features] = enc.transform(df.loc[:, high_cardinality_features]).values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [15]:
for df in [train, validation, test]:
    df = df.drop(high_cardinality_features, axis=1)

In [16]:
dense_features = dense_features + catboost_features + target_features
sparse_features = [feat for feat in sparse_features if feat not in high_cardinality_features]
sparse_features_dims = {feat: dim for feat, dim in sparse_features_dims.items() if feat not in high_cardinality_features}

#### Low cardinality features

Now let's deal with low (kind of) cardinality features. For them we will use label encoding.

In [17]:
for feat in sparse_features:
    le = LabelEncoder()
    le.fit(data.loc[:, feat])

    for df in [train, validation, test]:
        df.loc[:, feat] = le.transform(df.loc[:, feat])

In [18]:
train.head()

Unnamed: 0_level_0,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,...,c37_catboost,c39_catboost,c16_target,c17_target,c23_target,c25_target,c29_target,c34_target,c37_target,c39_target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12,1,0.0,9.1e-05,0.0,0.0,0.000558,0.0,0.001717,0.0,0.000207,...,0.260759,0.267092,0.254942,0.360034,0.370765,0.254942,0.365178,0.365178,0.26076,0.267092
26,1,0.0,0.000181,0.000305,0.018244,0.00059,0.000798,0.004242,0.006335,0.047188,...,0.496607,0.506555,0.254942,0.66,0.555449,0.254942,0.254942,0.254942,0.497268,0.507289
39,0,0.001385,0.000136,0.000229,0.022805,4.4e-05,0.000206,0.000808,0.004554,0.001242,...,0.156868,0.313736,0.254942,0.03039,0.211873,0.254942,0.254942,0.254942,0.143134,0.323989
41,1,0.015238,0.014591,0.0,0.004561,2e-06,3.4e-05,0.008989,0.007919,0.004553,...,0.230221,0.267092,0.208054,0.230209,0.352697,0.211798,0.230209,0.211798,0.230209,0.267092
85,0,0.0,0.002537,0.0,0.011403,0.002497,0.000841,0.003434,0.002178,0.018058,...,0.383206,0.311325,0.254942,0.3125,0.217176,0.254942,0.254942,0.254942,0.384615,0.3125


Finally, prepare input for the model (mostly copied from the teacher model)

In [19]:
fixlen_feature_columns = [SparseFeat(feat, 
                                     vocabulary_size=vocab_size, 
                                     embedding_dim=min(int(6 * (vocab_size) ** (0.25)), 100), 
                                     use_hash=False, dtype='string') 
                          for feat, vocab_size in sparse_features_dims.items()] + \
                        [DenseFeat(feat, 1,) for feat in dense_features]


linear_feature_columns = fixlen_feature_columns
dnn_feature_columns = fixlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns, )

In [20]:
def gen_model_input(df):
    return {name: (pd.core.series.Series(df[name]) if name in sparse_features else np.array(df[name])) for name in feature_names}


train_model_input = gen_model_input(train)
validation_model_input = gen_model_input(validation)
test_model_input = gen_model_input(test)

In [21]:
train_model_targets = train[targets]
validation_model_targets = validation[targets]
test_model_targets = test[targets]

### Model

Можно также использовать Pruning и/или Quantinization.

### Define Model

Let's use the same model as the teacher (DCN) but with smaller number of parameters.

To do so, we should redefine the loss function to use both soft and hard targets.

In [55]:
weight = torch.FloatTensor([0.1, 0.9])

def distillation_loss(y_pred, y_true, reduction='sum'):
    y_pred = y_pred.unsqueeze(-1).expand(-1, 2)
    loss = F.binary_cross_entropy(y_pred, y_true, reduction='none').sum(axis=0)
    weighted_loss = torch.mul(loss, weight).sum()
    return weighted_loss

In [23]:
def metric_auc(y_true, y_pred):
    return roc_auc_score(y_true[:, 0], y_pred)

def metric_distillation_loss(y_true, y_pred):
    return distillation_loss(y_pred, y_true)

In [35]:
model = DCN(linear_feature_columns, dnn_feature_columns, cross_num=2,
            dnn_hidden_units=(128, 64), l2_reg_linear=0, l2_reg_embedding=0,
            l2_reg_cross=0, l2_reg_dnn=0, init_std=0.0001, seed=1024, 
            dnn_use_bn=True, dnn_activation='relu', task='binary')

model.compile("adam", distillation_loss, metrics=[])

In [36]:
# model.metrics['auc'] = metric_auc
# model.metrics['distillation_loss'] = metric_distillation_loss

### Train

In [37]:
# The implementation in the library works incorrectly due to issues with reg loss. 
# I copied the training part from the lib and fixed it a little bit

# model.fit(train_model_input, train_model_targets.values[:, 0], 
#           epochs=5, verbose=1)

In [38]:
import torch.utils.data as Data
from torch.utils.data import DataLoader

In [39]:
BATCH_SIZE = 512
EPOCHS = 1
EVAL_BATCHES = 1000

In [40]:
def prepare_data_loader(x, y, batch_size=BATCH_SIZE):
    if isinstance(x, dict):
        x = [x[feature] for feature in build_input_features(linear_feature_columns + dnn_feature_columns)]

    for i in range(len(x)):
        if len(x[i].shape) == 1:
            x[i] = np.expand_dims(x[i], axis=1)

    
    tensor_data = Data.TensorDataset(torch.from_numpy(np.concatenate(x, axis=-1)), torch.from_numpy(y))
        
    loader = DataLoader(dataset=tensor_data, shuffle=True, batch_size=batch_size)
    
    return loader

In [41]:
train_loader = prepare_data_loader(train_model_input, train_model_targets.values, BATCH_SIZE)
val_loader = prepare_data_loader(validation_model_input, validation_model_targets.values, BATCH_SIZE)
test_loader = prepare_data_loader(test_model_input, test_model_targets.values, BATCH_SIZE)

In [42]:
def eval(model, loader, n_samples, batch_size=BATCH_SIZE):
    loss_cur = 0
    cnt = 0

    val_pred = np.empty(n_samples)
    val_true = np.empty(n_samples)

    model.eval()
    for index, (x_val, y_val) in enumerate(loader):
        with torch.no_grad():
            x = x_val.float()
            y = y_val.float()

            y_pred = model(x).squeeze()

            val_pred[index * batch_size:(index + 1) * batch_size] = y_pred[:]
            val_true[index * batch_size:(index + 1) * batch_size] = y[:, 0]

            loss_cur += loss_func(y_pred, y.squeeze(), reduction='sum')
            cnt += 1

    average_loss = loss_cur / cnt
    auc = roc_auc_score(val_true, val_pred)
    
    return average_loss, auc

In [43]:
def train(model, optim, loss_func, epochs, eval_batches=1000, batch_size=BATCH_SIZE):
    for epoch in range(epochs):
        print(f'Start epoch {epoch + 1}')

        loss_cur = 0
        model.train()
        for index, (x_train, y_train) in enumerate(train_loader):
            x = x_train.float()
            y = y_train.float()

            y_pred = model(x).squeeze()

            optim.zero_grad()
            loss = loss_func(y_pred, y.squeeze(), reduction='sum')

            loss_cur += loss.item()

            loss.backward(retain_graph=True)
            optim.step()

            if (index + 1) % eval_batches == 0 or index + 1 == steps_per_epoch:
                print(f'Step {index + 1} of {steps_per_epoch}')
                print(f'Average train loss: {loss_cur / eval_batches}')
                loss_cur = 0
                
                average_val_loss, val_auc = eval(model, val_loader, len(validation_model_targets))

                print(f'Average validation loss: {average_val_loss}')
                print(f'Validation ROC AUC: {val_auc}')

        print(f'Finished epoch')
    
    average_test_loss, test_auc = eval(model, test_loader, len(test_model_targets))

    print(f'Average test loss: {average_test_loss}')
    print(f'Test ROC AUC: {test_auc}')

In [50]:
loss_func = distillation_loss
optim = torch.optim.Adam(model.parameters(), lr=0.0005)

sample_num = len(train_model_targets)
steps_per_epoch = (sample_num - 1) // BATCH_SIZE + 1

print(f"Train on {sample_num} samples, validate on {len(validation_model_targets)} samples, {steps_per_epoch} steps per epoch")

Train on 2931944 samples, validate on 366493 samples, 5727 steps per epoch


In [48]:
train(model, optim, loss_func, epochs=3)

Start epoch 1
Step 1000 of 5727
Average train loss: 225.89276454162598
Average validation loss: 228.2914581298828
Validation ROC AUC: 0.7869350674827783
Step 2000 of 5727
Average train loss: 225.7315050201416
Average validation loss: 228.20632934570312
Validation ROC AUC: 0.787350988959743
Step 3000 of 5727
Average train loss: 225.21759950256347
Average validation loss: 228.2144775390625
Validation ROC AUC: 0.7873141901963668
Step 4000 of 5727
Average train loss: 225.773056640625
Average validation loss: 228.10110473632812
Validation ROC AUC: 0.7875991977008148
Step 5000 of 5727
Average train loss: 225.6276220550537
Average validation loss: 228.0832977294922
Validation ROC AUC: 0.7878775984946009
Step 5727 of 5727
Average train loss: 163.8836019821167
Average validation loss: 228.08460998535156
Validation ROC AUC: 0.7876097888764665
Finished epoch
Start epoch 2
Step 1000 of 5727
Average train loss: 225.30787113952636
Average validation loss: 228.1519775390625
Validation ROC AUC: 0.7877

In [51]:
train(model, optim, loss_func, epochs=1)

Start epoch 1
Step 1000 of 5727
Average train loss: 224.91948080444337
Average validation loss: 229.39315795898438
Validation ROC AUC: 0.7891213346828043
Step 2000 of 5727
Average train loss: 224.81551387023927
Average validation loss: 229.46002197265625
Validation ROC AUC: 0.788949591428552
Step 3000 of 5727
Average train loss: 224.64388711547852
Average validation loss: 229.48814392089844
Validation ROC AUC: 0.7891638395126233
Step 4000 of 5727
Average train loss: 224.5111068572998
Average validation loss: 229.56686401367188
Validation ROC AUC: 0.7893397359785075
Step 5000 of 5727
Average train loss: 224.6653574066162
Average validation loss: 229.5581817626953
Validation ROC AUC: 0.7889826557033917
Step 5727 of 5727
Average train loss: 163.3666459197998
Average validation loss: 229.51560974121094
Validation ROC AUC: 0.7892915225558679
Finished epoch
Average test loss: 232.68360900878906
Test ROC AUC: 0.7872441827415813


## Evaluation

Наша основная задача получить модель, которая 
* в терминах ROC AUC не намного хуже модели учителя, и в то же время 
* сильно меньше по размеру

### ROC AUC

Сравним ROC AUC модели ученика с показателем для учителя.

ROC AUC учителя: 0.802

In [60]:
f'{(0.7872 / 0.802 * 100):.3f}%'

'98.155%'

### Compression rate

Пусть 
* $a$ - \# of the parameters in the original model $M$
* $a^{*}$ - \# of the parameters in compressed model $M^{*}$

тогда compression rate is $$\alpha(M,M^{*}) = \frac{a}{a^{*}}$$

Можно также посчитать comression rate просто как отношение фактических размеров моделей.

Размер модели учителя - 168MB


In [61]:
torch.save(model.state_dict(), 'model_compressed.pth')

9.2MB

In [62]:
168 / 9.2

18.260869565217394