TODO

[kaggle 22 place solution]('https://www.kaggle.com/c/commonlitreadabilityprize/discussion/257302)

[github]('https://github.com/kurupical/commonlit)

[inference]('https://www.kaggle.com/kurupical/191-192-202-228-251-253-268-288-278-final?scriptVersionId=69642056)


## Description

### worked for me
- model ensemble: I thought diversity is the most important thing in this competition.
    - At the beginning of the competition, I tested the effectiveness of the ensemble.
    - Up to the middle stage, I fixed the model to roberta-large and tried to improve the score.
    - At the end, I applied the method to another models. I found that key parameters for this task are {learning_rate, N layers to re-initialize}, so I tuned those parameters for each models.
- re-initialization
    - This paper (https://arxiv.org/pdf/2006.05987.pdf) shows that fine-tuning with reinitialization last N layers works well.
    - Different models have different optimal N. Almost models set N=4~5, gpt2-models set N=6.
- LSTM head
    - Input BERT's first and last hidden layer into LSTM layer worked well.
    - I think first layer represent vocabulary difficulty and last layer represent sentence difficulty. Both are important for inference readbility.
- Remove dropout. Improve 0.01~0.02 CV.
- gradient clipping. (0.2 or 0.5 works well for me, improve about 0.005 CV)

### not worked for me
- Input attention matrix to 2D-CNN(like ResNet18 or simple 2DCNN)
    - I thought this could represent the complexity of sentences with relative pronouns.
- masked 5%~10% vocabulary.
- Minimize KLDiv loss to fit distribution.
- Scale target to 0~1 and minimize crossentropy loss
- "base" models excluding mpnet. I got 0.47x CV but Public LB: 0.48x ~ 0.49x.
- Stacking using LightGBM.
- another models.(result is below table. single CV is well but zero weight for ensemble)
- T5. Below notebook achieve 0.47 LB using T5, so I tried but failed.
I got only 0.49x(fold 0 only) with learning_rate=1.5e-4

configuration for almost all models:
```
epochs = 4
optimizer: AdamW
scheduler: linear_schedule_with_warmup(warmup: 5%)
lr_bert: 3e-5
batch_size: 12
gradient clipping: 0.2~0.5
reinitialize layers: last 2~6 layers
ensemble: Nelder-Mead
custom head(finally concat all)
    averaging last 4 hidden layer
    LSTM head
    vocabulary dense
hidden_states: (batch_size, vocab_size, bert_hidden_size)
  linear_vocab = nn.Sequential(
      nn.Linear(bert_hidden_size, 128),
      nn.GELU(),
      nn.Linear(128, 64),
      nn.GELU()
  )
  linear_final = nn.Linear(vocab_size * 64, 128)
  out = linear_vocab(hidden_states).view(len(input_ids), -1)) # final shape: (batch_size, vocab_size * 64)
  out = linear_final(out) # out shape: (batch_size, 128)
17 hand-made features
    sentence count
    average character count in documents
```

The main hyperparameters:

|nlp_model_name|funnel-large-base|funnel-large|
|----|----|----|
|dropout|	0|	0|
batch_size|	12|	12|
lr_bert|	2E-05|	2E-05|
lr_fc|	5E-05|	5E-05|
warmup_ratio|	0.05|	0.05|
epochs|	6|	6|
activation|	GELU|	GELU|
optimizer|	AdamW|	AdamW|
weight_decay|	0.1|	0.1|
rnn_module|	LSTM|	LSTM|
rnn_module_num|	0|	1|
rnn_hidden_indice|	(-1, 0)|	(-1, 0)|
linear_vocab_enable|	False|	True|
multi_dropout_ratio|	0.3|	0.3|
multi_dropout_num|	10|	10|
max_length|	256|	256|
hidden_stack_enable|	True|	True|
reinit_layers|	4|	4|
gradient_clipping|	0.2|	0.2|
feature_enable|	False|	True|
stochastic_weight_avg|	False|	False|
val_check_interval|	0.05|	0.05|

In [1]:
from IPython.display import clear_output, Image
!pip install transformers
clear_output()

In [9]:
!sudo apt-get update -y
!sudo apt-get install python3.9

#change alternatives
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 1
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 2
# !sudo update-alternatives --config python3
clear_output()
!python --version

In [2]:
import os
import re
import torch
import itertools
import transformers
import pandas as pd
import numpy as np
from tqdm import tqdm
from numpy import random
from torch import nn, optim
import matplotlib.pyplot as plt

from dataclasses import dataclass
"""
any:Это означает, что можно выполнить любую операцию или вызов метода 
    для значения типа Any и присвоить его любой переменной

optional: Обратите внимание, что это не то же самое, что необязательный
          аргумент, который имеет значение по умолчанию.
"""
from typing import Any, Optional, List, Tuple

"""
получения информации о запущенных процессах
и использовании системы (ЦП, память, диски, сеть, датчики) в Python.
"""
import psutil

path_tr = '/content/drive/MyDrive/CommonLit/input/train.csv'
path_test = '/content/drive/MyDrive/CommonLit/input/test.csv'
path_sub = '/content/drive/MyDrive/CommonLit/input/sample_submission.csv'

SEED =13
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## dataclass

In [3]:
@dataclass
class Config:
    experiment_name: str
    seed: int = 10
    debug: bool = False
    fold: int = 0

    nlp_model_name: str = "roberta-base"
    linear_dim: int = 64
    linear_vocab_dim_1: int = 64
    linear_vocab_dim: int = 16
    linear_perplexity_dim: int = 64
    linear_final_dim: int = 256
    dropout: float = 0
    dropout_stack: float = 0
    dropout_output_hidden: float = 0
    dropout_attn: float = 0
    batch_size: int = 32

    lr_bert: float = 3e-5
    lr_fc: float = 5e-5
    lr_rnn: float = 1e-3
    lr_tcn: float = 1e-3
    lr_cnn: float = 1e-3
    warmup_ratio: float = 0.1
    training_steps_ratio: float = 1
    if debug:
        epochs: int = 2
        epochs_max: int = 8
    else:
        epochs: int = 6
        epochs_max: int = 6

    activation: Any = nn.GELU
    # optimizer: Any = transformers.AdamW
    weight_decay: float = 0.1

    rnn_module: nn.Module = nn.LSTM
    rnn_module_num: int = 0
    rnn_module_dropout: float = 0
    rnn_module_activation: Any = None
    rnn_module_shrink_ratio: float = 0.25
    rnn_hidden_indice: Tuple[int] = (-1, 0)
    bidirectional: bool = True

    tcn_module_enable: bool = False
    tcn_module_num: int = 3
    # tcn_module: nn.Module = TemporalConvNet
    tcn_module_kernel_size: int = 4
    tcn_module_dropout: float = 0

    linear_vocab_enable: bool = False
    augmantation_range: Tuple[float, float] = (0, 0)
    lr_bert_decay: float = 1

    multi_dropout_ratio: float = 0.3
    multi_dropout_num: int = 10
    fine_tuned_path: str = None

    # convnet
    cnn_model_name: str = "resnet18"
    cnn_pretrained: bool = False
    self_attention_enable: bool = False

    mask_p: float = 0
    max_length: int = 256

    hidden_stack_enable: bool = False
    prep_enable: bool = False
    kl_div_enable: bool = False

    # reinit
    reinit_pooler: bool = True
    reinit_layers: int = 4

    # pooler
    pooler_enable: bool = True

    word_axis: bool = False

    # conv1d
    conv1d_num: int = 1
    conv1d_stride: int = 2
    conv1d_kernel_size: int = 2

    attention_pool_enable: bool = False
    conv2d_hidden_channel: int = 32

    simple_structure: bool = False
    crossentropy: bool = False
    crossentropy_min: int = -8
    crossentropy_max: int = 4

    accumulate_grad_batches: int = 1
    gradient_clipping: int = 0.2

    dropout_bert: float = 0

    feature_enable: bool = False
    decoder_only: bool = True

    stochastic_weight_avg: bool = False
    val_check_interval: float = 0.05

    attention_head_enable: bool = False

cfg = Config('test1')

In [4]:
cfg

Config(experiment_name='test1', seed=10, debug=False, fold=0, nlp_model_name='roberta-base', linear_dim=64, linear_vocab_dim_1=64, linear_vocab_dim=16, linear_perplexity_dim=64, linear_final_dim=256, dropout=0, dropout_stack=0, dropout_output_hidden=0, dropout_attn=0, batch_size=32, lr_bert=3e-05, lr_fc=5e-05, lr_rnn=0.001, lr_tcn=0.001, lr_cnn=0.001, warmup_ratio=0.1, training_steps_ratio=1, epochs=6, epochs_max=6, activation=<class 'torch.nn.modules.activation.GELU'>, weight_decay=0.1, rnn_module=<class 'torch.nn.modules.rnn.LSTM'>, rnn_module_num=0, rnn_module_dropout=0, rnn_module_activation=None, rnn_module_shrink_ratio=0.25, rnn_hidden_indice=(-1, 0), bidirectional=True, tcn_module_enable=False, tcn_module_num=3, tcn_module_kernel_size=4, tcn_module_dropout=0, linear_vocab_enable=False, augmantation_range=(0, 0), lr_bert_decay=1, multi_dropout_ratio=0.3, multi_dropout_num=10, fine_tuned_path=None, cnn_model_name='resnet18', cnn_pretrained=False, self_attention_enable=False, mask_

## feature_engineering

In [11]:
def total_words(x):
    return len(x.split(" "))

def total_unique_words(x):
    return len(np.unique(x.split(" ")))

def total_charactors(x):
    x = x.replace(" ", "")
    return len(x)

def total_sentence(x):
    x = x.replace("!", "[end]").replace("?", "[end]").replace(".", "[end]")
    return len(x.split("[end]"))

In [7]:
df = pd.read_csv(path_tr)
df.columns

Index(['id', 'url_legal', 'license', 'excerpt', 'target', 'standard_error'], dtype='object')

In [12]:
df_ret = df[["id", "excerpt", "target", "standard_error"]]
excerpt = df["excerpt"].values
df_ret["total_words"] = [total_words(x) for x in excerpt]
df_ret["total_unique_words"] = [total_unique_words(x) for x in excerpt]
df_ret["total_characters"] = [total_charactors(x) for x in excerpt]
df_ret["total_sentence"] = [total_sentence(x) for x in excerpt]

df_ret["div_sentence_characters"] = df_ret["total_sentence"] / df_ret["total_characters"]
df_ret["div_sentence_words"] = df_ret["total_sentence"] / df_ret["total_words"]
df_ret["div_characters_words"] = df_ret["total_characters"] / df_ret["total_words"]
df_ret["div_words_unique_words"] = df_ret["total_words"] / df_ret["total_unique_words"]

for i, word in enumerate(["!", "?", "(", ")", "'", '"', ";", ".", ","]):
    df_ret[f"count_word_special_{i}"] = [x.count(word) for x in excerpt]
df_ret.fillna(0, inplace=True)

df_ret.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,id,excerpt,target,standard_error,total_words,total_unique_words,total_characters,total_sentence,div_sentence_characters,div_sentence_words,div_characters_words,div_words_unique_words,count_word_special_0,count_word_special_1,count_word_special_2,count_word_special_3,count_word_special_4,count_word_special_5,count_word_special_6,count_word_special_7,count_word_special_8
0,c12129c31,When the young people returned to the ballroom...,-0.340259,0.464009,174,112,819,12,0.014652,0.068966,4.706897,1.553571,0,0,0,0,0,0,0,11,14
1,85aa80a4c,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805,164,123,774,18,0.023256,0.109756,4.719512,1.333333,5,2,0,0,3,12,0,10,24
2,b69ac6792,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676,162,124,747,13,0.017403,0.080247,4.611111,1.306452,1,0,0,0,4,10,2,11,17
3,dd1000b26,And outside before the palace a great garden w...,-1.054013,0.450007,163,117,747,6,0.008032,0.03681,4.582822,1.393162,0,0,0,0,0,0,2,5,23
4,37c1b32fb,Once upon a time there were Three Bears who li...,0.247197,0.510845,147,51,577,6,0.010399,0.040816,3.92517,2.882353,0,0,0,0,0,0,10,5,13


In [18]:
cfg.feature_columns = [x for x in df_ret.columns if x not in ["id", "excerpt", "target", "kfold", "standard_error"]]
"""
.mean
    array([1.71654905e+02, 1.13895554e+02, 8.01077982e+02, 1.08479181e+01,
        1.37249582e-02, 6.32846225e-02, 4.66937398e+00, 1.51519104e+00,
        4.57304164e-01, 3.55681016e-01, 3.68031052e-01, 3.68031052e-01,
        1.15031757e+00, 2.38884968e+00, 8.71912491e-01, 9.03493296e+00,
        1.17314749e+01])
"""
cfg.feature_mean = df_ret[cfg.feature_columns].mean().values
cfg.feature_std = df_ret[cfg.feature_columns].std().values
cfg.feature_columns

['total_words',
 'total_unique_words',
 'total_characters',
 'total_sentence',
 'div_sentence_characters',
 'div_sentence_words',
 'div_characters_words',
 'div_words_unique_words',
 'count_word_special_0',
 'count_word_special_1',
 'count_word_special_2',
 'count_word_special_3',
 'count_word_special_4',
 'count_word_special_5',
 'count_word_special_6',
 'count_word_special_7',
 'count_word_special_8']

## Dataset

In [None]:
class CommonLitDataset(Dataset):
    """   
    return:
        input_ids_masked
        attention_mask
        token_type_ids
        input_ids

        features - array norm maked features
        target - target
        std - "standard_error" from data ori
    """
    def __init__(self, df, tokenizer, cfg, transforms=None):
        self.df = df.reset_index()
        self.augmentations = transforms
        self.cfg = cfg
        self.tokenizer = tokenizer

    def __len__(self):
        return self.df.shape[0]

    def __getitem__(self, index):
        row = self.df.iloc[index]

        text_original = row["excerpt"]

        text = self.tokenizer(text_original,
                              padding="max_length",
                              max_length=self.cfg.max_length,
                              truncation=True,
                              return_tensors="pt",
                              return_token_type_ids=True)
        input_ids = text["input_ids"][0].detach().cpu().numpy()
        input_ids_masked = [x if np.random.random() > self.cfg.mask_p else self.tokenizer.mask_token_id for x in input_ids]
        input_ids_masked = torch.LongTensor(input_ids_masked).to("cuda")
        attention_mask = text["attention_mask"][0]
        token_type_ids = text["token_type_ids"][0]
        std = row["standard_error"]

        features = ((row[self.cfg.feature_columns].fillna(0).values - self.cfg.feature_mean) / self.cfg.feature_std)
        """
        take currnet (row and - mean(all)) / std(all)
        array([ 0.13797008, -0.14786123,  0.17155507,  0.24627862,  0.14901487,
                0.21311318,  0.08742476,  0.26920065, -0.39593219, -0.3817358 ,
               -0.38798113, -0.38889757, -0.63937729, -0.57226923, -0.66869829,
                0.4939904 ,  0.48264299])                
        """
        features = torch.tensor(features, dtype=torch.float)
        target = torch.tensor(row["target"], dtype=torch.float)
        return input_ids_masked, attention_mask, token_type_ids, input_ids, features, target, std

## Module&Heads, Layers

In [None]:
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.core.lightning import LightningModule
from pytorch_lightning.utilities import rank_zero_warn

In [None]:
import gc
import torch
from torch import nn
from torch.nn import functional as F
# Applies weight normalization to a parameter in the given module.
from torch.nn.utils import weight_norm
"""
https://github.com/rwightman/pytorch-image-models#introduction

Модели изображений PyTorch (timm) - это набор моделей изображений, слоев, утилит, оптимизаторов,
планировщиков, загрузчиков / дополнений данных и эталонных сценариев обучения / проверки, которые
призваны объединить широкий спектр моделей SOTA с возможностью воспроизведения обучения ImageNet.
полученные результаты.
"""
import timm
try:
    import mlflow.pytorch
except Exception as e:
    print("error: mlflow is not found")