TODO

[kaggle 22 place solution]('https://www.kaggle.com/c/commonlitreadabilityprize/discussion/257302)

[github]('https://github.com/kurupical/commonlit)

[inference]('https://www.kaggle.com/kurupical/191-192-202-228-251-253-268-288-278-final?scriptVersionId=69642056)


### worked for me
- model ensemble: I thought diversity is the most important thing in this competition.
    - At the beginning of the competition, I tested the effectiveness of the ensemble.
    - Up to the middle stage, I fixed the model to roberta-large and tried to improve the score.
    - At the end, I applied the method to another models. I found that key parameters for this task are {learning_rate, N layers to re-initialize}, so I tuned those parameters for each models.
- re-initialization
    - This paper (https://arxiv.org/pdf/2006.05987.pdf) shows that fine-tuning with reinitialization last N layers works well.
    - Different models have different optimal N. Almost models set N=4~5, gpt2-models set N=6.
- LSTM head
    - Input BERT's first and last hidden layer into LSTM layer worked well.
    - I think first layer represent vocabulary difficulty and last layer represent sentence difficulty. Both are important for inference readbility.
- Remove dropout. Improve 0.01~0.02 CV.
- gradient clipping. (0.2 or 0.5 works well for me, improve about 0.005 CV)

### not worked for me
- Input attention matrix to 2D-CNN(like ResNet18 or simple 2DCNN)
    - I thought this could represent the complexity of sentences with relative pronouns.
- masked 5%~10% vocabulary.
- Minimize KLDiv loss to fit distribution.
- Scale target to 0~1 and minimize crossentropy loss
- "base" models excluding mpnet. I got 0.47x CV but Public LB: 0.48x ~ 0.49x.
- Stacking using LightGBM.
- another models.(result is below table. single CV is well but zero weight for ensemble)
- T5. Below notebook achieve 0.47 LB using T5, so I tried but failed.
I got only 0.49x(fold 0 only) with learning_rate=1.5e-4

configuration for almost all models:
```
epochs = 4
optimizer: AdamW
scheduler: linear_schedule_with_warmup(warmup: 5%)
lr_bert: 3e-5
batch_size: 12
gradient clipping: 0.2~0.5
reinitialize layers: last 2~6 layers
ensemble: Nelder-Mead
custom head(finally concat all)
    averaging last 4 hidden layer
    LSTM head
    vocabulary dense
hidden_states: (batch_size, vocab_size, bert_hidden_size)
  linear_vocab = nn.Sequential(
      nn.Linear(bert_hidden_size, 128),
      nn.GELU(),
      nn.Linear(128, 64),
      nn.GELU()
  )
  linear_final = nn.Linear(vocab_size * 64, 128)
  out = linear_vocab(hidden_states).view(len(input_ids), -1)) # final shape: (batch_size, vocab_size * 64)
  out = linear_final(out) # out shape: (batch_size, 128)
17 hand-made features
    sentence count
    average character count in documents
```

The main hyperparameters:

|nlp_model_name|funnel-large-base|funnel-large|
|----|----|----|
|dropout|	0|	0|
batch_size|	12|	12|
lr_bert|	2E-05|	2E-05|
lr_fc|	5E-05|	5E-05|
warmup_ratio|	0.05|	0.05|
epochs|	6|	6|
activation|	GELU|	GELU|
optimizer|	AdamW|	AdamW|
weight_decay|	0.1|	0.1|
rnn_module|	LSTM|	LSTM|
rnn_module_num|	0|	1|
rnn_hidden_indice|	(-1, 0)|	(-1, 0)|
linear_vocab_enable|	False|	True|
multi_dropout_ratio|	0.3|	0.3|
multi_dropout_num|	10|	10|
max_length|	256|	256|
hidden_stack_enable|	True|	True|
reinit_layers|	4|	4|
gradient_clipping|	0.2|	0.2|
feature_enable|	False|	True|
stochastic_weight_avg|	False|	False|
val_check_interval|	0.05|	0.05|

In [None]:
from IPython.display import clear_output, Image
!pip install transformers
clear_output()

In [None]:
import os
import re
import torch
import itertools
import transformers
import pandas as pd
import numpy as np
from tqdm import tqdm
from numpy import random
from torch import nn, optim
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn import model_selection
from torch.optim.lr_scheduler import LambdaLR

"""
получения информации о запущенных процессах
и использовании системы (ЦП, память, диски, сеть, датчики) в Python.
"""
import psutil

path_tr = '/content/drive/MyDrive/CommonLit/input/train.csv'
path_test = '/content/drive/MyDrive/CommonLit/input/test.csv'
path_sub = '/content/drive/MyDrive/CommonLit/input/sample_submission.csv'

SEED =13
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
@dataclasses.dataclass
class Config:
    experiment_name: str
    seed: int = 10
    debug: bool = False
    fold: int = 0

    nlp_model_name: str = "roberta-base"
    linear_dim: int = 64
    linear_vocab_dim_1: int = 64
    linear_vocab_dim: int = 16
    linear_perplexity_dim: int = 64
    linear_final_dim: int = 256
    dropout: float = 0
    dropout_stack: float = 0
    dropout_output_hidden: float = 0
    dropout_attn: float = 0
    batch_size: int = 32

    lr_bert: float = 3e-5
    lr_fc: float = 5e-5
    lr_rnn: float = 1e-3
    lr_tcn: float = 1e-3
    lr_cnn: float = 1e-3
    warmup_ratio: float = 0.1
    training_steps_ratio: float = 1
    if debug:
        epochs: int = 2
        epochs_max: int = 8
    else:
        epochs: int = 6
        epochs_max: int = 6

    activation: Any = nn.GELU
    optimizer: Any = AdamW
    weight_decay: float = 0.1

    rnn_module: nn.Module = nn.LSTM
    rnn_module_num: int = 0
    rnn_module_dropout: float = 0
    rnn_module_activation: Any = None
    rnn_module_shrink_ratio: float = 0.25
    rnn_hidden_indice: Tuple[int] = (-1, 0)
    bidirectional: bool = True

    tcn_module_enable: bool = False
    tcn_module_num: int = 3
    tcn_module: nn.Module = TemporalConvNet
    tcn_module_kernel_size: int = 4
    tcn_module_dropout: float = 0

    linear_vocab_enable: bool = False
    augmantation_range: Tuple[float, float] = (0, 0)
    lr_bert_decay: float = 1

    multi_dropout_ratio: float = 0.3
    multi_dropout_num: int = 10
    fine_tuned_path: str = None

    # convnet
    cnn_model_name: str = "resnet18"
    cnn_pretrained: bool = False
    self_attention_enable: bool = False

    mask_p: float = 0
    max_length: int = 256

    hidden_stack_enable: bool = False
    prep_enable: bool = False
    kl_div_enable: bool = False

    # reinit
    reinit_pooler: bool = True
    reinit_layers: int = 4

    # pooler
    pooler_enable: bool = True

    word_axis: bool = False

    # conv1d
    conv1d_num: int = 1
    conv1d_stride: int = 2
    conv1d_kernel_size: int = 2

    attention_pool_enable: bool = False
    conv2d_hidden_channel: int = 32

    simple_structure: bool = False
    crossentropy: bool = False
    crossentropy_min: int = -8
    crossentropy_max: int = 4

    accumulate_grad_batches: int = 1
    gradient_clipping: int = 0.2

    dropout_bert: float = 0

    feature_enable: bool = False
    decoder_only: bool = True

    stochastic_weight_avg: bool = False
    val_check_interval: float = 0.05

    attention_head_enable: bool = False

cfg = Config()