TODO

[kaggle 22 place solution]('https://www.kaggle.com/c/commonlitreadabilityprize/discussion/257302)

[github]('https://github.com/kurupical/commonlit)

[inference]('https://www.kaggle.com/kurupical/191-192-202-228-251-253-268-288-278-final?scriptVersionId=69642056)


## Description

### worked for me
- model ensemble: I thought diversity is the most important thing in this competition.
    - At the beginning of the competition, I tested the effectiveness of the ensemble.
    - Up to the middle stage, I fixed the model to roberta-large and tried to improve the score.
    - At the end, I applied the method to another models. I found that key parameters for this task are {learning_rate, N layers to re-initialize}, so I tuned those parameters for each models.
- re-initialization
    - This paper (https://arxiv.org/pdf/2006.05987.pdf) shows that fine-tuning with reinitialization last N layers works well.
    - Different models have different optimal N. Almost models set N=4~5, gpt2-models set N=6.
- LSTM head
    - Input BERT's first and last hidden layer into LSTM layer worked well.
    - I think first layer represent vocabulary difficulty and last layer represent sentence difficulty. Both are important for inference readbility.
- Remove dropout. Improve 0.01~0.02 CV.
- gradient clipping. (0.2 or 0.5 works well for me, improve about 0.005 CV)

### not worked for me
- Input attention matrix to 2D-CNN(like ResNet18 or simple 2DCNN)
    - I thought this could represent the complexity of sentences with relative pronouns.
- masked 5%~10% vocabulary.
- Minimize KLDiv loss to fit distribution.
- Scale target to 0~1 and minimize crossentropy loss
- "base" models excluding mpnet. I got 0.47x CV but Public LB: 0.48x ~ 0.49x.
- Stacking using LightGBM.
- another models.(result is below table. single CV is well but zero weight for ensemble)
- T5. Below notebook achieve 0.47 LB using T5, so I tried but failed.
I got only 0.49x(fold 0 only) with learning_rate=1.5e-4

configuration for almost all models:
```
epochs = 4
optimizer: AdamW
scheduler: linear_schedule_with_warmup(warmup: 5%)
lr_bert: 3e-5
batch_size: 12
gradient clipping: 0.2~0.5
reinitialize layers: last 2~6 layers
ensemble: Nelder-Mead
custom head(finally concat all)
    averaging last 4 hidden layer
    LSTM head
    vocabulary dense
hidden_states: (batch_size, vocab_size, bert_hidden_size)
  linear_vocab = nn.Sequential(
      nn.Linear(bert_hidden_size, 128),
      nn.GELU(),
      nn.Linear(128, 64),
      nn.GELU()
  )
  linear_final = nn.Linear(vocab_size * 64, 128)
  out = linear_vocab(hidden_states).view(len(input_ids), -1)) # final shape: (batch_size, vocab_size * 64)
  out = linear_final(out) # out shape: (batch_size, 128)
17 hand-made features
    sentence count
    average character count in documents
```

The main hyperparameters:

|nlp_model_name|funnel-large-base|funnel-large|
|----|----|----|
|dropout|	0|	0|
|batch_size|	12|	12|
|lr_bert|	2E-05|	2E-05|
|lr_fc|	5E-05|	5E-05|
|warmup_ratio|	0.05|	0.05|
|epochs|	6|	6|
|activation|	GELU|	GELU|
|optimizer|	AdamW|	AdamW|
|weight_decay|	0.1|	0.1|
|rnn_module|	LSTM|	LSTM|
|rnn_module_num|	0|	1|
|rnn_hidden_indice|	(-1, 0)|	(-1, 0)|
|linear_vocab_enable|	False|	True|
|multi_dropout_ratio|	0.3|	0.3|
|multi_dropout_num|	10|	10|
|max_length|	256|	256|
|hidden_stack_enable|	True|	True|
|reinit_layers|	4|	4|
|gradient_clipping|	0.2|	0.2|
|feature_enable|	False|	True|
|stochastic_weight_avg|	False|	False|
|val_check_interval|	0.05|	0.05|

look for https://github.com/kurupical/commonlit/blob/master/exp/exp278.py

In [1]:
from IPython.display import clear_output, Image
!pip install transformers
clear_output()

In [None]:
!sudo apt-get update -y
!sudo apt-get install python3.9

#change alternatives
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 1
!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 2
# !sudo update-alternatives --config python3
clear_output()
!python --version

In [2]:
import os
import re
import torch
import itertools
import transformers
import pandas as pd
import numpy as np
from tqdm import tqdm
from numpy import random
from torch import nn, optim
import matplotlib.pyplot as plt

from dataclasses import dataclass
"""
any:Это означает, что можно выполнить любую операцию или вызов метода 
    для значения типа Any и присвоить его любой переменной

optional: Обратите внимание, что это не то же самое, что необязательный
          аргумент, который имеет значение по умолчанию.
"""
from typing import Any, Optional, List, Tuple

"""
получения информации о запущенных процессах
и использовании системы (ЦП, память, диски, сеть, датчики) в Python.
"""
import psutil

path_tr = '/content/drive/MyDrive/CommonLit/input/train.csv'
path_test = '/content/drive/MyDrive/CommonLit/input/test.csv'
path_sub = '/content/drive/MyDrive/CommonLit/input/sample_submission.csv'

SEED =13
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## dataclass

In [3]:
@dataclass
class Config:
    experiment_name: str
    seed: int = 10
    debug: bool = False
    fold: int = 0

    nlp_model_name: str = "roberta-base"
    linear_dim: int = 64
    linear_vocab_dim_1: int = 64
    linear_vocab_dim: int = 16
    linear_perplexity_dim: int = 64
    linear_final_dim: int = 256
    dropout: float = 0
    dropout_stack: float = 0
    dropout_output_hidden: float = 0
    dropout_attn: float = 0
    batch_size: int = 32

    lr_bert: float = 3e-5
    lr_fc: float = 5e-5
    lr_rnn: float = 1e-3
    lr_tcn: float = 1e-3
    lr_cnn: float = 1e-3
    warmup_ratio: float = 0.1
    training_steps_ratio: float = 1
    if debug:
        epochs: int = 2
        epochs_max: int = 8
    else:
        epochs: int = 6
        epochs_max: int = 6

    activation: Any = nn.GELU
    # optimizer: Any = transformers.AdamW
    weight_decay: float = 0.1

    rnn_module: nn.Module = nn.LSTM
    rnn_module_num: int = 0
    rnn_module_dropout: float = 0
    rnn_module_activation: Any = None
    rnn_module_shrink_ratio: float = 0.25
    rnn_hidden_indice: Tuple[int] = (-1, 0)
    bidirectional: bool = True

    tcn_module_enable: bool = False
    tcn_module_num: int = 3
    # tcn_module: nn.Module = TemporalConvNet
    tcn_module_kernel_size: int = 4
    tcn_module_dropout: float = 0

    linear_vocab_enable: bool = False
    augmantation_range: Tuple[float, float] = (0, 0)
    lr_bert_decay: float = 1

    multi_dropout_ratio: float = 0.3
    multi_dropout_num: int = 10
    fine_tuned_path: str = None

    # convnet
    cnn_model_name: str = "resnet18"
    cnn_pretrained: bool = False
    self_attention_enable: bool = False

    mask_p: float = 0
    max_length: int = 256

    hidden_stack_enable: bool = False
    prep_enable: bool = False
    kl_div_enable: bool = False

    # reinit
    reinit_pooler: bool = True
    reinit_layers: int = 4

    # pooler
    pooler_enable: bool = True

    word_axis: bool = False

    # conv1d
    conv1d_num: int = 1
    conv1d_stride: int = 2
    conv1d_kernel_size: int = 2

    attention_pool_enable: bool = False
    conv2d_hidden_channel: int = 32

    simple_structure: bool = False
    crossentropy: bool = False
    crossentropy_min: int = -8
    crossentropy_max: int = 4

    accumulate_grad_batches: int = 1
    gradient_clipping: int = 0.2

    dropout_bert: float = 0

    feature_enable: bool = False
    decoder_only: bool = True

    stochastic_weight_avg: bool = False
    val_check_interval: float = 0.05

    attention_head_enable: bool = False

cfg = Config('test1')

In [None]:
cfg

Config(experiment_name='test1', seed=10, debug=False, fold=0, nlp_model_name='roberta-base', linear_dim=64, linear_vocab_dim_1=64, linear_vocab_dim=16, linear_perplexity_dim=64, linear_final_dim=256, dropout=0, dropout_stack=0, dropout_output_hidden=0, dropout_attn=0, batch_size=32, lr_bert=3e-05, lr_fc=5e-05, lr_rnn=0.001, lr_tcn=0.001, lr_cnn=0.001, warmup_ratio=0.1, training_steps_ratio=1, epochs=6, epochs_max=6, activation=<class 'torch.nn.modules.activation.GELU'>, weight_decay=0.1, rnn_module=<class 'torch.nn.modules.rnn.LSTM'>, rnn_module_num=0, rnn_module_dropout=0, rnn_module_activation=None, rnn_module_shrink_ratio=0.25, rnn_hidden_indice=(-1, 0), bidirectional=True, tcn_module_enable=False, tcn_module_num=3, tcn_module_kernel_size=4, tcn_module_dropout=0, linear_vocab_enable=False, augmantation_range=(0, 0), lr_bert_decay=1, multi_dropout_ratio=0.3, multi_dropout_num=10, fine_tuned_path=None, cnn_model_name='resnet18', cnn_pretrained=False, self_attention_enable=False, mask_

## feature_engineering

In [None]:
def total_words(x):
    return len(x.split(" "))

def total_unique_words(x):
    return len(np.unique(x.split(" ")))

def total_charactors(x):
    x = x.replace(" ", "")
    return len(x)

def total_sentence(x):
    x = x.replace("!", "[end]").replace("?", "[end]").replace(".", "[end]")
    return len(x.split("[end]"))

In [None]:
df = pd.read_csv(path_tr)
df.columns

Index(['id', 'url_legal', 'license', 'excerpt', 'target', 'standard_error'], dtype='object')

In [None]:
df_ret = df[["id", "excerpt", "target", "standard_error"]]
excerpt = df["excerpt"].values
df_ret["total_words"] = [total_words(x) for x in excerpt]
df_ret["total_unique_words"] = [total_unique_words(x) for x in excerpt]
df_ret["total_characters"] = [total_charactors(x) for x in excerpt]
df_ret["total_sentence"] = [total_sentence(x) for x in excerpt]

df_ret["div_sentence_characters"] = df_ret["total_sentence"] / df_ret["total_characters"]
df_ret["div_sentence_words"] = df_ret["total_sentence"] / df_ret["total_words"]
df_ret["div_characters_words"] = df_ret["total_characters"] / df_ret["total_words"]
df_ret["div_words_unique_words"] = df_ret["total_words"] / df_ret["total_unique_words"]

for i, word in enumerate(["!", "?", "(", ")", "'", '"', ";", ".", ","]):
    df_ret[f"count_word_special_{i}"] = [x.count(word) for x in excerpt]
df_ret.fillna(0, inplace=True)

df_ret.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,id,excerpt,target,standard_error,total_words,total_unique_words,total_characters,total_sentence,div_sentence_characters,div_sentence_words,div_characters_words,div_words_unique_words,count_word_special_0,count_word_special_1,count_word_special_2,count_word_special_3,count_word_special_4,count_word_special_5,count_word_special_6,count_word_special_7,count_word_special_8
0,c12129c31,When the young people returned to the ballroom...,-0.340259,0.464009,174,112,819,12,0.014652,0.068966,4.706897,1.553571,0,0,0,0,0,0,0,11,14
1,85aa80a4c,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805,164,123,774,18,0.023256,0.109756,4.719512,1.333333,5,2,0,0,3,12,0,10,24
2,b69ac6792,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676,162,124,747,13,0.017403,0.080247,4.611111,1.306452,1,0,0,0,4,10,2,11,17
3,dd1000b26,And outside before the palace a great garden w...,-1.054013,0.450007,163,117,747,6,0.008032,0.03681,4.582822,1.393162,0,0,0,0,0,0,2,5,23
4,37c1b32fb,Once upon a time there were Three Bears who li...,0.247197,0.510845,147,51,577,6,0.010399,0.040816,3.92517,2.882353,0,0,0,0,0,0,10,5,13


In [None]:
cfg.feature_columns = [x for x in df_ret.columns if x not in ["id", "excerpt", "target", "kfold", "standard_error"]]
"""
.mean
    array([1.71654905e+02, 1.13895554e+02, 8.01077982e+02, 1.08479181e+01,
        1.37249582e-02, 6.32846225e-02, 4.66937398e+00, 1.51519104e+00,
        4.57304164e-01, 3.55681016e-01, 3.68031052e-01, 3.68031052e-01,
        1.15031757e+00, 2.38884968e+00, 8.71912491e-01, 9.03493296e+00,
        1.17314749e+01])
"""
cfg.feature_mean = df_ret[cfg.feature_columns].mean().values
cfg.feature_std = df_ret[cfg.feature_columns].std().values
cfg.feature_columns

['total_words',
 'total_unique_words',
 'total_characters',
 'total_sentence',
 'div_sentence_characters',
 'div_sentence_words',
 'div_characters_words',
 'div_words_unique_words',
 'count_word_special_0',
 'count_word_special_1',
 'count_word_special_2',
 'count_word_special_3',
 'count_word_special_4',
 'count_word_special_5',
 'count_word_special_6',
 'count_word_special_7',
 'count_word_special_8']

## Dataset

In [None]:
class CommonLitDataset(Dataset):
    """   
    return:
        input_ids_masked
        attention_mask
        token_type_ids
        input_ids

        features - array norm maked features
        target - target
        std - "standard_error" from data ori
    """
    def __init__(self, df, tokenizer, cfg, transforms=None):
        self.df = df.reset_index()
        self.augmentations = transforms
        self.cfg = cfg
        self.tokenizer = tokenizer

    def __len__(self):
        return self.df.shape[0]

    def __getitem__(self, index):
        row = self.df.iloc[index]

        text_original = row["excerpt"]

        text = self.tokenizer(text_original,
                              padding="max_length",
                              max_length=self.cfg.max_length,
                              truncation=True,
                              return_tensors="pt",
                              return_token_type_ids=True)
        input_ids = text["input_ids"][0].detach().cpu().numpy()
        input_ids_masked = [x if np.random.random() > self.cfg.mask_p else self.tokenizer.mask_token_id for x in input_ids]
        input_ids_masked = torch.LongTensor(input_ids_masked).to("cuda")
        attention_mask = text["attention_mask"][0]
        token_type_ids = text["token_type_ids"][0]
        std = row["standard_error"]

        features = ((row[self.cfg.feature_columns].fillna(0).values - self.cfg.feature_mean) / self.cfg.feature_std)
        """
        take currnet (row and - mean(all)) / std(all)
        array([ 0.13797008, -0.14786123,  0.17155507,  0.24627862,  0.14901487,
                0.21311318,  0.08742476,  0.26920065, -0.39593219, -0.3817358 ,
               -0.38798113, -0.38889757, -0.63937729, -0.57226923, -0.66869829,
                0.4939904 ,  0.48264299])                
        """
        features = torch.tensor(features, dtype=torch.float)
        target = torch.tensor(row["target"], dtype=torch.float)
        return input_ids_masked, attention_mask, token_type_ids, input_ids, features, target, std

## Module&Heads, Layers

In [None]:
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.core.lightning import LightningModule
from pytorch_lightning.utilities import rank_zero_warn

In [None]:
import gc
import torch
from torch import nn
from torch.nn import functional as F
# Applies weight normalization to a parameter in the given module.
from torch.nn.utils import weight_norm
"""
https://github.com/rwightman/pytorch-image-models#introduction

Модели изображений PyTorch (timm) - это набор моделей изображений, слоев, утилит, оптимизаторов,
планировщиков, загрузчиков / дополнений данных и эталонных сценариев обучения / проверки, которые
призваны объединить широкий спектр моделей SOTA с возможностью воспроизведения обучения ImageNet.
полученные результаты.
"""
import timm

### TemporalConvNet
[link]('https://web.cse.ohio-state.edu/~wang.77/papers/Pandey-Wang1.icassp19.pdf)


Предлагаемая CNN представляет собой архитектуру на основе кодера-декодера с дополнительным временным сверточным модулем.

(TCM) вставляется между кодировщиком и декодером.
Мы называем эта архитектура - временная сверточная нейронная сеть.(TCNN). 

Кодер в TCNN создает низкоразмерное представление зашумленного входного кадра. TCM использует
причинные и расширенные сверточные слои для использования кодировщика
вывод текущего и предыдущего кадров. Декодер использует
выход TCM для восстановления улучшенного кадра. Предлагаемая модель обучается не зависящим от динамика и шума.
Экспериментальные результаты показывают, что предложенный
модель дает стабильно лучшие результаты улучшения, чем
современная сверточная рекуррентная модель в реальном времени.
Более того, поскольку модель является полностью сверточной, в ней много меньше обучаемых параметров, чем в более ранних моделях.

In [None]:
class Chomp1d(nn.Module):
    def __init__(self, chomp_size):
        super(Chomp1d, self).__init__()
        self.chomp_size = chomp_size

    def forward(self, x):
        return x[:, :, :-self.chomp_size].contiguous()


class TemporalBlock(nn.Module):
    def __init__(self, n_inputs, n_outputs, kernel_size, stride, dilation, padding, dropout=0.2):
        super(TemporalBlock, self).__init__()
        self.conv1 = weight_norm(nn.Conv1d(n_inputs, n_outputs, kernel_size,
                                           stride=stride, padding=(kernel_size-1)*dilation,
                                           dilation=dilation))
        self.chomp1 = Chomp1d(padding)
        self.relu1 = nn.ReLU()
        self.dropout1 = nn.Dropout(dropout)

        self.conv2 = weight_norm(nn.Conv1d(n_outputs, n_outputs, kernel_size,
                                           stride=stride, padding=(kernel_size-1)*dilation,
                                           dilation=dilation))
        self.chomp2 = Chomp1d(padding)
        self.relu2 = nn.ReLU()
        self.dropout2 = nn.Dropout(dropout)

        self.net = nn.Sequential(self.conv1, self.chomp1, self.relu1, self.dropout1,
                                 self.conv2, self.chomp2, self.relu2, self.dropout2)
        self.downsample = nn.Conv1d(n_inputs, n_outputs, 1, padding=(kernel_size-1)*dilation) if n_inputs != n_outputs else None
        self.relu = nn.ReLU()
        self.init_weights()

    def init_weights(self):
        self.conv1.weight.data.normal_(0, 0.01)
        self.conv2.weight.data.normal_(0, 0.01)
        if self.downsample is not None:
            self.downsample.weight.data.normal_(0, 0.01)

    def forward(self, x):
        out = self.net(x)
        res = x if self.downsample is None else self.downsample(x)
        return self.relu(out + res)


class TemporalConvNet(nn.Module):
    def __init__(self, num_inputs, num_channels, kernel_size=2, dropout=0.2):
        super(TemporalConvNet, self).__init__()
        layers = []
        num_levels = len(num_channels)
   
        for i in range(num_levels):
            dilation_size = 2 ** i
            in_channels = num_inputs if i == 0 else num_channels[i-1]
            out_channels = num_channels[i]  
            layers += [TemporalBlock(in_channels, out_channels, kernel_size, stride=1, dilation=dilation_size,
                                     padding=(kernel_size-1) * dilation_size, dropout=dropout)]

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

In [None]:
"""
cfg.max_length, cfg.tcn_module_num, cfg.tcn_module_kernel_size
>>> (256, 3, 4)

in model:
    num_levels = 3
    in_channels=256
    out_channels=256
"""

tcn_model = TemporalConvNet(
    num_inputs=256,
    num_channels=[256] * 3,
    kernel_size=cfg.tcn_module_kernel_size,
    dropout=cfg.tcn_module_dropout
)

In [None]:
tcn_model

TemporalConvNet(
  (network): Sequential(
    (0): TemporalBlock(
      (conv1): Conv1d(256, 256, kernel_size=(4,), stride=(1,), padding=(3,))
      (chomp1): Chomp1d()
      (relu1): ReLU()
      (dropout1): Dropout(p=0, inplace=False)
      (conv2): Conv1d(256, 256, kernel_size=(4,), stride=(1,), padding=(3,))
      (chomp2): Chomp1d()
      (relu2): ReLU()
      (dropout2): Dropout(p=0, inplace=False)
      (net): Sequential(
        (0): Conv1d(256, 256, kernel_size=(4,), stride=(1,), padding=(3,))
        (1): Chomp1d()
        (2): ReLU()
        (3): Dropout(p=0, inplace=False)
        (4): Conv1d(256, 256, kernel_size=(4,), stride=(1,), padding=(3,))
        (5): Chomp1d()
        (6): ReLU()
        (7): Dropout(p=0, inplace=False)
      )
      (relu): ReLU()
    )
    (1): TemporalBlock(
      (conv1): Conv1d(256, 256, kernel_size=(4,), stride=(1,), padding=(6,), dilation=(2,))
      (chomp1): Chomp1d()
      (relu1): ReLU()
      (dropout1): Dropout(p=0, inplace=False)
    

###modules

#### linear vocab enable(линейный словарь)

three variant

In [None]:
# for example input_size = 768 this out model hidden_size
linear_vocab = nn.Sequential(
    nn.Linear(768, 64),
    nn.Dropout(0),
    cfg.activation(),
    nn.Linear(64, 16),
    nn.Dropout(0),
    cfg.activation()
)
# if "large-base"
linear_vocab_final = nn.Sequential(
    nn.Linear(16 * 256 // 4, 256),
    # nn.BatchNorm1d(cfg.linear_final_dim),
    cfg.activation(),
    nn.Dropout(0)
)
# else
linear_vocab_final = nn.Sequential(
    nn.Linear(16 * 256 // 4, 256),
    # nn.BatchNorm1d(cfg.linear_final_dim),
    cfg.activation(),
    nn.Dropout(0)
)

#### attention_enable

##### Simple

In [None]:
convnet = nn.Sequential(
    nn.Conv2d(bert.config.num_hidden_layers * bert.config.num_attention_heads,
              cfg.conv2d_hidden_channel, kernel_size=(1, 1), stride=(1, 1), bias=False),
    nn.ReLU(),
    nn.Conv2d(cfg.conv2d_hidden_channel, 1, kernel_size=(1, 1), stride=(1, 1), bias=False),
    nn.ReLU(),
    Lambda(lambda x: x.view(x.size(0), -1)),
)
convnet.num_features = 256 ** 2

#####TIMM

In [None]:
!pip install timm --quiet
clear_output()
import timm

In [None]:
timm.list_models()

['adv_inception_v3',
 'bat_resnext26ts',
 'botnet26t_256',
 'botnet50ts_256',
 'cait_m36_384',
 'cait_m48_448',
 'cait_s24_224',
 'cait_s24_384',
 'cait_s36_384',
 'cait_xs24_384',
 'cait_xxs24_224',
 'cait_xxs24_384',
 'cait_xxs36_224',
 'cait_xxs36_384',
 'coat_lite_mini',
 'coat_lite_small',
 'coat_lite_tiny',
 'coat_mini',
 'coat_tiny',
 'convit_base',
 'convit_small',
 'convit_tiny',
 'cspdarknet53',
 'cspdarknet53_iabn',
 'cspresnet50',
 'cspresnet50d',
 'cspresnet50w',
 'cspresnext50',
 'cspresnext50_iabn',
 'darknet53',
 'deit_base_distilled_patch16_224',
 'deit_base_distilled_patch16_384',
 'deit_base_patch16_224',
 'deit_base_patch16_384',
 'deit_small_distilled_patch16_224',
 'deit_small_patch16_224',
 'deit_tiny_distilled_patch16_224',
 'deit_tiny_patch16_224',
 'densenet121',
 'densenet121d',
 'densenet161',
 'densenet169',
 'densenet201',
 'densenet264',
 'densenet264d_iabn',
 'densenetblur121d',
 'dla34',
 'dla46_c',
 'dla46x_c',
 'dla60',
 'dla60_res2net',
 'dla60_res2n

In [None]:
#loads model
res_18 = timm.create_model(
    model_name='resnet18',
    pretrained=True,
    num_classes=0)

###### SWIN

https://github.com/microsoft/Swin-Transformer

Swin Transformer (название Swin означает «Сдвинутое окно») изначально описан в [arxiv](https://arxiv.org/abs/2103.14030), который может служить универсальной основой для компьютерного зрения. 

По сути, это иерархический преобразователь, представление которого вычисляется со смещенными окнами. Схема смещения окон обеспечивает большую эффективность, ограничивая вычисление самовнимания неперекрывающимися локальными окнами, а также допускает межоконное соединение.

Swin Transformer обеспечивает высокую производительность при обнаружении объектов COCO (58,7 прямоугольных AP и 51,1 маскированных AP на test-dev) и семантической сегментации ADE20K (53,5 mIoU на val), значительно превосходя предыдущие модели.

In [None]:
convnet.patch_embed.proj = nn.Conv2d(
    bert.config.num_hidden_layers * bert.config.num_attention_heads,
    96, kernel_size=(4, 4), stride=(4, 4)

###### VIT
https://huggingface.co/google/vit-base-patch16-224

Vision Transformer (ViT) - это модель кодировщика трансформатора (подобная BERT), предварительно обученная на большой коллекции изображений контролируемым образом, а именно ImageNet-21k, с разрешением 224x224 пикселей. Затем модель была доработана в ImageNet (также называемом ILSVRC2012), наборе данных, включающем 1 миллион изображений и 1000 классов, также с разрешением 224x224.

Изображения представлены модели в виде последовательности участков фиксированного размера (разрешение 16x16), которые встроены линейно. Также в начало последовательности добавляется токен [CLS], чтобы использовать его для задач классификации. Также добавляются вложения абсолютного положения перед подачей последовательности на уровни энкодера Transformer.

Предварительно обучая модель, она изучает внутреннее представление изображений, которое затем можно использовать для извлечения функций, полезных для последующих задач: например, если у вас есть набор данных с помеченными изображениями, вы можете обучить стандартный классификатор, поместив линейный слой на верхняя часть предварительно обученного кодировщика. Обычно поверх токена [CLS] размещается линейный слой, поскольку последнее скрытое состояние этого токена можно рассматривать как представление всего изображения.

Вы можете использовать необработанную модель для классификации изображений. 

In [None]:
convnet.patch_embed.proj = nn.Conv2d(
    bert.config.num_hidden_layers * bert.config.num_attention_heads,
    768, kernel_size=(32, 32), stride=(32, 32)
)

###### efficientnet

In [None]:
convnet.conv_stem = nn.Conv2d(
    bert.config.num_hidden_layers * bert.config.num_attention_heads,
    32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)

###### resnet

In [None]:
convnet.conv1 = nn.Conv2d(
    bert.config.num_hidden_layers * bert.config.num_attention_heads,
    64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)

###### linear_conv_final

In [None]:
linear_conv_final = nn.Sequential(
    nn.Linear(convnet.num_features, cfg.linear_final_dim),
    # nn.BatchNorm1d(cfg.linear_final_dim),
    cfg.activation(),
    nn.Dropout(cfg.dropout)
)

##### linear features

In [None]:
linear_feature = nn.Sequential(
    nn.Linear(17, cfg.linear_final_dim),
    cfg.activation(),
    nn.Dropout(cfg.dropout)
)

##### lstm

In [None]:
 def make_lstm_module(self):
    ret = []
    if cfg.word_axis:
        hidden_size = cfg.max_length * len(cfg.rnn_hidden_indice)
    else:
        hidden_size = bert.config.hidden_size * len(cfg.rnn_hidden_indice)

    for i in range(cfg.rnn_module_num):
        ret.append((f"lstm_module_{i}", LSTMModule(cfg=cfg, hidden_size=hidden_size)))
        if cfg.bidirectional:
            hidden_size = int(hidden_size * cfg.rnn_module_shrink_ratio * 2)
        else:
            hidden_size = int(hidden_size * cfg.rnn_module_shrink_ratio)
    return nn.Sequential(OrderedDict(ret))


lstm = make_lstm_module()
if cfg.bidirectional:
    if cfg.word_axis: # bool param
        lstm_size = int(cfg.max_length * len(cfg.rnn_hidden_indice) * (
            (2 * cfg.rnn_module_shrink_ratio) ** cfg.rnn_module_num))
    else:
        lstm_size = int(bert.config.hidden_size * len(cfg.rnn_hidden_indice) * (
            (2 * cfg.rnn_module_shrink_ratio) ** cfg.rnn_module_num))
else:
    if cfg.word_axis:
        lstm_size = int(cfg.max_length * len(cfg.rnn_hidden_indice) * (
            cfg.rnn_module_shrink_ratio ** cfg.rnn_module_num))

    else:
        lstm_size = int(bert.config.hidden_size * len(cfg.rnn_hidden_indice) * (
            cfg.rnn_module_shrink_ratio ** cfg.rnn_module_num))
            
            
linear_lstm_final = nn.Sequential(
    nn.Linear(lstm_size, cfg.linear_final_dim),
    # nn.BatchNorm1d(cfg.linear_final_dim),
    cfg.activation(),
    nn.Dropout(cfg.dropout)
)

#### re-init

In [None]:
def reinit_bert(self):
    def get_model_type(x):
        if "distilbert" in x: return "distilbert"
        if "albert" in x: return "albert"
        if "roberta" in x: return "roberta"
        if "bert" in x: return "bert"

    # re-init pooler
    if cfg.reinit_pooler and not cfg.prep_enable:
        if "bert" in cfg.nlp_model_name or "roberta" in cfg.nlp_model_name or "luke" in cfg.nlp_model_name:
            bert.pooler.dense.weight.data.normal_(mean=0.0, std=bert.config.initializer_range)
            bert.pooler.dense.bias.data.zero_()
            for p in bert.pooler.parameters():
                p.requires_grad = True
        elif "xlnet" in cfg.nlp_model_name:
            raise ValueError(f"{cfg.nlp_model_name} does not have a pooler at the end")
        else:
            raise NotImplementedError

    # re-init layers
    if cfg.reinit_layers > 0:
        if "albert" in cfg.nlp_model_name:
            raise ValueError("albert not reinit")

        elif "bert" in cfg.nlp_model_name or "roberta" in cfg.nlp_model_name:
            if cfg.prep_enable:
                if get_model_type(cfg.nlp_model_name) == "bert":
                    layers = bert.bert.encoder.layer[-cfg.reinit_layers:]
                elif get_model_type(cfg.nlp_model_name) == "roberta":
                    layers = bert.roberta.encoder.layer[-cfg.reinit_layers:]
            else:
                if get_model_type(cfg.nlp_model_name) == "distilbert":
                    layers = bert.transformer.layer[-cfg.reinit_layers:]
                else:
                    layers = bert.encoder.layer[-cfg.reinit_layers:]
            for layer in layers:
                for module in layer.modules():
                    if isinstance(module, nn.Linear):
                        # Slightly different from the TF version which uses truncated_normal for initialization
                        # cf https://github.com/pytorch/pytorch/pull/5617
                        module.weight.data.normal_(mean=0.0, std=bert.config.initializer_range)
                        if module.bias is not None:
                            module.bias.data.zero_()
                    elif isinstance(module, nn.Embedding):
                        module.weight.data.normal_(mean=0.0, std=bert.config.initializer_range)
                        if module.padding_idx is not None:
                            module.weight.data[module.padding_idx].zero_()
                    elif isinstance(module, nn.LayerNorm):
                        module.bias.data.zero_()
                        module.weight.data.fill_(1.0)

        elif "xlnet" in cfg.nlp_model_name:
            for layer in bert.layer[-cfg.reinit_layers:]:
                for module in layer.modules():
                    bert._init_weights(module)
        elif "luke" in cfg.nlp_model_name:
            if cfg.prep_enable:
                raise NotImplementedError
            else:
                layers = bert.encoder.layer[-cfg.reinit_layers:]
            for layer in layers:
                for module in layer.modules():
                    if isinstance(module, nn.Linear):
                        # Slightly different from the TF version which uses truncated_normal for initialization
                        # cf https://github.com/pytorch/pytorch/pull/5617
                        module.weight.data.normal_(mean=0.0, std=bert.config.initializer_range)
                        if module.bias is not None:
                            module.bias.data.zero_()
                    elif isinstance(module, nn.Embedding):
                        module.weight.data.normal_(mean=0.0, std=bert.config.initializer_range)
                        if module.padding_idx is not None:
                            module.weight.data[module.padding_idx].zero_()
                    elif isinstance(module, nn.LayerNorm):
                        module.bias.data.zero_()
                        module.weight.data.fill_(1.0)
        elif "funnel" in cfg.nlp_model_name:
            for layer in bert.encoder.blocks[2][-cfg.reinit_layers:]:
                for module in layer.modules():
                    bert._init_weights(module)
        elif "bart" in cfg.nlp_model_name:
            for layer in bert.decoder.layers[-cfg.reinit_layers:]:
                for module in layer.modules():
                    bert._init_weights(module)
            if not cfg.decoder_only:
                for layer in bert.encoder.layers[-cfg.reinit_layers:]:
                    for module in layer.modules():
                        bert._init_weights(module)
        elif "electra" in cfg.nlp_model_name:
            for layer in bert.encoder.layer[-cfg.reinit_layers:]:
                for module in layer.modules():
                    bert._init_weights(module)
        elif "gpt2" in cfg.nlp_model_name:
            for layer in bert.h[-cfg.reinit_layers:]:
                for module in layer.modules():
                    bert._init_weights(module)
        elif "gpt-neo" in cfg.nlp_model_name:
            for layer in bert.h[-cfg.reinit_layers:]:
                for module in layer.modules():
                    bert._init_weights(module)
        elif "t5" in cfg.nlp_model_name:
            for layer in bert.encoder.block[-cfg.reinit_layers:]:
                for module in layer.modules():
                    bert._init_weights(module)
            if not cfg.decoder_only:
                for layer in bert.decoder.block[-cfg.reinit_layers:]:
                    for module in layer.modules():
                        bert._init_weights(module)
        elif "mpnet" in cfg.nlp_model_name:
            for layer in bert.encoder.layer[-cfg.reinit_layers:]:
                for module in layer.modules():
                    bert._init_weights(module)
        elif "layoutlm" in cfg.nlp_model_name:
            for layer in bert.encoder.layer[-cfg.reinit_layers:]:
                for module in layer.modules():
                    bert._init_weights(module)
    """
    for layer in [linear1, linear2, linear1_std, linear2_std, linear_perp, linear_vocab,
                    linear_tcn_final, linear_lstm_final, linear_hidden_final,
                    linear_conv_final, linear_vocab_final]:
        for module in layer.modules():
            if isinstance(module, nn.Linear):
                module.weight.data.normal_(mean=0.0, std=bert.config.initializer_range)
                if module.bias is not None:
                    module.bias.data.zero_()
    """

#### forward layer

In [None]:
def forward(self, input_ids_masked, attention_mask, token_type_ids, input_ids, features):
    def f(x_in, perplexity=None):
        x_in = F.dropout(x_in, p=cfg.multi_dropout_ratio, training=True)
        if perplexity is not None:
            x_out_mean = linear1(torch.cat([x_in, perplexity], dim=1))
            x_out_mean = linear2(torch.cat([x_out_mean, perplexity], dim=1))
        else:
            x_out_mean = linear1(x_in)
            x_out_mean = linear2(x_out_mean)
        return x_out_mean

    def g(x_in, perplexity=None):
        x_in = F.dropout(x_in, p=cfg.multi_dropout_ratio, training=True)
        if perplexity is not None:
            x_out_std = linear1_std(torch.cat([x_in, perplexity], dim=1))
            x_out_std = linear2_std(torch.cat([x_out_std, perplexity], dim=1))
        else:
            x_out_std = linear1(x_in)
            x_out_std = linear2(x_out_std)
        x_out_std = torch.exp(x_out_std) ** 0.5
        return x_out_std

    if not cfg.prep_enable:
        x = bert(input_ids=input_ids_masked,
                        attention_mask=attention_mask,
                        output_attentions=True,
                        output_hidden_states=True)
        if "deberta" in cfg.nlp_model_name:
            x = [x[0], x[1], x[2]]
        elif "xlnet" in cfg.nlp_model_name:
            if len(x) == 4:
                x = [x[0], x[2], x[3], x[1]]
            else:
                x = [x[0], x[1], x[2]]
        elif "albert" in cfg.nlp_model_name:
            x = [x[0], x[1]]
        elif "distilbert" in cfg.nlp_model_name:
            x = [x[0], x[1], x[2]]
        elif "funnel" in cfg.nlp_model_name:
            x = [x[0], x[1], x[2]]
        elif "bart" in cfg.nlp_model_name:
            x = [x[0], x[2], x[3], None]  # x[2]: decoder hidden_states, x[3]: cross attention
        elif "electra" in cfg.nlp_model_name:
            x = [x[0], x[1], x[2]]
        elif "t5" in cfg.nlp_model_name:
            x = [x[0], x[1], x[2]]
        elif "mpnet" in cfg.nlp_model_name:
            x = [x[0], x[2], x[3], x[1]]
        else:
            x = [x[0], x[2], x[3], x[1]]
    elif "funnel" in cfg.nlp_model_name:
        x = bert.funnel(input_ids=input_ids_masked,
                                attention_mask=attention_mask,
                                token_type_ids=token_type_ids,
                                output_attentions=True,
                                output_hidden_states=True)
        if cfg.prep_enable:
            input_ids_pred = bert.lm_head(x[0])
    elif "albert" in cfg.nlp_model_name:
        x = bert.albert(input_ids=input_ids_masked,
                                attention_mask=attention_mask,
                                token_type_ids=token_type_ids,
                                output_attentions=True,
                                output_hidden_states=True)
        if cfg.prep_enable:
            input_ids_pred = bert.predictions(x[0])
    elif "deberta" in cfg.nlp_model_name:
        x = bert.deberta(input_ids=input_ids_masked,
                                attention_mask=attention_mask,
                                token_type_ids=token_type_ids,
                                output_attentions=True,
                                output_hidden_states=True)
        if cfg.prep_enable:
            input_ids_pred = bert.cls(x[0])
    elif "roberta" in cfg.nlp_model_name and "bigbird" not in cfg.nlp_model_name:
        x = bert.roberta(input_ids=input_ids_masked,
                                attention_mask=attention_mask,
                                token_type_ids=token_type_ids,
                                output_attentions=True,
                                output_hidden_states=True)
        if cfg.prep_enable:
            input_ids_pred = bert.lm_head(x[0])

    elif "bert" in cfg.nlp_model_name or "bigbird" in cfg.nlp_model_name:
        x = bert.bert(input_ids=input_ids_masked,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            output_attentions=True,
                            output_hidden_states=True)
        if cfg.prep_enable:
            input_ids_pred = bert.cls(x[0])
    else:
        x = bert(input_ids=input_ids_masked,
                        attention_mask=attention_mask,
                        output_attentions=True,
                        output_hidden_states=True)
        if "luke-base" in cfg.nlp_model_name or "luke-large" in cfg.nlp_model_name:
            x = [x[0], x[2], x[3]]

    # x[0]: last hidden layer, x[1]: all hidden layer, x[2]: attention matrix
    if cfg.prep_enable:
        loss = torch.nn.functional.cross_entropy(input_ids_pred.view(-1, bert.config.vocab_size), input_ids.view(-1), reduction="none")
        perplexity = loss.view(len(input_ids), -1) * attention_mask
        perplexity = perplexity.sum(dim=1) / attention_mask.sum(dim=1)
        perplexity = perplexity.view(-1, 1)

    # base feature
    x_bert = []
    if cfg.pooler_enable:
        x_bert.append(x[3])
    if cfg.linear_vocab_enable:
        xx = dropout(linear_vocab(x[0]).view(len(input_ids), -1))
        x_bert.append(linear_vocab_final(xx))
    if cfg.self_attention_enable:
        xx = torch.cat([dropout_attn(xx) for xx in x[2]], dim=1)
        xx = convnet(xx)
        xx = linear_conv_final(xx)
        x_bert.append(xx)
    if cfg.hidden_stack_enable:
        if "albert" in cfg.nlp_model_name:
            xx = linear_hidden_final(x[0].mean(dim=1))
            x_bert.append(xx)
        else:
            if "funnel" in cfg.nlp_model_name:
                xx = torch.stack([dropout_bert_stack(xx) for xx in x[1][-3:]]).mean(dim=[0, 2])
            else:
                xx = torch.stack([dropout_bert_stack(xx) for xx in x[1][-4:]]).mean(dim=0)
                xx = torch.sum(
                    xx * attention_mask.unsqueeze(-1), dim=1, keepdim=False
                )
                xx = xx / torch.sum(attention_mask, dim=-1, keepdim=True)
            xx = linear_hidden_final(xx)
            x_bert.append(xx)

    # residual feature
    if cfg.rnn_module_num > 0:
        if cfg.word_axis:
            x_lstm = lstm(torch.cat([x[1][idx] for idx in cfg.rnn_hidden_indice], dim=1).transpose(2, 1)).mean(dim=1)
        else:
            x_lstm = lstm(torch.cat([x[1][idx] for idx in cfg.rnn_hidden_indice], dim=2)).mean(dim=1)
        x_lstm = linear_lstm_final(x_lstm)
        x_bert.append(x_lstm)
    if cfg.attention_head_enable:
        weights_attn = attention_head(x[0])
        x_attn = torch.sum(weights_attn * x[0], dim=1)
        x_attn = linear_attention_head_final(x_attn)
        x_bert.append(x_attn)
    if cfg.tcn_module_enable:
        if cfg.word_axis:
            x_tcn = tcn(dropout(x[0])).mean(dim=1)
        else:
            x_tcn = tcn(dropout(x[0]).permute(0, 2, 1)).mean(dim=2)
        x_tcn = linear_tcn_final(x_tcn)
        x_bert.append(x_tcn)
    if cfg.attention_pool_enable:
        xx = torch.cat([xx for xx in x[2]], dim=1).mean(dim=1).reshape(len(input_ids), -1)
        xx = linear_attention_pool_final(xx)
        x_bert.append(xx)
    if cfg.feature_enable:
        xx = linear_feature(features)
        x_bert.append(xx)

    x_bert = torch.cat(x_bert, dim=1)

    if cfg.prep_enable:
        perplexity = linear_perp(perplexity)
        x_out_mean = torch.stack([f(x_bert, perplexity) for _ in range(cfg.multi_dropout_num)]).mean(dim=0)
        x_out_std = torch.stack([g(x_bert, perplexity) for _ in range(cfg.multi_dropout_num)]).mean(dim=0)
    elif cfg.simple_structure:
        x_out_mean = linear_simple(x_bert)
        x_out_std = None
    else:
        x_out_mean = torch.stack([f(x_bert) for _ in range(cfg.multi_dropout_num)]).mean(dim=0)
        x_out_std = torch.stack([g(x_bert) for _ in range(cfg.multi_dropout_num)]).mean(dim=0)
        return x_out_mean, x_out_std

#### optimizers

In [None]:
def configure_optimizers(self):
    def extract_params(named_parameters, lr, weight_decay, no_decay=False):
        ret = {}
        no_decay_ary = ["bias", "LayerNorm.weight", "pooler"]

        if no_decay:
            ret["params"] = [p for n, p in named_parameters if not any(nd in n for nd in no_decay_ary)]
            ret["weight_decay"] = 0
        else:
            ret["params"] = [p for n, p in named_parameters if any(nd in n for nd in no_decay_ary) and "pooler" not in n]
            ret["weight_decay"] = weight_decay
        ret["lr"] = lr
        return ret

    params = []
    if cfg.prep_enable:
        if "funnel" in cfg.nlp_model_name:
            params.append({"params": bert.lm_head.parameters(), "weight_decay": cfg.weight_decay, "lr": cfg.lr_bert})
        elif "albert" in cfg.nlp_model_name:
            params.append({"params": bert.predictions.parameters(), "weight_decay": cfg.weight_decay, "lr": cfg.lr_bert})
        elif "deberta" in cfg.nlp_model_name:
            params.append({"params": bert.cls.parameters(), "weight_decay": cfg.weight_decay, "lr": cfg.lr_bert})
        elif "roberta" in cfg.nlp_model_name and "bigbird" not in cfg.nlp_model_name:
            params.append({"params": bert.lm_head.parameters(), "weight_decay": cfg.weight_decay, "lr": cfg.lr_bert})
        elif "bert" in cfg.nlp_model_name or "bigbird" in cfg.nlp_model_name:
            params.append({"params": bert.cls.parameters(), "weight_decay": cfg.weight_decay, "lr": cfg.lr_bert})
        else:
            raise ValueError("mask用のparameterありません")
    params.append(extract_params(bert.named_parameters(), lr=cfg.lr_bert, weight_decay=cfg.weight_decay, no_decay=False))
    params.append(extract_params(bert.named_parameters(), lr=cfg.lr_bert, weight_decay=0, no_decay=True))

    if cfg.linear_vocab_enable:
        params.append(extract_params(linear_vocab.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear_vocab.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
        params.append(extract_params(linear_vocab_final.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear_vocab_final.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
    if cfg.self_attention_enable:
        params.append(extract_params(linear_conv_final.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear_conv_final.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
        params.append(extract_params(convnet.named_parameters(), lr=cfg.lr_cnn, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(convnet.named_parameters(), lr=cfg.lr_cnn, weight_decay=0, no_decay=True))
    if cfg.attention_pool_enable:
        params.append(extract_params(linear_attention_pool_final.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear_attention_pool_final.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
    if cfg.tcn_module_enable:
        params.append(extract_params(linear_tcn_final.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear_tcn_final.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
        params.append(extract_params(tcn.named_parameters(), lr=cfg.lr_tcn, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(tcn.named_parameters(), lr=cfg.lr_tcn, weight_decay=0, no_decay=True))
    if cfg.rnn_module_num > 0:
        params.append(extract_params(linear_lstm_final.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear_lstm_final.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
        params.append(extract_params(lstm.named_parameters(), lr=cfg.lr_rnn, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(lstm.named_parameters(), lr=cfg.lr_rnn, weight_decay=0, no_decay=True))
    if cfg.hidden_stack_enable:
        params.append(extract_params(linear_hidden_final.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear_hidden_final.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
    if cfg.simple_structure:
        params.append(extract_params(linear_simple.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear_simple.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
    else:
        params.append(extract_params(linear1.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear1.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
        params.append(extract_params(linear2.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear2.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
        params.append(extract_params(linear1_std.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear1_std.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
        params.append(extract_params(linear2_std.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear2_std.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
    if cfg.attention_head_enable:
        params.append(extract_params(attention_head.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(attention_head.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
        params.append(extract_params(linear_attention_head_final.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear_attention_head_final.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
    if cfg.pooler_enable:
        params.append(extract_params(bert.pooler.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(bert.pooler.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))
    if cfg.feature_enable:
        params.append(extract_params(linear_feature.named_parameters(), lr=cfg.lr_fc, weight_decay=cfg.weight_decay, no_decay=False))
        params.append(extract_params(linear_feature.named_parameters(), lr=cfg.lr_fc, weight_decay=0, no_decay=True))

    optimizer = cfg.optimizer(params)
    num_warmup_steps = int(cfg.epochs_max * len(df_train) / cfg.batch_size * cfg.warmup_ratio)
    num_training_steps = int(cfg.epochs_max * len(df_train) / cfg.batch_size) * cfg.training_steps_ratio

    scheduler = get_linear_schedule_with_warmup(optimizer,
                                                num_warmup_steps=num_warmup_steps,
                                                num_training_steps=num_training_steps)
    return [optimizer], [scheduler]

In [None]:
https://seasongid.com/season-3454-boston_legal.html
10