This seminar was prepared with the help of the following materials:
- [bertviz tool demo](https://colab.research.google.com/drive/1YoJqS9cPGu3HL2_XExw3kCsRBtySQS2v?usp=sharing#scrollTo=bYs0L8Ftt_Hu);
- [How to use BERT from HuggingFace](https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209)


![alt text](https://hsto.org/webt/uh/cd/qv/uhcdqv--w2t4i8srv9rtzjgk9ac.png)

In [None]:
# the main install of the whole notebook
!pip install transformers datasets bertviz -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/7.2 MB[0m [31m76.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m7.2/7.2 MB[0m [31m120.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m78.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.6/157.6 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m101.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━

## 1.Byte-pair-encoding

A simple data compression algorithm first [introduced in 1994](https://www.derczynski.com/papers/archive/BPE_Gage.pdf). It was later reintroudiced for NLP to the task of word segmentation in [this article](https://arxiv.org/pdf/1508.07909.pdf). BPE allows for the
representation of an open vocabulary through
a fixed-size vocabulary of variable-length
character sequences, making it a very suitable word segmentation strategy for neural
network models.

The code below shows a toy example of learned BPE
operations. At test time, we first split words into
sequences of characters, then apply the learned operations to merge the characters into larger, known
symbols. This is applicable to any word, and
allows for open-vocabulary networks with fixed
symbol vocabularies.
In our example, the
‘lower’ would be segmented into ‘low er·’

![alt text](https://alexanderdyakonov.files.wordpress.com/2019/11/bpe.jpg)

Source: [Subword Tokenization](https://dyakonov.org/2019/11/29/%D1%82%D0%BE%D0%BA%D0%B5%D0%BD%D0%B8%D0%B7%D0%B0%D1%86%D0%B8%D1%8F-%D0%BD%D0%B0-%D0%BF%D0%BE%D0%B4%D1%81%D0%BB%D0%BE%D0%B2%D0%B0-subword-tokenization/)

### 1.1.BPE simple version

In [None]:
import re, collections

def get_stats(vocab):
  """collect charcters pairs frequency"""
  pairs = collections.defaultdict(int)
  for word, freq in vocab.items(): #iterate over word and their frequencies
    symbols = word.split()
    for i in range(len(symbols)-1): #increment symbol-pairs frequency
      pairs[symbols[i],symbols[i+1]] += freq
  return pairs

#?<! - 'Negative Lookbehind Before the Match' What's before this is not... http://www.rexegg.com/regex-disambiguation.html#lookbehind
#?! - 'Negative Lookahead After the Match' What's after this is not... http://www.rexegg.com/regex-disambiguation.html#negative-lookahead
def merge_vocab(pair, v_in):
  v_out = {}
  bigram = re.escape(' '.join(pair)) #join character pairs with escape character and space
  p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)') #generate regex bigram, for matching namely "not no_whitespace >> whitespace"+"character1" + "space" + "character2" + "not no_whitespace >> whitespace"
  for word in v_in:
    # print("orig_word", word)
    w_out = p.sub(''.join(pair), word)
    # print("w_out", w_out)
    v_out[w_out] = v_in[word]
  return v_out

vocab = {'l o w </w>' : 5, 'l o w e r </w>' : 2,'n e w e s t </w>':6, 'w i d e s t </w>':3}

num_merges = 10
for i in range(num_merges):
  pairs = get_stats(vocab)
  print("pairs_loop", pairs)
  best = max(pairs, key=pairs.get) #get the characters pair with the highest frequency
  print("best", best)
  vocab = merge_vocab(best, vocab)
  print("vocab", vocab)
  print("="*100)

pairs_loop defaultdict(<class 'int'>, {('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 8, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('e', 's'): 9, ('s', 't'): 9, ('t', '</w>'): 9, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'e'): 3})
best ('e', 's')
vocab {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}
pairs_loop defaultdict(<class 'int'>, {('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('w', 'es'): 6, ('es', 't'): 9, ('t', '</w>'): 9, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'es'): 3})
best ('es', 't')
vocab {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}
pairs_loop defaultdict(<class 'int'>, {('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 2, ('e', 'r'): 2, ('r', '</w>'): 2, ('n', 'e'): 6, ('e', 'w'): 6, ('w', 'est'): 6, ('est', '</w>'): 9, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'est'): 3})
best ('est', '</w>')

### 1.2.Transformers tokenizers

In [None]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# subwords:'gp', '##u'
print(tokenizer.tokenize("I have a new GPU!"))

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

['i', 'have', 'a', 'new', 'gp', '##u', '!']


In [None]:
print(tokenizer.tokenize("dsfgshdlfkjgs"))

['ds', '##f', '##gs', '##hd', '##lf', '##k', '##j', '##gs']


In [None]:
tokenizer.encode("jhgsdf I have a new GPU! jsdfjh")

[101,
 1046,
 25619,
 16150,
 2546,
 1045,
 2031,
 1037,
 2047,
 14246,
 2226,
 999,
 1046,
 16150,
 2546,
 3501,
 2232,
 102]

In [None]:
tokenizer.convert_ids_to_tokens(tokenizer.encode("jhgsdf I have a new GPU! jsdfjh"))

['[CLS]',
 'j',
 '##hg',
 '##sd',
 '##f',
 'i',
 'have',
 'a',
 'new',
 'gp',
 '##u',
 '!',
 'j',
 '##sd',
 '##f',
 '##j',
 '##h',
 '[SEP]']

In [None]:
tokenizer.decode(tokenizer.encode("jhgsdf I have a new GPU! jsdfjh"))

'[CLS] jhgsdf i have a new gpu! jsdfjh [SEP]'

In [None]:
tokenizer.decode(tokenizer.encode("jhgsdf I have a new GPU! jsdfjh"), skip_special_tokens=True)

'jhgsdf i have a new gpu! jsdfjh'

In [None]:
tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

In [None]:
tokenizer.eos_token

Using eos_token, but it is not set yet.


In [None]:
tokenizer("I have a new GPU!")

{'input_ids': [101, 1045, 2031, 1037, 2047, 14246, 2226, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
encoding = tokenizer("I have a new GPU!", add_special_tokens = True,
                     truncation = True, padding = "max_length",
                     return_attention_mask = True, return_tensors = "pt")

In [None]:
encoding.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [None]:
encoding

{'input_ids': tensor([[  101,  1045,  2031,  1037,  2047, 14246,  2226,   999,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,  

## 2.Attention Visualization

In [None]:
# Load model and retrieve attention weights
from bertviz import head_view, model_view
from transformers import BertTokenizer, BertModel, BertForQuestionAnswering

model_version = 'bert-base-uncased'
do_lower_case = True
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
sentence_a = "The cat sat on the mat"
sentence_b = "The cat lay on the rug"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
sentence_b_start = token_type_ids[0].tolist().index(1)
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list)

### 2.1.Model View

The model view gives a birds-eye view of attention across all of the layers (rows) and heads (columns) in the model. In this case we are showing bert-base, which has 12 layers and 12 heads (zero-indexed).

In [None]:
model_view(attention, tokens, sentence_b_start)

Output hidden; open in https://colab.research.google.com to view.

### 2.2.Head view

The attention-head view visualizes attention in one or more heads in a particular layer in the model

In [None]:
head_view(attention, tokens, sentence_b_start)

Output hidden; open in https://colab.research.google.com to view.

### 2.3.Neuron View

The attention-head view visualizes attention, as well as query and key values, in a particuler attention head.

**NOTE: This visualization requires Chrome when run in Colab**

In [None]:
from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show

model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)
model_type = 'bert'
show(model, model_type, tokenizer, sentence_a, sentence_b, layer=2, head=0)

Output hidden; open in https://colab.research.google.com to view.

## 3.Text Classification using BERT

### 3.1.Work with Transformers

In [None]:
import transformers

GITHUB https://github.com/huggingface/transformers


See examples of how to do comon tasks:
https://github.com/huggingface/transformers/tree/master/examples


All available Hugging Face models you can find here:
https://huggingface.co/models

The library is build around three types of classes for each model:

* ***model classes*** e.g., BertModel which are ~100 PyTorch models (torch.nn.Modules) that work with the pretrained weights provided in the library. In TF2, these are tf.keras.Model.

* ***configuration classes*** which store all the parameters required to build a model, e.g., BertConfig. You don’t always need to instantiate these your-self. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)

* ***tokenizer classes*** which store the vocabulary for each model and provide methods for encoding/decoding strings in a list of token embeddings indices to be fed to a model, e.g., BertTokenizer

All these classes can be instantiated from pretrained instances and saved locally using two methods:

* *from_pretrained()* let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed here) or stored locally (or on a server) by the user,

* *save_pretrained()* let you save a model/configuration/tokenizer locally so that it can be reloaded using from_pretrained().

`AutoModel` (`AutoModelFor*`) or `AutoTokenizer` are special classes that automatically convert themselves to specific model-based classes (such as `BertModel`, `BertTokenizer`) based on the data loaded into them.


#### 3.1.1.Masked Language Modeling

In [None]:
from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
model.eval();

In [None]:
text = "The capital of France, " + tokenizer.mask_token + ", contains the Eiffel Tower."
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)

with torch.no_grad():
    output = model(**input)
    logits = output.logits
    softmax = F.softmax(logits, dim = -1)
    mask_word = softmax[0, mask_index, :]
    top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
   word = tokenizer.decode([token])
   new_sentence = text.replace(tokenizer.mask_token, word)
   print(new_sentence)

The capital of France, paris, contains the Eiffel Tower.
The capital of France, lyon, contains the Eiffel Tower.
The capital of France, lille, contains the Eiffel Tower.
The capital of France, toulouse, contains the Eiffel Tower.
The capital of France, marseille, contains the Eiffel Tower.
The capital of France, orleans, contains the Eiffel Tower.
The capital of France, strasbourg, contains the Eiffel Tower.
The capital of France, nice, contains the Eiffel Tower.
The capital of France, cannes, contains the Eiffel Tower.
The capital of France, versailles, contains the Eiffel Tower.


In [None]:
text

'The capital of France, [MASK], contains the Eiffel Tower.'

In [None]:
tokenizer

BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

#### 3.1.2.Language Modeling

BERT can be fine-tuned as a decoder (with causal attention mask to predict the next token).

Because it already has a layer for MLM, the last decoder layer can be initialized with it. However, without fine-tuning such a model performs poorly.

In [None]:
from transformers import BertTokenizer, BertLMHeadModel
import torch
from torch.nn import functional as F
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertLMHeadModel.from_pretrained('bert-base-uncased', return_dict=True, is_decoder = True)

text = "A knife is very "
input = tokenizer.encode_plus(text, return_tensors = "pt")
output = model(**input).logits[:, -1, :]
softmax = F.softmax(output, -1)
index = torch.argmax(softmax, dim = -1)
x = tokenizer.decode(index)
print(x)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


.


#### 3.1.3.Next Sentence Prediction

In [None]:
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch
from torch.nn import functional as F
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

prompt = "London is the capital of Great Britain"
next_sentence = "I like playing football."
encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='pt')
with torch.no_grad():
    outputs = model(**encoding)[0]
    softmax = F.softmax(outputs, dim = 1)
print(softmax)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[7.5438e-04, 9.9925e-01]])


In [None]:
encoding

{'input_ids': tensor([[ 101, 2414, 2003, 1996, 3007, 1997, 2307, 3725,  102, 1045, 2066, 2652,
         2374, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

#### 3.1.4.Pipelines

[Pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) in the Hugging Face Transformers library are abstractions that contain models and tokenizers.



In [None]:
from transformers import pipeline

# Allocate a pipeline for question-answering
question_answerer = pipeline('question-answering')
question_answerer({
     'question': 'What is the name of the repository ?',
     'context': 'Pipeline have been included in the huggingface/transformers repository'
})

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

{'score': 0.513595461845398,
 'start': 35,
 'end': 59,
 'answer': 'huggingface/transformers'}

In [None]:
from transformers import pipeline

# Allocate a pipeline for sentiment-analysis
classifier = pipeline('sentiment-analysis')
classifier('We are very happy to use transformers repository.')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.99924635887146}]

### 3.2.Application Example

We will fine tune a BERT-based model to classify [restaurant reviews](https://huggingface.co/datasets/blinoff/restaurants_reviews).

In [None]:
import pandas as pd
import numpy as np
import torch
from tqdm.auto import tqdm, trange

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
from datasets import Dataset

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_json('https://huggingface.co/datasets/blinoff/restaurants_reviews/resolve/main/restaurants_reviews.jsonl', lines=True)

In [None]:
pd.options.display.max_colwidth = 300

In [None]:
print(df.shape)
df.sample(3)

(47139, 6)


Unnamed: 0,review_id,general,food,interior,service,text
18910,18910,0,8,8,10,"Обедал раза по два и здесь и на Ваське , на Ваське понравилось больше . Видимо из-за вида . Все время брал одно и тоже . Не травился . Хотя верю что такое может быть , даже японцы , чокнутые на рыбе , и то травятся , дело тут не в ресторане . Очень душевные юги-официанты . В принципе , от ..."
14885,14885,0,4,7,8,"Кафе "" Траппист "" - само по себе событие . Огромная карта бельгийского пива ( подозреваю , самая большая в России ) , свежие беломорские мидии , запрет на курение - есть чем впечатлиться , есть о чем говорить . Но это теория . Практика в моем случае состоит из двух посещений - ноябрьском и св..."
31614,31614,0,7,10,9,Праздновали свадьбу 13 ноября . Сейчас эмоции все подостыли и можно спокойно поделиться свои мнением . Выбор ресторана занял не долго времени . После того как побывали в Нева-Холл остальные варианты отпали в сразу . Персонал вежливый и учтивый . Выбор блюд был большой и все остались сыты и ...


In [None]:
df.groupby('general').sample(1)

Unnamed: 0,review_id,general,food,interior,service,text
7701,7701,0,2,4,1,"Ужасное впечатление . Мало того , что обслуживают как эстонские черепахи , ещё и приносят не то ! Причём народу на тот момент было немного , но девушка , похожая на смерть , видно выбрала себе не ту профессию . Для начала Вельвет с его "" бурей в стакане "" после отстоя пены сильно не дотянул д..."
34850,34850,1,0,0,0,"Безобразие ужасное место мало того что напитки были тухлые я про сок . Хамы , девушка Анастасия предлагала нам после закрытия места разобраться за углом я честно просто в шоке все понятно они наверно устали , но вы работаете для людей а если не можете разговаривать то сидите в архиве и работайт..."
36424,36424,2,0,0,0,"Были в этом ресторане по акции , обещали скидку 30 % . Еда не очень - ассорти мясное из 3х видов колбасы . Сырное ассорти , нарезано толстыми кусками . На горячее взяли шашлык на углях , мясо на кости показалось с жалком или оно не умеют готовить баранину . Но это бы ничего , но нас об счита..."
35708,35708,3,0,0,0,"В воскресенье вечером были в Белом Кролике . Прекрасный вид , стильный интерьер , интересные блюда в меню . Но впечатление несколько подпортилось по нескольким пунктам . Официанты . Первый официант все время нас переспрашивал , практически каждую фразу , хотя не шептали , да и музыка негромк..."
37399,37399,4,0,0,0,"Интерьер по-домашнему уютный , цены демократичные , пробовала шашлык на мангале , осталась довольна ))"
34409,34409,5,0,0,0,"Господа-Граждане-Товарищи кто хочет отдохнуть в реально хорошем месте , то вы не зря это читаете . Мы были большой компанией . Выбор места сабонтуя для встречи одноклассников лежал на мне ( как и в студенчестве самой заводной и ответственной в этом плане ) . А я подхожу к этому с особой тщате..."


In [None]:
df.general.value_counts()

0    43940
5     2164
1      462
4      257
2      166
3      150
Name: general, dtype: int64

In [None]:
g = df[df.general>0]

data = Dataset.from_dict({'text': g.text, 'label': g.general-1}).train_test_split(test_size=0.2, seed=1)
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2559
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 640
    })
})

# 🤗
https://huggingface.co/ai-forever/ruBert-base

In [None]:
base_model = 'ai-forever/ruBert-base'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model)

Downloading (…)lve/main/config.json:   0%|          | 0.00/590 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/1.78M [00:00<?, ?B/s]

In [None]:
data_tokenized = data.map(lambda x: tokenizer(x['text'], truncation=True, max_length=512), batched=True, remove_columns=['text'])

Map:   0%|          | 0/2559 [00:00<?, ? examples/s]

Map:   0%|          | 0/640 [00:00<?, ? examples/s]

In [None]:
data_tokenized

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2559
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 640
    })
})

In [None]:
print(data_tokenized['train'][0])

{'label': 1, 'input_ids': [101, 945, 86782, 1055, 736, 1613, 965, 3844, 110, 11239, 126, 57893, 133, 2065, 734, 24350, 110755, 46789, 1151, 1712, 702, 378, 160, 57031, 17398, 27204, 49342, 650, 378, 158, 41832, 121, 4024, 9198, 57741, 680, 107, 5850, 56602, 52417, 126, 6167, 20220, 24326, 63915, 8928, 378, 121, 750, 22008, 1179, 53362, 177, 107, 36466, 110870, 14394, 133, 18777, 64866, 126, 43752, 5608, 20473, 4305, 78726, 133, 40816, 945, 1003, 672, 58207, 656, 126, 3966, 64440, 1721, 107, 3313, 126, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [None]:
collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
train_dataloader = DataLoader(data_tokenized['train'], shuffle=True, batch_size=4, collate_fn=collator)
val_dataloader = DataLoader(data_tokenized['test'], shuffle=False, batch_size=4, collate_fn=collator)

In [None]:
from torch.optim import Adam

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=5)

Downloading pytorch_model.bin:   0%|          | 0.00/716M [00:00<?, ?B/s]

Some weights of the model checkpoint at ai-forever/ruBert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not ini

In [None]:
type(model)

transformers.models.bert.modeling_bert.BertForSequenceClassification

The model is [BertForSequenceClassification](https://github.com/huggingface/transformers/blob/v4.19.4/src/transformers/models/bert/modeling_bert.py#L1508).

![alt text](https://jalammar.github.io/images/distilBERT/bert-model-calssification-output-vector-cls.png)

Source: [A Visual Guide to Using BERT for the First Time](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)

Approximately, `BertForSequenceClassification` looks like this, but with extra features inherited from Transformers, and with built-in loss computation

In [None]:
import torch

class BertClassifierSimple(torch.nn.Module):
    def __init__(self, num_labels):
        super(BertClassifierSimple, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = torch.nn.Dropout(self.bert.config.dropout)
        self.out = torch.nn.Linear(self.bert.config.hidden_size, num_labels)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        bert_output = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        output = self.out(self.dropout(bert_output[1]))  # output raw scores to be put into a softmax transformation
        return output

In [None]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(120138, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

In [None]:
if torch.cuda.is_available():
    model.cuda()

In [None]:
# model.classifier.parameters()
optimizer = Adam(model.parameters(), lr=1e-6)  # with tiny batches, LR should be very small as well
# Adagrad

In [None]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [None]:
losses = []
for epoch in trange(3):
    pbar = tqdm(train_dataloader)
    model.train()
    for i, batch in enumerate(pbar):
        out = model(**batch.to(model.device))
        out.loss.backward()
        if i % 1 == 0:
            optimizer.step()
            optimizer.zero_grad()
        losses.append(out.loss.item())
        pbar.set_description(f'loss: {np.mean(losses[-100:]):2.2f}')
    model.eval()
    eval_losses = []
    eval_preds = []
    eval_targets = []
    for batch in tqdm(val_dataloader):
        with torch.no_grad():
                out = model(**batch.to(model.device))
        eval_losses.append(out.loss.item())
        eval_preds.extend(out.logits.argmax(1).tolist())
        eval_targets.extend(batch['labels'].tolist())
    print('recent train loss', np.mean(losses[-100:]), 'eval loss', np.mean(eval_losses), 'accuracy', np.mean(np.array(eval_targets) == eval_preds))

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/640 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/160 [00:00<?, ?it/s]

recent train loss 0.8778408291935921 eval loss 0.9332559704780579 accuracy 0.653125


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

recent train loss 0.7436618354916572 eval loss 0.7658384933136404 accuracy 0.7609375


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

recent train loss 0.6787145826220512 eval loss 0.7192965406458824 accuracy 0.7703125


In [None]:
model.eval()
eval_losses = []
eval_preds = []
eval_targets = []
for batch in tqdm(val_dataloader):
    with torch.no_grad():
            out = model(**batch.to(model.device))
    eval_losses.append(out.loss.item())
    eval_preds.extend(out.logits.argmax(1).tolist())
    eval_targets.extend(batch['labels'].tolist())
print('recent train loss', np.mean(losses[-100:]), 'eval loss', np.mean(eval_losses), 'accuracy', np.mean(np.array(eval_targets) == eval_preds))

  0%|          | 0/160 [00:00<?, ?it/s]

recent train loss 0.6787145826220512 eval loss 0.7192965406458824 accuracy 0.7703125


In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(eval_targets, eval_preds)

array([[ 84,   0,   0,   0,  11],
       [ 37,   0,   0,   0,   5],
       [ 18,   0,   0,   0,   7],
       [  5,   0,   0,   0,  56],
       [  8,   0,   0,   0, 409]])

Save the model for future use

In [None]:
model.save_pretrained('sentiment_classifier')
tokenizer.save_pretrained('sentiment_classifier')

('sentiment_classifier/tokenizer_config.json',
 'sentiment_classifier/special_tokens_map.json',
 'sentiment_classifier/vocab.txt',
 'sentiment_classifier/added_tokens.json',
 'sentiment_classifier/tokenizer.json')

In [None]:
!ls sentiment_classifier -alsh

total 686M
4.0K drwxr-xr-x 2 root root 4.0K Jun 30 15:45 .
4.0K drwxr-xr-x 1 root root 4.0K Jun 30 15:45 ..
4.0K -rw-r--r-- 1 root root 1.1K Jun 30 15:45 config.json
681M -rw-r--r-- 1 root root 681M Jun 30 15:45 pytorch_model.bin
4.0K -rw-r--r-- 1 root root  125 Jun 30 15:45 special_tokens_map.json
4.0K -rw-r--r-- 1 root root  394 Jun 30 15:45 tokenizer_config.json
3.6M -rw-r--r-- 1 root root 3.6M Jun 30 15:45 tokenizer.json
1.7M -rw-r--r-- 1 root root 1.7M Jun 30 15:45 vocab.txt


Load the model from disk and use for inference

In [None]:
model = AutoModelForSequenceClassification.from_pretrained('sentiment_classifier')
tokenizer = AutoTokenizer.from_pretrained('sentiment_classifier')

In [None]:
def classify(text):
    with torch.no_grad():
        proba = torch.softmax(model(**tokenizer(text, return_tensors='pt', truncation=True, max_length=512).to(model.device)).logits, -1)
    return proba.cpu().numpy()[0]

In [None]:
classify('Мне было скучно')

array([0.19382632, 0.13150567, 0.17139487, 0.21016386, 0.29310927],
      dtype=float32)

In [None]:
classify('Мне было весело')

array([0.10384298, 0.09211494, 0.11016095, 0.14700359, 0.5468776 ],
      dtype=float32)

# home work

* For the final part of the notebook (experiments with the Restaurants Reviews dataset), select the optimal number of fine-tuning epochs (which yields the lowest validation loss or the validation loss reaches its plato) and compute the accuracy score for it.

* Then take "cointegrated/rubert-tiny" model from the HuggingFace project and perform the same experiment (compute the accuracy score for the model trained with the optimal number of epochs).

* Compare the results of both models, their optimal number of training epochs, and the fine-tuning time. Create a comparison table and write a short analysis in the Markdown in the notebook.

As a solution, send the following:
* 1) A modified notebook version (with "cointegrated/rubert-tiny" model fine-tuned with the optimal number of epochs) with the comparison table and results analysis.
* 2) A screenshot of the comparison result table and the analysis of the results.

In [4]:
# !pip install transformers datasets bertviz -q

In [5]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm, trange
from sklearn.metrics import confusion_matrix

from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW

 ## data download

In [6]:
df = pd.read_json('https://huggingface.co/datasets/blinoff/restaurants_reviews/resolve/main/restaurants_reviews.jsonl', lines=True)
df = df[df.general>0]

data = Dataset.from_dict({'text': df.text, 'label': df.general-1}).train_test_split(test_size=0.2, seed=1)
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2559
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 640
    })
})

## ruBert-base

In [17]:
base_model = 'ai-forever/ruBert-base'
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=5)
if torch.cuda.is_available():
    model.cuda()

Some weights of the model checkpoint at ai-forever/ruBert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not ini

In [18]:
data_tokenized = data.map(lambda x: tokenizer(x['text'], truncation=True, max_length=512),
                          batched=True,
                          remove_columns=['text'])
collator = DataCollatorWithPadding(tokenizer=tokenizer)
train_dataloader = DataLoader(data_tokenized['train'], shuffle=True, batch_size=4, collate_fn=collator)
val_dataloader = DataLoader(data_tokenized['test'], shuffle=False, batch_size=4, collate_fn=collator)

Map:   0%|          | 0/2559 [00:00<?, ? examples/s]

Map:   0%|          | 0/640 [00:00<?, ? examples/s]

In [19]:
def train_gen_model(model, model_name, train_dataloader, val_dataloader, epoch_count):
    data_list = []
    optimizer = AdamW(model.parameters(), lr=1e-6)  # with tiny batches, LR should be very small as well

    for epoch in trange(epoch_count):
        losses = []
        print(f' epoch num {epoch+1} '.center(80, '-'))
        # обучение
        pbar = tqdm(train_dataloader)
        model.train()
        for i, batch in enumerate(pbar):
            out = model(**batch.to(model.device))
            out.loss.backward()
            if i % 1 == 0:
                optimizer.step()
                optimizer.zero_grad()
            losses.append(out.loss.item())
            pbar.set_description(f'train process ({np.mean(losses[-100:]):2.2f})')
        # валидация
        pbar = tqdm(val_dataloader)
        model.eval()
        eval_losses = []
        eval_preds = []
        eval_targets = []
        for batch in pbar:
            with torch.no_grad():
                out = model(**batch.to(model.device))
            eval_losses.append(out.loss.item())
            eval_preds.extend(out.logits.argmax(1).tolist())
            eval_targets.extend(batch['labels'].tolist())
            pbar.set_description(f'eval process ({np.mean(eval_losses[-100:]):2.2f})')

        print('train loss', round(np.mean(losses), 3),
              'eval loss', round(np.mean(eval_losses), 3),
              'accuracy', round(np.mean(np.array(eval_targets) == eval_preds), 3))
        data_list.append([np.mean(losses),
                          np.mean(eval_losses),
                          np.mean(np.array(eval_targets) == eval_preds)])
    columns = ['train_loss', 'eval_loss', 'eval_accur']
    columns = [(model_name, i) for i in columns]
    return pd.DataFrame(data_list, columns=columns)

In [20]:
df_1 = train_gen_model(model, 'ruBert-base', train_dataloader, val_dataloader, 5)

  0%|          | 0/5 [00:00<?, ?it/s]

--------------------------------- epoch num 1 ----------------------------------


  0%|          | 0/640 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/160 [00:00<?, ?it/s]

train loss 1.071 eval loss 0.93 accuracy 0.698
--------------------------------- epoch num 2 ----------------------------------


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

train loss 0.782 eval loss 0.775 accuracy 0.752
--------------------------------- epoch num 3 ----------------------------------


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

train loss 0.691 eval loss 0.726 accuracy 0.764
--------------------------------- epoch num 4 ----------------------------------


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

train loss 0.652 eval loss 0.699 accuracy 0.769
--------------------------------- epoch num 5 ----------------------------------


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

train loss 0.628 eval loss 0.698 accuracy 0.77


## rubert-tiny

In [21]:
base_model = 'cointegrated/rubert-tiny'
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForSequenceClassification.from_pretrained(base_model, num_labels=5)
if torch.cuda.is_available():
    model.cuda()
#-------------------------------------------------------------------------------------------
data_tokenized = data.map(lambda x: tokenizer(x['text'], truncation=True, max_length=512),
                          batched=True,
                          remove_columns=['text'])
collator = DataCollatorWithPadding(tokenizer=tokenizer)
train_dataloader = DataLoader(data_tokenized['train'], shuffle=True, batch_size=4, collate_fn=collator)
val_dataloader = DataLoader(data_tokenized['test'], shuffle=False, batch_size=4, collate_fn=collator)
#-------------------------------------------------------------------------------------------
df_2 = train_gen_model(model, 'rubert-tiny', train_dataloader, val_dataloader, 5)

Downloading (…)okenizer_config.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/241k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/468k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/47.7M [00:00<?, ?B/s]

Some weights of the model checkpoint at cointegrated/rubert-tiny were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cointegrated/rubert-tiny a

Map:   0%|          | 0/2559 [00:00<?, ? examples/s]

Map:   0%|          | 0/640 [00:00<?, ? examples/s]

  0%|          | 0/5 [00:00<?, ?it/s]

--------------------------------- epoch num 1 ----------------------------------


  0%|          | 0/640 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/160 [00:00<?, ?it/s]

train loss 1.458 eval loss 1.281 accuracy 0.652
--------------------------------- epoch num 2 ----------------------------------


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

train loss 1.152 eval loss 1.117 accuracy 0.652
--------------------------------- epoch num 3 ----------------------------------


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

train loss 1.045 eval loss 1.069 accuracy 0.652
--------------------------------- epoch num 4 ----------------------------------


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

train loss 0.997 eval loss 1.037 accuracy 0.652
--------------------------------- epoch num 5 ----------------------------------


  0%|          | 0/640 [00:00<?, ?it/s]

  0%|          | 0/160 [00:00<?, ?it/s]

train loss 0.962 eval loss 1.008 accuracy 0.652


## conclusion

берт после первой эпохи начинает переобучать, поэтому для него оптимальное кол-во эпох обучения - 1, для rubert-tiny аналогично

In [23]:
df_0 = pd.DataFrame(list(range(1,len(df_1)+1)), columns=[('','epoch')])
comparison_df = pd.concat([df_0, df_1, df_2], axis=1)
comparison_df.columns = pd.MultiIndex.from_tuples(comparison_df.columns, names=['model',''])
comparison_df

model,Unnamed: 1_level_0,ruBert-base,ruBert-base,ruBert-base,rubert-tiny,rubert-tiny,rubert-tiny
Unnamed: 0_level_1,epoch,train_loss,eval_loss,eval_accur,train_loss,eval_loss,eval_accur
0,1,1.070625,0.930234,0.698438,1.458118,1.281334,0.651563
1,2,0.781987,0.774909,0.751563,1.152164,1.117055,0.651563
2,3,0.691176,0.725557,0.764062,1.044989,1.068617,0.651563
3,4,0.651958,0.698736,0.76875,0.99723,1.036946,0.651563
4,5,0.628007,0.697591,0.770312,0.962301,1.008325,0.651563
