# Использование готовых архитектур, пример Hugginface


Библиотека 🤗 Transformers была разработана для упрощения работы с большими и сложными моделями трансформеров, которые обладают миллионами и даже десятками миллиардов параметров.

Она обеспечивает унифицированный интерфейс для загрузки, обучения и сохранения различных моделей трансформеров, что значительно упрощает их тестирование и развертывание.

* Основными преимуществами библиотеки являются ее простота в использовании, позволяющая в несколько строк кода запускать современные NLP-модели, и гибкость, благодаря интеграции с популярными фреймворками машинного обучения, как PyTorch и TensorFlow.
* Каждая модель в 🤗 Transformers реализована в виде отдельных классов, соответствующих этим фреймворкам, что делает код легким для понимания и модификации.
* Важной особенностью является также то, что вся логика модели сосредоточена в одном файле, облегчая эксперименты и модификации без влияния на другие модели, что отличает 🤗 Transformers от других библиотек машинного обучения.

`AutoModel` - класс для инициализации любой модели на основе заданной контрольной точки.

Класс AutoModel вместе с его производными представляет собой универсальные интерфейсы для обширного набора моделей, доступных в библиотеке. Главная его особенность заключается в способности автоматически определять, какая архитектура модели наилучшим образом соответствует предоставленной контрольной точке (заданной текстовым описанием), и создавать экземпляр соответствующей модели.

В случаях, когда вы точно знаете, какую модель желаете использовать, возможен прямой выбор и использование специализированного класса, который описывает ее архитектуру.

In [None]:
%%bash
pip install transformers datasets evaluate

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 474.3/474.3 kB 8.4 MB/s eta 0:00:00
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.0/84.0 kB 4.2 MB/s eta 0:00:00
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 17.0.0 which is incompatible.
ibis-framework 8.0.0 requires pyarrow<16,>=2, but you have pyarrow 17.0.0 which is incompatible.


## Summarization

При обобщении создается сокращенная версия документа или статьи, в которой отражена вся важная информация. Наряду с переводом, это еще один пример задачи, которую можно сформулировать как задачу "последовательность-последовательность".

Резюме может быть:


**Экстрактивным**: извлекать из документа наиболее значимую информацию.


**Абстрактным:** генерировать новый текст, содержащий наиболее значимую информацию.


In [None]:
# Датасет законопроектов штата Калифорния из набора данных BillSum для абстрактного обобщения.

from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")
billsum = billsum.train_test_split(test_size=0.2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.27k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/91.8M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/15.8M [00:00<?, ?B/s]

ca_test-00000-of-00001.parquet:   0%|          | 0.00/6.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/18949 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3269 [00:00<?, ? examples/s]

Generating ca_test split:   0%|          | 0/1237 [00:00<?, ? examples/s]

In [None]:
billsum['train']

Dataset({
    features: ['text', 'summary', 'title'],
    num_rows: 989
})

In [None]:
print(f"\n\nTEXT:\n\n{billsum['train'][0]['text']}")
print(f"\n\nSUMMARY:\n\n{billsum['train'][0]['summary']}")



TEXT:

The people of the State of California do enact as follows:


SECTION 1.
Section 13332.19 of the Government Code is amended to read:
13332.19.
(a) For the purposes of this section, the following definitions shall apply:
(1) “Design-build” means a construction procurement process in which both the design and construction of a project are procured from a single entity.
(2) “Design-build project” means a capital outlay project using the design-build construction procurement process.
(3) “Design-build entity” means a partnership, corporation, or other legal entity that is able to provide appropriately licensed contracting, architectural, and engineering services as needed.
(4) “Design-build solicitation package” means the performance criteria, any concept drawings, the form of contract, and all other documents and information that serve as the basis on which bids or proposals will be solicited from the design-build entities.
(5) “Design-build phase” means the period following the a

In [None]:
from transformers import AutoTokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [None]:
prefix = "summarize: "


def preprocess_function(examples, prefix):

    # Префикс входных данных с подсказкой, чтобы T5 "знал", что это задача обобщения.

    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In [None]:
tokenized_billsum = billsum.map(lambda x: preprocess_function(x, prefix), batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [None]:
# Теперь создадим батчи с помощью класса DataCollatorForSeq2Seq.

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)

In [None]:
%%bash
pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py): started
  Building wheel for rouge_score (setup.py): finished with status 'done'
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=59ab4ca86a127bb1560cb7a5f898a32bb9523c683aa1b8cf489129c495cc4bee
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
import evaluate

rouge = evaluate.load("rouge") # bleu, BERTScore, Meteor etc. link: https://fabianofalcao.medium.com/metrics-for-evaluating-summarization-of-texts-performed-by-transformers-how-to-evaluate-the-b3ce68a309c3

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
import numpy as np


def compute_metrics(eval_pred):

    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_billsum_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=4,
    predict_with_generate=True,
    fp16=True,
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.791936,0.1313,0.0381,0.1086,0.1088,19.0




KeyboardInterrupt: 

In [None]:
text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

In [None]:
tokenizer = AutoTokenizer.from_pretrained("./my_awesome_billsum_model/checkpoint-62")
inputs = tokenizer(text, return_tensors="pt").input_ids

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("./my_awesome_billsum_model/checkpoint-62")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

In [None]:
outputs.shape

torch.Size([1, 62])

In [None]:
tokenizer.decode(outputs[0], skip_special_tokens=False)

"<pad> the Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in American history. it'll ask the ultra-wealthy and corporations to pay their fair share.</s>"

Напишем цикл для обучения самостоятельно

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
# del model
# del trainer
# torch.cuda.empty_cache()

In [None]:
def preprocess_function_2(examples, prefix, use_padding=True):

    # Префикс входных данных с подсказкой, чтобы T5 "знал", что это задача обобщения.

    inputs = [prefix + doc for doc in examples["text"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True, padding=use_padding)

    labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True, padding=use_padding)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenized_billsum = billsum.map(lambda x: preprocess_function_2(x, prefix, use_padding=True), batched=True)

Map:   0%|          | 0/989 [00:00<?, ? examples/s]

Map:   0%|          | 0/248 [00:00<?, ? examples/s]

In [None]:
tokenized_billsum = tokenized_billsum.remove_columns(["text", "summary", "title"])

In [None]:
tokenized_billsum['train']

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 989
})

In [None]:
# Установим формат набора данных, чтобы он возвращал тензоры PyTorch, а не списки:

tokenized_billsum.set_format("torch")

In [None]:
train_dataset = tokenized_billsum["train"].shuffle(seed=42)
eval_dataset = tokenized_billsum["test"].shuffle(seed=42)

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(eval_dataset, batch_size=8)

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:
from transformers import get_scheduler
from torch.optim import AdamW



# оптимизатор AdamW, лр стоит поперебирать
optimizer = AdamW(model.parameters(), lr=1e-4)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)

# будем линейно увеличивать первые 200 шагов
lr_scheduler = get_scheduler(
    "cosine",
    optimizer=optimizer,
    num_warmup_steps= int(0.1 * num_epochs * len(train_dataloader)), # ~10% от общего числа шагов оптимально
    num_training_steps=num_training_steps
)

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=512, bias=False)
              (k): Linear(in_features=512, out_features=512, bias=False)
              (v): Linear(in_features=512, out_features=512, bias=False)
              (o): Linear(in_features=512, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 8)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=512, out_features=2048, bias=False)
              (wo): Linear(in_features=2048, out_features=512, bias=False)
              (dropout): Drop

In [None]:
test = next(iter(train_dataloader))
test.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [None]:
train_batch = {k: v for k, v in test.items()}
train_batch

{'input_ids': tensor([[21603,    10,    37,  ...,   489, 12278,     1],
         [21603,    10,    37,  ...,     8,    20,     1],
         [21603,    10,    37,  ...,    97, 25429,     1],
         ...,
         [21603,    10,    37,  ...,  6859,  3636,     1],
         [21603,    10,    37,  ...,   153,   405,     1],
         [21603,    10,    37,  ...,   255,    65,     1]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]]),
 'labels': tensor([[ 5637,  3526,  1895,  ...,   973,     6,     1],
         [   37,  6708,  5287,  ...,   273, 15375,     1],
         [ 3526,  1895,   973,  ...,     3,     9,     1],
         ...,
         [ 5637, 17061,    53,  ...,  8401,  1201,     1],
         [17061,    53,   973,  ...,   538,  2465,     1],
         [17061,    53,   973,  ...,  7328,     6,    



Что подается на вход модели?

**input ids:** последовательность чисел, отождествляющих каждый токен с его номером в словаре.  
**labels:** последовательность чисел, отождествляющих каждый токен с его номером в словаре.  
**segment mask:** (необязательно) последовательность нулей и единиц, которая показывает, состоит ли входной текст из одного или двух предложений. Для случая одного предложения получится вектор из одних нулей. Для двух: нулей и единиц.  
**attention mask:** (необязательно) последовательность нулей и единиц, где единицы обозначают токены предложения, нули - паддинг.



In [None]:
from tqdm.auto import tqdm

def train(model, num_training_steps, train_dataloader, num_epochs):

    progress_bar = tqdm(range(num_training_steps))

    ## дабавить eval в обучение

    model.train()
    for epoch in range(num_epochs):
        for batch in train_dataloader:
            train_batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**train_batch)
            loss = outputs.loss
            loss.backward()

            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

In [None]:
def compute_metrics_2(eval_pred, metric):

    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    # prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    # result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [None]:
def evaluation(model, eval_dataloader, metric, target_metric_name, target_metric_value):

    model.eval()
    for batch in eval_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        items = compute_metrics_2([predictions, batch['labels']], metric)
    print(items)

    if items[target_metric_name] > target_metric_value:
        model.save_pretrained(f"best_model_seminar_2_{metric.name}={target_metric_value}")
        print(f"Model saved.")

In [None]:
train(model, num_training_steps, train_dataloader, num_epochs)

  0%|          | 0/372 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
evaluation(model, eval_dataloader, rouge, 'rouge1', 0.5)

{'rouge1': 0.5601, 'rouge2': 0.2428, 'rougeL': 0.4487, 'rougeLsum': 0.4467}
Model saved.


In [None]:
model_torch = AutoModelForSeq2SeqLM.from_pretrained("./best_model_seminar_2_rouge=0.5")

## Translation

In [None]:
books = load_dataset("opus_books", "en-ru")

README.md:   0%|          | 0.00/28.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/17496 [00:00<?, ? examples/s]

In [None]:
books = books["train"].train_test_split(test_size=0.2)

In [None]:
books['train'][0]

{'id': '13604',
 'translation': {'en': "You in your purity cannot understand all I suffer!'",
  'ru': 'Ты не можешь со своею чистотой понять всего того, чем я страдаю.'}}

In [None]:
from transformers import AutoTokenizer, Seq2SeqTrainingArguments

model_name = "ai-forever/ruT5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/20.4k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
source_lang = "en"
target_lang = "ru"
prefix = "translate English to Russian: "


def preprocess_function_3(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

In [None]:
tokenized_books = books.map(preprocess_function_3, batched=True)

Map:   0%|          | 0/13996 [00:00<?, ? examples/s]

Map:   0%|          | 0/3500 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoModelForSeq2SeqLM

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [None]:
%pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.4.3-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.4.3-py3-none-any.whl (103 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-2.10.1 sacrebleu-2.4.3


In [None]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [None]:
import numpy as np


def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [None]:
# На всякий случай, если trainer не найдет нужной версии трансформеров
# ! pip install -U accelerate
# ! pip install -U transformers

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_opus_books_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True,
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books["train"],
    eval_dataset=tokenized_books["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,3.0257,2.348334,9.8035,14.9603




KeyboardInterrupt: 

In [None]:
text = "translate English to Russian: Hi, My name is Bert"

In [None]:
checkpoints = 'checkpoint-875'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(f"./my_awesome_opus_books_model/{checkpoints}")
inputs = tokenizer(text, return_tensors="pt").input_ids

model = AutoModelForSeq2SeqLM.from_pretrained(f"./my_awesome_opus_books_model/{checkpoints}")
outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)


tokenizer.decode(outputs[0], skip_special_tokens=True)

'Вот, мое имя Борис.'

- Почему мы получаем такой плохой перевод?
- Что можно сделать чтобы его улучшить?
- Что можно сделать, чтобы сразу получить результат лучше?
