# Finetune RuGPTs in huggingface
How to finetune RuGPTs models with huggingface. Example for RuGPT3Small. Nfor other models it will take more GPU memory.

This notebook is valid for all RuGPTs models except RuGPT3XL.
## Install env

In [2]:
!rm -rf /usr/local/cuda
!ln -s /usr/local/cuda-10.1 /usr/local/cuda

In [3]:
%%bash
export LD_LIBRARY_PATH=/usr/lib/

In [4]:
!apt-get install clang-9 llvm-9 llvm-9-dev llvm-9-tools

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  binfmt-support libclang-common-9-dev libclang-cpp9 libclang1-9 libffi-dev
  libllvm9 libomp-9-dev libomp5-9 libpfm4 llvm-9-runtime python-chardet
  python-pkg-resources python-pygments python-yaml python3-pkg-resources
  python3-pygments python3-yaml
Suggested packages:
  clang-9-doc libomp-9-doc llvm-9-doc python-setuptools ttf-bitstream-vera
  python3-setuptools
The following NEW packages will be installed:
  binfmt-support clang-9 libclang-common-9-dev libclang-cpp9 libclang1-9
  libffi-dev libllvm9 libomp-9-dev libomp5-9 libpfm4 llvm-9 llvm-9-dev
  llvm-9-runtime llvm-9-tools python-chardet python-pkg-resources
  python-pygments python-yaml python3-pkg-resources python3-pygments
  python3-yaml
0 upgra

In [5]:
import subprocess

CUDA_version = [s for s in subprocess.check_output(["nvcc", "--version"]).decode("UTF-8").split(", ") if s.startswith("release")][0].split(" ")[-1]
print("CUDA version:", CUDA_version)

if CUDA_version == "10.0":
    torch_version_suffix = "+cu100"
elif CUDA_version == "10.1":
    torch_version_suffix = "+cu101"
elif CUDA_version == "10.2":
    torch_version_suffix = ""
else:
    torch_version_suffix = "+cu110"

CUDA version: 10.1


If code below doesn't work, check your cuda version and installation here https://pytorch.org/get-started/previous-versions/

In [6]:
!pip install torch==1.7.1{torch_version_suffix} torchvision==0.8.2{torch_version_suffix} torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.7.1+cu101
  Downloading https://download.pytorch.org/whl/cu101/torch-1.7.1%2Bcu101-cp37-cp37m-linux_x86_64.whl (735.4 MB)
[K     |████████████████████████████████| 735.4 MB 16 kB/s 
[?25hCollecting torchvision==0.8.2+cu101
  Downloading https://download.pytorch.org/whl/cu101/torchvision-0.8.2%2Bcu101-cp37-cp37m-linux_x86_64.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 32.8 MB/s 
[?25hCollecting torchaudio==0.7.2
  Downloading torchaudio-0.7.2-cp37-cp37m-manylinux1_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 5.0 MB/s 
Installing collected packages: torch, torchvision, torchaudio
  Attempting uninstall: torch
    Found existing installation: torch 1.12.1+cu113
    Uninstalling torch-1.12.1+cu113:
      Successfully uninstalled torch-1.12.1+cu113
  Attempt

In [7]:
!pip install transformers==3.5.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==3.5.1
  Downloading transformers-3.5.1-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 5.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 48.9 MB/s 
Collecting sentencepiece==0.1.91
  Downloading sentencepiece-0.1.91-cp37-cp37m-manylinux1_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 49.6 MB/s 
Collecting tokenizers==0.9.3
  Downloading tokenizers-0.9.3-cp37-cp37m-manylinux1_x86_64.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 56.8 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895260 sha256=89ba758298ff6c7eaf0a558e5e9e44fadbfab0d7f5f5772fc4fdcb4f03c7fb18
  Stored in

In [8]:
!git clone  https://github.com/sberbank-ai/ru-gpts

Cloning into 'ru-gpts'...
remote: Enumerating objects: 686, done.[K
remote: Counting objects: 100% (244/244), done.[K
remote: Compressing objects: 100% (79/79), done.[K
remote: Total 686 (delta 189), reused 165 (delta 165), pack-reused 442[K
Receiving objects: 100% (686/686), 377.90 KiB | 4.78 MiB/s, done.
Resolving deltas: 100% (427/427), done.


In [9]:
!mkdir models/

## Download files

In [None]:
!wget -O train.txt https://www.dropbox.com/s/oa3v9c7g9bp40xw/train.txt?dl=0
!wget -O valid.txt https://www.dropbox.com/s/mworl3ld6r3bg62/valid.txt?dl=0

## Train 
The following code download our model and tokenizer from huggingface and finetune model for generating essays.

This took aroung ten minutes and obtain perplexity = 13-16

In [11]:
!export PYTHONPATH=${PYTHONPATH}:/ru-gpts/
!CUDA_VISIBLE_DEVICES=0 python ru-gpts/pretrain_transformers.py \
    --output_dir=models/essays \
    --model_type=gpt2 \
    --model_name_or_path=sberbank-ai/rugpt3small_based_on_gpt2 \
    --do_train \
    --train_data_file=train.txt \
    --do_eval \
    --eval_data_file=valid.txt \
    --per_gpu_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 5 \
    --block_size 2048 \
    --overwrite_output_dir

Epoch:   0% 0/5 [00:00<?, ?it/s]
Iteration:   0% 0/109 [00:00<?, ?it/s][A
Iteration:   1% 1/109 [00:00<01:44,  1.04it/s][A
Iteration:   2% 2/109 [00:01<01:35,  1.13it/s][A
Iteration:   3% 3/109 [00:02<01:32,  1.14it/s][A
Iteration:   4% 4/109 [00:03<01:31,  1.15it/s][A
Iteration:   5% 5/109 [00:04<01:30,  1.15it/s][A
Iteration:   6% 6/109 [00:05<01:29,  1.15it/s][A
Iteration:   6% 7/109 [00:06<01:28,  1.15it/s][A
Iteration:   7% 8/109 [00:06<01:27,  1.15it/s][A
Iteration:   8% 9/109 [00:07<01:26,  1.15it/s][A
Iteration:   9% 10/109 [00:08<01:25,  1.15it/s][A
Iteration:  10% 11/109 [00:09<01:25,  1.15it/s][A
Iteration:  11% 12/109 [00:10<01:24,  1.15it/s][A
Iteration:  12% 13/109 [00:11<01:23,  1.15it/s][A
Iteration:  13% 14/109 [00:12<01:22,  1.15it/s][A
Iteration:  14% 15/109 [00:13<01:21,  1.15it/s][A
Iteration:  15% 16/109 [00:13<01:21,  1.15it/s][A
Iteration:  16% 17/109 [00:14<01:20,  1.15it/s][A
Iteration:  17% 18/109 [00:15<01:19,  1.15it/s][A
Iteration:  17% 

## Evaluate model
### Generate with generate_transformers.py script.

In [12]:
!python ru-gpts/generate_transformers.py \
    --model_type=gpt2 \
    --model_name_or_path=models/essays \
    --k=5 \
    --p=0.95 \
    --length=500 \
    --repetition_penalty=5

Context >>> 


### Generate mannualy

In [13]:
import numpy as np
import torch

In [14]:
np.random.seed(42)
torch.manual_seed(42)

<torch._C.Generator at 0x7f2087a83950>

In [15]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [16]:
tok = GPT2Tokenizer.from_pretrained("models/essays")

In [17]:
model = GPT2LMHeadModel.from_pretrained("models/essays")

In [18]:
model.cuda()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

In [49]:
text = "вход: А не пора ли нам основать новый, народный театр?\nвыход: "
inpt = tok.encode(text, return_tensors="pt")

In [50]:
out = model.generate(inpt.cuda(), max_length=70, repetition_penalty=15.0, do_sample=True, top_k=5, top_p=0.95, temperature=1.2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [51]:
tok.decode(out[0])

'вход: А не пора ли нам основать новый, народный театр?\nвыход:  Не знаю. Я сам только что оттуда приехал; но я еще ничего в театрах делать умею... так вы дайте мне подумать над вашим предложением". \n "Вы хотите создать драматический кружок?" - спросил он меня однажды на одном вечере по поводу драмы и'