<a href="https://colab.research.google.com/github/danyesss/NLP/blob/main/My_RuGPT3FinetuneHF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune RuGPTs in huggingface
How to finetune RuGPTs models with huggingface. Example for RuGPT3Small. Nfor other models it will take more GPU memory.

This notebook is valid for all RuGPTs models except RuGPT3XL.
## Install env

In [None]:
!rm -rf /usr/local/cuda
!ln -s /usr/local/cuda-10.1 /usr/local/cuda

In [None]:
%%bash
export LD_LIBRARY_PATH=/usr/lib/

In [None]:
!apt-get install clang-9 llvm-9 llvm-9-dev llvm-9-tools

Reading package lists... Done
Building dependency tree       
Reading state information... Done
clang-9 is already the newest version (1:9-2~ubuntu18.04.2).
llvm-9 is already the newest version (1:9-2~ubuntu18.04.2).
llvm-9-dev is already the newest version (1:9-2~ubuntu18.04.2).
llvm-9-tools is already the newest version (1:9-2~ubuntu18.04.2).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


In [None]:
import subprocess

CUDA_version = [s for s in subprocess.check_output(["nvcc", "--version"]).decode("UTF-8").split(", ") if s.startswith("release")][0].split(" ")[-1]
print("CUDA version:", CUDA_version)

if CUDA_version == "10.0":
    torch_version_suffix = "+cu100"
elif CUDA_version == "10.1":
    torch_version_suffix = "+cu101"
elif CUDA_version == "10.2":
    torch_version_suffix = ""
else:
    torch_version_suffix = "+cu110"

CUDA version: 10.1


If code below doesn't work, check your cuda version and installation here https://pytorch.org/get-started/previous-versions/

In [None]:
!pip install torch==1.7.1{torch_version_suffix} torchvision==0.8.2{torch_version_suffix} torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [None]:
!pip install transformers==3.5.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!git clone  https://github.com/sberbank-ai/ru-gpts

fatal: destination path 'ru-gpts' already exists and is not an empty directory.


In [None]:
!mkdir models/

## Download files

In [None]:
# !wget -O train.txt https://www.dropbox.com/s/oa3v9c7g9bp40xw/train.txt?dl=0
# !wget -O valid.txt https://www.dropbox.com/s/mworl3ld6r3bg62/valid.txt?dl=0

--2022-08-02 16:38:49--  https://www.dropbox.com/s/oa3v9c7g9bp40xw/train.txt?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.64.18, 2620:100:6019:18::a27d:412
Connecting to www.dropbox.com (www.dropbox.com)|162.125.64.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/oa3v9c7g9bp40xw/train.txt [following]
--2022-08-02 16:38:50--  https://www.dropbox.com/s/raw/oa3v9c7g9bp40xw/train.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc1ead17b8dd979001022af50272.dl.dropboxusercontent.com/cd/0/inline/BqT2ZnWVFY8zEJJ6_XKyTETiqpPNpoJD821VEQ_zxax24qyJ1aIXGHoa7AiNPpKILhRHt-8W4e9JQ4lZm96LfMPFnCa-xX0SWD8xtPO6bWXa4t9keEH-F32p3TW8ihK1-oYbf-AT7f1ZkH01atNarIrihv7cjItCFrI1WtpNEn04BA/file# [following]
--2022-08-02 16:38:50--  https://uc1ead17b8dd979001022af50272.dl.dropboxusercontent.com/cd/0/inline/BqT2ZnWVFY8zEJJ6_XKyTETiqpPNpoJD821VEQ_zxax24qyJ1aIXGHoa7AiNPpKILhRHt-

### Data validation

In [None]:
!pip install -q tendo
# Downloader.
!pip install awscli

In [None]:
from fastai.basics import *
from multiprocessing import Pool
import regex as re
# You better train your model through ssh. Because using notebook it will glitch out when the output will be too big to handle.
#I was using ngrok.

# from colab_ssh import launch_ssh

In [None]:
# Change to your directory with your files.
DATASET_PATH = "/content/raw_data/"# "/content/dataset"
SAVE_TO_PATH = "/content/data_to_learn/"

In [None]:
def process_function(path_to_file):
    match = re.compile(r'(?=[^ ])([\W])([\w])')
    match2 = re.compile('(.|\s)\\1\\1+')
    with open(path_to_file, 'r') as f:
        lines = f.read()
    if lines and lines[0] != ' ': lines = ' ' + lines
    lines = match.sub(r'\g<1> \g<2>', lines)
    lines = match2.sub(r'\1'*3, lines)
    path = os.path.join(SAVE_TO_PATH, os.path.split(path_to_file)[1])
    with open(path, 'w') as handle:
        handle.write(lines)

In [None]:
txts = get_files(DATASET_PATH, '.txt')
# Create dataset folder if not exists.
if not os.path.exists(SAVE_TO_PATH):
    os.makedirs(SAVE_TO_PATH)

# Process files.
for _ in progress_bar(Pool(64).imap_unordered(process_function, txts), len(txts)):
    pass

### My files setup

In [None]:
with open('/content/russian_corpus_for_vocab_valid.txt', encoding="utf-8", errors='ignore') as f:
    text = f.read()

In [None]:
text = text.encode("utf-8")

In [None]:
file = open("valid.txt","wb") 
file.write(text) 
file.close()

## Train 
The following code download our model and tokenizer from huggingface and finetune model for generating essays.

This took aroung ten minutes and obtain perplexity = 13-16

In [None]:
--block_size 2048 \

In [None]:
!export PYTHONPATH=${PYTHONPATH}:/ru-gpts/
!CUDA_VISIBLE_DEVICES=0 python ru-gpts/pretrain_transformers.py \
    --output_dir=models/poems \
    --model_type=gpt2 \
    --model_name_or_path=sberbank-ai/rugpt3small_based_on_gpt2 \
    --do_train \
    --train_data_file=brodsky_valid.txt \
    --do_eval \
    --eval_data_file=brodsky_valid.txt \
    --per_gpu_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 10 \
    --block_size 1024 \
    --overwrite_output_dir


    

08/03/2022 06:05:29 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1024, cache_dir=None, config_name=None, device=device(type='cuda'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='brodsky_valid.txt', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=False, mlm_probability=0.15, model_name_or_path='sberbank-ai/rugpt3small_based_on_gpt2', model_type='gpt2', n_gpu=1, no_cuda=False, num_train_epochs=10.0, output_dir='models/poems', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=1, save_steps=500, save_total_limit=None, seed=42, server_ip='', server_port='', should_continue=False, tokenizer_name=None, train_data_file='brodsky_valid.txt', warmup_steps=0, weight_decay=0.01)
08/03/2022 06:05:29 - INFO - __main_

## Evaluate model
### Generate with generate_transformers.py script.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from google.colab import files
files.download('./essays.zip')

In [None]:
!python ru-gpts/generate_transformers.py \
    --model_type=gpt2 \
    --model_name_or_path=models/essays \
    --k=5 \
    --p=0.95 \
    --length=500 \
    --repetition_penalty=5

Traceback (most recent call last):
  File "ru-gpts/generate_transformers.py", line 268, in <module>
    main()
  File "ru-gpts/generate_transformers.py", line 204, in main
    model = model_class.from_pretrained(args.model_name_or_path)
  File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py", line 947, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_gpt2.py", line 702, in __init__
    self.transformer = GPT2Model(config)
  File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_gpt2.py", line 491, in __init__
    self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
  File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_gpt2.py", line 491, in <listcomp>
    self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
  File "/usr/local/lib/python3.7/dist-packages/transf

### Generate mannualy

In [None]:
import numpy as np
import torch

In [None]:
# !pip install -q torch==1.4.0 -f https://download.pytorch.org/whl/cu101/torch_stable.html

[K     |████████████████████████████████| 753.4 MB 7.0 kB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.8.2+cu101 requires torch==1.7.1, but you have torch 1.4.0 which is incompatible.
torchtext 0.13.0 requires torch==1.12.0, but you have torch 1.4.0 which is incompatible.
torchaudio 0.7.2 requires torch==1.7.1, but you have torch 1.4.0 which is incompatible.
fastai 2.7.7 requires torch<1.13,>=1.7, but you have torch 1.4.0 which is incompatible.[0m
[?25h

In [None]:
np.random.seed(42)
torch.manual_seed(42)

<torch._C.Generator at 0x7f0038b910b0>

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [None]:
tok = GPT2Tokenizer.from_pretrained("models/poems")

In [None]:
model = GPT2LMHeadModel.from_pretrained("models/poems")

In [None]:
model.cuda()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

In [None]:
# text = "<s>Тема: «Создает человека природа, но развивает и образует его общество». (В.Т. Белинский)\nСочинение: "
# text = "<s>Тема: «Жизнь развивается неспеша. Время идет.». (А.С. Пушкин)\nСочинение: "
# text = "Я был только тем, чего\n ты касалась ладонью,\n над чем в глухую, воронью\n ночь склоняла чело.\n"
text = "Прощай,\n позабудь\n и не обессудь.\n А письма сожги,\n как мост."
inpt = tok.encode(text, return_tensors="pt")

In [None]:
out = model.generate(inpt.cuda(), max_length=200, repetition_penalty=5.0, do_sample=True, top_k=5, top_p=0.95, temperature=0.5)
# to explain each parameter meaning: repetition_penalty, top_k, top_p, temperature

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
tok.decode(out[0])

'Прощай,\n позабудь\n и не обессудь.\n А письма сожги,\n как мост. — ( В письме)...» И далее по тексту: « Я уезжаю».]. Прощайте! Не грустите!» Молча смотрят в окно на пустой аэродромные деревья... Потом он добавил с гримасой неуместной печали;): «… Вы знаете что- нибудь о судьбе Лжеца?»[* Поэма приближается к концу.]. Он поднял взгляд от лирики сквозь решетку форточки своей комнаты для прислуги.[ 1 - Здесь так светло.] Поднялся ветер со стороны подоконника за окнами его апартаментов? О нет!.. Все лампы сейчас вне досягаемости их взгляда из тех самых темных углов души моей.) Огоньки вдоль рамок устремлены вверх почти отвесно вниз через квадратную раму окна во все углы моего существования таким образом, чтобы лучи ее пересекали друг друга наискось перпендикулярно поверхности листа бумаги. Это значит : если свет проходит прямо посередине между лампочками'

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

MessageError: ignored

In [None]:
!zip -r /content/models /content/

  adding: content/ (stored 0%)
  adding: content/.config/ (stored 0%)
  adding: content/.config/configurations/ (stored 0%)
  adding: content/.config/configurations/config_default (deflated 15%)
  adding: content/.config/.last_opt_in_prompt.yaml (stored 0%)
  adding: content/.config/active_config (stored 0%)
  adding: content/.config/.last_survey_prompt.yaml (stored 0%)
  adding: content/.config/logs/ (stored 0%)
  adding: content/.config/logs/2022.07.28/ (stored 0%)
  adding: content/.config/logs/2022.07.28/13.39.23.304837.log (deflated 86%)
  adding: content/.config/logs/2022.07.28/13.39.49.666021.log (deflated 54%)
  adding: content/.config/logs/2022.07.28/13.39.30.781967.log (deflated 54%)
  adding: content/.config/logs/2022.07.28/13.39.04.909036.log (deflated 54%)
  adding: content/.config/logs/2022.07.28/13.38.43.753460.log (deflated 91%)
  adding: content/.config/logs/2022.07.28/13.39.50.360650.log (deflated 54%)
  adding: content/.config/config_sentinel (stored 0%)
  adding: co

In [None]:
model_save_name = 'pelevin_0.pt'
path = F"/content/gdrive/My Drive/{model_save_name}" 
# torch.save(model.state_dict(), path)


In [None]:
GPT2Tokenizer.from_pretrained(path)
# tok = GPT2Tokenizer.from_pretrained(path)
# model = GPT2LMHeadModel.from_pretrained(path)

OSError: ignored

In [None]:
GPT2Tokenizer.state_dict()

AttributeError: ignored

In [None]:
!zip -r /content/essays.zip /content/models/poems/checkpoint-2000 /content/models/poems/config.json /content/models/poems/eval_results.txt /content/models/poems/merges.txt /content/models/poems/pytorch_model.bin /content/models/poems/special_tokens_map.json /content/models/poems/tokenizer_config.json /content/models/poems/training_args.bin /content/models/poems/vocab.json

  adding: content/models/poems/checkpoint-2000/ (stored 0%)
  adding: content/models/poems/checkpoint-2000/config.json (deflated 48%)
  adding: content/models/poems/checkpoint-2000/pytorch_model.bin (deflated 16%)
  adding: content/models/poems/checkpoint-2000/vocab.json (deflated 76%)
  adding: content/models/poems/checkpoint-2000/training_args.bin (deflated 43%)
  adding: content/models/poems/checkpoint-2000/tokenizer_config.json (deflated 66%)
  adding: content/models/poems/checkpoint-2000/merges.txt (deflated 76%)
  adding: content/models/poems/checkpoint-2000/optimizer.pt (deflated 8%)
  adding: content/models/poems/checkpoint-2000/special_tokens_map.json (deflated 72%)
  adding: content/models/poems/checkpoint-2000/scheduler.pt (deflated 49%)
  adding: content/models/poems/config.json (deflated 48%)
  adding: content/models/poems/eval_results.txt (stored 0%)
  adding: content/models/poems/merges.txt (deflated 76%)
  adding: content/models/poems/pytorch_model.bin (deflated 16%)
  

In [None]:
from google.colab import drive
drive.mount('drive')

Mounted at drive


In [None]:
!cp models.zip "drive/My Drive/"