# ru_transformers - GPT-2 tuning for horoscopes
Original git: https://github.com/mgrankin/ru_transformers

Открыть ноутбук можно на Colab:
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZnTI4Cg9YGK4wIdE16AjdfiUpcxTpa0A?usp=sharing)

In [6]:
import regex as re
import os
from fastai.basics import *
from multiprocessing import Pool
from colab_ssh import launch_ssh


In [1]:
%tensorflow_version 2.x
!rm -rf '/content/ru_transformers' || :
!git clone https://github.com/mgrankin/ru_transformers
%cd ru_transformers
from google.colab import drive
drive.mount('/content/gdrive') # для сохранения моделей и импорта данных
!pip install -q -r "tpu_requirements.txt"
!pip install -q tendo
!pip install awscli
!pip install colab_ssh
!nvidia-smi

Cloning into 'ru_transformers'...
remote: Enumerating objects: 38, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 1959 (delta 14), reused 22 (delta 6), pack-reused 1921[K
Receiving objects: 100% (1959/1959), 5.72 MiB | 22.03 MiB/s, done.
Resolving deltas: 100% (1292/1292), done.
/content/ru_transformers
Mounted at /content/gdrive
[K     |████████████████████████████████| 3.5MB 5.3MB/s 
[K     |████████████████████████████████| 3.8MB 49.3MB/s 
[K     |████████████████████████████████| 368kB 53.6MB/s 
[K     |████████████████████████████████| 1.1MB 53.1MB/s 
[K     |████████████████████████████████| 51kB 7.4MB/s 
[K     |████████████████████████████████| 1.7MB 53.4MB/s 
[K     |████████████████████████████████| 133kB 55.2MB/s 
[K     |████████████████████████████████| 890kB 52.5MB/s 
[K     |████████████████████████████████| 8.7MB 50.9MB/s 
[K     |████████████████████████████████| 61kB 8.8MB/s 
[K  

In [None]:
%set_env CUDA_HOME=/usr/local/cuda-10.1
!git clone https://github.com/NVIDIA/apex
%cd apex
# fix setup.py if complains for version mismatch
!pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
%cd ..

env: CUDA_HOME=/usr/local/cuda-10.1
Cloning into 'apex'...
remote: Enumerating objects: 1, done.[K
remote: Counting objects: 100% (1/1), done.[K
remote: Total 7456 (delta 0), reused 0 (delta 0), pack-reused 7455[K
Receiving objects: 100% (7456/7456), 13.91 MiB | 4.07 MiB/s, done.
Resolving deltas: 100% (5038/5038), done.
/content/ru_transformers/apex
  cmdoptions.check_install_build_global(options)
Created temporary directory: /tmp/pip-ephem-wheel-cache-ljqbdlcq
Created temporary directory: /tmp/pip-req-tracker-djjax0z6
Created requirements tracker '/tmp/pip-req-tracker-djjax0z6'
Created temporary directory: /tmp/pip-install-46ed7biz
Processing /content/ru_transformers/apex
  Created temporary directory: /tmp/pip-req-build-pworqt62
  Added file:///content/ru_transformers/apex to build tracker '/tmp/pip-req-tracker-djjax0z6'
    Running setup.py (path:/tmp/pip-req-build-pworqt62/setup.py) egg_info for package from file:///content/ru_transformers/apex
    Running command python setup.

In [2]:
!cd ..; aws s3 sync --no-sign-request s3://models.dobro.ai/gpt2/ru/unfreeze_all gpt2 #загрузка моделей

Completed 7 Bytes/2.3 GiB (16 Bytes/s) with 10 file(s) remainingdownload: s3://models.dobro.ai/gpt2/ru/unfreeze_all/m_checkpoint-3364613/step.txt to gpt2/m_checkpoint-3364613/step.txt
Completed 7 Bytes/2.3 GiB (16 Bytes/s) with 9 file(s) remainingCompleted 605 Bytes/2.3 GiB (1.2 KiB/s) with 9 file(s) remainingdownload: s3://models.dobro.ai/gpt2/ru/unfreeze_all/s_checkpoint-1900000/config.json to gpt2/s_checkpoint-1900000/config.json
Completed 605 Bytes/2.3 GiB (1.2 KiB/s) with 8 file(s) remainingCompleted 1.2 KiB/2.3 GiB (2.6 KiB/s) with 8 file(s) remaining  download: s3://models.dobro.ai/gpt2/ru/unfreeze_all/m_checkpoint-3364613/config.json to gpt2/m_checkpoint-3364613/config.json
Completed 1.2 KiB/2.3 GiB (2.6 KiB/s) with 7 file(s) remainingCompleted 2.6 KiB/2.3 GiB (5.2 KiB/s) with 7 file(s) remainingdownload: s3://models.dobro.ai/gpt2/ru/unfreeze_all/m_checkpoint-3364613/training_args.bin to gpt2/m_checkpoint-3364613/training_args.bin
Completed 2.6 KiB/2.3 GiB (5.2 KiB/s) wi

In [3]:
DATASET_PATH = "/content/gdrive/My Drive/dataset"
SAVE_TO_PATH = "./dataset/prepared/"

In [4]:
def process_function(path_to_file):
    match = re.compile(r'(?=[^ ])([\W])([\w])')
    match2 = re.compile('(.|\s)\\1\\1+')
    with open(path_to_file, 'r') as f:
        lines = f.read()
    if lines and lines[0] != ' ': lines = ' ' + lines
    lines = match.sub(r'\g<1> \g<2>', lines)
    lines = match2.sub(r'\1'*3, lines)
    path = os.path.join(SAVE_TO_PATH, os.path.split(path_to_file)[1])
    with open(path, 'w') as handle:
        handle.write(lines)

In [7]:
txts = get_files(DATASET_PATH, '.txt')
if not os.path.exists(SAVE_TO_PATH):
    os.makedirs(SAVE_TO_PATH)


for _ in progress_bar(Pool(64).imap_unordered(process_function, txts), len(txts)):
    pass

In [8]:
!mkdir ./dataset/validation

In [9]:
!mkdir ./dataset/train

In [10]:
!mv ./dataset/prepared/corpus.txt ./dataset/train/


In [11]:
!mv ./dataset/prepared/validation.txt ./dataset/validation

In [19]:
os.chdir('/content/')

In [15]:
%set_env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


In [16]:
# Можно и по ssh root@some.domen -p 12345
#launch_ssh("YOUR TOKEN HERE", "YOUR PASSWORD HERE")

In [17]:
os.chdir('/content/ru_transformers')

In [None]:
!python3 run_lm_finetuning.py \
    --output_dir="/content/gdrive/My Drive/gpt2-ru" \
    --model_type=gpt2 \
    --model_name_or_path=gpt2/s_checkpoint-1900000/ \
    --do_train \
    --train_data_file=./dataset/train/corpus.txt \
    --per_gpu_train_batch_size=2 \
    --save_steps=10000 \
    --logging_steps=1 \
    --warmup_samples 16000 \
    --learning_rate 3e-5 \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./dataset/validation/validation.txt \
    --num_train_epochs 5.0 \
    --unfreeze_level 0 \
    --fp16 \
    --fp16_opt_level O2 \
    --overwrite_output_dir

2020-10-21 09:09:02.518664: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-21 09:09:03.892569: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-21 09:09:03.892803: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-21 09:09:03.893613: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-10-21 09:09:03.893660: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-21 09:09:03.895635: I tensorflow/stream_executor/platform/defa

In [None]:
# Start
# python3 run_lm_finetuning.py \
#     --output_dir="/content/gdrive/My Drive/gpt2-ru" \
#     --model_type=gpt2 \
#     --model_name_or_path=./gpt2/m_checkpoint-3364613/ \
#     --do_train \
#     --train_data_file=./dataset/train/ \
#     --per_gpu_train_batch_size=2 \
#     --save_steps=10000 \
#     --logging_steps=1 \
#     --warmup_samples 16000 \
#     --learning_rate 3e-5 \
#     --tokenizer_class YTEncoder \
#     --tokenizer_name bpe/yt.model \
#     --do_eval \
#     --evaluate_during_training \
#     --eval_steps 1000 \
#     --eval_data_file=./dataset/validation/ \
#     --num_train_epochs 1.0 \
#     --unfreeze_level 0 \
#     --fp16 \
#     --fp16_opt_level O2 \
#     --overwrite_output_dir

# Continue
# python3 run_lm_finetuning.py \
#     --output_dir="/content/gdrive/My Drive/gpt2-ru" \
#     --model_type=gpt2 \
#     --model_name_or_path="/content/gdrive/My Drive/gpt2-ru" \
#     --do_train \
#     --train_data_file="./dataset/prepared/" \
#     --per_gpu_train_batch_size=2 \
#     --save_steps=10000 \
#     --logging_steps=1 \
#     --warmup_samples 16000 \
#     --learning_rate 3e-5 \
#     --tokenizer_class YTEncoder \
#     --tokenizer_name bpe/yt.model \
#     --do_eval \
#     --evaluate_during_training \
#     --eval_steps 1000 \
#     --eval_data_file=./dataset/eval/ \
#     --num_train_epochs 2.0 \
#     --unfreeze_level 0 \
#     --fp16 \
#     --fp16_opt_level O2 \
#     --overwrite_output_dir

# Генерация

Можно поиграться, нужно задавать начало фразы. Модель подгружается с Google Disk, указывайте путь до своего чекпоинта, если будете обучать.




In [None]:
!python3 evaluate_model.py --length 39 --model_path "/content/gdrive/My Drive/gpt2-ru/checkpoint-1900915" --continuous_run

2020-10-21 09:43:53.161083: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-21 09:43:55.176095: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-21 09:43:55.180351: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-21 09:43:55.181192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-10-21 09:43:55.181236: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-21 09:43:55.183343: I tensorflow/stream_executor/platform/defa

Аналогичная генерация из модели побольше, но без тюнинга.

In [None]:
!python3 evaluate_model.py --length 39 --model_path "/content/ru_transformers/gpt2/m_checkpoint-3364613" --continuous_run

2020-10-21 09:50:45.990143: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-21 09:50:47.919881: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-21 09:50:47.923786: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-10-21 09:50:47.924541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-10-21 09:50:47.924584: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-21 09:50:47.926534: I tensorflow/stream_executor/platform/defa