## README

Please check your Colab GPU. You should have at least 16GB RAM to fine tune the model!

In [None]:
!nvidia-smi

## INSTALL TRANSFORMERS / CUDA / APEX

In [None]:
!pip3 install transformers==2.8.0

In [None]:
!wget https://raw.githubusercontent.com/sberbank-ai/ru-gpts/master/pretrain_transformers.py
!wget https://raw.githubusercontent.com/sberbank-ai/ru-gpts/master/generate_transformers.py

In [None]:
%%writefile setup.sh

export CUDA_HOME=/usr/local/cuda-10.1
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

In [None]:
!sh setup.sh

## GET DATA



In [None]:
!rm -rf ruGPT3
!rm -rf sample_data
!git clone https://github.com/gotzmann/ruGPT3.git

In [None]:
import glob
datasets = glob.glob("ruGPT3/datasets/*")
with open('corpus.txt', 'w') as outfile:
    for name in datasets:
        with open(name) as infile:
            outfile.write(infile.read())
            outfile.write("\n\n")

## PREPROCESSING

In [None]:
data = open('corpus.txt', 'r').read()
data = data.split('\n\n')

# Do we need some preprocessing here?

# data = data.replace('COVID-19', 'коронавирус')
# data = data.replace('COVID', 'коронавирус')

# Do split the whole dataset to train / validate portions?

# import numpy as np
# import random

# random.seed(1980)
# np.random.seed(1980)

# val_ind = random.sample(range(len(data)), int(len(data) / 5))
# train = [data[i] for i in range(len(data)) if i not in val_ind][:len(data)]
# valid = [data[i] for i in range(len(data)) if i in val_ind]
# train = [str.replace('\n', ' ') for str in train]
# valid = [str.replace('\n', ' ') for str in valid]

# Or just train on the whole data?

train = [str.replace('\n', ' ') for str in data]
valid = [str.replace('\n', ' ') for str in data]

# Do we need JSON for GPT3 Large and plaintext for Medium / Small?

# train = [str.replace('"', '\\"') for str in train]
# valid = [str.replace('"', '\"') for str in valid]

# train = ['{ "text" : "' + str + '" }' for str in train]
# valid = ['{ "text" : "' + str + '" }' for str in valid]

with open("train.data", "w") as file:
    file.write("\n".join(train))
    
with open("valid.data", "w") as file:
    file.write("\n".join(valid)) 

len(train), len(valid)    

## FINE-TUNING

In [None]:
# To fit all the data in Collab GPU, we use Medium model with block size of 1024
# If your GPU has more than 16Gb, go better with Large and 2048 blocks

!python pretrain_transformers.py \
    --output_dir=model \
    --model_type=gpt2 \
    --model_name_or_path=sberbank-ai/rugpt3medium_based_on_gpt2 \
    --do_train \
    --train_data_file=train.data \
    --do_eval \
    --eval_data_file=valid.data \
    --fp16 \
    --per_gpu_train_batch_size 1 \
    --num_train_epochs 10 \
    --block_size 1024 \
    --overwrite_output_dir \
    --line_by_line 

## 42 :)

In [None]:
!python generate_transformers.py \
    --model_type=gpt2 \
    --model_name_or_path=model \
    --k=50 \
    --p=0.95 \
    --length=100 \
    --temperature=0.75 \
    --num_return_sequences=3

## EXAMPLES

Q: Что такое коронавирус? A:

Q: Каковы первые симптомы коронавируса? A:

Q: Что лучше защищает от коронавируса: маска или респиратор?

Q: Какие лекарственные препараты назначают пациентам с коронавирусом? A: