# Models Finetuning

In this notebook several models will be finetuned to perform sentence simplification in Russian. All the models will be tuned with the parametres offered at RuSimpleSentEval competition. The main objective is not to achieve the best performance but rather compare different models trained with and without translated data. In every case training will last 5 epochs. Overall, there are five models:

* Model trained on pairs: original english - simplified russian sentence. So, it learns both translate and simplify at the same time.
* Model trained only on the translated to Russian data.
* Model trained firstly on the original data and then on the translated corpus

All the models will be evaluated and compared 

The first trial of training was quite unsuccessful. So, it was decided to change the approach 1) filter the data 2) use russian corpus Paraphraser

The following models were trained:
* Model trained on filtered translated data
* Model trained on filtered translated data and then on Paraphraser
* Model trained on Paraphraser
* Model trained on Paraphraser + filtered translated data
* Model trained on Paraphraser + filtered translated data with control tokens

P.S: this notebook is heavily based on competition https://github.com/dialogue-evaluation/RuSimpleSentEval

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


### Necessary libraries

In [None]:
import pandas as pd
import re
# import nltk
# nltk.download('punkt')

In [None]:
! wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz
! tar -xzvf /content/mbart.cc25.v2.tar.gz
! apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

In [None]:
!git clone https://github.com/google/sentencepiece.git 
%cd sentencepiece
!mkdir build

Cloning into 'sentencepiece'...
remote: Enumerating objects: 3706, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 3706 (delta 4), reused 1 (delta 0), pack-reused 3691[K
Receiving objects: 100% (3706/3706), 28.59 MiB | 18.92 MiB/s, done.
Resolving deltas: 100% (2596/2596), done.
/content/sentencepiece


In [None]:
%cd build
!cmake ..
!make
!make install
!ldconfig -v

In [None]:
# from sentencepiece git
# !git clone https://github.com/google/sentencepiece.git 
# %cd sentencepiece
# %mkdir build
# %cd build
# !cmake ..
# !make -j $(nproc)
# !sudo make install
# !sudo ldconfig -v

In [None]:
%cd /content

/content


In [None]:
# !git clone https://github.com/pytorch/fairseq
# !cd fairseq
# %pip install --editable ./

In [None]:
!git clone https://github.com/pytorch/fairseq
%cd /content/fairseq/
!python -m pip install --editable .
%cd /content

! echo $PYTHONPATH

import os
os.environ['PYTHONPATH'] += ":/content/fairseq/"

! echo $PYTHONPATH

### Loading data...

I will use original WikiLarge, Google translation and Paraphraser

In [None]:
! mkdir data
! gdown https://drive.google.com/uc?id=1bJo8TagTGKa0uyppQRqsHrKHyYO5tcZc
! gdown https://drive.google.com/uc?id=11lqipq6ggrgCk8bVxQ4-uuPVMCKN5ebU
! gdown https://drive.google.com/uc?id=1dB3X-Wx8qU_5nDG_pxAmLvo5H_sgnHrE

In [None]:
% cd /content/fairseq

In [None]:
data_train = pd.read_csv('/content/wiki_train_cleaned_translated_sd.csv')
data_dev = pd.read_csv('/content/wiki_dev_cleaned_translated_sd.csv')
data_test  = pd.read_csv('/content/wiki_test_cleaned_translated_sd.csv')

As a test set I use the dev part of a russian dataset collected for RuSimpleSentEval competition: https://github.com/dialogue-evaluation/RuSimpleSentEval

In [None]:
######
data_test = pd.read_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_test_dev_eng.csv', sep='\t')

I get rid of the sentences where the simplified versions coincide with the original sentences:

In [None]:
data_train = data_train[data_train.target_x!=data_train.target_y]
data_dev = data_dev[data_dev.target_x!=data_dev.target_y]

Also, I filter data based on the simplification lengths

In [None]:
dat_train = data_train[(data_train['target_y'].apply(lambda x: len(x.split(' ')))/data_train['target_x'].apply(lambda x: len(x.split(' '))))<0.8]

data_train = dat_train[:82000]
data_dev = dat_train[82000:]

In [None]:
data_train.shape

(82000, 5)

In [None]:
data_dev.shape


(20066, 5)

For additional model pretraining I use Paraphraser corpus that has proven to be quite effective

In [None]:
! gdown https://drive.google.com/uc?id=1JaNqhyZf-3Fybs3iTo90__4eEN4CmhMl
import json
from sklearn.utils import shuffle
with open('/content/ParaPhraserPlus.json', 'r') as f:
  data = json.loads(f.read())

import random
src, dst = [], []
for i in data.keys():
  src.append(data[i]['headlines'][0])
  dst.append(data[i]['headlines'][1])
data = pd.DataFrame(list(zip(src, dst)), columns=['src','dst'])
# random.shuffle(data)
data.head(3)
data = shuffle(data)
data.drop_duplicates(subset=['dst'], inplace=True)
data_new = data.sample(241000)
data_train = data_new[:240000]
data_dev = data_new[240000:]

Downloading...
From: https://drive.google.com/uc?id=1JaNqhyZf-3Fybs3iTo90__4eEN4CmhMl
To: /content/ParaPhraserPlus.json
1.11GB [00:08, 139MB/s]


## Trial with all data

In [None]:
data_train = pd.read_csv('/content/wiki_train_cleaned_translated_sd.csv')
data_dev = pd.read_csv('/content/wiki_dev_cleaned_translated_sd.csv')
data_test  = pd.read_csv('/content/wiki_test_cleaned_translated_sd.csv')
data_train = pd.concat((data_train, data_dev, data_test))
data_train = data_train[data_train.target_x!=data_train.target_y]
data_train = data_train[(data_train['target_y'].apply(lambda x: len(x.split(' ')))/data_train['target_x'].apply(lambda x: len(x.split(' '))))<0.8]

In [None]:
data_test = pd.read_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_test_dev_eng.csv', sep='\t')

In [None]:
! gdown https://drive.google.com/uc?id=1JaNqhyZf-3Fybs3iTo90__4eEN4CmhMl
import json
from sklearn.utils import shuffle
with open('/content/ParaPhraserPlus.json', 'r') as f:
  data = json.loads(f.read())

import random
src, dst = [], []
for i in data.keys():
  src.append(data[i]['headlines'][0])
  dst.append(data[i]['headlines'][1])
data = pd.DataFrame(list(zip(src, dst)), columns=['target_x','target_y'])
data = data.sample(241000)
# random.shuffle(data)
data.head(3)
data.drop_duplicates(subset=['target_y'], inplace=True)

Downloading...
From: https://drive.google.com/uc?id=1JaNqhyZf-3Fybs3iTo90__4eEN4CmhMl
To: /content/ParaPhraserPlus.json
1.11GB [00:11, 93.9MB/s]


In [None]:
data_train.drop(columns=['src', 'dst'], axis=1, inplace=True)

In [None]:
data_train = pd.concat((data_train, data))

In [None]:
from sklearn.utils import shuffle

In [None]:
data_train = shuffle(data_train)

In [None]:
data_dev = data_train.sample(n=2000, random_state=42)
data_train = data_train.drop(data_dev.index)

In [None]:
data_train.shape

(334256, 3)

## ACCESS

In [None]:
! gdown https://drive.google.com/uc?id=1ZelXO1Toyfk7ezW50HEaAdnO5XoORa4P
data = pd.read_csv('/content/asset_data_ru.csv')
data_test = pd.read_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_test_dev_eng.csv', sep='\t')

from sklearn.utils import shuffle
data = shuffle(data)
data_dev = data.sample(n=500, random_state=42)
data_train = data.drop(data_dev.index)

Downloading...
From: https://drive.google.com/uc?id=1ZelXO1Toyfk7ezW50HEaAdnO5XoORa4P
To: /content/asset_data_ru.csv
0.00B [00:00, ?B/s]4.72MB [00:00, 30.7MB/s]14.7MB [00:00, 69.0MB/s]


In [None]:
data_train.head()

Unnamed: 0.1,Unnamed: 0,src,dst,target_x,target_y
6300,2710,"A Bulldog, also known as British Bulldog or English Bulldog, is a breed of dog which traces its ancestry to England.","A Bulldog, also known as British Bulldog or English Bulldog, is a breed of dog which originates from England.","Бульдог, также известный как британский бульдог или английский бульдог, - это порода собак, которая ведет свое происхождение от Англии.","Бульдог, также известный как британский бульдог или английский бульдог, - это порода собак, которая происходит из Англии."
486,486,"Mariel of Redwall is a fantasy novel by Brian Jacques, published in 1991.",Mariel of Redwall is a 1991 fantasy book by Brian Jacques.,"Мариэль из Редволла - это фантастический роман Брайана Жака, опубликованный в 1991 году.",Мариэль из Редволла - это книга в жанре фэнтези Брайана Жака 1991 года.
19210,15620,The Movie and subsequent additions to the franchise.,The Movie and future additions to the franchise.,Фильм и последующие дополнения к франшизе.,Фильм и будущие дополнения к франшизе.
19504,15914,"According to an interview in the UK newspaper The Sun, Heyman wrote the brand's weekly scripts and submitted them to writers for possible changes, and then Vince McMahon for final approval.","According to The Sun, Heyman wrote the brand's scripts and gave them to writers for possible changes, and then to Vince McMahon for final approval.","Согласно интервью британской газете The Sun, Хейман написал еженедельные сценарии бренда и отправил их авторам для возможных изменений, а затем Винсу МакМахону для окончательного утверждения.","Согласно The Sun, Хейман написал сценарии бренда и передал их сценаристам для возможных изменений, а затем Винсу МакМахону для окончательного утверждения."
573,573,After graduation he returned to Yerevan to teach at the local Conservatory and later he was appointed artistic director of the Armenian Philarmonic Orchestra.,After graduation he returned to Yerevan to teach at the conservatory. Later he was appointed artistic director of the Armenian Philarmonic Orchestra.,"После окончания школы вернулся в Ереван, чтобы преподавать в местной консерватории, а затем был назначен художественным руководителем оркестра Армянской филармонии.","После окончания школы вернулся в Ереван, чтобы преподавать в консерватории. Позже был назначен художественным руководителем Армянского филармонического оркестра."


--------------------------------------------

Data preprocessing

In [None]:
! mkdir /content/fairseq/data

In [None]:
### process WikiLarge

with open('/content/fairseq/data/test.en', "a") as f:
  for i, row in data_test.iterrows():
    f.write(row['src']+'\n')

with open('/content/fairseq/data/train.en', "a") as f:
  for i, row in data_train.iterrows():
    f.write(row['src']+'\n')

with open('/content/fairseq/data/dev.en', "a") as f:
  for i, row in data_dev.iterrows():
    f.write(row['src']+'\n')

with open('/content/fairseq/data/test.ru', "a") as f:
  for i, row in data_test.iterrows():
    f.write(row['target_y']+'\n')

with open('/content/fairseq/data/train.ru', "a") as f:
  for i, row in data_train.iterrows():
    f.write(row['target_y']+'\n')

with open('/content/fairseq/data/dev.ru', "a") as f:
  for i, row in data_dev.iterrows():
    f.write(row['target_y']+'\n')

In [None]:
### process WikiLarge but with Russian test
with open('/content/fairseq/data/test.src', "a") as f:
  for i, row in data_test.iterrows():
    f.write(row['INPUT:source']+'\n')

with open('/content/fairseq/data/train.src', "a") as f:
  for i, row in data_train.iterrows():
    f.write(row['src']+'\n')

with open('/content/fairseq/data/dev.src', "a") as f:
  for i, row in data_dev.iterrows():
    f.write(row['src']+'\n')

with open('/content/fairseq/data/test.dst', "a") as f:
  for i, row in data_test.iterrows():
    f.write(row['OUTPUT:output']+'\n')

with open('/content/fairseq/data/train.dst', "a") as f:
  for i, row in data_train.iterrows():
    f.write(row['dst']+'\n')

with open('/content/fairseq/data/dev.dst', "a") as f:
  for i, row in data_dev.iterrows():
    f.write(row['dst']+'\n')


In [None]:
! echo $DATA_DIR




In [None]:
SPM="/content/sentencepiece/build/src/spm_encode"
BPE_MODEL="/content/mbart.cc25.v2/sentence.bpe.model"
DATA_DIR="/content/fairseq/data"
SRC="en"
TGT="ru" #en

!$SPM --model=$BPE_MODEL < $DATA_DIR/train.$SRC > $DATA_DIR/train.spm.$SRC &
!$SPM --model=$BPE_MODEL < $DATA_DIR/train.$TGT > $DATA_DIR/train.spm.$TGT &
!$SPM --model=$BPE_MODEL < $DATA_DIR/dev.$SRC > $DATA_DIR/dev.spm.$SRC &
!$SPM --model=$BPE_MODEL < $DATA_DIR/dev.$TGT > $DATA_DIR/dev.spm.$TGT &
!$SPM --model=$BPE_MODEL < $DATA_DIR/test.$SRC > $DATA_DIR/test.spm.$SRC &
!$SPM --model=$BPE_MODEL < $DATA_DIR/test.$TGT > $DATA_DIR/test.spm.$TGT &

In [None]:

PREPROCESSED_DATA_DIR="/content/fairseq/data"
DICT="/content/mbart.cc25.v2/dict.txt"
!fairseq-preprocess \
  --source-lang en \
  --target-lang ru \
  --trainpref /content/fairseq/data/train.spm \
  --validpref /content/fairseq/data/dev.spm \
  --testpref /content/fairseq/data/test.spm \
  --destdir /content/fairseq/data \
  --thresholdtgt 0 \
  --thresholdsrc 0 \
  --srcdict /content/mbart.cc25.v2/dict.txt \
  --tgtdict /content/mbart.cc25.v2/dict.txt \
  --workers 70

Second training with ru-ru

In [None]:
! rm -r /content/fairseq/data

In [None]:
! mkdir /content/fairseq/data

In [None]:
### process translated WikiLarge
with open('/content/fairseq/data/test.src', "a") as f:
  for i, row in data_test.iterrows():
    f.write(row['target_x']+'\n')

with open('/content/fairseq/data/train.src', "a") as f:
  for i, row in data_train.iterrows():
    f.write(row['target_x']+'\n')

with open('/content/fairseq/data/dev.src', "a") as f:
  for i, row in data_dev.iterrows():
    f.write(row['target_x']+'\n')

with open('/content/fairseq/data/test.dst', "a") as f:
  for i, row in data_test.iterrows():
    f.write(row['target_y']+'\n')

with open('/content/fairseq/data/train.dst', "a") as f:
  for i, row in data_train.iterrows():
    f.write(row['target_y']+'\n')

with open('/content/fairseq/data/dev.dst', "a") as f:
  for i, row in data_dev.iterrows():
    f.write(row['target_y']+'\n')

In [None]:
#### process translated WikiLarge + russian dev set as test
with open('/content/fairseq/data/test.src', "a") as f:
  for i, row in data_test.iterrows():
    f.write(row['INPUT:source']+'\n')

with open('/content/fairseq/data/train.src', "a") as f:
  for i, row in data_train.iterrows():
    f.write(row['target_x']+'\n')

with open('/content/fairseq/data/dev.src', "a") as f:
  for i, row in data_dev.iterrows():
    f.write(row['target_x']+'\n')

with open('/content/fairseq/data/test.dst', "a") as f:
  for i, row in data_test.iterrows():
    f.write(row['OUTPUT:output']+'\n')

with open('/content/fairseq/data/train.dst', "a") as f:
  for i, row in data_train.iterrows():
    f.write(row['target_y']+'\n')

with open('/content/fairseq/data/dev.dst', "a") as f:
  for i, row in data_dev.iterrows():
    f.write(row['target_y']+'\n')

In [None]:
#### process paraphraser
with open('/content/fairseq/data/test.src', "a") as f:
  for i, row in data_test.iterrows():
    f.write(row['INPUT:source']+'\n')

with open('/content/fairseq/data/train.src', "a") as f:
  for i, row in data_train.iterrows():
    f.write(row['src']+'\n')

with open('/content/fairseq/data/dev.src', "a") as f:
  for i, row in data_dev.iterrows():
    f.write(row['src']+'\n')

with open('/content/fairseq/data/test.dst', "a") as f:
  for i, row in data_test.iterrows():
    f.write(row['OUTPUT:output']+'\n')

with open('/content/fairseq/data/train.dst', "a") as f:
  for i, row in data_train.iterrows():
    f.write(row['dst']+'\n')

with open('/content/fairseq/data/dev.dst', "a") as f:
  for i, row in data_dev.iterrows():
    f.write(row['dst']+'\n')

In [None]:
SPM="/content/sentencepiece/build/src/spm_encode"
BPE_MODEL="/content/mbart.cc25.v2/sentence.bpe.model"
DATA_DIR="/content/fairseq/data"
SRC="src"
TGT="dst"

!$SPM --model=$BPE_MODEL < $DATA_DIR/train.$SRC > $DATA_DIR/train.spm.$SRC &
!$SPM --model=$BPE_MODEL < $DATA_DIR/train.$TGT > $DATA_DIR/train.spm.$TGT &
!$SPM --model=$BPE_MODEL < $DATA_DIR/dev.$SRC > $DATA_DIR/dev.spm.$SRC &
!$SPM --model=$BPE_MODEL < $DATA_DIR/dev.$TGT > $DATA_DIR/dev.spm.$TGT &
!$SPM --model=$BPE_MODEL < $DATA_DIR/test.$SRC > $DATA_DIR/test.spm.$SRC &
!$SPM --model=$BPE_MODEL < $DATA_DIR/test.$TGT > $DATA_DIR/test.spm.$TGT &

In [None]:

PREPROCESSED_DATA_DIR="/content/fairseq/data"
DICT="/content/mbart.cc25.v2/dict.txt"
!fairseq-preprocess \
  --source-lang src \
  --target-lang dst \
  --trainpref /content/fairseq/data/train.spm \
  --validpref /content/fairseq/data/dev.spm \
  --testpref /content/fairseq/data/test.spm \
  --destdir /content/fairseq/data \
  --thresholdtgt 0 \
  --thresholdsrc 0 \
  --srcdict /content/mbart.cc25.v2/dict.txt \
  --tgtdict /content/mbart.cc25.v2/dict.txt \
  --workers 70

2021-05-11 19:05:01 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, azureml_logging=False, bf16=False, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='/content/fairseq/data', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_file=None, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer=None, padding_factor=8, plasma_path='/tmp/plasma', profile=False, quantization_config_path=None, reset_logging=False, scoring='bleu', seed=1, simul_type=None, source_lang='src', srcdict='/content/mbart.cc25.v2/dict.txt', suppress_crashes=False, target_lang='dst', task='translation', tensorboard_logdir=None, testpref='/con

The code for training was the same all the times, just "src" and "dst" parts were changed. So, I do not repeated it six times, but rather altered this one, putting the necessary data in it


In [None]:
# ! rm -r /content/drive/MyDrive/checkpoints_ru_ru_added
! mkdir /content/drive/MyDrive/checkpoints_paraphrases_wiki_filtered_20_epochs

In [None]:
! mkdir /content/drive/MyDrive/checkpoints_ru_ru_add

**Also**, it is necessary to make the following change in /content/fairseq/fairseq/tasks/translation_from_pretrained_bart.py:

```
def __init__(self, args, src_dict, tgt_dict):
        super().__init__(args, src_dict, tgt_dict)
        self.args = args                  # add this line !!!!!
        self.langs = args.langs.split(",")
        for d in [src_dict, tgt_dict]:
            for l in self.langs:
```


The next two cells should install apex for faster training, but some error occured:(

In [None]:
# %%writefile setup.sh

# export CUDA_HOME=/usr/local/cuda-10.1
# git clone https://github.com/NVIDIA/apex
# pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

Writing setup.sh


In [None]:
# !sh setup.sh

# Training-------------------------------

In [None]:
# those are just some variations in parameters that I tried
# > train_log.txt &
#  --update-freq 1
#  --ddp-backend no_c10d
# --max-tokens 1024
# --batch-size 4 2
# --max-epoch 25
# --fp16 \?????
# --update-freq? increase????
# --update-freq 2??? 5??
# 3
# --max-tokens 300
#  --ddp-backend no_c10d \
# --fp16 \
# --memory-efficient-fp16 \
# --save-interval-updates 5000 \
# /content/mbart.cc25.v2/model.pt
# --max-epoch 10

In [None]:
!fairseq-train /content/fairseq/data \
  --encoder-normalize-before --decoder-normalize-before \
  --arch mbart_large --layernorm-embedding \
  --task translation_from_pretrained_bart \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
  --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
  --lr-scheduler polynomial_decay --lr 3e-05 --warmup-updates 2500 --total-num-update 54725  \
  --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
  --max-tokens 1024 --update-freq 5 \
  --source-lang src --target-lang dst \
  --batch-size 16 \
  --update-freq 4 \
  --memory-efficient-fp16 \
  --validate-interval 1 \
  --patience 3 \
  --max-epoch 5 \
  --save-interval 5 --keep-last-epochs 10 --keep-best-checkpoints 2 \
  --ddp-backend no_c10d \
  --seed 42 --log-format simple --log-interval 500 \
  --restore-file /content/mbart.cc25.v2/model.pt \
  --reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler \
  --langs ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN \
  --scoring bleu \
  --save-dir /content/drive/MyDrive/checkpoints_ru_ru_add

2021-05-11 19:06:53 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 500, 'log_format': 'simple', 'log_file': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 42, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_w

In [None]:
#https://github.com/awesomedata/awesome-public-datasets

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)

In [None]:
data_train.sample(10)

Unnamed: 0.1,Unnamed: 0,src,dst,target_x,target_y
198495,198495,"The Mahseer ( Tor putitora ) , an indigenous riverine fish found in the Hub River , grows up to 2m in length and provides for excellent angling .","The Mahseer ( Tor putitora ) , an indigenous riverine fish found in the Hub River , grows up to 2m in length and is fished .","Махсир (Tor putitora), местная речная рыба, обитающая в реке Хаб, вырастает до 2 м в длину и обеспечивает отличную рыбалку.","Махсир (Тор путитора), местная речная рыба, обитающая в реке Хаб, вырастает до 2 м в длину, и ее ловят."
245110,245110,"A constituent country is a country that is part of a larger entity , such as a sovereign state or supranational body .","A constituent country is a country which makes up a part of a larger country , or federation .","Составляющая страна - это страна, которая является частью более крупного образования, такого как суверенное государство или наднациональный орган.","Составляющая страна - это страна, которая является частью более крупной страны или федерации."
95476,95476,"Chinese in Penang , Kuala Lumpur , of Malaysia also pray to Lord Murugan during Thaipusam .",He is the son of Lord Shiva and Goddess Parvati .,"Китайцы в Пенанге, Куала-Лумпур, Малайзии, также молятся Господу Муругану во время Тайпусама.",Он сын Господа Шивы и богини Парвати.
216288,216288,15 5 | - style = '' background-color : #c 0ffff '' | Argon | | Ar | | 18 | | 39.948 ( 1 ) The isotopic composition varies in terrestrial material such that a more precise atomic weight can not be given .,15 5 | - | Argon | | Ar | | 18 | | 39.948 ( 1 ) The isotopic composition varies in terrestrial material such that a more precise atomic weight can not be given .,"15 5 | - style = '' background-color: #c 0ffff '' | Аргон | | Ar | | 18 | | 39.948 (1) Изотопный состав земного материала различается, поэтому более точный атомный вес дать невозможно.","15 5 | - | Аргон | | Ar | | 18 | | 39.948 (1) Изотопный состав земного материала различается, поэтому более точный атомный вес дать невозможно."
138765,138765,Old French was the Romance dialect continuum spoken in territories which span roughly the northern half of modern France and parts of modern Belgium and Switzerland from around 900 to 1300 .,Old French was the Romance dialect continuum spoken in the places of northern half of modern France and parts of modern Belgium and Switzerland from around 1000 to 1300 .,"Старофранцузский язык был континуумом романского диалекта, на котором говорили на территориях, которые охватывали примерно северную половину современной Франции и части современной Бельгии и Швейцарии примерно с 900 по 1300 год.","Старофранцузский был континуумом романского диалекта, на котором говорили в местах северной половины современной Франции и некоторых частях современной Бельгии и Швейцарии примерно с 1000 по 1300 год."
29328,29328,The M6 is the longest motorway in the United Kingdom and one of the busiest .,The M6 motorway is the longest motorway in the United Kingdom .,M6 - самая длинная автомагистраль в Соединенном Королевстве и одна из самых загруженных.,Автомагистраль M6 - самая длинная автомагистраль в Соединенном Королевстве.
170674,170674,DIN 476 : international paper sizes ( now ISO 216 or DIN EN ISO 216 ) DIN 946 : Determination of coefficient of friction of bolt/nut assemblies under specified conditions .,Example of DIN standards DIN 476 : international paper sizes ( now ISO 216 or DIN EN ISO 216 ) .,DIN 476: международные форматы бумаги (теперь ISO 216 или DIN EN ISO 216) DIN 946: Определение коэффициента трения сборок болт / гайка при определенных условиях.,Пример стандартов DIN DIN 476: международные форматы бумаги (теперь ISO 216 или DIN EN ISO 216).
150709,150709,"In the early 2000s , the genre name began to describe a different , slower and less dissonant style that borrowed from alternative rock .","While many types of music have screaming vocals , screamo usually has a certain kind of harsher screaming .","В начале 2000-х название жанра начало описывать другой, более медленный и менее противоречивый стиль, заимствованный из альтернативного рока.","В то время как во многих музыкальных стилях есть кричащий вокал, у скримо обычно более резкий крик."
22361,22361,"They were first observed in 1869 by German physicist Johann Hittorf , and were named in 1876 by Eugen Goldstein kathodenstrahlen , or cathode rays .",A Cathode ray is a stream of electrons that are seen in vacuum tubes .,"Впервые они были обнаружены в 1869 году немецким физиком Иоганном Хитторфом и в 1876 году были названы Ойгеном Гольдштейном катоденстрахленом, или катодными лучами.","Катодный луч - это поток электронов, который виден в электронных лампах."
93282,93282,"Following the title 's introduction in 1975 , Harley Race became the inaugural champion on January 1 .",The title became '' Undisputed '' in January 1981 when no other United States title was recognized in other promotions governed by the National Wrestling Alliance .,После введения этого титула в 1975 году Harley Race 1 января стал первым чемпионом.,"Этот титул стал «бесспорным» в январе 1981 года, когда ни один другой титул Соединенных Штатов не был признан в других акциях, проводимых Национальным борцовским альянсом."


### Test to check that everything is ok and get prediction

In [None]:
! pip install sentencepiece



In [None]:
!fairseq-generate /content/fairseq/data \
  --path /content/drive/MyDrive/checkpoints_ru_ru_add/checkpoint_best.pt \
  --task translation_from_pretrained_bart \
  --gen-subset test \
  --source-lang src --target-lang dst \
  --bpe 'sentencepiece' --sentencepiece-model /content/mbart.cc25.v2/sentence.bpe.model \
  --sacrebleu --remove-bpe 'sentencepiece' \
  --batch-size 32 --langs ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN > model_prediction.txt & 
!cat model_prediction.txt | grep -P "^H" |sort -V |cut -f 3- > model_prediction_wiki_15ru_ru.hyp




In [None]:
! rm -rf /content/drive/MyDrive/checkpoints_ru_ru_add

In [None]:
a = 4

In [None]:
!cp /content/model_prediction_wiki_20_filtered.hyp /content/drive/MyDrive/MT_sentence_simpl/predictions/model_prediction_wiki_20_filtered.hyp

# Also, try SARI evaluation

In [None]:
%cd /content

/content


In [None]:
! git clone https://github.com/feralvam/easse
! git clone https://github.com/Andoree/sent_simplification.git
%cp /content/sent_simplification/sari.py /content/easse/easse
%cd easse
! pip install .

In [None]:
%cd /content
! mkdir prepared_data

/content


Prepare data for SARI calculation

In [None]:
!rm -r prepared_data

In [None]:
! python /content/sent_simplification/refs_to_easse_format.py \
--input_path /content/drive/MyDrive/MT_sentence_simpl/wiki_test_dev_eng.csv \
--output_dataset_name test_ref_data \
--src_column "INPUT:source" \
--trg_column "OUTPUT:output" \
--output_dir /content/prepared_data


1000
3406
3406
Overall number of references: 3406


In [None]:
with open('/content/drive/MyDrive/MT_sentence_simpl/predictions/model_prediction_paraphraser2.hyp', 'r') as f:
  sentences = [i.strip()+'.' for i in f.readlines()]

lt = list()
st = set()
for i in sentences:
  if i not in st:
    lt.append(i)
    st.add(i)

with open('/content/model_prediction_paraphraser2.hyp', 'w') as f:
  for i in lt:
    f.write(i+'\n')

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
! easse evaluate \
--test_set custom \
--metrics sari \
--refs_sents_paths /content/prepared_data/test_ref_data.ref.0,/content/prepared_data/test_ref_data.ref.1,/content/prepared_data/test_ref_data.ref.2,/content/prepared_data/test_ref_data.ref.3,/content/prepared_data/test_ref_data.ref.4 \
--orig_sents_path /content/prepared_data/test_ref_data.src \
--sys_sents_path /content/model_prediction_paraphraser2.hyp -q

{'sari': 34.851, 'quality_estimation': {'Compression ratio': 0.605, 'Sentence splits': 0.984, 'Levenshtein similarity': 0.7, 'Exact copies': 0.031, 'Additions proportion': 0.043, 'Deletions proportion': 0.45, 'Lexical complexity score': 10.745}}
