In this notebook English dataset Wikilarge collected from Wikipedia will be explored and translated to Russian with a MT model from Hugging Face. Also, a translated via Google translate API data will be considered. All the data will be evaluated


# Libraries and dependencies

In [None]:
! pip install googletrans
! pip install transformers
! pip install laserembeddings
! pip install sentence_transformers
! pip install language_tool_python
! pip install textstat

In [None]:
! python -m laserembeddings download-models

Downloading models into /usr/local/lib/python3.7/dist-packages/laserembeddings/data

✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/bilstm.93langs.2018-12-26.pt    

✨ You're all set!


In [None]:
import pandas as pd
import numpy as np
import torch
import language_tool_python
import re
import textstat
from laserembeddings import Laser
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from transformers import MarianMTModel, MarianTokenizer
from transformers.hf_api import HfApi
from googletrans import Translator
from tqdm import tqdm_notebook

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)

# Data loading...

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# ! tar -xf /content/drive/MyDrive/MT_sentence_simpl/data-simplification.tar.bz2
! gdown https://drive.google.com/uc?id=0B6-YKFW-MnbOYWxUMTBEZ1FBam8
! tar -xf /content/data-simplification.tar.bz2

In [None]:
# train-------------------------------------------------------------
with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.train.src') as f:
 data_train =  f.readlines()
data_train = [i.strip('\n') for i in data_train]

with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.train.dst') as f:
 data_train_dst =  f.readlines()
data_train_dst = [i.strip('\n') for i in data_train_dst]

df_train = pd.DataFrame.from_records(zip(data_train, data_train_dst), columns=['src', 'dst'])

# valid-------------------------------------------------------------
with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.valid.src') as f:
 data_valid =  f.readlines()
data_valid = [i.strip('\n') for i in data_valid]

with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.valid.dst') as f:
 data_valid_dst =  f.readlines()
data_valid_dst = [i.strip('\n') for i in data_valid_dst]

df_val = pd.DataFrame.from_records(zip(data_valid, data_valid_dst), columns=['src', 'dst'])

# test-------------------------------------------------------------
with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.test.src') as f:
 data_test =  f.readlines()
data_test = [i.strip('\n') for i in data_test]

with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.test.dst') as f:
 data_test_dst =  f.readlines()
data_test_dst = [i.strip('\n') for i in data_test_dst]

df_test = pd.DataFrame.from_records(zip(data_test, data_test_dst), columns=['src', 'dst'])

### First look at the data:

In [None]:
df_train.sample(5)

Unnamed: 0,src,dst
96347,"Daintree National Park and Cape Tribulation , about 130 k north of Cairns , are popular areas for experiencing a tropical rainforest .","The Daintree National Park and Cape Tribulation , about 130km north of Cairns , are popular areas for experiencing a tropical rainforest ."
65533,YÅ kai are a class of preternatural creatures in Japanese folklore ranging from the evil oni -LRB- ogre -RRB- to the mischievous kitsune -LRB- fox -RRB- or snow woman Yuki-onna . Some possess part animal and part human features -LRB- e.g. Kappa and Tengu -RRB- .,"YÅ kai are a type of creatures appeared in Japanese old stories , such as oni -LRB- the evil -RRB- , kappa , or tengu ."
239,"Photography is the process , activity and art of creating still or moving pictures by recording radiation on a radiation-sensitive medium , such as a photographic film , or an electronic sensor .",The picture the lens makes is recorded on photographic film .
200665,Qalandarabad is major town in Baldheri Union Council and is famous throughout Hazara for its Chapal Kebabs .,Qalandarabad
232250,"Negros Oriental -LRB- Filipino : Silangang Negros -RRB- -LRB- also called Oriental Negros , '' Eastern Negros '' -RRB- is a province of the Philippines located in the Central Visayas region .","Negros Oriental , sometimes called Oriental Negros -LRB- East Negros -RRB- , is a province in the Philippines ."


In [None]:
df_test.sample(5)

Unnamed: 0,src,dst
345,"On 15 April 1989 , the ground was the scene of one of the worst sporting tragedies of all time when 94 Liverpool fans -LRB- the final death toll was 96 -RRB- were crushed to death in an FA Cup semi-final in the infamous Hillsborough disaster .",The device can be designed for use in less exact environments .
192,"The park , which receives approximately thirty-five million visitors annually , is the most visited urban park in the United States .",The lawyer Brandon -LRB- Waise Lee -RRB- was his idol as MK Sun grew up to be a lawyer .
147,It was founded in 1440 by King Henry VI as '' The King 's College of Our Lady of Eton besides Wyndsor '' .,"Later , Esperanto speakers started to see the language and culture that had grown up around it as ends in themselves , though Esperanto is never accepted by the United Nations of other international organizations ."
66,"Etymology The Portuguese Man O ' War -LRB- named caravela-portuguesa in Portuguese -RRB- is named for its air bladder , which looks similar to the triangular sails of the Portuguese ship -LRB- man-of-war -RRB- Caravela latina -LRB- two - or three-masted lateen-rigged ship caravel -RRB- , of the 15th and 16th centuries .","Rollo swore fealty to Charles , changed to Christianity , and undertook to standby the northern region of France against the incursions of other Viking groups ."
41,"Sudan is situated in northern Africa , bordering the Red Sea and it has a coastline of 853 km along the Red Sea .",Working Group I : makes note of climate system and climate change


Initial sizes are:

* train: 296402
* dev: 992
* test: 359

 However, some of the sentences are of a very bad quality. So, they will be removed before translation

In [None]:
# dataset sizes

df_train.shape[0], df_val.shape[0], df_test.shape[0]

(296402, 992, 359)

# Basic cleaning

In [None]:
# preprocessing to clean the texts

# train-------------------------------------------------------------
for i, obj  in enumerate(data_train):
  data_train[i] = data_train[i].replace('-LRB-', '(')
  data_train[i] = data_train[i].replace('-RRB-', ')')

for i, obj in enumerate(data_train_dst):
  data_train_dst[i] = data_train_dst[i].replace('-LRB-', '(')
  data_train_dst[i] = data_train_dst[i].replace('-RRB-', ')')

# valid-------------------------------------------------------------
for i, obj  in enumerate(data_valid):
  data_valid[i] = data_valid[i].replace('-LRB-', '(')
  data_valid[i] = data_valid[i].replace('-RRB-', ')')

for i, obj in enumerate(data_valid_dst):
  data_valid_dst[i] = data_valid_dst[i].replace('-LRB-', '(')
  data_valid_dst[i] = data_valid_dst[i].replace('-RRB-', ')')

# test-------------------------------------------------------------
for i, obj  in enumerate(data_test):
  data_test[i] = data_test[i].replace('-LRB-', '(')
  data_test[i] = data_test[i].replace('-RRB-', ')')

for i, obj in enumerate(data_test_dst):
  data_test_dst[i] = data_test_dst[i].replace('-LRB-', '(')
  data_test_dst[i] = data_test_dst[i].replace('-RRB-', ')')


Extra preprocessing to get rid of the bad punctuation

In [None]:
deleted = []
for i, j in enumerate(data_train):
  if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2 or len(j.split()) == 3 or j[-1] not in ('?.\'') or not re.match(r'[A-ZÉ0-9"\']', j[0]) or re.match('[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
    deleted.append(j)
    data_train.pop(i)
    data_train_dst.pop(i)
  else:
    if '()' in data_train[i]:
       data_train[i] = data_train[i].replace('()', '')
    
    if '( )' in data_train[i]:
       data_train[i] = data_train[i].replace('( )', '')
    if ',,' in data_train[i]:
      data_train[i] = data_train[i].replace(',,', '')
    if ';)' in data_train[i]:
      data_train[i] = data_train[i].replace(';)', ')')
    if ':)' in data_train[i]:
      data_train[i] = data_train[i].replace(':)', ')')
    if '(:' in data_train[i]:
      data_train[i] = data_train[i].replace('(:', '(')
    if '(;' in data_train[i]:
      data_train[i] = data_train[i].replace('(:', '(')
    if "& ndash ;" in data_train[i]: 
      data_train[i] = data_train[i].replace("& ndash ;", '')
    if "& minus ;" in data_train[i]: 
      data_train[i] = data_train[i].replace("& minus ;", '')
    if "( , ; ; ; )" in data_train[i]: 
      data_train[i] = data_train[i].replace("( , ; ; ; )", '')
    if "( , ; , ; , ; ; )" in data_train[i]: 
      data_train[i] = data_train[i].replace("( , ; , ; , ; ; )", '')
    if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_train[i]: 
      data_train[i] = data_train[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

len(deleted)

19493

In [None]:
deleted[:10]

["Many still refer to 25 , 50 and 75 paise as 4 , 8 and 12 annas respectively , not unlike the usage of '' bit '' in American English for â",
 'Chronic obstructive pulmonary disease ( COPD ) , interstitial lung disease ( ILD )',
 '1810 & ndash ; Ernst Kummer , German mathematician ( d. 1893 )',
 'That same year , Mattel also introduced Fashion Polly !',
 "They have appeared on the cover of that company 's free magazine , V !",
 "She won two matches against Sonja Graf for the Women 's World Champion title ; ( +3 -- 1 = 0 ) at Rotterdam 1934 , and ( +9 -- 2 = 5 ) at Semmering 1937 .",
 'Walter Kogler',
 'a family of',
 "Poanes benito Freeman , 1979 -- Benito 's Skipper",
 ", Prutky 's Travels in Ethiopia and other Countries with notes by Richard Pankhurst ( London : Hakluyt Society , 1991 ) In the South Asian subcontinent , food is traditionally always eaten without utensils ."]

In [None]:
deleted = []
for i, j in enumerate(data_train_dst):
  if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2  or j[-1] not in ('?.\'') or not re.match(r'[A-ZÉ0-9"\']', j[0]) or re.match('[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
    deleted.append(j)
    data_train_dst.pop(i)
    data_train.pop(i)
  else:
    if '()' in data_train_dst[i]:
       data_train_dst[i] = data_train_dst[i].replace('()', '')
    
    if '( )' in data_train_dst[i]:
       data_train_dst[i] = data_train_dst[i].replace('( )', '')
    if ',,' in data_train_dst[i]:
      data_train_dst[i] = data_train_dst[i].replace(',,', '')
    if ';)' in data_train_dst[i]:
      data_train_dst[i] = data_train_dst[i].replace(';)', ')')
    if ':)' in data_train_dst[i]:
      data_train_dst[i] = data_train_dst[i].replace(':)', ')')
    if '(:' in data_train_dst[i]:
      data_train_dst[i] = data_train_dst[i].replace('(:', '(')
    if '(;' in data_train_dst[i]:
      data_train_dst[i] = data_train_dst[i].replace('(:', '(')
    if "& ndash ;" in data_train_dst[i]: 
      data_train_dst[i] = data_train_dst[i].replace("& ndash ;", '')
    if "& minus ;" in data_train_dst[i]: 
      data_train_dst[i] = data_train_dst[i].replace("& minus ;", '')
    if "( , ; ; ; )" in data_train_dst[i]: 
      data_train_dst[i] = data_train_dst[i].replace("( , ; ; ; )", '')
    if "( , ; , ; , ; ; )" in data_train_dst[i]: 
      data_train_dst[i] = data_train_dst[i].replace("( , ; , ; , ; ; )", '')
    if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_train_dst[i]: 
      data_train_dst[i] = data_train_dst[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

  len(deleted)

In [None]:
df_train = pd.DataFrame.from_records(list(zip(data_train, data_train_dst)), columns=['src', 'dst'])
df_train.shape[0]

246993

In [None]:
deleted = []
for i, j in enumerate(data_valid):
    if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2 or len(j.split()) == 3 or j[-1] not in (
    '?.\'') or not re.match(r'[A-ZÉ0-9"\']', j[0]) or re.match('[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
        deleted.append(j)
        data_valid.pop(i)
        data_valid_dst.pop(i)
    else:
        if '()' in data_valid[i]:
            data_valid[i] = data_valid[i].replace('()', '')

        if '( )' in data_valid[i]:
            data_valid[i] = data_valid[i].replace('( )', '')
        if ',,' in data_valid[i]:
            data_valid[i] = data_valid[i].replace(',,', '')
        if ';)' in data_valid[i]:
            data_valid[i] = data_valid[i].replace(';)', ')')
        if ':)' in data_valid[i]:
            data_valid[i] = data_valid[i].replace(':)', ')')
        if '(:' in data_valid[i]:
            data_valid[i] = data_valid[i].replace('(:', '(')
        if '(;' in data_valid[i]:
            data_valid[i] = data_valid[i].replace('(:', '(')
        if "& ndash ;" in data_valid[i]:
            data_valid[i] = data_valid[i].replace("& ndash ;", '')
        if "& minus ;" in data_valid[i]:
            data_valid[i] = data_valid[i].replace("& minus ;", '')
        if "( , ; ; ; )" in data_valid[i]:
            data_valid[i] = data_valid[i].replace("( , ; ; ; )", '')
        if "( , ; , ; , ; ; )" in data_valid[i]:
            data_valid[i] = data_valid[i].replace("( , ; , ; , ; ; )", '')
        if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_valid[i]:
            data_valid[i] = data_valid[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

len(deleted)

62

In [None]:
deleted = []
for i, j in enumerate(data_valid_dst):
    if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2 or j[-1] not in ('?.\'') or not re.match(r'[A-ZÉ0-9"\']',
                                                                                                        j[
                                                                                                            0]) or re.match(
            '[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
        deleted.append(j)
        data_valid_dst.pop(i)
        data_valid.pop(i)
    else:
        if '()' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace('()', '')

        if '( )' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace('( )', '')
        if ',,' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace(',,', '')
        if ';)' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace(';)', ')')
        if ':)' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace(':)', ')')
        if '(:' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace('(:', '(')
        if '(;' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace('(:', '(')
        if "& ndash ;" in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace("& ndash ;", '')
        if "& minus ;" in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace("& minus ;", '')
        if "( , ; ; ; )" in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace("( , ; ; ; )", '')
        if "( , ; , ; , ; ; )" in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace("( , ; , ; , ; ; )", '')
        if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

    len(deleted)

In [None]:
df_val = pd.DataFrame.from_records(list(zip(data_valid, data_valid_dst)), columns=['src', 'dst'])
df_val.shape[0]

818

In [None]:
deleted = []
for i, j in enumerate(data_test):
    if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2 or len(j.split()) == 3 or j[-1] not in (
    '?.\'') or not re.match(r'[A-ZÉ0-9"\']', j[0]) or re.match('[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
        deleted.append(j)
        data_test.pop(i)
        data_test_dst.pop(i)
    else:
        if '()' in data_test[i]:
            data_test[i] = data_test[i].replace('()', '')

        if '( )' in data_test[i]:
            data_test[i] = data_test[i].replace('( )', '')
        if ',,' in data_test[i]:
            data_test[i] = data_test[i].replace(',,', '')
        if ';)' in data_test[i]:
            data_test[i] = data_test[i].replace(';)', ')')
        if ':)' in data_test[i]:
            data_test[i] = data_test[i].replace(':)', ')')
        if '(:' in data_test[i]:
            data_test[i] = data_test[i].replace('(:', '(')
        if '(;' in data_test[i]:
            data_test[i] = data_test[i].replace('(:', '(')
        if "& ndash ;" in data_test[i]:
            data_test[i] = data_test[i].replace("& ndash ;", '')
        if "& minus ;" in data_test[i]:
            data_test[i] = data_test[i].replace("& minus ;", '')
        if "( , ; ; ; )" in data_test[i]:
            data_test[i] = data_test[i].replace("( , ; ; ; )", '')
        if "( , ; , ; , ; ; )" in data_test[i]:
            data_test[i] = data_test[i].replace("( , ; , ; , ; ; )", '')
        if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_test[i]:
            data_test[i] = data_test[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

len(deleted)

0

In [None]:
deleted = []
for i, j in enumerate(data_test_dst):
    if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2 or j[-1] not in ('?.\'') or not re.match(r'[A-ZÉ0-9"\']',
                                                                                                        j[
                                                                                                            0]) or re.match(
            '[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
        deleted.append(j)
        data_test_dst.pop(i)
        data_test.pop(i)
    else:
        if '()' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace('()', '')

        if '( )' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace('( )', '')
        if ',,' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace(',,', '')
        if ';)' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace(';)', ')')
        if ':)' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace(':)', ')')
        if '(:' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace('(:', '(')
        if '(;' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace('(:', '(')
        if "& ndash ;" in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace("& ndash ;", '')
        if "& minus ;" in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace("& minus ;", '')
        if "( , ; ; ; )" in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace("( , ; ; ; )", '')
        if "( , ; , ; , ; ; )" in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace("( , ; , ; , ; ; )", '')
        if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

    len(deleted)

In [None]:
df_test = pd.DataFrame.from_records(list(zip(data_test, data_test_dst)), columns=['src', 'dst'])
df_test.shape[0]

326

In [None]:
 sum(df_train.src == df_train.dst)

11

In [None]:
sum(df_train['src'].apply(lambda x: len(x.split(' ')))<df_train['dst'].apply(lambda x: len(x.split(' '))))

70489

In [None]:
70489/df_train.shape[0]

0.28538865473920316

In [None]:
sum(df_val['src'].apply(lambda x: len(x.split(' ')))<df_val['dst'].apply(lambda x: len(x.split(' '))))

230

In [None]:
230/df_val.shape[0]

0.28117359413202936

In [None]:
sum(df_test['src'].apply(lambda x: len(x.split(' ')))<df_test['dst'].apply(lambda x: len(x.split(' '))))

82

In [None]:
82/df_test.shape[0]

0.25153374233128833

After cleaning the lengths are:

* train: 243466
* dev: 930
* test: 326

So, a significant number of sentences has been removed from the test set

# Translation
There are 2 types of translation:
* Machine translation with 'Helsinki-NLP/opus-mt' model from  transformers library
* Machine translation with Google API (the data translated with the latter one will be downloaded as the api is for charge)



/usr/local/lib/python3.7/dist-packages/torch/tensor.py \
https://github.com/simple-ai-pixel/gpt3_ru/blob/main/gpt.ipynb

In [None]:
# helper functions

def batch_generator(
        list_of_sentences,
        size=64
):
    num_batch = len(list_of_sentences)//size
    for index in range(num_batch):
        yield list_of_sentences[index*size:(index+1)*size]
    yield list_of_sentences[num_batch*size:]


def translate_to_russin(model, tok, src, dst):

  translations_src = []
  for i in tqdm_notebook(batch_generator(src)):
    l = tok(i, return_tensors="pt", padding=True).to(device)
    translated = model.generate(**l)
    translations_src.extend([tok.decode(t, skip_special_tokens=True) for t in translated])

  translations_dst = []
  for i in tqdm_notebook(batch_generator(dst)):
    l = tok(i, return_tensors="pt", padding=True).to(device)
    translated = model.generate(**l)
    translations_dst.extend([tok.decode(t, skip_special_tokens=True) for t in translated])

  return translations_src, translations_dst


In [None]:
# import en-ru model from transformers
model_name = 'Helsinki-NLP/opus-mt-en-ru'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device
model.to(device)

### Translation by parts because of GPU limitations:(

In [None]:
# val--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_valid, data_valid_dst)

for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_val['target_x'] = src
df_val['target_y'] = dst
df_val.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_val_mtt.csv')

In [None]:
# test--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_test, data_test_dst)
for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_test['target_x'] = src
df_test['target_y'] = dst
df_test.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_test_mtt.csv')

In [None]:
# train--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_train[232000:], data_train_dst[232000:])

for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_train_22000 = pd.DataFrame(list(zip(data_train[232000:], data_train_dst[232000:], src, dst)), columns=['src', 'dst', 'target_x', 'target_y'])
df_train_22000.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_train_mtt_232-246k.csv')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




### Normal code for the whole dataset

Some cleaning to remove common translation artifacts is applied

In [None]:
# train--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_train[:1000], data_train_dst[:1000])

for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_train['target_x'] = src
df_train['target_y'] = dst

# val--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_valid, data_valid_dst)

for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_val['target_x'] = src
df_val['target_y'] = dst

# test--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_test, data_test_dst)
for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_test['target_x'] = src
df_test['target_y'] = dst

df_train.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_train_mtt.csv')
df_val.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_val_mtt.csv')
df_test.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_test_mtt.csv')

### Making the whole dataset from pieces

In [None]:
# 1-2
! gdown https://drive.google.com/uc?id=1-0oww8dmt7jRiE4uMVVd1GqrU51rpWNm
#https://drive.google.com/file/d/1-0oww8dmt7jRiE4uMVVd1GqrU51rpWNm/view?usp=sharing

# 2 - 22
! gdown https://drive.google.com/uc?id=1-2CTWEuaCaTeOSWVHWVD9hytmnsT-xov
# https://drive.google.com/file/d/1-2CTWEuaCaTeOSWVHWVD9hytmnsT-xov/view?usp=sharing

# 22-42
! gdown https://drive.google.com/uc?id=1-4tocDbuVFdcEfWnGsXJYl6UNUigT2O6
# https://drive.google.com/file/d/1-4tocDbuVFdcEfWnGsXJYl6UNUigT2O6/view?usp=sharing

# 42-62
! gdown https://drive.google.com/uc?id=1-9Yxy9dIQT_jtih9cgkAIyQibeZ1iPAF
# https://drive.google.com/file/d/1-9Yxy9dIQT_jtih9cgkAIyQibeZ1iPAF/view?usp=sharing

# 62-82
! gdown https://drive.google.com/uc?id=15HE_G_1iRvLsfUHKOOa0bvuYYtAJ503P
# https://drive.google.com/file/d/15HE_G_1iRvLsfUHKOOa0bvuYYtAJ503P/view?usp=sharing

# 82 - 112
! gdown https://drive.google.com/uc?id=1-1zRK3DO25xazojKj31he1U9cUmtqovg
# https://drive.google.com/file/d/1-1zRK3DO25xazojKj31he1U9cUmtqovg/view?usp=sharing

# 112-132
! gdown https://drive.google.com/uc?id=1-2o-bgldNJWYvEtY6uN0w5fae6KbajdR
# https://drive.google.com/file/d/1-2o-bgldNJWYvEtY6uN0w5fae6KbajdR/view?usp=sharing

# 132 -152
! gdown https://drive.google.com/uc?id=1e7sPM2AL03EBlJuNzCQgKHL_Zq8Tc5UP
#https://drive.google.com/file/d/1e7sPM2AL03EBlJuNzCQgKHL_Zq8Tc5UP/view?usp=sharing

# 152 - 182
! gdown https://drive.google.com/uc?id=1--lcK4KBaK_hkK4lVBbf-ze7lSdxSt2k
# https://drive.google.com/file/d/1--lcK4KBaK_hkK4lVBbf-ze7lSdxSt2k/view?usp=sharing

# 182 - 212
! gdown https://drive.google.com/uc?id=1-0loBcmLIYdxtMIBWxyyAHQyPCp1mJF9
# https://drive.google.com/file/d/1-0loBcmLIYdxtMIBWxyyAHQyPCp1mJF9/view?usp=sharing

# 212 - 232
! gdown https://drive.google.com/uc?id=1-4qPowP66ximyi7Q6lqsit9BKmqAinzz
# https://drive.google.com/file/d/1-4qPowP66ximyi7Q6lqsit9BKmqAinzz/view?usp=sharing

# 232 - 246 
! gdown https://drive.google.com/uc?id=1-63gYL7NMFB4Tvh0y24CkDd370ZOsbTV
# https://drive.google.com/file/d/1-63gYL7NMFB4Tvh0y24CkDd370ZOsbTV/view?usp=sharing

In [None]:
# from pathlib import Path
# for i in p.glob('./wiki_train_mtt*.csv'):
#   print(p.cwd().joinpath(i))

In [None]:
paths = ['/content/wiki_train_mtt_1-2k.csv',
         '/content/wiki_train_mtt_2-22k.csv',
         '/content/wiki_train_mtt_22-42k.csv',
         '/content/wiki_train_mtt_42-62k.csv',
         '/content/wiki_train_mtt_62-82k.csv',
         '/content/wiki_train_mtt_82-112k.csv',
         '/content/wiki_train_mtt_112-132k.csv',
         '/content/wiki_train_mtt_132-152k.csv',
         '/content/wiki_train_mtt_152-182k.csv',
         '/content/wiki_train_mtt_182-212k.csv', 
         '/content/wiki_train_mtt_212-232k.csv',
         '/content/wiki_train_mtt_232-246k.csv'
         ]
train_full = pd.read_csv(paths[0])
for i in paths[1:]:
    train_full = pd.concat((train_full, pd.read_csv(i)))

train_full.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_train_mtt.csv', index=False)

Upload the transformer translation

In [None]:
! gdown https://drive.google.com/uc?id=1--3BHQFq5hKy5RiI4G1nfFG6Erj9iN4s
! gdown https://drive.google.com/uc?id=17dwqCJc7hWpG39LrJMFmsqbTMiFt7KtD
! gdown https://drive.google.com/uc?id=1URMCB5ocu_Xeco_df3ol4XnfY1k6LpmU

df_train_mt = pd.read_csv('/content/wiki_train_mtt.csv')
df_val_mt = pd.read_csv('/content/wiki_val_mtt.csv')
df_test_mt = pd.read_csv('/content/wiki_test_mtt.csv')

Downloading...
From: https://drive.google.com/uc?id=1--3BHQFq5hKy5RiI4G1nfFG6Erj9iN4s
To: /content/wiki_test_mtt.csv
100% 220k/220k [00:00<00:00, 28.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=17dwqCJc7hWpG39LrJMFmsqbTMiFt7KtD
To: /content/wiki_val_mtt.csv
100% 563k/563k [00:00<00:00, 30.8MB/s]
Downloading...
From: https://drive.google.com/uc?id=1URMCB5ocu_Xeco_df3ol4XnfY1k6LpmU
To: /content/wiki_train_mtt.csv
167MB [00:01, 114MB/s]


A better translation obtained via Google translate API

In [None]:
! gdown https://drive.google.com/uc?id=1dB3X-Wx8qU_5nDG_pxAmLvo5H_sgnHrE
! gdown https://drive.google.com/uc?id=1bJo8TagTGKa0uyppQRqsHrKHyYO5tcZc
! gdown https://drive.google.com/uc?id=11lqipq6ggrgCk8bVxQ4-uuPVMCKN5ebU

df_val_gl = pd.read_csv('/content/wiki_dev_cleaned_translated_sd.csv')
df_test_gl = pd.read_csv('/content/wiki_test_cleaned_translated_sd.csv')
df_train_gl = pd.read_csv('/content/wiki_train_cleaned_translated_sd.csv')

Downloading...
From: https://drive.google.com/uc?id=1dB3X-Wx8qU_5nDG_pxAmLvo5H_sgnHrE
To: /content/wiki_train_cleaned_translated_sd.csv
172MB [00:01, 130MB/s]
Downloading...
From: https://drive.google.com/uc?id=1bJo8TagTGKa0uyppQRqsHrKHyYO5tcZc
To: /content/wiki_dev_cleaned_translated_sd.csv
100% 545k/545k [00:00<00:00, 81.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=11lqipq6ggrgCk8bVxQ4-uuPVMCKN5ebU
To: /content/wiki_test_cleaned_translated_sd.csv
100% 254k/254k [00:00<00:00, 37.1MB/s]


In [None]:
df_dev_google.sample(3)

Unnamed: 0.1,Unnamed: 0,src,dst,target_x,target_y
208,208,"Additionally , since this AO system provides an excellent and stable correction ( angular resolution of 0.060 arcsec in K band ) , a 15-km moonlet at 1000 km of Hektor 's primary was detected .","Additionally , since this AO system provides an excellent and stable correction ( angular resolution of 0.060 arcsec in K band ) , a 15-km moon at 1000 km from Hektor was found .","Кроме того, поскольку эта система AO обеспечивает отличную и стабильную коррекцию (угловое разрешение 0,060 угловой секунды в диапазоне K), была обнаружена луна длиной 15 км на 1000 км от главной звезды Гектора.","Кроме того, поскольку эта система AO обеспечивает отличную и стабильную коррекцию (угловое разрешение 0,060 угловой секунды в диапазоне K), была обнаружена 15-километровая луна в 1000 км от Гектора."
29,29,Class 316 and Class 457 were TOPS classifications assigned to a single electric multiple unit ( EMU ) at different stages of its use as a prototype for the Networker series .,Class 316 and Class 457 were two suggested TOPS classifications . They were given to a single electric multiple unit ( EMU ) at different stages of its use as a prototype for the Networker series .,"Классы 316 и 457 были классификациями TOPS, присвоенными одиночному электрическому блоку (EMU) на разных этапах его использования в качестве прототипа для серии Networker.",Класс 316 и класс 457 были двумя предложенными классификациями TOPS. Они были переданы в единый электрический многоканальный блок (EMU) на разных этапах его использования в качестве прототипа для серии Networker.
51,51,"Mount Batur ( Gunung Batur ) is an active volcano located at the center of two concentric calderas north west of Mount Agung , Bali , Indonesia .",Mount Batur or Gunung Batur is a volcano on Bali .,"Гора Батур (Гунунг Батур) - действующий вулкан, расположенный в центре двух концентрических кальдер к северо-западу от горы Агунг, Бали, Индонезия.",Гора Батур или Гунунг Батур - вулкан на Бали.


# Evaluation of all the data

### WikiLarge

* Cosine Similarity between original/simple sentences
* Flesch Kincaid Grade Level
* Grammar Check

### WikiLarge MT Helsinki opus and GOogle Api translations + Original dataset (dev+ test parts) collected via Toloka

* Cosine Similarity between original/simple sentences
* Cosine Similarity between original sentences and their translation, simple sentences and their translations
* Flesch Kincaid Grade Level
* Grammar Check



# Evaluation of WikiLarge

### Cosine Similarity between original/simple sentences

In [None]:
from transformers import RobertaTokenizer, RobertaModel, AutoConfig, AutoTokenizer, AutoModelForMaskedLM
device = "cuda" if torch.cuda.is_available() else "cpu"
config = AutoConfig.from_pretrained("roberta-base") # "roberta-base" 'xlm-mlm-100-1280' 'xlm-roberta-base' 'bert-base-multilingual-cased'
config.output_hidden_states = True

tok = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForMaskedLM.from_pretrained("roberta-base", config=config)
model.to(device)

# from sklearn.metrics.pairwise import cosine_similarity

# shape should be [1, something (768, ex)]

# import numpy as np
# def cs(a, b):
#   return (a @ b.T)/(np.linalg.norm(a)*np.linalg.norm(b))

def calc_cos_sim(df, model,tok, x, y, column_name):
    Cos_sim= []
    for index, row in df.iterrows():
        
        # original
          sentence_A = tok.encode(row[x], padding='max_length', max_length=50, truncation=True, return_tensors='pt')
          sentence_A = sentence_A.to(device)
          output = model(sentence_A)
          sent_emb = output[-1][0]
          emb_source = sent_emb.mean(axis=1)
          emb_source = emb_source.cpu().detach().numpy()

          sentence_B = tok.encode(row[y], padding='max_length', max_length=50, truncation=True, return_tensors='pt')
          sentence_B = sentence_B.to(device)
          output = model(sentence_B)
          sent_emb = output[-1][0]
          emb_target= sent_emb.mean(axis=1)
          emb_target = emb_target.cpu().detach().numpy()

          cos_val = cosine_similarity(emb_source.reshape(emb_source.shape[0], -1), emb_target.reshape(emb_target.shape[0], -1))[0][0]
          Cos_sim.append(cos_val)
    df[column_name] = Cos_sim

In [None]:
calc_cos_sim(df_test, model, tok, 'src', 'dst', 'cos_sim_src_dst')
calc_cos_sim(df_val, model, tok, 'src', 'dst', 'cos_sim_src_dst')
calc_cos_sim(df_train, model, tok, 'src', 'dst', 'cos_sim_src_dst')

In [None]:
df_train.cos_sim_src_dst.mean(), df_val.cos_sim_src_dst.mean(), df_test.cos_sim_src_dst.mean()

(0.8546753056345266, 0.8547325381886187, 0.9688304633450654)

### Flesch Kincaid Grade Level

In [None]:
textstat.set_lang('en')

 Путем объединения

In [None]:
a, b = ' '.join(list(df_train['src'].values)), ' '.join(list(df_train['dst'].values))
fk_src = textstat.flesch_kincaid_grade(a)
fk_dst = textstat.flesch_kincaid_grade(b)
asl_src = textstat.syllable_count(a)
asl_dst = textstat.syllable_count(b)
aws_src = textstat.lexicon_count(a)
aws_dst = textstat.lexicon_count(b)
ease_src = textstat.flesch_reading_ease(a)
ease_dst = textstat.flesch_reading_ease(b)
fk_src, fk_dst, asl_src, asl_dst, aws_src, aws_dst, ease_src, ease_dst

(7.1, 5.0, 6415755, 5114582, 5546816, 4426502, 106.25, 113.27)

In [None]:
a, b = ' '.join(list(df_val['src'].values)), ' '.join(list(df_val['dst'].values))
fk_src = textstat.flesch_kincaid_grade(a)
fk_dst = textstat.flesch_kincaid_grade(b)
asl_src = textstat.syllable_count(a)
asl_dst = textstat.syllable_count(b)
aws_src = textstat.lexicon_count(a)
aws_dst = textstat.lexicon_count(b)
ease_src = textstat.flesch_reading_ease(a)
ease_dst = textstat.flesch_reading_ease(b)
fk_src, fk_dst, asl_src, asl_dst, aws_src, aws_dst, ease_src, ease_dst

(7.3, 5.2, 21474, 17343, 18633, 15033, 105.73, 112.62)

In [None]:
a, b = ' '.join(list(df_test['src'].values)), ' '.join(list(df_test['dst'].values))
fk_src = textstat.flesch_kincaid_grade(a)
fk_dst = textstat.flesch_kincaid_grade(b)
asl_src = textstat.syllable_count(a)
asl_dst = textstat.syllable_count(b)
aws_src = textstat.lexicon_count(a)
aws_dst = textstat.lexicon_count(b)
ease_src = textstat.flesch_reading_ease(a)
ease_dst = textstat.flesch_reading_ease(b)
fk_src, fk_dst, asl_src, asl_dst, aws_src, aws_dst, ease_src, ease_dst

(4.9, 4.3, 7359, 7101, 6465, 6307, 115.64, 117.72)

Среднее значение по предложениям

In [None]:
df_train['fkg_src'] = df_train['src'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_train['fkg_dst'] = df_train['dst'].apply(lambda x: textstat.flesch_kincaid_grade(x))

df_val['fkg_src'] = df_val['src'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_val['fkg_dst'] = df_val['dst'].apply(lambda x: textstat.flesch_kincaid_grade(x))

df_test['fkg_src'] = df_test['src'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_test['fkg_dst'] = df_test['dst'].apply(lambda x: textstat.flesch_kincaid_grade(x))

In [None]:
df_train['asl_src'] = df_train['src'].apply(lambda x: textstat.syllable_count(x))
df_train['asl_dst'] = df_train['dst'].apply(lambda x: textstat.syllable_count(x))

df_val['asl_src'] = df_val['src'].apply(lambda x: textstat.syllable_count(x))
df_val['asl_dst'] = df_val['dst'].apply(lambda x: textstat.syllable_count(x))

df_test['asl_src'] = df_test['src'].apply(lambda x: textstat.syllable_count(x))
df_test['asl_dst'] = df_test['dst'].apply(lambda x: textstat.syllable_count(x))



df_train['aws_src'] = df_train['src'].apply(lambda x: textstat.lexicon_count(x))
df_train['aws_dst'] = df_train['dst'].apply(lambda x: textstat.lexicon_count(x))

df_val['aws_src'] = df_val['src'].apply(lambda x: textstat.lexicon_count(x))
df_val['aws_dst'] = df_val['dst'].apply(lambda x: textstat.lexicon_count(x))

df_test['aws_src'] = df_test['src'].apply(lambda x: textstat.lexicon_count(x))
df_test['aws_dst'] = df_test['dst'].apply(lambda x: textstat.lexicon_count(x))

df_train['ease_src'] = df_train['src'].apply(lambda x: textstat.flesch_reading_ease(x))
df_train['ease_dst'] = df_train['dst'].apply(lambda x: textstat.flesch_reading_ease(x))

df_val['ease_src'] = df_val['src'].apply(lambda x: textstat.flesch_reading_ease(x))
df_val['ease_dst'] = df_val['dst'].apply(lambda x: textstat.flesch_reading_ease(x))

df_test['ease_src'] = df_test['src'].apply(lambda x: textstat.flesch_reading_ease(x))
df_test['ease_dst'] = df_test['dst'].apply(lambda x: textstat.flesch_reading_ease(x))

In [None]:
df_train['ease_src'] = df_train['src'].apply(lambda x: textstat.flesch_reading_ease(x))
df_train['ease_dst'] = df_train['dst'].apply(lambda x: textstat.flesch_reading_ease(x))

df_val['ease_src'] = df_val['src'].apply(lambda x: textstat.flesch_reading_ease(x))
df_val['ease_dst'] = df_val['dst'].apply(lambda x: textstat.flesch_reading_ease(x))

df_test['ease_src'] = df_test['src'].apply(lambda x: textstat.flesch_reading_ease(x))
df_test['ease_dst'] = df_test['dst'].apply(lambda x: textstat.flesch_reading_ease(x))

In [None]:
df_train.fkg_src.mean(), df_train.fkg_dst.mean(), df_train.fkg_src.mean() - df_train.fkg_dst.mean()

(11.971624297044338, 9.28312057426604, 2.6885037227782984)

In [None]:
df_val.fkg_src.mean(), df_val.fkg_dst.mean(), df_val.fkg_src.mean() - df_val.fkg_dst.mean()

(12.096454767726172, 9.439119804400974, 2.657334963325198)

In [None]:
df_test.fkg_src.mean(), df_test.fkg_dst.mean(), df_test.fkg_src.mean() - df_test.fkg_dst.mean()

(11.017484662576688, 9.821779141104283, 1.195705521472405)

In [None]:
df_train.ease_src.mean(), df_train.ease_dst.mean(), df_train.ease_src.mean() - df_train.ease_dst.mean()

(108.62351200250956, 115.37726150150199, -6.753749498992434)

In [None]:
df_val.ease_src.mean(), df_val.ease_dst.mean(), df_val.ease_src.mean() - df_val.ease_dst.mean()

(108.42540342298297, 114.76784841075778, -6.342444987774812)

In [None]:
df_test.ease_src.mean(), df_test.ease_dst.mean(), df_test.ease_src.mean() - df_test.ease_dst.mean()

(112.74450920245376, 115.11917177914093, -2.3746625766871716)

In [None]:
df_train.to_csv('/content/drive/MyDrive/MT_sentence_simpl/WikiLarge_train_CosSImFKG.csv')
df_val.to_csv('/content/drive/MyDrive/MT_sentence_simpl/WikiLarge_val_CosSImFKG.csv')
df_test.to_csv('/content/drive/MyDrive/MT_sentence_simpl/WikiLarge_test_CosSImFKG.csv')

#Alex.waters

### Grammar Checker

In [None]:
tool = language_tool_python.LanguageTool('en')

Downloading LanguageTool: 100%|██████████| 190M/190M [00:15<00:00, 12.0MB/s]
Unzipping /tmp/tmpz7h9vuh4.zip to /root/.cache/language_tool_python.
Downloaded https://www.languagetool.org/download/LanguageTool-5.2.zip to /root/.cache/language_tool_python.


In [None]:
def get_mistakes_summary(df_test):
    src_test = list(df_test['src'].values)
    dst_test =list(df_test['dst'].values)
    matches_src = []
    for i in src_test:
      matches_src.extend(tool.check(i))
    matches_src

    matches_dst = []
    for i in dst_test:
      matches_dst.extend(tool.check(i))
    matches_dst

    categories = set([i.category for i in matches_src+matches_dst])

    categories_src = {i:0 for i in categories}
    categories_dst = {i:0 for i in categories}

    for i in matches_src:
      categories_src[i.category]+=1

    for i in matches_dst:
      categories_dst[i.category]+=1
      
    return categories_src, categories_dst

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_test)

In [None]:
src_errors

{'CASING': 3,
 'COLLOCATIONS': 1,
 'GRAMMAR': 3,
 'MISC': 2,
 'PUNCTUATION': 13,
 'REDUNDANCY': 0,
 'TYPOGRAPHY': 826,
 'TYPOS': 2}

In [None]:
dst_errors

{'CASING': 4,
 'COLLOCATIONS': 0,
 'GRAMMAR': 11,
 'MISC': 2,
 'PUNCTUATION': 14,
 'REDUNDANCY': 1,
 'TYPOGRAPHY': 744,
 'TYPOS': 4}

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_val)

In [None]:
src_errors

{'CASING': 1,
 'COLLOCATIONS': 3,
 'CONFUSED_WORDS': 1,
 'GRAMMAR': 12,
 'MISC': 7,
 'NONSTANDARD_PHRASES': 0,
 'PUNCTUATION': 21,
 'REDUNDANCY': 2,
 'SEMANTICS': 3,
 'STYLE': 3,
 'TYPOGRAPHY': 2476,
 'TYPOS': 6}

In [None]:
dst_errors

{'CASING': 2,
 'COLLOCATIONS': 2,
 'CONFUSED_WORDS': 2,
 'GRAMMAR': 12,
 'MISC': 15,
 'NONSTANDARD_PHRASES': 2,
 'PUNCTUATION': 43,
 'REDUNDANCY': 3,
 'SEMANTICS': 1,
 'STYLE': 4,
 'TYPOGRAPHY': 2019,
 'TYPOS': 9}

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_train)

In [None]:
src_errors

{'CASING': 1069,
 'COLLOCATIONS': 230,
 'COMPOUNDING': 20,
 'CONFUSED_WORDS': 222,
 'GRAMMAR': 3288,
 'MISC': 2755,
 'NONSTANDARD_PHRASES': 57,
 'PUNCTUATION': 9216,
 'REDUNDANCY': 1495,
 'SEMANTICS': 127,
 'STYLE': 539,
 'TYPOGRAPHY': 763452,
 'TYPOS': 2092}

In [None]:
dst_errors

{'CASING': 1113,
 'COLLOCATIONS': 231,
 'COMPOUNDING': 17,
 'CONFUSED_WORDS': 238,
 'GRAMMAR': 4364,
 'MISC': 3319,
 'NONSTANDARD_PHRASES': 182,
 'PUNCTUATION': 14757,
 'REDUNDANCY': 1418,
 'SEMANTICS': 98,
 'STYLE': 524,
 'TYPOGRAPHY': 609128,
 'TYPOS': 2513}

# WikiLarge translated with Mt Helsinki from transformers

### Cosine Similarity between original/simple sentences

In [None]:
# device = "cuda" if torch.cuda.is_available() else "cpu"
# config = AutoConfig.from_pretrained("DeepPavlov/rubert-base-cased") # "roberta-base" 'xlm-mlm-100-1280' 'xlm-roberta-base' 'bert-base-multilingual-cased'
# config.output_hidden_states = True

# tok = AutoTokenizer.from_pretrained("DeepPavlov/rubert-base-cased")
# model = AutoModelForMaskedLM.from_pretrained("DeepPavlov/rubert-base-cased", config=config)
# model.to(device)

In [None]:
calc_cos_sim(df_test_mt, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')
calc_cos_sim(df_val_mt, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')
calc_cos_sim(df_train_mt, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')


In [None]:
df_train_mt.cos_sim_x_y.mean(), df_val_mt.cos_sim_x_y.mean(), df_test_mt.cos_sim_x_y.mean()

0.9468623274033268 0.9478329893174264 0.9841433672085862


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SentenceTransformer('LaBSE')
model.to(device)

def calc_cos_sim_(df, model):
    LABSE_orig = []
    LABSE_simpl = []
    for index, row in df.iterrows():

        # original
          emb_source = model.encode(row['src'])
          emb_target = model.encode(row['target_x'])

          cos_val = cosine_similarity(emb_source.reshape(-1, emb_source.shape[0]), emb_target.reshape(-1, emb_target.shape[0]))[0][0]
          LABSE_orig.append(cos_val)
          
        # simplified
          emb_source = model.encode(row['dst'])
          emb_target = model.encode(row['target_y'])

          cos_val = cosine_similarity(emb_source.reshape(-1, emb_source.shape[0]), emb_target.reshape(-1, emb_target.shape[0]))[0][0]
          LABSE_simpl.append(cos_val)

    df['LABSE_orig'] = LABSE_orig
    df['LABSE_simpl'] = LABSE_simpl

calc_cos_sim_(df_test_mt, model)
calc_cos_sim_(df_val_mt, model)
calc_cos_sim_(df_train_mt, model)

In [None]:
df_train_mt.LABSE_orig.mean(), df_val_mt.LABSE_orig.mean(), df_test_mt.LABSE_orig.mean()

(0.8862167089812055, 0.8850720814067169, 0.8860442356829263)

In [None]:
df_train_mt.LABSE_simpl.mean(), df_val_mt.LABSE_simpl.mean(), df_test_mt.LABSE_simpl.mean()

(0.8818510114127176, 0.8806601700653074, 0.8815181845901934)

### Flesch Kincaid Grade Level

In [None]:
#plainrussian
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from math import sqrt
import csv


from numpy import mean, arange


# Russian sounds and characters
RU_CONSONANTS_LOW = [u'к', u'п', u'с', u'т', u'ф', u'х', u'ц', u'ч', u'ш', u'щ']
RU_CONSONANTS_HIGH = [u'б', u'в', u'г', u'д', u'ж', u'з']
RU_CONSONANTS_SONOR = [u'л', u'м', u'н', u'р']
RU_CONSONANTS_YET = [u'й']

RU_CONSONANTS = RU_CONSONANTS_HIGH + RU_CONSONANTS_LOW + RU_CONSONANTS_SONOR + RU_CONSONANTS_YET
RU_VOWELS = [u'а', u'е', u'и', u'у', u'о', u'я', u'ё', u'э', u'ю', u'я', u'ы']
RU_MARKS = [u'ь', u'ъ']
SENTENCE_SPLITTERS = [u'.', u'?', u'!']
RU_LETTERS = RU_CONSONANTS + RU_MARKS + RU_VOWELS
SPACES = [u' ', u'\t']

# List of prepared texts

GRADE_TEXT = {
    1: u'1 - 3-й класс (возраст примерно: 6-8 лет)',
    2: u'1 - 3-й класс (возраст примерно: 6-8 лет)',
    3: u'1 - 3-й класс (возраст примерно: 6-8 лет)',
    4: u'4 - 6-й класс (возраст примерно: 9-11 лет)',
    5: u'4 - 6-й класс (возраст примерно: 9-11 лет)',
    6: u'4 - 6-й класс (возраст примерно: 9-11 лет)',
    7: u'7 - 9-й класс (возраст примерно: 12-14 лет)',
    8: u'7 - 9-й класс (возраст примерно: 12-14 лет)',
    9: u'7 - 9-й класс (возраст примерно: 12-14 лет)',
    10: u'10 - 11-й класс (возраст примерно: 15-16 лет)',
    11: u'10 - 11-й класс (возраст примерно: 15-16 лет)',
    12: u'1 - 3 курсы ВУЗа (возраст примерно: 17-19 лет)',
    13: u'1 - 3 курсы ВУЗа (возраст примерно: 17-19 лет)',
    14: u'1 - 3 курсы ВУЗа (возраст примерно: 17-19 лет)',
    15: u'4 - 6 курсы ВУЗа (возраст примерно: 20-22 лет)',
    16: u'4 - 6 курсы ВУЗа (возраст примерно: 20-22 лет)',
    17: u'4 - 6 курсы ВУЗа (возраст примерно: 20-22 лет)',
}

POST_GRADE_TEXT_18_24 = u'Аспирантура, второе высшее образование, phD'


def calc_SMOG(n_psyl, n_sent):
    """Метрика SMOG для английского языка"""
    n = 1.0430 * sqrt((float(30.0) / n_sent) * n_psyl) + 3.1291
    return n

def calc_Gunning_fog(n_psyl, n_words, n_sent):
    """Метрика Gunning fog для английского языка"""
    n = 0.4 * ((float(n_words)/ n_sent) + 100 * (float(n_psyl) / n_words))
    return n

def calc_Dale_Chale(n_psyl, n_words, n_sent):
    """Метрика Dale Chale для английского языка"""
    n = 0.1579 * (100.0 * n_psyl / n_words) + 0.0496 * (float(n_words) / n_sent)
    return n

def calc_Flesh_Kincaid(n_syllabes, n_words, n_sent):
    """Метрика Flesh Kincaid для английского языка"""
    n = 206.835 - 1.015 * (float(n_words) / n_sent) - 84.6 * (float(n_syllabes) / n_words)
    return n


def calc_Flesh_Kincaid_rus(n_syllabes, n_words, n_sent):
    """Метрика Flesh Kincaid для русского языка"""
    n = 220.755 - 1.315 * (float(n_words) / n_sent) - 50.1 * (float(n_syllabes) / n_words)
    return n

def calc_Flesh_Kincaid_Grade_rus(n_syllabes, n_words, n_sent):
    """Метрика Flesh Kincaid Grade для русского языка"""
#    n = 0.59 * (float(n_words) / n_sent) + 6.2 * (float(n_syllabes) / n_words) - 16.59
    n = 0.49 * (float(n_words) / n_sent) + 7.3 * (float(n_syllabes) / n_words) - 16.59
    return n



def calc_Flesh_Kincaid_Grade_rus_adapted(n_syllabes, n_words, n_sent, X, Y, Z):
    """Метрика Flesh Kincaid Grade для русского языка с параметрами"""
#    n = 0.59 * (float(n_words) / n_sent) + 6.2 * (float(n_syllabes) / n_words) - 16.59
    if n_words == 0 or n_sent == 0: return 0
    n = X * (float(n_words) / n_sent) + Y * (float(n_syllabes) / n_words) - Z
    return n


#X_GRADE = 0.186
#Y_GRADE = 7.21
#Z_GRADE = 15.443

# Flesh Kinkaid Grade константы. Подробнее http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests
FLG_X_GRADE = 0.318
FLG_Y_GRADE = 14.2
FLG_Z_GRADE = 30.5

def calc_Flesh_Kincaid_Grade_rus_flex(n_syllabes, n_words, n_sent):
    """Метрика Flesh Kincaid Grade для русского языка с константными параметрами"""
    if n_words == 0 or n_sent == 0: return 0
    n = FLG_X_GRADE * (float(n_words) / n_sent) + FLG_Y_GRADE * (float(n_syllabes) / n_words) - FLG_Z_GRADE
    return n


# Coleman Liau константы. Подробнее http://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index

CLI_X_GRADE = 0.055
CLI_Y_GRADE = 0.35
CLI_Z_GRADE = 20.33


def calc_Coleman_Liau_index_adapted(n_letters, n_words, n_sent, x, y, z):
    """ Метрика Coleman Liau для русского языка с адаптированными параметрами """
    if n_words == 0: return 0
    n = x * (n_letters * (100.0 / n_words)) - y * (n_sent * (100.0 / n_words)) - z
    return n

def calc_Coleman_Liau_index(n_letters, n_words, n_sent):
    """ Метрика Coleman Liau для русского языка с константными параметрами """
    if n_words == 0: return 0
    n = CLI_X_GRADE * (n_letters * (100.0 / n_words)) - CLI_Y_GRADE * (n_sent * (100.0 / n_words)) - CLI_Z_GRADE
    return n


# Константы SMOG Index http://en.wikipedia.org/wiki/SMOG
SMOG_X_GRADE = 1.1
SMOG_Y_GRADE = 64.6
SMOG_Z_GRADE = 0.05

def calc_SMOG_index(n_psyl, n_sent):
    """Метрика SMOG для русского языка с константными параментрами"""
    n = SMOG_X_GRADE * sqrt((float(SMOG_Y_GRADE) / n_sent) * n_psyl) + SMOG_Z_GRADE
    return n

def calc_SMOG_index_adapted(n_psyl, n_sent, x, y, z):
    """Метрика SMOG для русского языка адаптированная с коэффициентами"""
    n = x * sqrt((float(y) / n_sent) * n_psyl) + z
    return n

DC_X_GRADE = 0.552
DC_Y_GRADE = 0.273

def calc_Dale_Chale_index(n_psyl, n_words, n_sent):
    """Метрика Dale Chale для русского языка с константным параметрами"""
    n = DC_X_GRADE * (100.0 * n_psyl / n_words) + DC_Y_GRADE * (float(n_words) / n_sent)
    return n


def calc_Dale_Chale_adapted(n_psyl, n_words, n_sent, x, y):
    """Метрика Dale Chale для русского языка с адаптированными параметрами"""
    n = x * (100.0 * n_psyl / n_words) + y * (float(n_words) / n_sent)
    return n

ARI_X_GRADE = 6.26
ARI_Y_GRADE = 0.2805
ARI_Z_GRADE = 31.04


def calc_ARI_index_adapted(n_letters, n_words, n_sent, x, y, z):
    """ Метрика Automated Readability Index (ARI) для русского языка с адаптированными параметрами """
    if n_words == 0 or n_sent == 0: return 0
    n = x * (float(n_letters) / n_words) + y * (float(n_words) / n_sent) - z
    return n

def calc_ARI_index(n_letters, n_words, n_sent):
    """ Метрика Automated Readability Index (ARI) для русского языка с константными параметрами """
    if n_words == 0 or n_sent == 0: return 0
    n = ARI_X_GRADE * (float(n_letters) / n_words) + ARI_Y_GRADE * (float(n_words) / n_sent) - ARI_Z_GRADE
    return n


def load_words(filename):
    """Load words from filename"""
    words = []
    f = open(filename, 'r')
    for l in f:
        words.append(l.strip().decode('utf8'))
    f.close()
    return words

#FAM_WORDS = load_words('1norm50000.txt')

bad_chars = '(){}<>"\'!?,.:;'


def calc_text_metrics(filename, verbose=True):
    """Расчет метрик"""
    f = open(filename, 'r')
    text = f.read().decode('utf8')    
    f.close()
    return calc_readability_metrics(text, verbose)


# Number of syllabes for long words
COMPLEX_SYL_FACTOR = 4

def calc_readability_metrics(text, verbose=True):
    sentences = 0
    chars = 0
    spaces = 0
    letters = 0
    syllabes = 0
    words = 0
    complex_words = 0
    simple_words = 0
    wsyllabes = {}

    wordStart = False
    for l in text.splitlines():
        chars += len(l)
#        l = l.decode('utf8')
        for ch in l:
            if ch in SENTENCE_SPLITTERS:
                sentences += 1
            if ch in SPACES:
                spaces += 1

        for w in l.split():
            has_syl = False
            wsyl = 0
#            if len(w) > 1: words += 1
            for ch in w:
                if ch in RU_LETTERS:
                    letters += 1
                if ch in RU_VOWELS:
                    syllabes += 1
                    has_syl = True
                    wsyl += 1
            if wsyl > COMPLEX_SYL_FACTOR:
                complex_words += 1
            elif wsyl < COMPLEX_SYL_FACTOR+1 and wsyl > 0:
                simple_words += 1
            if has_syl:
                words += 1
                v = wsyllabes.get(str(wsyl), 0)
                wsyllabes[str(wsyl)] = v + 1
    metrics = {'c_share': float(complex_words) * 100 / words if words > 0 else 0,
               'avg_slen' : float(words) / sentences if sentences > 0 else 0,
               'avg_syl' : float(syllabes) / words if words > 0 else 0,
               'n_syllabes': syllabes,
               'n_words' : words,
               'n_sentences': sentences,
               'n_complex_words': complex_words,
               'n_simple_words' : simple_words,
               'chars': chars,
               'letters' : letters,
               'spaces' : spaces,
               'index_fk_rus': calc_Flesh_Kincaid_Grade_rus_flex(syllabes, words, sentences),
               'index_cl_rus' : calc_Coleman_Liau_index(letters, words, sentences),
               'index_dc_rus' : calc_Dale_Chale_index(complex_words, words, sentences),
               'index_SMOG_rus' : calc_SMOG_index(complex_words, sentences),
               'index_ari_rus' : calc_ARI_index(letters, words, sentences),
#               'index_fk_rus': calc_Flesh_Kincaid_Grade_rus(syllabes, words, sentences),
               'wsyllabes' : wsyllabes
    }
    del text
    return metrics

In [None]:
a,b = '\n'.join(list(df_test_mt['target_x'].values)), '\n'.join(list(df_test_mt['target_y'].values))
src, dst = calc_readability_metrics(a), calc_readability_metrics(b)
src, dst

({'avg_slen': 13.623906705539358,
  'avg_syl': 2.789000641985876,
  'c_share': 12.261930237534774,
  'chars': 39562,
  'index_SMOG_rus': 11.47718333406957,
  'index_ari_rus': 13.508345120439422,
  'index_cl_rus': 12.883353306227264,
  'index_dc_rus': 10.487912021731441,
  'index_fk_rus': 13.436211448560961,
  'letters': 30402,
  'n_complex_words': 573,
  'n_sentences': 343,
  'n_simple_words': 4100,
  'n_syllabes': 13033,
  'n_words': 4673,
  'spaces': 5260,
  'wsyllabes': {'1': 973,
   '10': 1,
   '2': 1230,
   '3': 1149,
   '4': 748,
   '5': 355,
   '6': 154,
   '7': 52,
   '8': 11}},
 {'avg_slen': 12.621468926553673,
  'avg_syl': 2.7309758281110117,
  'c_share': 11.078782452999105,
  'chars': 37032,
  'index_SMOG_rus': 10.50465997793507,
  'index_ari_rus': 12.355281747416662,
  'index_cl_rus': 11.91328558639212,
  'index_dc_rus': 9.56114893100466,
  'index_fk_rus': 12.293483877820428,
  'letters': 28446,
  'n_complex_words': 495,
  'n_sentences': 354,
  'n_simple_words': 3973,
  'n_

In [None]:
a,b = '\n'.join(list(df_val_mt['target_x'].values)), '\n'.join(list(df_val_mt['target_y'].values))
src, dst = calc_readability_metrics(a), calc_readability_metrics(b)
src, dst

({'avg_slen': 14.151111111111112,
  'avg_syl': 2.7993875628140703,
  'c_share': 12.821922110552764,
  'chars': 110547,
  'index_SMOG_rus': 11.959150450155732,
  'index_ari_rus': 13.796333902847572,
  'index_cl_rus': 13.102160804020102,
  'index_dc_rus': 10.940954338358459,
  'index_fk_rus': 13.751356725293128,
  'letters': 83144,
  'n_complex_words': 1633,
  'n_sentences': 900,
  'n_simple_words': 11103,
  'n_syllabes': 35653,
  'n_words': 12736,
  'spaces': 14878,
  'wsyllabes': {'1': 2542,
   '10': 4,
   '11': 2,
   '2': 3477,
   '3': 3202,
   '4': 1882,
   '5': 1013,
   '6': 460,
   '7': 110,
   '8': 30,
   '9': 14}},
 {'avg_slen': 10.148261758691207,
  'avg_syl': 2.7216120906801007,
  'c_share': 10.780856423173804,
  'chars': 84154,
  'index_SMOG_rus': 9.297650450509348,
  'index_ari_rus': 11.335727977469055,
  'index_cl_rus': 10.951209068010083,
  'index_dc_rus': 8.72150820571464,
  'index_fk_rus': 11.37403892692123,
  'letters': 62672,
  'n_complex_words': 1070,
  'n_sentences': 

In [None]:
a,b = '\n'.join(list(df_train_mt['target_x'].values)), '\n'.join(list(df_train_mt['target_y'].values))
src, dst = calc_readability_metrics(a), calc_readability_metrics(b)
src, dst

({'avg_slen': 13.049542555561745,
  'avg_syl': 2.7694332731752147,
  'c_share': 12.116218548708735,
  'chars': 32549775,
  'index_SMOG_rus': 11.167064806039756,
  'index_ari_rus': 13.080458957824057,
  'index_cl_rus': 12.535891680705788,
  'index_dc_rus': 10.250677756555579,
  'index_fk_rus': 12.975707011756683,
  'letters': 24227272,
  'n_complex_words': 454171,
  'n_sentences': 287248,
  'n_simple_words': 3294284,
  'n_syllabes': 10381096,
  'n_words': 3748455,
  'spaces': 4426893,
  'wsyllabes': {'1': 761801,
   '10': 591,
   '11': 170,
   '12': 86,
   '13': 17,
   '14': 11,
   '15': 2,
   '2': 1025158,
   '3': 952200,
   '4': 555125,
   '43': 1,
   '5': 290202,
   '6': 121861,
   '7': 31819,
   '8': 6989,
   '9': 2422}},
 {'avg_slen': 10.103324336138884,
  'avg_syl': 2.7021020808595124,
  'c_share': 10.662112890697701,
  'chars': 24845256,
  'index_SMOG_rus': 9.226197025142815,
  'index_ari_rus': 11.21290988485832,
  'index_cl_rus': 10.839036612660973,
  'index_dc_rus': 8.643693859

In [None]:
df_train_mt.to_csv('/content/drive/MyDrive/MT_sentence_simpl/MT_WikiLarge_train_CosSImFKG.csv')
df_val_mt.to_csv('/content/drive/MyDrive/MT_sentence_simpl/MT_WikiLarge_val_CosSImFKG.csv')
df_test_mt.to_csv('/content/drive/MyDrive/MT_sentence_simpl/T_WikiLarge_test_CosSImFKG.csv')


### Grammar Checker

In [None]:
tool = language_tool_python.LanguageTool('ru')

Downloading LanguageTool: 100%|██████████| 190M/190M [00:10<00:00, 18.8MB/s]
Unzipping /tmp/tmpytc6_ady.zip to /root/.cache/language_tool_python.
Downloaded https://www.languagetool.org/download/LanguageTool-5.2.zip to /root/.cache/language_tool_python.


In [None]:
def get_mistakes_summary(df_test):
    src_test = list(df_test['target_x'].values)
    dst_test =list(df_test['target_y'].values)
    matches_src = []
    for i in src_test:
      matches_src.extend(tool.check(i))
    matches_src

    matches_dst = []
    for i in dst_test:
      matches_dst.extend(tool.check(i))
    matches_dst

    categories = set([i.category for i in matches_src+matches_dst])

    categories_src = {i:0 for i in categories}
    categories_dst = {i:0 for i in categories}

    for i in matches_src:
      categories_src[i.category]+=1

    for i in matches_dst:
      categories_dst[i.category]+=1
      
    return categories_src, categories_dst

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_test_mt)

In [None]:
src_errors

{'CASING': 1,
 'GRAMMAR': 2,
 'LOGIC': 1,
 'PUNCTUATION': 11,
 'STYLE': 6,
 'TYPOGRAPHY': 62,
 'TYPOS': 237}

In [None]:
dst_errors

{'CASING': 1,
 'GRAMMAR': 3,
 'LOGIC': 0,
 'PUNCTUATION': 10,
 'STYLE': 6,
 'TYPOGRAPHY': 49,
 'TYPOS': 242}

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_val_mt)

In [None]:
src_errors

{'CASING': 4,
 'GRAMMAR': 28,
 'LOGIC': 1,
 'MISC': 4,
 'PUNCTUATION': 37,
 'STYLE': 10,
 'TYPOGRAPHY': 141,
 'TYPOS': 797}

In [None]:
dst_errors

{'CASING': 4,
 'GRAMMAR': 17,
 'LOGIC': 0,
 'MISC': 6,
 'PUNCTUATION': 13,
 'STYLE': 7,
 'TYPOGRAPHY': 120,
 'TYPOS': 631}

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_train_mt)

In [None]:
src_errors

{'CASING': 1297,
 'EXTEND': 62,
 'GRAMMAR': 6611,
 'LOGIC': 362,
 'MISC': 2384,
 'PUNCTUATION': 10517,
 'STYLE': 3211,
 'TYPOGRAPHY': 46870,
 'TYPOS': 235762}

In [None]:
dst_errors

{'CASING': 1019,
 'EXTEND': 25,
 'GRAMMAR': 4987,
 'LOGIC': 252,
 'MISC': 1779,
 'PUNCTUATION': 6954,
 'STYLE': 1981,
 'TYPOGRAPHY': 38058,
 'TYPOS': 184223}

# WikiLarge translated with Google API


### Cosine Similarity between original/simple sentences


In [None]:
df_train_gl, df_val_gl, df_test_gl

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
config = AutoConfig.from_pretrained("DeepPavlov/rubert-base-cased") # "roberta-base" 'xlm-mlm-100-1280' 'xlm-roberta-base' 'bert-base-multilingual-cased'
config.output_hidden_states = True

tok = AutoTokenizer.from_pretrained("DeepPavlov/rubert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("DeepPavlov/rubert-base-cased", config=config)
model.to(device)

In [None]:
calc_cos_sim(df_test_gl, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')
calc_cos_sim(df_val_gl, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')
calc_cos_sim(df_train_gl, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')

In [None]:
df_train_gl.cos_sim_x_y.mean(), df_val_gl.cos_sim_x_y.mean(), df_test_gl.cos_sim_x_y.mean()

(0.8626444426285484, 0.8616611667481872, 0.9632613898956612)

### Cosine Similarity between English sentences and their translations

In [None]:
df_test_gl.src = df_test_gl.src.astype(str)
df_test_gl.dst = df_test_gl.dst.astype(str)
df_test_gl.target_x = df_test_gl.target_x.astype(str)
df_test_gl.target_y = df_test_gl.target_y.astype(str)

df_train_gl.src = df_train_gl.src.astype(str)
df_train_gl.dst = df_train_gl.dst.astype(str)
df_train_gl.target_x = df_train_gl.target_x.astype(str)
df_train_gl.target_y = df_train_gl.target_y.astype(str)

df_val_gl.src = df_val_gl.src.astype(str)
df_val_gl.dst = df_val_gl.dst.astype(str)
df_val_gl.target_x = df_val_gl.target_x.astype(str)
df_val_gl.target_y = df_val_gl.target_y.astype(str)

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# model = SentenceTransformer('LaBSE')
# model.to(device)

calc_cos_sim_(df_test_gl, model)
calc_cos_sim_(df_val_gl, model)
calc_cos_sim_(df_train_gl, model)

In [None]:
df_train_gl.LABSE_orig.mean(), df_val_gl.LABSE_orig.mean(), df_test_gl.LABSE_orig.mean()

(0.8977616297411125, 0.8965500514023006, 0.8895553515382009)

In [None]:
df_train_gl.LABSE_simpl.mean(), df_val_gl.LABSE_simpl.mean(), df_test_gl.LABSE_simpl.mean()

(0.8960233570680219, 0.8958086021399746, 0.8843331897095458)

### Flesch Kincaid Grade Level


In [None]:
a,b = '\n'.join(list(df_train_gl['target_x'].values)), '\n'.join(list(df_train_gl['target_y'].values))
src, dst = calc_readability_metrics(a), calc_readability_metrics(b)
src, dst

({'avg_slen': 13.511591525677904,
  'avg_syl': 2.784873958058751,
  'c_share': 12.364810921322636,
  'chars': 34122843,
  'index_SMOG_rus': 11.477623874423962,
  'index_ari_rus': 13.366235887323278,
  'index_cl_rus': 12.76482160401882,
  'index_dc_rus': 10.514040115080164,
  'index_fk_rus': 13.341896309599832,
  'letters': 25721593,
  'n_complex_words': 490185,
  'n_sentences': 293404,
  'n_simple_words': 3474170,
  'n_syllabes': 11040229,
  'n_words': 3964355,
  'spaces': 4587774,
  'wsyllabes': {'1': 789445,
   '10': 608,
   '11': 221,
   '12': 65,
   '13': 18,
   '14': 9,
   '15': 5,
   '16': 1,
   '2': 1084592,
   '24': 1,
   '26': 2,
   '3': 1005366,
   '4': 594767,
   '5': 314475,
   '6': 131371,
   '7': 33599,
   '8': 7468,
   '9': 2342}},
 {'avg_slen': 10.390097483991022,
  'avg_syl': 2.7183118297190982,
  'c_share': 10.749788353653758,
  'chars': 26185118,
  'index_SMOG_rus': 9.393696260156942,
  'index_ari_rus': 11.47905559530929,
  'index_cl_rus': 11.097810945241413,
  'inde

In [None]:
a,b = '\n'.join(list(df_val_gl['target_x'].values)), '\n'.join(list(df_val_gl['target_y'].values))
src, dst = calc_readability_metrics(a), calc_readability_metrics(b)
src, dst

({'avg_slen': 13.996651785714286,
  'avg_syl': 2.811019854876007,
  'c_share': 12.965473247747388,
  'chars': 107527,
  'index_SMOG_rus': 11.960094570513345,
  'index_ari_rus': 13.790950388128728,
  'index_cl_rus': 13.108202695159882,
  'index_dc_rus': 10.97802717025656,
  'index_fk_rus': 13.867417207096437,
  'letters': 81947,
  'n_complex_words': 1626,
  'n_sentences': 896,
  'n_simple_words': 10915,
  'n_syllabes': 35253,
  'n_words': 12541,
  'spaces': 14359,
  'wsyllabes': {'1': 2455,
   '11': 2,
   '2': 3420,
   '3': 3124,
   '4': 1916,
   '5': 1046,
   '6': 423,
   '7': 120,
   '8': 23,
   '9': 12}},
 {'avg_slen': 10.470899470899472,
  'avg_syl': 2.7488630621526022,
  'c_share': 11.177362304194038,
  'chars': 83869,
  'index_SMOG_rus': 9.614683495566927,
  'index_ari_rus': 11.798498115129497,
  'index_cl_rus': 11.384552804446688,
  'index_dc_rus': 9.028459547470664,
  'index_fk_rus': 11.863601514312982,
  'letters': 63071,
  'n_complex_words': 1106,
  'n_sentences': 945,
  'n_si

In [None]:
a,b = '\n'.join(list(df_test_gl['target_x'].values)), '\n'.join(list(df_test_gl['target_y'].values))
src, dst = calc_readability_metrics(a), calc_readability_metrics(b)
src, dst

({'avg_slen': 13.751898734177216,
  'avg_syl': 2.809646539027982,
  'c_share': 12.794550810014728,
  'chars': 45668,
  'index_SMOG_rus': 11.777428630287249,
  'index_ari_rus': 13.800124826159092,
  'index_cl_rus': 13.132076583210605,
  'index_dc_rus': 10.816860401558511,
  'index_fk_rus': 13.770084651665698,
  'letters': 35562,
  'n_complex_words': 695,
  'n_sentences': 395,
  'n_simple_words': 4737,
  'n_syllabes': 15262,
  'n_words': 5432,
  'spaces': 5972,
  'wsyllabes': {'1': 1105,
   '10': 2,
   '11': 2,
   '2': 1430,
   '3': 1369,
   '4': 833,
   '5': 430,
   '6': 177,
   '7': 71,
   '8': 10,
   '9': 3}},
 {'avg_slen': 12.649880095923262,
  'avg_syl': 2.7438862559241706,
  'c_share': 11.639810426540285,
  'chars': 43377,
  'index_SMOG_rus': 10.77815654833856,
  'index_ari_rus': 12.67672359439463,
  'index_cl_rus': 12.194928909952608,
  'index_dc_rus': 9.878592621637289,
  'index_fk_rus': 12.485846704626816,
  'letters': 33848,
  'n_complex_words': 614,
  'n_sentences': 417,
  'n_

In [None]:
df_train_gl.to_csv('/content/drive/MyDrive/MT_sentence_simpl/Google_WikiLarge_train_CosSImFKG.csv')
df_val_gl.to_csv('/content/drive/MyDrive/MT_sentence_simpl/Google_WikiLarge_val_CosSImFKG.csv')
df_test_gl.to_csv('/content/drive/MyDrive/MT_sentence_simpl/Google_WikiLarge_test_CosSImFKG.csv')

### Grammar Checker

In [None]:
tool = language_tool_python.LanguageTool('ru')

In [None]:
# def get_mistakes_summary(df_test):
#     src_test = list(df_test['target_x'].values)
#     dst_test =list(df_test['target_y'].values)
#     matches_src = []
#     for i in src_test:
#       matches_src.extend(tool.check(i))
#     matches_src

#     matches_dst = []
#     for i in dst_test:
#       matches_dst.extend(tool.check(i))
#     matches_dst

#     categories = set([i.category for i in matches_src+matches_dst])

#     categories_src = {i:0 for i in categories}
#     categories_dst = {i:0 for i in categories}

#     for i in matches_src:
#       categories_src[i.category]+=1

#     for i in matches_dst:
#       categories_dst[i.category]+=1
      
#     return categories_src, categories_dst

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_test_gl)


In [None]:
src_errors

{'CASING': 2,
 'GRAMMAR': 2,
 'MISC': 1,
 'PUNCTUATION': 2,
 'STYLE': 6,
 'TYPOGRAPHY': 37,
 'TYPOS': 227}

In [None]:
dst_errors

{'CASING': 1,
 'GRAMMAR': 1,
 'MISC': 1,
 'PUNCTUATION': 5,
 'STYLE': 5,
 'TYPOGRAPHY': 44,
 'TYPOS': 234}

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_val_gl)


In [None]:
src_errors

{'CASING': 9,
 'EXTEND': 1,
 'GRAMMAR': 5,
 'LOGIC': 2,
 'MISC': 4,
 'PUNCTUATION': 18,
 'STYLE': 8,
 'TYPOGRAPHY': 144,
 'TYPOS': 716}

In [None]:
dst_errors

{'CASING': 10,
 'EXTEND': 0,
 'GRAMMAR': 7,
 'LOGIC': 0,
 'MISC': 8,
 'PUNCTUATION': 10,
 'STYLE': 10,
 'TYPOGRAPHY': 140,
 'TYPOS': 568}

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_train_gl)

In [None]:
src_errors

{'CASING': 3605,
 'EXTEND': 33,
 'GRAMMAR': 3296,
 'LOGIC': 305,
 'MISC': 1921,
 'PUNCTUATION': 4863,
 'STYLE': 3391,
 'TYPOGRAPHY': 41504,
 'TYPOS': 221050}

In [None]:
dst_errors

{'CASING': 1663,
 'EXTEND': 19,
 'GRAMMAR': 2328,
 'LOGIC': 244,
 'MISC': 1711,
 'PUNCTUATION': 3447,
 'STYLE': 1959,
 'TYPOGRAPHY': 38674,
 'TYPOS': 173040}

In [None]:
# lexical diversity
# https://shravan-kuchkula.github.io/Lexical-Diversity/#normalizing-text-to-understand-vocabulary
# https://github.com/facebookresearch/asset
# https://github.com/ddhruvkr/Edit-Unsup-TS
# https://newtechaudit.ru/trenirovka-nlp-zadachi/
# https://github.com/king-menin/mipt-nlp2021/blob/master/seminars/sem10/Summarization.ipynb