In this notebook English dataset Wikilarge collected from Wikipedia will be explored and translated to Russian with a MT model from Hugging Face. Also, a translated via Google translate API data will be considered. All the data will be evaluated


# Libraries and dependencies

In [None]:
! pip install googletrans
! pip install transformers
! pip install laserembeddings
! pip install sentence_transformers
! pip install language_tool_python
! pip install textstat

In [2]:
! python -m laserembeddings download-models

Downloading models into /usr/local/lib/python3.7/dist-packages/laserembeddings/data

✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab    
✅   Downloaded https://dl.fbaipublicfiles.com/laser/models/bilstm.93langs.2018-12-26.pt    

✨ You're all set!


In [3]:
import pandas as pd
import numpy as np
import torch
import language_tool_python
import re
import textstat
from laserembeddings import Laser
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from transformers import MarianMTModel, MarianTokenizer
from transformers.hf_api import HfApi
from googletrans import Translator
from tqdm import tqdm_notebook

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)

# Data loading...

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# ! tar -xf /content/drive/MyDrive/MT_sentence_simpl/data-simplification.tar.bz2
! gdown https://drive.google.com/uc?id=0B6-YKFW-MnbOYWxUMTBEZ1FBam8
! tar -xf /content/data-simplification.tar.bz2

In [None]:
# train-------------------------------------------------------------
with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.train.src') as f:
 data_train =  f.readlines()
data_train = [i.strip('\n') for i in data_train]

with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.train.dst') as f:
 data_train_dst =  f.readlines()
data_train_dst = [i.strip('\n') for i in data_train_dst]

df_train = pd.DataFrame.from_records(zip(data_train, data_train_dst), columns=['src', 'dst'])

# valid-------------------------------------------------------------
with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.valid.src') as f:
 data_valid =  f.readlines()
data_valid = [i.strip('\n') for i in data_valid]

with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.valid.dst') as f:
 data_valid_dst =  f.readlines()
data_valid_dst = [i.strip('\n') for i in data_valid_dst]

df_val = pd.DataFrame.from_records(zip(data_valid, data_valid_dst), columns=['src', 'dst'])

# test-------------------------------------------------------------
with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.test.src') as f:
 data_test =  f.readlines()
data_test = [i.strip('\n') for i in data_test]

with open('/content/data-simplification/wikilarge/wiki.full.aner.ori.test.dst') as f:
 data_test_dst =  f.readlines()
data_test_dst = [i.strip('\n') for i in data_test_dst]

df_test = pd.DataFrame.from_records(zip(data_test, data_test_dst), columns=['src', 'dst'])

### First look at the data:

In [None]:
df_train.sample(5)

Unnamed: 0,src,dst
113874,Saint-Martin-de-Lerm is a commune in the Gironde department in Aquitaine in south-western France .,It is found in the region Aquitaine in the Gironde department in the southwest of France .
252319,Legislative power is exercised by a bicameral Congress composed of the Senate -LRB- with 32 members -RRB- and the Chamber of Deputies -LRB- with 178 members -RRB- .,"The Congress is divided into two groups : the Senate , with 32 members -LRB- one for every province and one for the National District -RRB- , and the Chamber of Deputies with 178 members ."
28214,Manin is a commune in the Pas-de-Calais department in the Nord-Pas-de-Calais region of France .,It is found in the region Nord-Pas-de-Calais in the Pas-de-Calais department in the north of France .
275990,"According to science fiction writer Robert A. Heinlein , '' a handy short definition of almost all science fiction might read : realistic speculation about possible future events , based solidly on adequate knowledge of the real world , past and present , and on a thorough understanding of the nature and significance of the scientific method . ''",Possible examples of science fiction from the past
53114,Veuilly-la-Poterie is a commune in the Aisne department in Picardie in northern France .,It is found in the region Picardie in the Aisne department in the north of France .


In [None]:
df_test.sample(5)

Unnamed: 0,src,dst
345,"On 15 April 1989 , the ground was the scene of one of the worst sporting tragedies of all time when 94 Liverpool fans -LRB- the final death toll was 96 -RRB- were crushed to death in an FA Cup semi-final in the infamous Hillsborough disaster .",The device can be designed for use in less exact environments .
192,"The park , which receives approximately thirty-five million visitors annually , is the most visited urban park in the United States .",The lawyer Brandon -LRB- Waise Lee -RRB- was his idol as MK Sun grew up to be a lawyer .
147,It was founded in 1440 by King Henry VI as '' The King 's College of Our Lady of Eton besides Wyndsor '' .,"Later , Esperanto speakers started to see the language and culture that had grown up around it as ends in themselves , though Esperanto is never accepted by the United Nations of other international organizations ."
66,"Etymology The Portuguese Man O ' War -LRB- named caravela-portuguesa in Portuguese -RRB- is named for its air bladder , which looks similar to the triangular sails of the Portuguese ship -LRB- man-of-war -RRB- Caravela latina -LRB- two - or three-masted lateen-rigged ship caravel -RRB- , of the 15th and 16th centuries .","Rollo swore fealty to Charles , changed to Christianity , and undertook to standby the northern region of France against the incursions of other Viking groups ."
41,"Sudan is situated in northern Africa , bordering the Red Sea and it has a coastline of 853 km along the Red Sea .",Working Group I : makes note of climate system and climate change


Initial sizes are:

* train: 296402
* dev: 992
* test: 359

 However, some of the sentences are of a very bad quality. So, they will be removed before translation

In [None]:
# dataset sizes

df_train.shape[0], df_val.shape[0], df_test.shape[0]

(296402, 992, 359)

# Basic cleaning

In [None]:
# preprocessing to clean the texts

# train-------------------------------------------------------------
for i, obj  in enumerate(data_train):
  data_train[i] = data_train[i].replace('-LRB-', '(')
  data_train[i] = data_train[i].replace('-RRB-', ')')

for i, obj in enumerate(data_train_dst):
  data_train_dst[i] = data_train_dst[i].replace('-LRB-', '(')
  data_train_dst[i] = data_train_dst[i].replace('-RRB-', ')')

# valid-------------------------------------------------------------
for i, obj  in enumerate(data_valid):
  data_valid[i] = data_valid[i].replace('-LRB-', '(')
  data_valid[i] = data_valid[i].replace('-RRB-', ')')

for i, obj in enumerate(data_valid_dst):
  data_valid_dst[i] = data_valid_dst[i].replace('-LRB-', '(')
  data_valid_dst[i] = data_valid_dst[i].replace('-RRB-', ')')

# test-------------------------------------------------------------
for i, obj  in enumerate(data_test):
  data_test[i] = data_test[i].replace('-LRB-', '(')
  data_test[i] = data_test[i].replace('-RRB-', ')')

for i, obj in enumerate(data_test_dst):
  data_test_dst[i] = data_test_dst[i].replace('-LRB-', '(')
  data_test_dst[i] = data_test_dst[i].replace('-RRB-', ')')


Extra preprocessing to get rid of the bad punctuation

In [None]:
deleted = []
for i, j in enumerate(data_train):
  if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2 or len(j.split()) == 3 or j[-1] not in ('?.\'') or not re.match(r'[A-ZÉ0-9"\']', j[0]) or re.match('[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
    deleted.append(j)
    data_train.pop(i)
    data_train_dst.pop(i)
  else:
    if '()' in data_train[i]:
       data_train[i] = data_train[i].replace('()', '')
    
    if '( )' in data_train[i]:
       data_train[i] = data_train[i].replace('( )', '')
    if ',,' in data_train[i]:
      data_train[i] = data_train[i].replace(',,', '')
    if ';)' in data_train[i]:
      data_train[i] = data_train[i].replace(';)', ')')
    if ':)' in data_train[i]:
      data_train[i] = data_train[i].replace(':)', ')')
    if '(:' in data_train[i]:
      data_train[i] = data_train[i].replace('(:', '(')
    if '(;' in data_train[i]:
      data_train[i] = data_train[i].replace('(:', '(')
    if "& ndash ;" in data_train[i]: 
      data_train[i] = data_train[i].replace("& ndash ;", '')
    if "& minus ;" in data_train[i]: 
      data_train[i] = data_train[i].replace("& minus ;", '')
    if "( , ; ; ; )" in data_train[i]: 
      data_train[i] = data_train[i].replace("( , ; ; ; )", '')
    if "( , ; , ; , ; ; )" in data_train[i]: 
      data_train[i] = data_train[i].replace("( , ; , ; , ; ; )", '')
    if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_train[i]: 
      data_train[i] = data_train[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

len(deleted)

19493

In [None]:
deleted = []
for i, j in enumerate(data_train_dst):
  if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2  or j[-1] not in ('?.\'') or not re.match(r'[A-ZÉ0-9"\']', j[0]) or re.match('[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
    deleted.append(j)
    data_train_dst.pop(i)
    data_train.pop(i)
  else:
    if '()' in data_train_dst[i]:
       data_train_dst[i] = data_train_dst[i].replace('()', '')
    
    if '( )' in data_train_dst[i]:
       data_train_dst[i] = data_train_dst[i].replace('( )', '')
    if ',,' in data_train_dst[i]:
      data_train_dst[i] = data_train_dst[i].replace(',,', '')
    if ';)' in data_train_dst[i]:
      data_train_dst[i] = data_train_dst[i].replace(';)', ')')
    if ':)' in data_train_dst[i]:
      data_train_dst[i] = data_train_dst[i].replace(':)', ')')
    if '(:' in data_train_dst[i]:
      data_train_dst[i] = data_train_dst[i].replace('(:', '(')
    if '(;' in data_train_dst[i]:
      data_train_dst[i] = data_train_dst[i].replace('(:', '(')
    if "& ndash ;" in data_train_dst[i]: 
      data_train_dst[i] = data_train_dst[i].replace("& ndash ;", '')
    if "& minus ;" in data_train_dst[i]: 
      data_train_dst[i] = data_train_dst[i].replace("& minus ;", '')
    if "( , ; ; ; )" in data_train_dst[i]: 
      data_train_dst[i] = data_train_dst[i].replace("( , ; ; ; )", '')
    if "( , ; , ; , ; ; )" in data_train_dst[i]: 
      data_train_dst[i] = data_train_dst[i].replace("( , ; , ; , ; ; )", '')
    if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_train_dst[i]: 
      data_train_dst[i] = data_train_dst[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

  len(deleted)

In [None]:
df_train = pd.DataFrame.from_records(list(zip(data_train, data_train_dst)), columns=['src', 'dst'])
df_train.shape[0]

246993

In [None]:
deleted = []
for i, j in enumerate(data_valid):
    if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2 or len(j.split()) == 3 or j[-1] not in (
    '?.\'') or not re.match(r'[A-ZÉ0-9"\']', j[0]) or re.match('[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
        deleted.append(j)
        data_valid.pop(i)
        data_valid_dst.pop(i)
    else:
        if '()' in data_valid[i]:
            data_valid[i] = data_valid[i].replace('()', '')

        if '( )' in data_valid[i]:
            data_valid[i] = data_valid[i].replace('( )', '')
        if ',,' in data_valid[i]:
            data_valid[i] = data_valid[i].replace(',,', '')
        if ';)' in data_valid[i]:
            data_valid[i] = data_valid[i].replace(';)', ')')
        if ':)' in data_valid[i]:
            data_valid[i] = data_valid[i].replace(':)', ')')
        if '(:' in data_valid[i]:
            data_valid[i] = data_valid[i].replace('(:', '(')
        if '(;' in data_valid[i]:
            data_valid[i] = data_valid[i].replace('(:', '(')
        if "& ndash ;" in data_valid[i]:
            data_valid[i] = data_valid[i].replace("& ndash ;", '')
        if "& minus ;" in data_valid[i]:
            data_valid[i] = data_valid[i].replace("& minus ;", '')
        if "( , ; ; ; )" in data_valid[i]:
            data_valid[i] = data_valid[i].replace("( , ; ; ; )", '')
        if "( , ; , ; , ; ; )" in data_valid[i]:
            data_valid[i] = data_valid[i].replace("( , ; , ; , ; ; )", '')
        if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_valid[i]:
            data_valid[i] = data_valid[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

len(deleted)

62

In [None]:
deleted = []
for i, j in enumerate(data_valid_dst):
    if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2 or j[-1] not in ('?.\'') or not re.match(r'[A-ZÉ0-9"\']',
                                                                                                        j[
                                                                                                            0]) or re.match(
            '[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
        deleted.append(j)
        data_valid_dst.pop(i)
        data_valid.pop(i)
    else:
        if '()' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace('()', '')

        if '( )' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace('( )', '')
        if ',,' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace(',,', '')
        if ';)' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace(';)', ')')
        if ':)' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace(':)', ')')
        if '(:' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace('(:', '(')
        if '(;' in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace('(:', '(')
        if "& ndash ;" in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace("& ndash ;", '')
        if "& minus ;" in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace("& minus ;", '')
        if "( , ; ; ; )" in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace("( , ; ; ; )", '')
        if "( , ; , ; , ; ; )" in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace("( , ; , ; , ; ; )", '')
        if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_valid_dst[i]:
            data_valid_dst[i] = data_valid_dst[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

    len(deleted)

In [None]:
df_val = pd.DataFrame.from_records(list(zip(data_valid, data_valid_dst)), columns=['src', 'dst'])
df_val.shape[0]

818

In [None]:
deleted = []
for i, j in enumerate(data_test):
    if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2 or len(j.split()) == 3 or j[-1] not in (
    '?.\'') or not re.match(r'[A-ZÉ0-9"\']', j[0]) or re.match('[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
        deleted.append(j)
        data_test.pop(i)
        data_test_dst.pop(i)
    else:
        if '()' in data_test[i]:
            data_test[i] = data_test[i].replace('()', '')

        if '( )' in data_test[i]:
            data_test[i] = data_test[i].replace('( )', '')
        if ',,' in data_test[i]:
            data_test[i] = data_test[i].replace(',,', '')
        if ';)' in data_test[i]:
            data_test[i] = data_test[i].replace(';)', ')')
        if ':)' in data_test[i]:
            data_test[i] = data_test[i].replace(':)', ')')
        if '(:' in data_test[i]:
            data_test[i] = data_test[i].replace('(:', '(')
        if '(;' in data_test[i]:
            data_test[i] = data_test[i].replace('(:', '(')
        if "& ndash ;" in data_test[i]:
            data_test[i] = data_test[i].replace("& ndash ;", '')
        if "& minus ;" in data_test[i]:
            data_test[i] = data_test[i].replace("& minus ;", '')
        if "( , ; ; ; )" in data_test[i]:
            data_test[i] = data_test[i].replace("( , ; ; ; )", '')
        if "( , ; , ; , ; ; )" in data_test[i]:
            data_test[i] = data_test[i].replace("( , ; , ; , ; ; )", '')
        if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_test[i]:
            data_test[i] = data_test[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

len(deleted)

0

In [None]:
deleted = []
for i, j in enumerate(data_test_dst):
    if len(j.split()) == 1 or len(j.strip('.?"').split()) == 2 or j[-1] not in ('?.\'') or not re.match(r'[A-ZÉ0-9"\']',
                                                                                                        j[
                                                                                                            0]) or re.match(
            '[\W\w]+ = [\W\w]+[\d]* .', j) or ',.' in j:
        deleted.append(j)
        data_test_dst.pop(i)
        data_test.pop(i)
    else:
        if '()' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace('()', '')

        if '( )' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace('( )', '')
        if ',,' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace(',,', '')
        if ';)' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace(';)', ')')
        if ':)' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace(':)', ')')
        if '(:' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace('(:', '(')
        if '(;' in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace('(:', '(')
        if "& ndash ;" in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace("& ndash ;", '')
        if "& minus ;" in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace("& minus ;", '')
        if "( , ; ; ; )" in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace("( , ; ; ; )", '')
        if "( , ; , ; , ; ; )" in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace("( , ; , ; , ; ; )", '')
        if "( á 1\/4 Î '' Î Î 1\/2 Î Ï )" in data_test_dst[i]:
            data_test_dst[i] = data_test_dst[i].replace("( á 1\/4 Î '' Î Î 1\/2 Î Ï )", '')

    len(deleted)

In [None]:
df_test = pd.DataFrame.from_records(list(zip(data_test, data_test_dst)), columns=['src', 'dst'])
df_test.shape[0]

326

After cleaning the lengths are:

* train: 243466
* dev: 930
* test: 326

So, a significant number of sentences has been removed from the test set

# Translation
There are 2 types of translation:
* Machine translation with 'Helsinki-NLP/opus-mt' model from  transformers library
* Machine translation with Google API (the data translated with the latter one will be downloaded as the api is for charge)

Also, maybe the first model will be finetuned on russian parallel corpora:

Try tuning https://github.com/SimonWT/russian-english-nerual-machine-translation/blob/main/experiments/hugging_face_fine_tune.ipynb

/usr/local/lib/python3.7/dist-packages/torch/tensor.py \
https://github.com/simple-ai-pixel/gpt3_ru/blob/main/gpt.ipynb

In [None]:
# helper functions

def batch_generator(
        list_of_sentences,
        size=64
):
    num_batch = len(list_of_sentences)//size
    for index in range(num_batch):
        yield list_of_sentences[index*size:(index+1)*size]
    yield list_of_sentences[num_batch*size:]


def translate_to_russin(model, tok, src, dst):

  translations_src = []
  for i in tqdm_notebook(batch_generator(src)):
    l = tok(i, return_tensors="pt", padding=True).to(device)
    translated = model.generate(**l)
    translations_src.extend([tok.decode(t, skip_special_tokens=True) for t in translated])

  translations_dst = []
  for i in tqdm_notebook(batch_generator(dst)):
    l = tok(i, return_tensors="pt", padding=True).to(device)
    translated = model.generate(**l)
    translations_dst.extend([tok.decode(t, skip_special_tokens=True) for t in translated])

  return translations_src, translations_dst


In [None]:
# import en-ru model from transformers
model_name = 'Helsinki-NLP/opus-mt-en-ru'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device
model.to(device)

### Translation by parts because of GPU limitations:(

In [None]:
# val--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_valid, data_valid_dst)

for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_val['target_x'] = src
df_val['target_y'] = dst
df_val.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_val_mtt.csv')

In [None]:
# test--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_test, data_test_dst)
for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_test['target_x'] = src
df_test['target_y'] = dst
df_test.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_test_mtt.csv')

In [None]:
# train--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_train[232000:], data_train_dst[232000:])

for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_train_22000 = pd.DataFrame(list(zip(data_train[232000:], data_train_dst[232000:], src, dst)), columns=['src', 'dst', 'target_x', 'target_y'])
df_train_22000.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_train_mtt_232-246k.csv')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




### Normal code for the whole dataset

Some cleaning to remove common translation artifacts is applied

In [None]:
# train--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_train[:1000], data_train_dst[:1000])

for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_train['target_x'] = src
df_train['target_y'] = dst

# val--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_valid, data_valid_dst)

for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_val['target_x'] = src
df_val['target_y'] = dst

# test--------------------------------------------------
src, dst = translate_to_russin(model, tokenizer, data_test, data_test_dst)
for i, obj  in enumerate(src):
  src[i] = src[i].replace('&gt;', '')
  #src[i] = re.sub(r'[&gt;{1,}]', " ", src[i])
  src[i] = src[i].replace('&lt;', '')
  src[i] = src[i].replace('()', '')
  src[i] = src[i].replace('&quot;', '')
for i, obj in enumerate(dst):
  dst[i] = dst[i].replace('&gt;', '')
  dst[i] = dst[i].replace('&lt;', '')
  dst[i] = dst[i].replace('()', '')
  dst[i] = dst[i].replace('&quot;', '')
df_test['target_x'] = src
df_test['target_y'] = dst

df_train.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_train_mtt.csv')
df_val.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_val_mtt.csv')
df_test.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_test_mtt.csv')

### Making the whole dataset from pieces

In [None]:
# 1-2
! gdown https://drive.google.com/uc?id=1-0oww8dmt7jRiE4uMVVd1GqrU51rpWNm
#https://drive.google.com/file/d/1-0oww8dmt7jRiE4uMVVd1GqrU51rpWNm/view?usp=sharing

# 2 - 22
! gdown https://drive.google.com/uc?id=1-2CTWEuaCaTeOSWVHWVD9hytmnsT-xov
# https://drive.google.com/file/d/1-2CTWEuaCaTeOSWVHWVD9hytmnsT-xov/view?usp=sharing

# 22-42
! gdown https://drive.google.com/uc?id=1-4tocDbuVFdcEfWnGsXJYl6UNUigT2O6
# https://drive.google.com/file/d/1-4tocDbuVFdcEfWnGsXJYl6UNUigT2O6/view?usp=sharing

# 42-62
! gdown https://drive.google.com/uc?id=1-9Yxy9dIQT_jtih9cgkAIyQibeZ1iPAF
# https://drive.google.com/file/d/1-9Yxy9dIQT_jtih9cgkAIyQibeZ1iPAF/view?usp=sharing

# 62-82
! gdown https://drive.google.com/uc?id=15HE_G_1iRvLsfUHKOOa0bvuYYtAJ503P
# https://drive.google.com/file/d/15HE_G_1iRvLsfUHKOOa0bvuYYtAJ503P/view?usp=sharing

# 82 - 112
! gdown https://drive.google.com/uc?id=1-1zRK3DO25xazojKj31he1U9cUmtqovg
# https://drive.google.com/file/d/1-1zRK3DO25xazojKj31he1U9cUmtqovg/view?usp=sharing

# 112-132
! gdown https://drive.google.com/uc?id=1-2o-bgldNJWYvEtY6uN0w5fae6KbajdR
# https://drive.google.com/file/d/1-2o-bgldNJWYvEtY6uN0w5fae6KbajdR/view?usp=sharing

# 132 -152
! gdown https://drive.google.com/uc?id=1e7sPM2AL03EBlJuNzCQgKHL_Zq8Tc5UP
#https://drive.google.com/file/d/1e7sPM2AL03EBlJuNzCQgKHL_Zq8Tc5UP/view?usp=sharing

# 152 - 182
! gdown https://drive.google.com/uc?id=1--lcK4KBaK_hkK4lVBbf-ze7lSdxSt2k
# https://drive.google.com/file/d/1--lcK4KBaK_hkK4lVBbf-ze7lSdxSt2k/view?usp=sharing

# 182 - 212
! gdown https://drive.google.com/uc?id=1-0loBcmLIYdxtMIBWxyyAHQyPCp1mJF9
# https://drive.google.com/file/d/1-0loBcmLIYdxtMIBWxyyAHQyPCp1mJF9/view?usp=sharing

# 212 - 232
! gdown https://drive.google.com/uc?id=1-4qPowP66ximyi7Q6lqsit9BKmqAinzz
# https://drive.google.com/file/d/1-4qPowP66ximyi7Q6lqsit9BKmqAinzz/view?usp=sharing

# 232 - 246 
! gdown https://drive.google.com/uc?id=1-63gYL7NMFB4Tvh0y24CkDd370ZOsbTV
# https://drive.google.com/file/d/1-63gYL7NMFB4Tvh0y24CkDd370ZOsbTV/view?usp=sharing

In [None]:
# from pathlib import Path
# for i in p.glob('./wiki_train_mtt*.csv'):
#   print(p.cwd().joinpath(i))

In [None]:
paths = ['/content/wiki_train_mtt_1-2k.csv',
         '/content/wiki_train_mtt_2-22k.csv',
         '/content/wiki_train_mtt_22-42k.csv',
         '/content/wiki_train_mtt_42-62k.csv',
         '/content/wiki_train_mtt_62-82k.csv',
         '/content/wiki_train_mtt_82-112k.csv',
         '/content/wiki_train_mtt_112-132k.csv',
         '/content/wiki_train_mtt_132-152k.csv',
         '/content/wiki_train_mtt_152-182k.csv',
         '/content/wiki_train_mtt_182-212k.csv', 
         '/content/wiki_train_mtt_212-232k.csv',
         '/content/wiki_train_mtt_232-246k.csv'
         ]
train_full = pd.read_csv(paths[0])
for i in paths[1:]:
    train_full = pd.concat((train_full, pd.read_csv(i)))

train_full.to_csv('/content/drive/MyDrive/MT_sentence_simpl/wiki_train_mtt.csv', index=False)

Upload the transformer translation

In [4]:
! gdown https://drive.google.com/uc?id=1--3BHQFq5hKy5RiI4G1nfFG6Erj9iN4s
! gdown https://drive.google.com/uc?id=17dwqCJc7hWpG39LrJMFmsqbTMiFt7KtD
! gdown https://drive.google.com/uc?id=1URMCB5ocu_Xeco_df3ol4XnfY1k6LpmU

df_train_mt = pd.read_csv('/content/wiki_train_mtt.csv')
df_val_mt = pd.read_csv('/content/wiki_val_mtt.csv')
df_test_mt = pd.read_csv('/content/wiki_test_mtt.csv')

Downloading...
From: https://drive.google.com/uc?id=1--3BHQFq5hKy5RiI4G1nfFG6Erj9iN4s
To: /content/wiki_test_mtt.csv
100% 220k/220k [00:00<00:00, 7.09MB/s]
Downloading...
From: https://drive.google.com/uc?id=17dwqCJc7hWpG39LrJMFmsqbTMiFt7KtD
To: /content/wiki_val_mtt.csv
100% 563k/563k [00:00<00:00, 8.80MB/s]
Downloading...
From: https://drive.google.com/uc?id=1URMCB5ocu_Xeco_df3ol4XnfY1k6LpmU
To: /content/wiki_train_mtt.csv
167MB [00:03, 52.5MB/s]


A better translation obtained via Google translate API

In [7]:
! gdown https://drive.google.com/uc?id=1dB3X-Wx8qU_5nDG_pxAmLvo5H_sgnHrE
! gdown https://drive.google.com/uc?id=1bJo8TagTGKa0uyppQRqsHrKHyYO5tcZc
! gdown https://drive.google.com/uc?id=11lqipq6ggrgCk8bVxQ4-uuPVMCKN5ebU

df_val_gl = pd.read_csv('/content/wiki_dev_cleaned_translated_sd.csv')
df_test_gl = pd.read_csv('/content/wiki_test_cleaned_translated_sd.csv')
df_train_gl = pd.read_csv('/content/wiki_train_cleaned_translated_sd.csv')

Downloading...
From: https://drive.google.com/uc?id=1dB3X-Wx8qU_5nDG_pxAmLvo5H_sgnHrE
To: /content/wiki_train_cleaned_translated_sd.csv
172MB [00:03, 53.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1bJo8TagTGKa0uyppQRqsHrKHyYO5tcZc
To: /content/wiki_dev_cleaned_translated_sd.csv
100% 545k/545k [00:00<00:00, 8.52MB/s]
Downloading...
From: https://drive.google.com/uc?id=11lqipq6ggrgCk8bVxQ4-uuPVMCKN5ebU
To: /content/wiki_test_cleaned_translated_sd.csv
100% 254k/254k [00:00<00:00, 8.14MB/s]


In [None]:
df_dev_google.sample(3)

Unnamed: 0.1,Unnamed: 0,src,dst,target_x,target_y
208,208,"Additionally , since this AO system provides an excellent and stable correction ( angular resolution of 0.060 arcsec in K band ) , a 15-km moonlet at 1000 km of Hektor 's primary was detected .","Additionally , since this AO system provides an excellent and stable correction ( angular resolution of 0.060 arcsec in K band ) , a 15-km moon at 1000 km from Hektor was found .","Кроме того, поскольку эта система AO обеспечивает отличную и стабильную коррекцию (угловое разрешение 0,060 угловой секунды в диапазоне K), была обнаружена луна длиной 15 км на 1000 км от главной звезды Гектора.","Кроме того, поскольку эта система AO обеспечивает отличную и стабильную коррекцию (угловое разрешение 0,060 угловой секунды в диапазоне K), была обнаружена 15-километровая луна в 1000 км от Гектора."
29,29,Class 316 and Class 457 were TOPS classifications assigned to a single electric multiple unit ( EMU ) at different stages of its use as a prototype for the Networker series .,Class 316 and Class 457 were two suggested TOPS classifications . They were given to a single electric multiple unit ( EMU ) at different stages of its use as a prototype for the Networker series .,"Классы 316 и 457 были классификациями TOPS, присвоенными одиночному электрическому блоку (EMU) на разных этапах его использования в качестве прототипа для серии Networker.",Класс 316 и класс 457 были двумя предложенными классификациями TOPS. Они были переданы в единый электрический многоканальный блок (EMU) на разных этапах его использования в качестве прототипа для серии Networker.
51,51,"Mount Batur ( Gunung Batur ) is an active volcano located at the center of two concentric calderas north west of Mount Agung , Bali , Indonesia .",Mount Batur or Gunung Batur is a volcano on Bali .,"Гора Батур (Гунунг Батур) - действующий вулкан, расположенный в центре двух концентрических кальдер к северо-западу от горы Агунг, Бали, Индонезия.",Гора Батур или Гунунг Батур - вулкан на Бали.


# Evaluation of all the data

### WikiLarge

* Cosine Similarity between original/simple sentences
* Flesch Kincaid Grade Level
* Grammar Check

### WikiLarge MT Helsinki opus and GOogle Api translations + Original dataset (dev+ test parts) collected via Toloka

* Cosine Similarity between original/simple sentences
* Cosine Similarity between original sentences and their translation, simple sentences and their translations
* Flesch Kincaid Grade Level
* Grammar Check



# Evaluation of WikiLarge

### Cosine Similarity between original/simple sentences

In [None]:
from transformers import RobertaTokenizer, RobertaModel, AutoConfig, AutoTokenizer, AutoModelForMaskedLM
device = "cuda" if torch.cuda.is_available() else "cpu"
config = AutoConfig.from_pretrained("roberta-base") # "roberta-base" 'xlm-mlm-100-1280' 'xlm-roberta-base' 'bert-base-multilingual-cased'
config.output_hidden_states = True

tok = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForMaskedLM.from_pretrained("roberta-base", config=config)
model.to(device)

# from sklearn.metrics.pairwise import cosine_similarity

# shape should be [1, something (768, ex)]

# import numpy as np
# def cs(a, b):
#   return (a @ b.T)/(np.linalg.norm(a)*np.linalg.norm(b))

def calc_cos_sim(df, model,tok, x, y, column_name):
    Cos_sim= []
    for index, row in df.iterrows():
        
        # original
          sentence_A = tok.encode(row[x], padding='max_length', max_length=50, truncation=True, return_tensors='pt')
          sentence_A = sentence_A.to(device)
          output = model(sentence_A)
          sent_emb = output[-1][0]
          emb_source = sent_emb.mean(axis=1)
          emb_source = emb_source.cpu().detach().numpy()

          sentence_B = tok.encode(row[y], padding='max_length', max_length=50, truncation=True, return_tensors='pt')
          sentence_B = sentence_B.to(device)
          output = model(sentence_B)
          sent_emb = output[-1][0]
          emb_target= sent_emb.mean(axis=1)
          emb_target = emb_target.cpu().detach().numpy()

          cos_val = cosine_similarity(emb_source.reshape(emb_source.shape[0], -1), emb_target.reshape(emb_target.shape[0], -1))[0][0]
          Cos_sim.append(cos_val)
    df[column_name] = Cos_sim

In [None]:
calc_cos_sim(df_test, model, tok, 'src', 'dst', 'cos_sim_src_dst')
calc_cos_sim(df_val, model, tok, 'src', 'dst', 'cos_sim_src_dst')
calc_cos_sim(df_train, model, tok, 'src', 'dst', 'cos_sim_src_dst')

In [None]:
df_train.cos_sim_src_dst.mean(), df_val.cos_sim_src_dst.mean(), df_test.cos_sim_src_dst.mean()

(0.8546753056345266, 0.8547325381886187, 0.9688304633450654)

### Flesch Kincaid Grade Level

In [None]:
textstat.set_lang('en')

In [None]:
df_train['fkg_src'] = df_train['src'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_train['fkg_dst'] = df_train['dst'].apply(lambda x: textstat.flesch_kincaid_grade(x))

df_val['fkg_src'] = df_val['src'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_val['fkg_dst'] = df_val['dst'].apply(lambda x: textstat.flesch_kincaid_grade(x))

df_test['fkg_src'] = df_test['src'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_test['fkg_dst'] = df_test['dst'].apply(lambda x: textstat.flesch_kincaid_grade(x))

In [None]:
df_train['asl_src'] = df_train['src'].apply(lambda x: textstat.syllable_count(x))
df_train['asl_dst'] = df_train['dst'].apply(lambda x: textstat.syllable_count(x))

df_val['asl_src'] = df_val['src'].apply(lambda x: textstat.syllable_count(x))
df_val['asl_dst'] = df_val['dst'].apply(lambda x: textstat.syllable_count(x))

df_test['asl_src'] = df_test['src'].apply(lambda x: textstat.syllable_count(x))
df_test['asl_dst'] = df_test['dst'].apply(lambda x: textstat.syllable_count(x))



df_train['aws_src'] = df_train['src'].apply(lambda x: textstat.lexicon_count(x))
df_train['aws_dst'] = df_train['dst'].apply(lambda x: textstat.lexicon_count(x))

df_val['aws_src'] = df_val['src'].apply(lambda x: textstat.lexicon_count(x))
df_val['aws_dst'] = df_val['dst'].apply(lambda x: textstat.lexicon_count(x))

df_test['aws_src'] = df_test['src'].apply(lambda x: textstat.lexicon_count(x))
df_test['aws_dst'] = df_test['dst'].apply(lambda x: textstat.lexicon_count(x))

In [None]:
df_train.fkg_src.mean(), df_train.fkg_dst.mean(), df_train.fkg_src.mean() - df_train.fkg_dst.mean()

(11.971624297044338, 9.28312057426604, 2.6885037227782984)

In [None]:
df_val.fkg_src.mean(), df_val.fkg_dst.mean(), df_val.fkg_src.mean() - df_val.fkg_dst.mean()

(12.096454767726172, 9.439119804400974, 2.657334963325198)

In [None]:
df_test.fkg_src.mean(), df_test.fkg_dst.mean(), df_test.fkg_src.mean() - df_test.fkg_dst.mean()

(11.017484662576688, 9.821779141104283, 1.195705521472405)

In [None]:
df_train.to_csv('/content/drive/MyDrive/MT_sentence_simpl/WikiLarge_train_CosSImFKG.csv')
df_val.to_csv('/content/drive/MyDrive/MT_sentence_simpl/WikiLarge_val_CosSImFKG.csv')
df_test.to_csv('/content/drive/MyDrive/MT_sentence_simpl/WikiLarge_test_CosSImFKG.csv')

#Alex.waters

### Grammar Checker

In [7]:
tool = language_tool_python.LanguageTool('en')

Downloading LanguageTool: 100%|██████████| 190M/190M [00:15<00:00, 12.0MB/s]
Unzipping /tmp/tmpz7h9vuh4.zip to /root/.cache/language_tool_python.
Downloaded https://www.languagetool.org/download/LanguageTool-5.2.zip to /root/.cache/language_tool_python.


In [None]:
def get_mistakes_summary(df_test):
    src_test = list(df_test['src'].values)
    dst_test =list(df_test['dst'].values)
    matches_src = []
    for i in src_test:
      matches_src.extend(tool.check(i))
    matches_src

    matches_dst = []
    for i in dst_test:
      matches_dst.extend(tool.check(i))
    matches_dst

    categories = set([i.category for i in matches_src+matches_dst])

    categories_src = {i:0 for i in categories}
    categories_dst = {i:0 for i in categories}

    for i in matches_src:
      categories_src[i.category]+=1

    for i in matches_dst:
      categories_dst[i.category]+=1
      
    return categories_src, categories_dst

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_test)

In [None]:
src_errors

{'CASING': 3,
 'COLLOCATIONS': 1,
 'GRAMMAR': 3,
 'MISC': 2,
 'PUNCTUATION': 13,
 'REDUNDANCY': 0,
 'TYPOGRAPHY': 826,
 'TYPOS': 2}

In [None]:
dst_errors

{'CASING': 4,
 'COLLOCATIONS': 0,
 'GRAMMAR': 11,
 'MISC': 2,
 'PUNCTUATION': 14,
 'REDUNDANCY': 1,
 'TYPOGRAPHY': 744,
 'TYPOS': 4}

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_val)

In [None]:
src_errors

{'CASING': 1,
 'COLLOCATIONS': 3,
 'CONFUSED_WORDS': 1,
 'GRAMMAR': 12,
 'MISC': 7,
 'NONSTANDARD_PHRASES': 0,
 'PUNCTUATION': 21,
 'REDUNDANCY': 2,
 'SEMANTICS': 3,
 'STYLE': 3,
 'TYPOGRAPHY': 2476,
 'TYPOS': 6}

In [None]:
dst_errors

{'CASING': 2,
 'COLLOCATIONS': 2,
 'CONFUSED_WORDS': 2,
 'GRAMMAR': 12,
 'MISC': 15,
 'NONSTANDARD_PHRASES': 2,
 'PUNCTUATION': 43,
 'REDUNDANCY': 3,
 'SEMANTICS': 1,
 'STYLE': 4,
 'TYPOGRAPHY': 2019,
 'TYPOS': 9}

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_train)

In [None]:
src_errors

{'CASING': 1069,
 'COLLOCATIONS': 230,
 'COMPOUNDING': 20,
 'CONFUSED_WORDS': 222,
 'GRAMMAR': 3288,
 'MISC': 2755,
 'NONSTANDARD_PHRASES': 57,
 'PUNCTUATION': 9216,
 'REDUNDANCY': 1495,
 'SEMANTICS': 127,
 'STYLE': 539,
 'TYPOGRAPHY': 763452,
 'TYPOS': 2092}

In [None]:
dst_errors

{'CASING': 1113,
 'COLLOCATIONS': 231,
 'COMPOUNDING': 17,
 'CONFUSED_WORDS': 238,
 'GRAMMAR': 4364,
 'MISC': 3319,
 'NONSTANDARD_PHRASES': 182,
 'PUNCTUATION': 14757,
 'REDUNDANCY': 1418,
 'SEMANTICS': 98,
 'STYLE': 524,
 'TYPOGRAPHY': 609128,
 'TYPOS': 2513}

# WikiLarge translated with Mt Helsinki from transformers

### Cosine Similarity between original/simple sentences

In [None]:
# df_train_mt, df_val_mt ,df_test_mt

In [None]:
# device = "cuda" if torch.cuda.is_available() else "cpu"
# config = AutoConfig.from_pretrained("DeepPavlov/rubert-base-cased") # "roberta-base" 'xlm-mlm-100-1280' 'xlm-roberta-base' 'bert-base-multilingual-cased'
# config.output_hidden_states = True

# tok = AutoTokenizer.from_pretrained("DeepPavlov/rubert-base-cased")
# model = AutoModelForMaskedLM.from_pretrained("DeepPavlov/rubert-base-cased", config=config)
# model.to(device)

In [None]:
calc_cos_sim(df_test_mt, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')
calc_cos_sim(df_val_mt, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')
calc_cos_sim(df_train_mt, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')


In [None]:
df_train_mt.cos_sim_x_y.mean(), df_val_mt.cos_sim_x_y.mean(), df_test_mt.cos_sim_x_y.mean()

0.9468623274033268 0.9478329893174264 0.9841433672085862


In [7]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = SentenceTransformer('LaBSE')
model.to(device)

def calc_cos_sim_(df, model):
    LABSE_orig = []
    LABSE_simpl = []
    for index, row in df.iterrows():

        # original
          emb_source = model.encode(row['src'])
          emb_target = model.encode(row['target_x'])

          cos_val = cosine_similarity(emb_source.reshape(-1, emb_source.shape[0]), emb_target.reshape(-1, emb_target.shape[0]))[0][0]
          LABSE_orig.append(cos_val)
          
        # simplified
          emb_source = model.encode(row['dst'])
          emb_target = model.encode(row['target_y'])

          cos_val = cosine_similarity(emb_source.reshape(-1, emb_source.shape[0]), emb_target.reshape(-1, emb_target.shape[0]))[0][0]
          LABSE_simpl.append(cos_val)

    df['LABSE_orig'] = LABSE_orig
    df['LABSE_simpl'] = LABSE_simpl

calc_cos_sim_(df_test_mt, model)
calc_cos_sim_(df_val_mt, model)
calc_cos_sim_(df_train_mt, model)

In [8]:
df_train_mt.LABSE_orig.mean(), df_val_mt.LABSE_orig.mean(), df_test_mt.LABSE_orig.mean()

(0.8862167089812055, 0.8850720814067169, 0.8860442356829263)

In [9]:
df_train_mt.LABSE_simpl.mean(), df_val_mt.LABSE_simpl.mean(), df_test_mt.LABSE_simpl.mean()

(0.8818510114127176, 0.8806601700653074, 0.8815181845901934)

### Flesch Kincaid Grade Level

In [8]:
textstat.set_lang('ru')

In [None]:
df_train_mt['fkg_src'] = df_train_mt['target_x'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_train_mt['fkg_dst'] = df_train_mt['target_y'].apply(lambda x: textstat.flesch_kincaid_grade(x))

df_val_mt['fkg_src'] = df_val_mt['target_x'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_val_mt['fkg_dst'] = df_val_mt['target_y'].apply(lambda x: textstat.flesch_kincaid_grade(x))

df_test_mt['fkg_src'] = df_test_mt['target_x'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_test_mt['fkg_dst'] = df_test_mt['target_y'].apply(lambda x: textstat.flesch_kincaid_grade(x))


In [28]:
df_train_mt['asl_src'] = df_train_mt['src'].apply(lambda x: textstat.syllable_count(x))
df_train_mt['asl_dst'] = df_train_mt['dst'].apply(lambda x: textstat.syllable_count(x))

df_val_mt['asl_src'] = df_val_mt['src'].apply(lambda x: textstat.syllable_count(x))
df_val_mt['asl_dst'] = df_val_mt['dst'].apply(lambda x: textstat.syllable_count(x))

df_test_mt['asl_src'] = df_test_mt['src'].apply(lambda x: textstat.syllable_count(x))
df_test_mt['asl_dst'] = df_test_mt['dst'].apply(lambda x: textstat.syllable_count(x))



df_train_mt['aws_src'] = df_train_mt['src'].apply(lambda x: textstat.lexicon_count(x))
df_train_mt['aws_dst'] = df_train_mt['dst'].apply(lambda x: textstat.lexicon_count(x))

df_val_mt['aws_src'] = df_val_mt['src'].apply(lambda x: textstat.lexicon_count(x))
df_val_mt['aws_dst'] = df_val_mt['dst'].apply(lambda x: textstat.lexicon_count(x))

df_test_mt['aws_src'] = df_test_mt['src'].apply(lambda x: textstat.lexicon_count(x))
df_test_mt['aws_dst'] = df_test_mt['dst'].apply(lambda x: textstat.lexicon_count(x))

In [None]:
df_train_mt.fkg_src.mean(), df_train_mt.fkg_dst.mean(), df_train_mt.fkg_src.mean() - df_train_mt.fkg_dst.mean()

(18.980026559457052, 16.78072819877633, 2.1992983606807215)

In [None]:
df_val_mt.fkg_src.mean(), df_val_mt.fkg_dst.mean(), df_val_mt.fkg_src.mean() - df_val_mt.fkg_dst.mean()

(19.49914425427874, 16.874449877750607, 2.624694376528133)

In [None]:
df_test_mt.fkg_src.mean(), df_test_mt.fkg_dst.mean(), df_test_mt.fkg_src.mean() - df_test_mt.fkg_dst.mean()

(19.190490797546026, 18.366871165644167, 0.8236196319018596)

In [29]:
df_train_mt['asl_src'].mean(), df_train_mt['asl_dst'].mean(), df_train_mt['asl_src'].mean() - df_train_mt['asl_dst'].mean()


(25.965731822359338, 20.707311543242117, 5.258420279117221)

In [30]:
df_val_mt['asl_src'].mean(), df_val_mt['asl_dst'].mean(), df_val_mt['asl_src'].mean() - df_val_mt['asl_dst'].mean()


(26.251833740831295, 21.20171149144254, 5.050122249388753)

In [31]:
df_test_mt['asl_src'].mean(), df_test_mt['asl_dst'].mean(), df_test_mt['asl_src'].mean() - df_test_mt['asl_dst'].mean()


(22.573619631901842, 21.782208588957054, 0.7914110429447874)

In [32]:
df_train_mt['aws_src'].mean(), df_train_mt['aws_dst'].mean(), df_train_mt['aws_src'].mean() - df_train_mt['aws_dst'].mean()


(22.45734899369618, 17.92156862745098, 4.5357803662451985)

In [33]:
df_val_mt['aws_src'].mean(), df_val_mt['aws_dst'].mean(), df_val_mt['aws_src'].mean() - df_val_mt['aws_dst'].mean()


(22.778728606356967, 18.377750611246945, 4.400977995110022)

In [34]:
df_test_mt['aws_src'].mean(), df_test_mt['aws_dst'].mean(), df_test_mt['aws_src'].mean() - df_test_mt['aws_dst'].mean()

(19.83128834355828, 19.346625766871167, 0.4846625766871142)

In [None]:
df_train_mt.to_csv('/content/drive/MyDrive/MT_sentence_simpl/MT_WikiLarge_train_CosSImFKG.csv')
df_val_mt.to_csv('/content/drive/MyDrive/MT_sentence_simpl/MT_WikiLarge_val_CosSImFKG.csv')
df_test_mt.to_csv('/content/drive/MyDrive/MT_sentence_simpl/T_WikiLarge_test_CosSImFKG.csv')


### Grammar Checker

In [5]:
tool = language_tool_python.LanguageTool('ru')

Downloading LanguageTool: 100%|██████████| 190M/190M [00:10<00:00, 18.8MB/s]
Unzipping /tmp/tmpytc6_ady.zip to /root/.cache/language_tool_python.
Downloaded https://www.languagetool.org/download/LanguageTool-5.2.zip to /root/.cache/language_tool_python.


In [6]:
def get_mistakes_summary(df_test):
    src_test = list(df_test['target_x'].values)
    dst_test =list(df_test['target_y'].values)
    matches_src = []
    for i in src_test:
      matches_src.extend(tool.check(i))
    matches_src

    matches_dst = []
    for i in dst_test:
      matches_dst.extend(tool.check(i))
    matches_dst

    categories = set([i.category for i in matches_src+matches_dst])

    categories_src = {i:0 for i in categories}
    categories_dst = {i:0 for i in categories}

    for i in matches_src:
      categories_src[i.category]+=1

    for i in matches_dst:
      categories_dst[i.category]+=1
      
    return categories_src, categories_dst

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_test_mt)

In [None]:
src_errors

{'CASING': 1,
 'GRAMMAR': 2,
 'LOGIC': 1,
 'PUNCTUATION': 11,
 'STYLE': 6,
 'TYPOGRAPHY': 62,
 'TYPOS': 237}

In [None]:
dst_errors

{'CASING': 1,
 'GRAMMAR': 3,
 'LOGIC': 0,
 'PUNCTUATION': 10,
 'STYLE': 6,
 'TYPOGRAPHY': 49,
 'TYPOS': 242}

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_val_mt)

In [None]:
src_errors

{'CASING': 4,
 'GRAMMAR': 28,
 'LOGIC': 1,
 'MISC': 4,
 'PUNCTUATION': 37,
 'STYLE': 10,
 'TYPOGRAPHY': 141,
 'TYPOS': 797}

In [None]:
dst_errors

{'CASING': 4,
 'GRAMMAR': 17,
 'LOGIC': 0,
 'MISC': 6,
 'PUNCTUATION': 13,
 'STYLE': 7,
 'TYPOGRAPHY': 120,
 'TYPOS': 631}

In [11]:
src_errors, dst_errors = get_mistakes_summary(df_train_mt)

In [12]:
src_errors

{'CASING': 1297,
 'EXTEND': 62,
 'GRAMMAR': 6611,
 'LOGIC': 362,
 'MISC': 2384,
 'PUNCTUATION': 10517,
 'STYLE': 3211,
 'TYPOGRAPHY': 46870,
 'TYPOS': 235762}

In [13]:
dst_errors

{'CASING': 1019,
 'EXTEND': 25,
 'GRAMMAR': 4987,
 'LOGIC': 252,
 'MISC': 1779,
 'PUNCTUATION': 6954,
 'STYLE': 1981,
 'TYPOGRAPHY': 38058,
 'TYPOS': 184223}

# WikiLarge translated with Google API


### Cosine Similarity between original/simple sentences


In [None]:
df_train_gl, df_val_gl, df_test_gl

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
config = AutoConfig.from_pretrained("DeepPavlov/rubert-base-cased") # "roberta-base" 'xlm-mlm-100-1280' 'xlm-roberta-base' 'bert-base-multilingual-cased'
config.output_hidden_states = True

tok = AutoTokenizer.from_pretrained("DeepPavlov/rubert-base-cased")
model = AutoModelForMaskedLM.from_pretrained("DeepPavlov/rubert-base-cased", config=config)
model.to(device)

In [None]:
calc_cos_sim(df_test_gl, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')
calc_cos_sim(df_val_gl, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')
calc_cos_sim(df_train_gl, model, tok, 'target_x', 'target_y', 'cos_sim_x_y')

In [None]:
df_train_gl.cos_sim_x_y.mean(), df_val_gl.cos_sim_x_y.mean(), df_test_gl.cos_sim_x_y.mean()

(0.8626444426285484, 0.8616611667481872, 0.9632613898956612)

### Cosine Similarity between English sentences and their translations

In [14]:
df_test_gl.src = df_test_gl.src.astype(str)
df_test_gl.dst = df_test_gl.dst.astype(str)
df_test_gl.target_x = df_test_gl.target_x.astype(str)
df_test_gl.target_y = df_test_gl.target_y.astype(str)

df_train_gl.src = df_train_gl.src.astype(str)
df_train_gl.dst = df_train_gl.dst.astype(str)
df_train_gl.target_x = df_train_gl.target_x.astype(str)
df_train_gl.target_y = df_train_gl.target_y.astype(str)

df_val_gl.src = df_val_gl.src.astype(str)
df_val_gl.dst = df_val_gl.dst.astype(str)
df_val_gl.target_x = df_val_gl.target_x.astype(str)
df_val_gl.target_y = df_val_gl.target_y.astype(str)

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# model = SentenceTransformer('LaBSE')
# model.to(device)

calc_cos_sim_(df_test_gl, model)
calc_cos_sim_(df_val_gl, model)
calc_cos_sim_(df_train_gl, model)

In [15]:
df_train_gl.LABSE_orig.mean(), df_val_gl.LABSE_orig.mean(), df_test_gl.LABSE_orig.mean()

(0.8977616297411125, 0.8965500514023006, 0.8895553515382009)

In [17]:
df_train_gl.LABSE_simpl.mean(), df_val_gl.LABSE_simpl.mean(), df_test_gl.LABSE_simpl.mean()

(0.8960233570680219, 0.8958086021399746, 0.8843331897095458)

### Flesch Kincaid Grade Level


In [18]:
textstat.set_lang('ru')

In [None]:
df_train_gl['fkg_src'] = df_train_gl['target_x'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_train_gl['fkg_dst'] = df_train_gl['target_y'].apply(lambda x: textstat.flesch_kincaid_grade(x))

df_val_gl['fkg_src'] = df_val_gl['target_x'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_val_gl['fkg_dst'] = df_val_gl['target_y'].apply(lambda x: textstat.flesch_kincaid_grade(x))

df_test_gl['fkg_src'] = df_test_gl['target_x'].apply(lambda x: textstat.flesch_kincaid_grade(x))
df_test_gl['fkg_dst'] = df_test_gl['target_y'].apply(lambda x: textstat.flesch_kincaid_grade(x))


In [19]:
df_train_gl['asl_src'] = df_train_gl['src'].apply(lambda x: textstat.syllable_count(x))
df_train_gl['asl_dst'] = df_train_gl['dst'].apply(lambda x: textstat.syllable_count(x))

df_val_gl['asl_src'] = df_val_gl['src'].apply(lambda x: textstat.syllable_count(x))
df_val_gl['asl_dst'] = df_val_gl['dst'].apply(lambda x: textstat.syllable_count(x))

df_test_gl['asl_src'] = df_test_gl['src'].apply(lambda x: textstat.syllable_count(x))
df_test_gl['asl_dst'] = df_test_gl['dst'].apply(lambda x: textstat.syllable_count(x))



df_train_gl['aws_src'] = df_train_gl['src'].apply(lambda x: textstat.lexicon_count(x))
df_train_gl['aws_dst'] = df_train_gl['dst'].apply(lambda x: textstat.lexicon_count(x))

df_val_gl['aws_src'] = df_val_gl['src'].apply(lambda x: textstat.lexicon_count(x))
df_val_gl['aws_dst'] = df_val_gl['dst'].apply(lambda x: textstat.lexicon_count(x))

df_test_gl['aws_src'] = df_test_gl['src'].apply(lambda x: textstat.lexicon_count(x))
df_test_gl['aws_dst'] = df_test_gl['dst'].apply(lambda x: textstat.lexicon_count(x))

In [None]:
df_train_gl.fkg_src.mean(), df_train_gl.fkg_dst.mean(), df_train_gl.fkg_src.mean() - df_train_gl.fkg_dst.mean()

(19.80360113046588, 17.572182137681914, 2.231418992783965)

In [None]:
df_val_gl.fkg_src.mean(), df_val_gl.fkg_dst.mean(), df_val_gl.fkg_src.mean() - df_val_gl.fkg_dst.mean()


(20.149479166666662, 17.943750000000048, 2.2057291666666146)

In [None]:
df_test_gl.fkg_src.mean(), df_test_gl.fkg_dst.mean(), df_test_gl.fkg_src.mean() - df_test_gl.fkg_dst.mean()


(19.73753424657536, 19.098630136986294, 0.6389041095890668)

In [21]:
df_train_gl['asl_src'].mean(), df_train_gl['asl_dst'].mean(), df_train_gl['asl_src'].mean() - df_train_gl['asl_dst'].mean()

(25.965810719983157, 20.70780393395363, 5.258006786029526)

In [22]:
df_val_gl['asl_src'].mean(), df_val_gl['asl_dst'].mean(), df_val_gl['asl_src'].mean() - df_val_gl['asl_dst'].mean()

(26.22265625, 21.311197916666668, 4.911458333333332)

In [23]:
df_test_gl['asl_src'].mean(), df_test_gl['asl_dst'].mean(), df_test_gl['asl_src'].mean() - df_test_gl['asl_dst'].mean()

(22.75890410958904, 21.964383561643835, 0.7945205479452042)

In [25]:
df_train_gl['aws_src'].mean(), df_train_gl['aws_dst'].mean(), df_train_gl['aws_src'].mean() - df_train_gl['aws_dst'].mean()

(22.45742130878054, 17.921976856238206, 4.5354444525423325)

In [26]:
df_val_gl['aws_src'].mean(), df_val_gl['aws_dst'].mean(), df_val_gl['aws_src'].mean() - df_val_gl['aws_dst'].mean()

(22.725260416666668, 18.451822916666668, 4.2734375)

In [27]:
df_test_gl['aws_src'].mean(), df_test_gl['aws_dst'].mean(), df_test_gl['aws_src'].mean() - df_test_gl['aws_dst'].mean()

(19.980821917808218, 19.476712328767125, 0.5041095890410929)

In [None]:
df_train_gl.to_csv('/content/drive/MyDrive/MT_sentence_simpl/Google_WikiLarge_train_CosSImFKG.csv')
df_val_gl.to_csv('/content/drive/MyDrive/MT_sentence_simpl/Google_WikiLarge_val_CosSImFKG.csv')
df_test_gl.to_csv('/content/drive/MyDrive/MT_sentence_simpl/Google_WikiLarge_test_CosSImFKG.csv')

### Grammar Checker

In [None]:
tool = language_tool_python.LanguageTool('ru')

In [None]:
# def get_mistakes_summary(df_test):
#     src_test = list(df_test['target_x'].values)
#     dst_test =list(df_test['target_y'].values)
#     matches_src = []
#     for i in src_test:
#       matches_src.extend(tool.check(i))
#     matches_src

#     matches_dst = []
#     for i in dst_test:
#       matches_dst.extend(tool.check(i))
#     matches_dst

#     categories = set([i.category for i in matches_src+matches_dst])

#     categories_src = {i:0 for i in categories}
#     categories_dst = {i:0 for i in categories}

#     for i in matches_src:
#       categories_src[i.category]+=1

#     for i in matches_dst:
#       categories_dst[i.category]+=1
      
#     return categories_src, categories_dst

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_test_gl)


In [None]:
src_errors

{'CASING': 2,
 'GRAMMAR': 2,
 'MISC': 1,
 'PUNCTUATION': 2,
 'STYLE': 6,
 'TYPOGRAPHY': 37,
 'TYPOS': 227}

In [None]:
dst_errors

{'CASING': 1,
 'GRAMMAR': 1,
 'MISC': 1,
 'PUNCTUATION': 5,
 'STYLE': 5,
 'TYPOGRAPHY': 44,
 'TYPOS': 234}

In [None]:
src_errors, dst_errors = get_mistakes_summary(df_val_gl)


In [None]:
src_errors

{'CASING': 9,
 'EXTEND': 1,
 'GRAMMAR': 5,
 'LOGIC': 2,
 'MISC': 4,
 'PUNCTUATION': 18,
 'STYLE': 8,
 'TYPOGRAPHY': 144,
 'TYPOS': 716}

In [None]:
dst_errors

{'CASING': 10,
 'EXTEND': 0,
 'GRAMMAR': 7,
 'LOGIC': 0,
 'MISC': 8,
 'PUNCTUATION': 10,
 'STYLE': 10,
 'TYPOGRAPHY': 140,
 'TYPOS': 568}

In [8]:
src_errors, dst_errors = get_mistakes_summary(df_train_gl)

In [9]:
src_errors

{'CASING': 3605,
 'EXTEND': 33,
 'GRAMMAR': 3296,
 'LOGIC': 305,
 'MISC': 1921,
 'PUNCTUATION': 4863,
 'STYLE': 3391,
 'TYPOGRAPHY': 41504,
 'TYPOS': 221050}

In [10]:
dst_errors

{'CASING': 1663,
 'EXTEND': 19,
 'GRAMMAR': 2328,
 'LOGIC': 244,
 'MISC': 1711,
 'PUNCTUATION': 3447,
 'STYLE': 1959,
 'TYPOGRAPHY': 38674,
 'TYPOS': 173040}

In [None]:
# lexical diversity
# https://shravan-kuchkula.github.io/Lexical-Diversity/#normalizing-text-to-understand-vocabulary