<a href="https://colab.research.google.com/github/dumaaan/nlp_projects/blob/master/Simple_Translator_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Translator From Ukrainian to Russian
## Part 1: Bilingual dictionary induction and unsupervised embedding-based MT
*Note: this is based on materials from yandexdataschool [NLP course](https://github.com/yandexdataschool/nlp_course/). Feel free to check this awesome course if you wish to dig deeper.*

## Data

In [0]:
import gensim
import numpy as np
from gensim.models import KeyedVectors

We're going to use pretrained word vectors - FastText (original paper - https://arxiv.org/abs/1607.04606).

You can download them from the official [website](https://fasttext.cc/docs/en/crawl-vectors.html). We're going to need embeddings for Russian and Ukrainian languages.

In [4]:
uk_emb = KeyedVectors.load_word2vec_format("https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.uk.300.vec.gz")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [5]:
ru_emb = KeyedVectors.load_word2vec_format("https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ru.300.vec.gz")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [6]:
ru_emb.most_similar([ru_emb["август"]], topn=10) #august

  if np.issubdtype(vec.dtype, np.int):


[('август', 1.0),
 ('июль', 0.9383153915405273),
 ('сентябрь', 0.9240028858184814),
 ('июнь', 0.9222575426101685),
 ('октябрь', 0.9095538854598999),
 ('ноябрь', 0.8930036425590515),
 ('апрель', 0.8729087114334106),
 ('декабрь', 0.8652557730674744),
 ('март', 0.8545796275138855),
 ('февраль', 0.8401416540145874)]

In [7]:
uk_emb.most_similar([uk_emb["серпень"]])

  if np.issubdtype(vec.dtype, np.int):


[('серпень', 0.9999999403953552),
 ('липень', 0.9096440076828003),
 ('вересень', 0.901697039604187),
 ('червень', 0.8992519378662109),
 ('жовтень', 0.8810408711433411),
 ('листопад', 0.8787633776664734),
 ('квітень', 0.8592804670333862),
 ('грудень', 0.8586863279342651),
 ('травень', 0.8408110737800598),
 ('лютий', 0.8256431818008423)]

In [8]:
ru_emb.most_similar([uk_emb["серпень"]])

  if np.issubdtype(vec.dtype, np.int):


[('Stepashka.com', 0.2757962942123413),
 ('ЖИЗНИВадим', 0.25203436613082886),
 ('2Дмитрий', 0.25048112869262695),
 ('2012Дмитрий', 0.24829231202602386),
 ('Ведущий-Алексей', 0.2443869560956955),
 ('Недопустимость', 0.24435284733772278),
 ('2Михаил', 0.23981399834156036),
 ('лексей', 0.23740756511688232),
 ('комплексн', 0.23695150017738342),
 ('персональ', 0.2368222028017044)]

Loading small dictionaries for correspoinding words pairs as trainset and testset.

In [0]:
def load_word_pairs(filename):
    uk_ru_pairs = []
    uk_vectors = []
    ru_vectors = []
    with open(filename, "r") as inpf:
        for line in inpf:
            uk, ru = line.rstrip().split("\t")
            if uk not in uk_emb or ru not in ru_emb:
                continue
            uk_ru_pairs.append((uk, ru))
            uk_vectors.append(uk_emb[uk])
            ru_vectors.append(ru_emb[ru])
    return uk_ru_pairs, np.array(uk_vectors), np.array(ru_vectors)

In [10]:
!wget -O ukr_rus.train.txt http://tiny.cc/jfgecz

--2020-05-14 08:01:29--  http://tiny.cc/jfgecz
Resolving tiny.cc (tiny.cc)... 192.241.240.89
Connecting to tiny.cc (tiny.cc)|192.241.240.89|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://tiny.cc/jfgecz [following]
--2020-05-14 08:01:29--  https://tiny.cc/jfgecz
Connecting to tiny.cc (tiny.cc)|192.241.240.89|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/week01_embeddings/ukr_rus.train.txt [following]
--2020-05-14 08:01:29--  https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/week01_embeddings/ukr_rus.train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 59351 (58K) [text/plain]
Saving to:

In [11]:
!wget -O ukr_rus.test.txt http://tiny.cc/6zoeez

--2020-05-14 08:01:32--  http://tiny.cc/6zoeez
Resolving tiny.cc (tiny.cc)... 192.241.240.89
Connecting to tiny.cc (tiny.cc)|192.241.240.89|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://tiny.cc/6zoeez [following]
--2020-05-14 08:01:32--  https://tiny.cc/6zoeez
Connecting to tiny.cc (tiny.cc)|192.241.240.89|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/week01_embeddings/ukr_rus.test.txt [following]
--2020-05-14 08:01:32--  https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/week01_embeddings/ukr_rus.test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12188 (12K) [text/plain]
Saving to: ‘

In [0]:
uk_ru_train, X_train, Y_train = load_word_pairs("ukr_rus.train.txt")

In [0]:
uk_ru_test, X_test, Y_test = load_word_pairs("ukr_rus.test.txt")

## Embedding space mapping

Let $x_i \in \mathrm{R}^d$ be the distributed representation of word $i$ in the source language, and $y_i \in \mathrm{R}^d$ is the vector representation of its translation. Our purpose is to learn such linear transform $W$ that minimizes euclidian distance between $Wx_i$ and $y_i$ for some subset of word embeddings. Thus we can formulate so-called Procrustes problem:

$$W^*= \arg\min_W \sum_{i=1}^n||Wx_i - y_i||_2$$
or
$$W^*= \arg\min_W ||WX - Y||_F$$

where $||*||_F$ - Frobenius norm.

$W^*= \arg\min_W \sum_{i=1}^n||Wx_i - y_i||_2$ looks like simple multiple linear regression (without intercept fit).

In [0]:
from sklearn.linear_model import LinearRegression
mapping = LinearRegression(fit_intercept=False).fit(X_train, Y_train)

Let's take a look at neigbours of the vector of word _"серпень"_ (_"август"_ in Russian) after linear transform.

In [15]:
august = mapping.predict(uk_emb["серпень"].reshape(1, -1))
ru_emb.most_similar(august)

  if np.issubdtype(vec.dtype, np.int):


[('апрель', 0.8531432747840881),
 ('июнь', 0.8402522802352905),
 ('март', 0.8385884165763855),
 ('сентябрь', 0.8331484794616699),
 ('февраль', 0.8311208486557007),
 ('октябрь', 0.8278019428253174),
 ('ноябрь', 0.8243728280067444),
 ('июль', 0.8229618072509766),
 ('август', 0.8112280368804932),
 ('январь', 0.8022986650466919)]

We can see that neighbourhood of this embedding cosists of different months, but right variant is on the ninth place.

As quality measure we will use precision top-1, top-5 and top-10 (for each transformed Ukrainian embedding we count how many right target pairs are found in top N nearest neighbours in Russian embedding space).

In [0]:
def precision(pairs, mapped_vectors, topn=1):
    """
    :args:
        pairs = list of right word pairs [(uk_word_0, ru_word_0), ...]
        mapped_vectors = list of embeddings after mapping from source embedding space to destination embedding space
        topn = the number of nearest neighbours in destination embedding space to choose from
    :returns:
        precision_val, float number, total number of words for those we can find right translation at top K.
    """
    assert len(pairs) == len(mapped_vectors)
    num_matches = 0
    for i, (_, ru) in enumerate(pairs):
      entry = mapping.predict(uk_emb[_].reshape(1, -1))
      close_words = [k.lower() for j,(k,l) in enumerate(ru_emb.most_similar(entry,topn=topn))]
      if ru in close_words:
        num_matches+=1
    precision_val = num_matches / len(pairs)
    return precision_val

In [17]:
precision([("серпень", "август")],august,topn=10)

  if np.issubdtype(vec.dtype, np.int):


1.0

## Making it better (orthogonal Procrustean problem) (0.3 pts)

It can be shown (see original paper) that a self-consistent linear mapping between semantic spaces should be orthogonal. 
We can restrict transform $W$ to be orthogonal. Then we will solve next problem:

$$W^*= \arg\min_W ||WX - Y||_F \text{, where: } W^TW = I$$

$$I \text{- identity matrix}$$

Instead of making yet another regression problem we can find optimal orthogonal transformation using singular value decomposition. It turns out that optimal transformation $W^*$ can be expressed via SVD components:
$$X^TY=U\Sigma V^T\text{, singular value decompostion}$$
$$W^*=UV^T$$

In [0]:
import numpy as np

In [0]:
def learn_transform(X_train, Y_train):
    """ 
    :returns: W* : float matrix[emb_dim x emb_dim] as defined in formulae above
    """
    u, s, vt = np.linalg.svd(np.matmul(X_train.T,Y_train))
    mapping = np.matmul(u,vt)
    return mapping

In [0]:
W = learn_transform(X_train, Y_train)

In [22]:
ru_emb.most_similar([np.matmul(uk_emb["серпень"], W)])

  if np.issubdtype(vec.dtype, np.int):


[('апрель', 0.8245131373405457),
 ('июнь', 0.8056631088256836),
 ('сентябрь', 0.8055763244628906),
 ('март', 0.8032934069633484),
 ('октябрь', 0.798710286617279),
 ('июль', 0.7946796417236328),
 ('ноябрь', 0.7939636707305908),
 ('август', 0.7938191294670105),
 ('февраль', 0.7923860549926758),
 ('декабрь', 0.7715376615524292)]

## Unsupervised embedding-based MT

Now, let's build our word embeddings-based translator!

We will use OPUS Tatoeba corpus.

In [23]:
!wget https://object.pouta.csc.fi/OPUS-Tatoeba/v20190709/mono/uk.txt.gz

--2020-05-14 08:14:15--  https://object.pouta.csc.fi/OPUS-Tatoeba/v20190709/mono/uk.txt.gz
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1819128 (1.7M) [application/gzip]
Saving to: ‘uk.txt.gz’


2020-05-14 08:14:16 (7.75 MB/s) - ‘uk.txt.gz’ saved [1819128/1819128]



In [0]:
!gzip -d ./uk.txt.gz

In [0]:
with open('./uk.txt', 'r') as f:
    uk_corpus = f.readlines()

In [0]:
# To save  time and CPU, we use first 1000 sentences of the corpus
uk_corpus = uk_corpus[:1000]

In [0]:
uk_corpus = [string.lower() for string in uk_corpus]

In [0]:
to_remove = ['\n',',','.','!','?']
for char in to_remove:
  uk_corpus = [string.replace(char,'') for string in uk_corpus]

In [29]:
uk_corpus[:11]

['я вже закінчу коледж коли ви вернетеся з америки',
 'він наказав мені негайно вийти з кімнати',
 'як би ти не намагався ти не вивчиш англійську за два-три місяці',
 'поки я не подзвонив він не прийшов',
 'у всесвіті багато галактик',
 'вона приймає душ щоранку',
 'неслухняний хлопчик заблукав й оглядався по сторонах',
 'вона повільно зникала в туманному лісі',
 'наш літак летів понад хмарами',
 'у майка є декілька друзів у флориді',
 'місто бомбардували ворожі літаки']

In [0]:
def translate(sentence):
    """
    :args:
        sentence - sentence in Ukrainian (str)
    :returns:
        translation - sentence in Russian (str)

    * find ukrainian embedding for each word in sentence
    * transform ukrainian embedding vector
    * find nearest russian word and replace
    """
    uk_words = sentence.split(" ")
    translated = [ru_emb.most_similar([np.matmul(uk_emb[uk_word], W)])[0][0] for uk_word in uk_words]
    return " ".join(translated)

In [31]:
translate('місто бомбардували ворожі літаки')

  if np.issubdtype(vec.dtype, np.int):


'город бомбили враждебные самолеты'

Now we can play with  model and try to get as accurate translations as possible. **Note**: one big issue is out-of-vocabulary words. For now, this simple model is not handling them.

In [32]:
for sent in uk_corpus[:100]:
    try:
      print(translate(sent))
    except KeyError:
      continue

  if np.issubdtype(vec.dtype, np.int):


мной уже закончу колледж когда мы прибежишь со америки
он велел мне немедленно выйти со комнаты
как бы ты не пытался ты не выучишь английский за два-три месяца
пока мной не позвонил он не пришел
во вселенной много галактик
она принимает души утрам
непослушный мальчик заблудился и проглядывался по сторонам
она медленно исчезала во туманном лесу
наш самолет летел свыше облаками
город бомбили враждебные самолеты
мной встретиться со тобой во субботу 00 третий
для финансирование войны было издан облигации
откуда принимают начало олимпийские игры
мы собирались пробыть там возле двух недель
как по меня то сейчас промолчу
мой дядя вчера умер от рака желудке
кажется дети устали от плавание
нет любовь без ревности
возможно мной антисоциальный конечно это не означает что мной не общаюсь со людьми
мной не знаю что ещe можно сделать
мной научился жить без неё
действительно
мне всегда больше нравились загадочные персонажи
мной могу только ждать
тебе лучше поспать
обдумать это
например тебе нравиться