# Библиотеки

In [1]:
import numpy as np

import unicodedata
import re

# Данные

Anki (https://www.manythings.org/anki/) - англоориентированная коллекция параллельных текстов, взятых из проекта Tatoeba. Включает 86 пар языков.

## Загрузка

Ссылка на архив - https://www.manythings.org/anki/rus-eng.zip

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
path = '/content/drive/My Drive/rus.txt'

lines = open(path, encoding='UTF-8').read().strip().split('\n')

## Анализ датасета

In [4]:
print(len(lines))

496059


Всего 496059 пар предложений, упорядоченных в порядке возрастания длины.

In [5]:
print(lines[5])
print(lines[50])
print(lines[500])
print(lines[5000])
print(lines[50000])

Hi.	Хай.	CC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #467233 (timsa)
Wait.	Подождите.	CC-BY 2.0 (France) Attribution: tatoeba.org #3048304 (camilozeta) & #5707977 (odexed)
Wake up!	Проснитесь.	CC-BY 2.0 (France) Attribution: tatoeba.org #323780 (CK) & #10710636 (marafon)
Is Tom rich?	Том богатый?	CC-BY 2.0 (France) Attribution: tatoeba.org #7901240 (Kamilla) & #10849612 (marafon)
Take off your tie.	Снимите галстук.	CC-BY 2.0 (France) Attribution: tatoeba.org #3732849 (CK) & #3972285 (odexed)


Данные записаны в виде: English + TAB + The Other Language + TAB + Attribution.

In [6]:
print(len(lines[0].split('\t')[0]), len(lines[-1].split('\t')[0]))

3 537


Длина самого короткого предложения (eng) - 3, самого длинного - 537.

## Обработка предложений



*   Перевести все символы в кодировку ASCII
*   Сделать все символы строчными
*   Убрать неважную пунктуацию
*   Отделить пунктуацию от текста
*   Добавить символы начала и конца строки



In [7]:
def unicode2ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

def normalize_sentence(s):
    s = unicode2ascii(s.lower().strip())
    s = re.sub(r"([.!?,])", r" \1 ", s)
    s = re.sub(r'[" "]+', " ", s)
    s = re.sub(r"[^a-zA-Zа-яА-Я.!?,]+", " ", s)
    s = s.rstrip().strip()
    s = '[START] ' + s + ' [END]'
    return s

In [8]:
split_lines = [line.split('\t') for line in lines]

eng_sentences = [eng for eng, _, _ in split_lines]
rus_sentences = [rus for _, rus, _ in split_lines]

In [9]:
print(eng_sentences[-50])
print(rus_sentences[-50])

Oh, sure, I studied English in my school days. But it wasn't until two or three years ago that I really started taking it seriously.
О, конечно я учил английский язык в школе. Но только два или три года назад я принялся за него всерьёз.


In [10]:
eng_sentences = [normalize_sentence(sentence) for sentence in eng_sentences]
rus_sentences = [normalize_sentence(sentence) for sentence in rus_sentences]

In [11]:
print(eng_sentences[-50])
print(rus_sentences[-50])

[START] oh , sure , i studied english in my school days . but it wasn t until two or three years ago that i really started taking it seriously . [END]
[START] о , конечно я учил англиискии язык в школе . но только два или три года назад я принялся за него всерьез . [END]


## Векторизация

In [12]:
eng_sentences = eng_sentences[:15000]
rus_sentences = rus_sentences[:15000]

In [13]:
print(eng_sentences[-5])
print(rus_sentences[-5])

[START] were you busy ? [END]
[START] вы были заняты ? [END]


In [14]:
class Vocab():
    def __init__(self):
        self.word2index = {'[PAD]': 0, '[START]': 1, '[END]': 2}
        self.index2word = {0: '[PAD]', 1: '[START]', 2: '[END]'}
        self.vocab_size = 3

    def add_words(self, texts):
        for text in texts:
            for word in text.split(' '):
                if word not in self.word2index:
                    self.word2index[word] = self.vocab_size
                    self.index2word[self.vocab_size] = word
                    self.vocab_size += 1

    def seq2vector(self, seq):
        vector = [self.word2index[word] for word in seq.split(' ')]
        return vector

    def vector2tokens(self, vector):
        tokens = []
        for index in vector:
            if index not in [0, 1, 2]:
                tokens.append(self.index2word[index])
        return tokens

In [15]:
eng_vocab = Vocab()
eng_vocab.add_words(eng_sentences)

rus_vocab = Vocab()
rus_vocab.add_words(rus_sentences)

In [16]:
def vectorization(data, vocab):
    return np.array([vocab.seq2vector(seq) for seq in data], dtype=object)

In [17]:
eng_vectors = vectorization(eng_sentences, eng_vocab)
rus_vectors = vectorization(rus_sentences, rus_vocab)

In [18]:
print(eng_vectors[-5])
print(rus_vectors[-5])

[1, 1209, 148, 188, 9, 2]
[1, 415, 1416, 2158, 17, 2]


In [19]:
def padding(data, max_length):
    return np.array([[0]*(max_length - len(vec)) + vec for vec in data])

In [20]:
max_length = max(len(eng_vectors[-1]), len(rus_vectors[-1]))

eng_vectors = padding(eng_vectors, 12)
rus_vectors = padding(rus_vectors, 12)

In [21]:
print(eng_vectors[-5])
print(rus_vectors[-5])

[   0    0    0    0    0    0    1 1209  148  188    9    2]
[   0    0    0    0    0    0    1  415 1416 2158   17    2]
