<a href="https://colab.research.google.com/github/VyatkinAlexey/Noiseless/blob/master/Autoencoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Машинный перевод с помощью openNMT для соревнования STAPLE от Duolingo

## Загрузка данных. 

Так как данных предоставленных орагнизаторами явно недостаточно для того чтобы обучить полноценную языковую модель, для каждого из 5 языков были загружены параллельные корпуса субтитров. (http://opus.nlpl.eu/OpenSubtitles-v2016.php) Данные можно также найти на сервере nlp1 в папке `voronov/data/OpenSubtitles`.

In [0]:
# this is for colab skip if you don't need to connect to drive)
import os
from google.colab import drive
drive.mount('/content/gdrive')

course = 'en_vi'
path_to_corpora = os.path.join('/content/gdrive/My Drive/data/work/Panchenko/corpora/OpenSubtitles/', course)
path_to_duolingo = os.path.join('/content/gdrive/My Drive/data/work/Panchenko/duolingo/data/', course)
path_to_model = os.path.join('/content/gdrive/My Drive/data/work/Panchenko/language_models', course)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
os.chdir(path_to_corpora) 

In [0]:
!wget http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2016/moses/en-ko.txt.zip
!unzip  'download.php?f=OpenSubtitles%2Fv2016%2Fmoses%2Fen-ko.txt.zip'

--2020-03-25 14:12:00--  http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2016/moses/en-ko.txt.zip
Resolving opus.nlpl.eu (opus.nlpl.eu)... 193.166.25.9
Connecting to opus.nlpl.eu (opus.nlpl.eu)|193.166.25.9|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/moses/en-ko.txt.zip [following]
--2020-03-25 14:12:00--  https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2016/moses/en-ko.txt.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12433629 (12M) [application/zip]
Saving to: ‘download.php?f=OpenSubtitles%2Fv2016%2Fmoses%2Fen-ko.txt.zip’


2020-03-25 14:12:01 (26.1 MB/s) - ‘download.php?f=OpenSubtitles%2Fv2016%2Fmoses%2Fen-ko.txt.zip’ saved [12433629/12433629]

Archive:  download.php?f=OpenSubtitles%2Fv2016%2Fmoses%2Fen-ko.txt.zip
 

Для правильного запуска openNMT понадобится разбить данные на трейн часть и на валидационную. Я решил использовать всего 1 миллион пар предложений (для корейского языка всего было доступно только 370 тысяч предложений) для обучения и 5000 пар для валидации (это не в прямом смысле валидация, она нужна только для корректной работы скриптов openNMT, в их руководстве сказано, что 5000 будет достаточно)  

In [0]:
!wc -l OpenSubtitles.en-ko.en

370702 OpenSubtitles.en-ko.en


In [0]:
!head -n 365000 'OpenSubtitles.en-ko.en' > 'train_subtitles_1m_en.txt'
!head -n 365000 'OpenSubtitles.en-ko.ko' > 'train_subtitles_1m_ko.txt'
!tail -n 5000 'OpenSubtitles.en-ko.en' > 'dev_subtitles_en.txt'
!tail -n 5000 'OpenSubtitles.en-ko.ko' > 'dev_subtitles_ko.txt'

## Данные Duolingo

Для составления предсказания для соревнования Duolingo я взял файл формата `*.aws_baseline.pred.txt`, который содержит по одному референсному переводу от Amazon для каждого предложения, предобработал его (убрал префиксы у строчек, соединил строчки "сообщение-перевод" в одну, разделённую табом, удалил пустые строчки) и разбил на два файла: с исходным языком и таргетным. Для того чтобы на стадии предсказания точно не возникло незнакомых для модели слов, я добавил полученные файлы в обучающую выборку.

In [0]:
os.chdir(os.path.join(path_to_duolingo, 'train'))
#remove all markers | remove blank lines | combine every two lines in one
#!sed 's/.*|//g' train.en_hu.aws_baseline.pred.txt | grep . | paste -d "\t"  - - > train_duolingo_en_hu.txt

In [0]:
with open('train_duolingo_en_ko.txt') as f, open('train_duolingo_en.txt', 'w') as file_en, \
open('train_duolingo_ko.txt', 'w') as file_ko:
    for line in f.readlines():
        pair = line.strip().split('\t')
        file_en.writelines(pair[0]+'\n')
        file_ko.writelines(pair[1]+'\n')

In [0]:
!tail -5 ../../../../corpora/OpenSubtitles/en_ko/train_subtitles_1m_en.txt
!cat train_duolingo_en.txt >> ../../../../corpora/OpenSubtitles/en_ko/train_subtitles_1m_en.txt
!cat train_duolingo_ko.txt >> ../../../../corpora/OpenSubtitles/en_ko/train_subtitles_1m_ko.txt
!tail -5 ../../../../corpora/OpenSubtitles/en_ko/train_subtitles_1m_en.txt

Again, nothing witchy.
Ghost?
Hard to say.
I mean, there's EMF in the church, but it's built on a burial ground.
You know that all the victims recently went to confession?
english is an international language.
which floor is it?
the waiter has asked everybody.
it is a beautiful bird.
i count.


## Обучение

In [0]:
!pip install OpenNMT-py



In [0]:
os.chdir(os.path.join(path_to_model, 'openNMT'))

In [0]:
!onmt_preprocess -train_src ../../../corpora/OpenSubtitles/en_vi/train_subtitles_1m_en.txt \
-train_tgt ../../../corpora/OpenSubtitles/en_vi/train_subtitles_1m_vi.txt \
-valid_src ../../../corpora/OpenSubtitles/en_vi/dev_subtitles_en.txt \
-valid_tgt ../../../corpora/OpenSubtitles/en_vi/dev_subtitles_vi.txt \
-tgt_vocab_size 50000 -src_vocab_size 50000 --src_seq_length 25 --tgt_seq_length 25 \
-save_data nmt_subs_en_vi

[2020-03-26 12:23:25,754 INFO] Extracting features...
[2020-03-26 12:23:26,631 INFO]  * number of source features: 0.
[2020-03-26 12:23:26,632 INFO]  * number of target features: 0.
[2020-03-26 12:23:26,632 INFO] Building `Fields` object...
[2020-03-26 12:23:26,633 INFO] Building & saving training data...
[2020-03-26 12:23:29,506 INFO] Building shard 0.
[2020-03-26 12:24:09,090 INFO]  * saving 0th train data shard to nmt_subs_en_vi.train.0.pt.
[2020-03-26 12:24:32,843 INFO] Building shard 1.
[2020-03-26 12:24:32,987 INFO]  * saving 1th train data shard to nmt_subs_en_vi.train.1.pt.
[2020-03-26 12:24:33,406 INFO]  * tgt vocab size: 50004.
[2020-03-26 12:24:33,731 INFO]  * src vocab size: 50002.
[2020-03-26 12:24:34,575 INFO] Building & saving validation data...
[2020-03-26 12:24:35,621 INFO] Building shard 0.
[2020-03-26 12:24:35,778 INFO]  * saving 0th valid data shard to nmt_subs_en_vi.valid.0.pt.


Это облегченная модель трансформера (чтобы обучение занимало адекватное время (~12 часов). Я уменьшил `rnn_size`, `word_vec_size`, `heads`, `training_steps` в 2 раза. Остальное оставил таким же как указано в рекомендациях openNMT, которые якобы повторяют исходный сетап Google.

In [0]:
!onmt_train -data nmt_subs_en_vi -save_model transformer \
        -layers 6 -rnn_size 256 -word_vec_size 256 -transformer_ff 2048 -heads 4  \
        -encoder_type transformer -decoder_type transformer -position_encoding \
        -train_steps 100000  -max_generator_batches 2 -dropout 0.1 \
        -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
        -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
        -max_grad_norm 0 -param_init 0  -param_init_glorot \
        -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 5000 \
        -world_size 1 -gpu_ranks 0
# i omitted output with training statistics for better representation

[2020-03-26 12:25:40,102 INFO]  * src vocab size = 50002
[2020-03-26 12:25:40,102 INFO]  * tgt vocab size = 50004
[2020-03-26 12:25:40,103 INFO] Building model...
[2020-03-26 12:25:49,045 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50002, 256, padding_idx=1)
        )
        (pe): PositionalEncoding(
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (transformer): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=256, out_features=256, bias=True)
          (linear_values): Linear(in_features=256, out_features=256, bias=True)
          (linear_query): Linear(in_features=256, out_features=256, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=256, out_feat

Теперь у нас есть модель-переводчик, исходный текст для перевода и их референсный перевод от Amazon. Для примера выведем 10 лучших переводов одного предложения (автоматическое определение правильного числа переводов для каждого предложения -- потом)

In [0]:
!onmt_translate -model transformer_step_70000.pt -src ../../../duolingo/data/en_pt/train/train_duolingo_en.txt \
-output train_duolingo_pred10_pt.txt -n_best 10 -replace_unk

[2020-03-18 15:51:10,744 INFO] Translating shard 0.
PRED AVG SCORE: -0.5227, PRED PPL: 1.6865


In [0]:
!head -n 20 train_duolingo_pred10_pt.txt

Se video com o video
Se importaria se video o video
Se importaria de trocar esse colar para mim?
Se importaria se eu video esse video
Se importaria se video com o video
Se importaria se video com o segundo para mim?
Se importaria se video com o segundo melhor para mim?
Se importaria se video com o segundo melhor video
Se importaria se video com o segundo melhor passeio para mim?
Se importaria se video com o segundo melhor passeio por mim?
A livraria não está nesta rua.
A livraria não está nesta estrada.
A livraria não está nessa rua.
- A livraria não está nesta rua.
A livraria não está aqui.
- A livraria não está nesta estrada.
- A livraria não está nesta rua. isn't
- A livraria não está nesta rua. - Sim.
- A livraria não está nesta rua. Não está aqui.
- A livraria não está nesta rua. - Está na rua.


## TODO: Evaluation



In [0]:
"""from nltk.translate.bleu_score import corpus_bleu
import re

references = []
predictions = []
with open('../../duolingo/data/en-pt/train/train_duolingo_pt.txt') as gold:
    for sentence in gold.readlines():
        s = ' '.join(re.split(r'([^0-9a-zÀ-ÿ\s])',sentence.lower()))
        s = re.sub(r' +', ' ', s)
        references.append(s.split())
with open('openNMT/train_duolingo_pred_pt.txt') as preds:
    for sentence in preds.readlines():
        s = ' '.join(re.split(r'([^0-9a-zÀ-ÿ\s])',sentence.lower()))
        s = re.sub(r' +', ' ', s)
        predictions.append(s.split())
print(predictions[1])
print(references[1])"""

['a', 'livraria', 'não', 'está', 'nesta', 'rua', '.']
['a', 'livraria', 'não', 'é', 'nesta', 'rua', '.']


In [0]:
# corpus_bleu(references, predictions)

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


0.6902782199046656

In [0]:
!less 'train_subtitles_1m_en.txt'

7[?47h[?1h=Male narrator: Next, on Ice Road Truckers...
Welcome to the twilight zone
Again.
Narrator: The pressure is on
For lisa's most important run of
The season.
I think I'm ready to kind of Step up the game a little bit.
Narrator: Two ice road Veterans get ready to throw
Down. I'm not here to make friends,
I'm here to ke money. Narrator: And a monster storm
Smashes into atigun pass. I'm gonna get this tank there
Tonight, storm or no storm. Narrator: Turning the dash
For the cash into a fight for
Survival.
We're at the eye of the Storm, the mouth of the beast. The way it looks here, we
Narrator: 50 miles outside
Of prudhoe bay...
Alex debogorski is headed south On the vast, open tundra of the
North slope.
Weather changes quickly across The flat planes. And, for the second week in a
Row, alex finds himself in the
Middle of a blinding snowstorm.
Yep, she's starting to get
[KWorse in here.
[KWell, you can see the fingers of Snow, the little snowdrifts
[KI

In [0]:
temp = ''
with open('train_subtitles_1m_en.txt') as source, open('my_version_en.txt', 'w') as target:
    for i, line in enumerate(source.readlines()):
        if i==50:
            break
        line = line.strip().lower()
        temp += line + ' '
        sentences = temp.split('.')
        if len(sentences) > 1:
            if sentences[-1][-1] != '.':
                temp = sentences[-1][-1]
                for sentence in sentences[:-1]:
                    
            else:
                for sentence in sentences:
                    target.writeline(sentence+'\n')

In [0]:
!head 'my_version_en.txt'

male narrator: next, on ice road truckers... 
welcome to the twilight zone again. 
narrator: the pressure is on for lisa's most important run of the season. 
i think i'm ready to kind of step up the game a little bit. 
narrator: two ice road veterans get ready to throw down. i'm not here to make friends, i'm here to ke money. narrator: and a monster storm smashes into atigun pass. i'm gonna get this tank there tonight, storm or no storm. narrator: turning the dash for the cash into a fight for survival. 
we're at the eye of the storm, the mouth of the beast. the way it looks here, we narrator: 50 miles outside of prudhoe bay... 
alex debogorski is headed south on the vast, open tundra of the north slope. 
weather changes quickly across the flat planes. and, for the second week in a row, alex finds himself in the middle of a blinding snowstorm. 
yep, she's starting to get worse in here. 
well, you can see the fingers of snow, the little snowdrifts into the road. 
