# CIS 530 Final Project: Low-Resource Machine Translation of Uyghur

**Project by Francesca Marini & Efe Ayhan**

CIS 530: Computational Linguistics

Spring 2021

Professor Mark Yatskar

The viewer is welcome to follow along in this notebook to replicate our experiments for this project. 

(Also, feel free to collapse sections when not viewing them, for the ease of reading, since this is a fairly long notebook.)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd drive/My\ Drive/CIS\ 530/Project/Milestone4

/content/drive/My Drive/CIS 530/Project/Milestone4


## Data and Preprocessing

Here is almost all of the code for dataset pre-processing used in this project. Not all of this data was used in every step of the project, but this involved a decent amount of effort to pre-process data, explore the data, and determine which data is right for the purposes of this project. As a result, we decided to include the majority of this code here to illustrate some of this effort. 

Old Data Preprocess is most of the preprocess code used in the previous milestones with the old data. New Data Preprocess details everything needed for the final milestone.

Running the preprocess code should not be necessary to replicate our results (in fact, it would be better not to run it since some files might be missing and directory structures may have changed), since the preprocessed data files are included with the notebook, but we include it for the sake of completeness.

For more information about the data used, please refer to the final project report. 

### Old Data Preprocess

In [None]:
import json
import os
import re

In [None]:
train_file_maps = {}
dev_file_maps = {}
test_file_maps = {}
#train
for ug_fname in os.listdir("data/uig-train/"):
    train_file_maps[os.path.join("data/uig-train/", ug_fname)] = []
    for en_fname in os.listdir("data/eng-train/"):
        if ug_fname in en_fname:
            train_file_maps[os.path.join("data/uig-train/", ug_fname)].append(os.path.join("data/eng-train/", en_fname))
#dev
for ug_fname in os.listdir("data/uig-dev/"):
    dev_file_maps[os.path.join("data/uig-dev/", ug_fname)] = []
    for en_fname in os.listdir("data/eng-dev/"):
        if ug_fname in en_fname:
            dev_file_maps[os.path.join("data/uig-dev/", ug_fname)].append(os.path.join("data/eng-dev/", en_fname))
#test
for ug_fname in os.listdir("data/uig-test/"):
    test_file_maps[os.path.join("data/uig-test/", ug_fname)] = []
    for en_fname in os.listdir("data/eng-test/"):
        if ug_fname in en_fname:
            test_file_maps[os.path.join("data/uig-test/", ug_fname)].append(os.path.join("data/eng-test/", en_fname))
#everything
all_files = {**train_file_maps, **dev_file_maps, **test_file_maps}
total_eng = len(os.listdir("data/eng-test/")) + len(os.listdir("data/eng-train/")) + len(os.listdir("data/eng-dev/"))
map_count = 0
for key in all_files:
    map_count += len(all_files[key])
skip_count = total_eng - map_count
print("english skipped: " + str(skip_count) + " out of: " + str(total_eng))
# all Uyghur files should have either 1 or 2 parallel reference files

In [None]:
uz_file_maps = {}
for uz_fname in os.listdir("data/uzbek-english-mt/from_uzb/uzb/"):
    uz_file_maps[os.path.join("data/uzbek-english-mt/from_uzb/uzb/", uz_fname)] = []
    for en_fname in os.listdir("data/uzbek-english-mt/from_uzb/eng/"):
        if uz_fname in en_fname:
            uz_file_maps[os.path.join("data/uzbek-english-mt/from_uzb/uzb/", uz_fname)].append(os.path.join("data/uzbek-english-mt/from_uzb/eng/", en_fname))

tr_file_maps = {}
for tr_fname in os.listdir("data/turkish-english-mt/from_tur/tur/"):
    tr_file_maps[os.path.join("data/turkish-english-mt/from_tur/tur/", tr_fname)] = []
    for en_fname in os.listdir("data/turkish-english-mt/from_tur/eng/"):
        if tr_fname in en_fname:
            tr_file_maps[os.path.join("data/turkish-english-mt/from_tur/tur/", tr_fname)].append(os.path.join("data/turkish-english-mt/from_tur/eng/", en_fname))

In [None]:
# transliteration
word_mappings = {}
char_mappings = {}
word_pairs_train = open("data/train.ug", "r", errors="ignore").readlines()
word_pairs_test = open("data/test.ug", "r", errors="ignore").readlines()
char_pairs = open("data/mappings.txt", "r", errors="ignore").readlines()
for pair in word_pairs_train:
    pair = pair.split()
    if len(pair) == 0:
        continue
    if pair[0] not in word_mappings:
        word_mappings[pair[0]] = pair[1]
for pair in word_pairs_test:
    pair = pair.split()
    if pair[0] not in word_mappings:
        word_mappings[pair[0]] = pair[1]
for pair in char_pairs:
    pair = pair.split()
    if pair[0] not in char_mappings:
        char_mappings[pair[0]] = pair[1]
# function that performs transliteration of a single token
def transliterate(token):
    if token in word_mappings:
        return word_mappings[token]
    else:
        tok = ""
        for c in token:
            if c not in char_mappings:
                tok = tok + c
            else:
                tok = tok + char_mappings[c]
        return tok

In [None]:
# extracting the parallel sentences from the documents
def extract_sentences(ug_fname, en_fname):
    en_doc = {}
    with open(en_fname, "r", errors="ignore") as f:
       en_doc = json.load(f)
    ug_doc = {}
    with open(ug_fname, "r", errors="ignore") as f:
       ug_doc = json.load(f)

    ug_toks = ug_doc["tokens"]
    #all_lines = True
    #for tok in ug_toks:
    #    if not re.match('_+', tok):
    #        all_lines = False
    #print(all_lines)
    #if all_lines:
    #    print("hi")
    #    raise BaseException("Found Bad File. Boo.")
    en_toks = en_doc["tokens"]

    ug_sentence_indices = ug_doc["sentences"]["sentenceEndPositions"]
    ug_sents = []
    old_idx = 0
    for idx in ug_sentence_indices:
        ug_sents.append(" ".join(ug_toks[old_idx:idx]))
        old_idx = idx
    ug_sents.append(" ".join(ug_toks[old_idx:]))

    en_sentence_indices = en_doc["sentences"]["sentenceEndPositions"]
    en_sents = []
    old_idx = 0
    for idx in en_sentence_indices:
        en_sents.append(" ".join(en_toks[old_idx:idx]))
        old_idx = idx
    en_sents.append(" ".join(en_toks[old_idx:]))

    ug_sents_translit = []
    for sent in ug_sents:
        sent_list_translit = []
        toks = sent.split()
        for tok in toks:
            new_tok = transliterate(tok)
            sent_list_translit.append(new_tok)
        ug_sents_translit.append(" ".join(sent_list_translit))

    return ug_sents, ug_sents_translit, en_sents

In [None]:
# write the parallel sentences out to file
# train -- lorelei
ug_sentences = []
ug_sentences_translit = []
en_sentences = []
for ug_fname in train_file_maps:
    for en_fname in train_file_maps[ug_fname]:
        try:
            ug_sents, ug_sents_translit, en_sents = extract_sentences(ug_fname, en_fname)
        except json.JSONDecodeError:
            continue
        ug_sentences.extend(ug_sents)
        ug_sentences_translit.extend(ug_sents_translit)
        en_sentences.extend(en_sents)
with open("data/src-train.txt", "w") as f:
    f.write("\n".join(ug_sentences))
with open("data/tgt-train.txt", "w") as f:
    f.write("\n".join(en_sentences))
with open("data/src-train-translit.txt", "w") as f:
    f.write("\n".join(ug_sentences_translit))

# dev -- lorelei
ug_sentences = []
ug_sentences_translit = []
en_sentences = []
for ug_fname in dev_file_maps:
    for en_fname in dev_file_maps[ug_fname]:
        try:
            ug_sents, ug_sents_translit, en_sents = extract_sentences(ug_fname, en_fname)
        except json.JSONDecodeError:
            continue
        ug_sentences.extend(ug_sents)
        ug_sentences_translit.extend(ug_sents_translit)
        en_sentences.extend(en_sents)
with open("data/src-dev.txt", "w") as f:
    f.write("\n".join(ug_sentences))
with open("data/tgt-dev.txt", "w") as f:
    f.write("\n".join(en_sentences))
with open("data/src-dev-translit.txt", "w") as f:
    f.write("\n".join(ug_sentences_translit))

# test -- lorelei
ug_sentences = []
ug_sentences_translit = []
en_sentences = []
for ug_fname in test_file_maps:
    for en_fname in test_file_maps[ug_fname]:
        try:
            ug_sents, ug_sents_translit, en_sents = extract_sentences(ug_fname, en_fname)
        except json.JSONDecodeError:
            continue
        ug_sentences.extend(ug_sents)
        ug_sentences_translit.extend(ug_sents_translit)
        en_sentences.extend(en_sents)
with open("data/src-test.txt", "w") as f:
    f.write("\n".join(ug_sentences))
with open("data/tgt-test.txt", "w") as f:
    f.write("\n".join(en_sentences))
with open("data/src-test-translit.txt", "w") as f:
    f.write("\n".join(ug_sentences_translit))

uz_sentences = []
uz_sentences_translit = []
en_sentences = []
for uz_fname in uz_file_maps:
    for en_fname in uz_file_maps[uz_fname]:
        try:
            uz_sents, uz_sents_translit, en_sents = extract_sentences(uz_fname, en_fname)
        except json.JSONDecodeError:
            continue
        uz_sentences.extend(uz_sents)
        uz_sentences_translit.extend(uz_sents_translit)
        en_sentences.extend(en_sents)
with open("data/src-train-uz.txt", "w") as f:
    f.write("\n".join(uz_sentences))
with open("data/tgt-train-uz.txt", "w") as f:
    f.write("\n".join(en_sentences))
with open("data/src-train-translit-uz.txt", "w") as f:
    f.write("\n".join(uz_sentences_translit))

tr_sentences = []
tr_sentences_translit = []
en_sentences = []
for tr_fname in tr_file_maps:
    for en_fname in tr_file_maps[tr_fname]:
        try:
            tr_sents, tr_sents_translit, en_sents = extract_sentences(tr_fname, en_fname)
        except json.JSONDecodeError:
            continue
        tr_sentences.extend(tr_sents)
        tr_sentences_translit.extend(tr_sents_translit)
        en_sentences.extend(en_sents)
with open("data/src-train-tr.txt", "w") as f:
    f.write("\n".join(tr_sentences))
with open("data/tgt-train-tr.txt", "w") as f:
    f.write("\n".join(en_sentences))
with open("data/src-train-translit-tr.txt", "w") as f:
    f.write("\n".join(tr_sentences_translit))

In [None]:
# tatoeba data preprocess
import csv

ug_tat_sents = []
ug_tat_sents_translit = []
ug_tat_sents_en = []
with open("data/Uyghur-English.tsv", "r") as f:
    rd = csv.reader(f, delimiter="\t", quotechar='"')
    for row in rd:
        ug_tat_sents.append(row[1])
        translit = []
        toks = row[1].split()
        for tok in toks:
            new_tok = transliterate(tok)
            translit.append(new_tok)
        ug_tat_sents_translit.append(" ".join(translit))
        ug_tat_sents_en.append(row[3])

uz_tat_sents = []
uz_tat_sents_en = []
with open("data/Uzbek-English.tsv", "r") as f:
    rd = csv.reader(f, delimiter="\t", quotechar='"')
    for row in rd:
        uz_tat_sents.append(row[1])
        uz_tat_sents_en.append(row[3])

print(len(uz_tat_sents))
print(len(ug_tat_sents))
count = 0
for sent in ug_tat_sents:
    toks = sent.split()
    for tok in toks:
        count +=1 
print(count)

with open("data/src-tatoeba-ug.txt", "w") as f:
    f.write("\n".join(ug_tat_sents))
with open("data/tgt-tatoeba-ug.txt", "w") as f:
    f.write("\n".join(ug_tat_sents_en))
with open("data/src-tatoeba-ug-translit.txt", "w") as f:
    f.write("\n".join(ug_tat_sents_translit))
with open("data/src-tatoeba-uz.txt", "w") as f:
    f.write("\n".join(uz_tat_sents))
with open("data/tgt-tatoeba-uz.txt", "w") as f:
    f.write("\n".join(uz_tat_sents_en))

In [None]:
lorelei = open("data/src-train-translit.txt", "r").readlines()
tatoeba = open("data/src-tatoeba-ug-translit.txt", "r").readlines()
lor_en = open("data/tgt-train.txt", "r").readlines()
tat_en = open("data/tgt-tatoeba-ug.txt", "r").readlines()
lorelei.extend(tatoeba)
lor_en.extend(tat_en)
with open("data/src-train-augmented.txt", "w") as f:
    f.write("\n".join(lorelei))
with open("data/tgt-train-augmented.txt", "w") as f:
    f.write("\n".join(lor_en))

In [None]:
ug_sentences_translit = open("data/src-tatoeba-ug-translit.txt", "r").readlines()
en_sentences = open("data/tgt-tatoeba-ug.txt", "r").readlines()

print(len(ug_sentences_translit))
import random
sampled_indices = random.sample(range(0, 4047), 808)
dev_indices = sampled_indices[:404]
test_indices = sampled_indices[404:]

#train -- tatoeba
train_sents = []
train_sents_en = []
for i in range(4047):
    if i not in sampled_indices:
        train_sents.append(ug_sentences_translit[i])
        train_sents_en.append(en_sentences[i])
with open("data/src-train-tat.txt", "w") as f:
    f.write("\n".join(train_sents))
with open("data/tgt-train-tat.txt", "w") as f:
    f.write("\n".join(train_sents_en))

#dev -- tatoeba
dev_sents = []
dev_sents_en = []
for i in dev_indices:
    dev_sents.append(ug_sentences_translit[i])
    dev_sents_en.append(en_sentences[i])
with open("data/src-dev-tat.txt", "w") as f:
    f.write("\n".join(dev_sents))
with open("data/tgt-dev-tat.txt", "w") as f:
    f.write("\n".join(dev_sents_en))

#test -- tatoeba
test_sents = []
test_sents_en = []
for i in test_indices:
    test_sents.append(ug_sentences_translit[i])
    test_sents_en.append(en_sentences[i])
with open("data/src-test-tat.txt", "w") as f:
    f.write("\n".join(test_sents))
with open("data/tgt-test-tat.txt", "w") as f:
    f.write("\n".join(test_sents_en))

In [None]:
# tatoeba data preprocess - english to uyghur data
import csv

ug_tat_sents = []
ug_tat_sents_translit = []
ug_tat_sents_en = []
with open("data/English-Uyghur.tsv", "r") as f:
    rd = csv.reader(f, delimiter="\t", quotechar='"')
    for row in rd:
        ug_tat_sents.append(row[3])
        translit = []
        toks = row[3].split()
        for tok in toks:
            new_tok = transliterate(tok)
            translit.append(new_tok)
        ug_tat_sents_translit.append(" ".join(translit))
        ug_tat_sents_en.append(row[1])

print(len(ug_tat_sents))
count = 0
for sent in ug_tat_sents:
    toks = sent.split()
    for tok in toks:
        count +=1 
print(count)

with open("data/src-tatoeba-en-ug.txt", "w") as f:
    f.write("\n".join(ug_tat_sents))
with open("data/tgt-tatoeba-en-ug.txt", "w") as f:
    f.write("\n".join(ug_tat_sents_en))
with open("data/src-tatoeba-en-ug-translit.txt", "w") as f:
    f.write("\n".join(ug_tat_sents_translit))

In [None]:
# remove the empty lines from the files cause theyre a pain in the butt
!sed -i '/^$/d' data/src-train-translit.txt
!sed -i '/^$/d' data/tgt-train.txt
!sed -i '/^$/d' data/src-train.txt
!sed -i '/^$/d' data/src-dev-translit.txt
!sed -i '/^$/d' data/tgt-dev.txt
!sed -i '/^$/d' data/src-dev.txt
!sed -i '/^$/d' data/src-test-translit.txt
!sed -i '/^$/d' data/tgt-test.txt
!sed -i '/^$/d' data/src-test.txt
!sed -i '/^$/d' data/tgt-train-uz.txt
!sed -i '/^$/d' data/src-train-uz.txt
!sed -i '/^$/d' data/tgt-train-tr.txt
!sed -i '/^$/d' data/src-train-tr.txt
!sed -i '/^$/d' data/src-train-augmented.txt
!sed -i '/^$/d' data/tgt-train-augmented.txt
!sed -i '/^$/d' data/src-tatoeba-ug.txt
!sed -i '/^$/d' data/src-tatoeba-ug-translit.txt
!sed -i '/^$/d' data/tgt-tatoeba-ug.txt
!sed -i '/^$/d' data/src-tatoeba-uz.txt
!sed -i '/^$/d' data/tgt-tatoeba-uz.txt

!sed -i '/^$/d' data/src-train-tat.txt
!sed -i '/^$/d' data/tgt-train-tat.txt
!sed -i '/^$/d' data/src-dev-tat.txt
!sed -i '/^$/d' data/tgt-dev-tat.txt
!sed -i '/^$/d' data/src-test-tat.txt
!sed -i '/^$/d' data/tgt-test-tat.txt

In [None]:
# remove more empty lines from files here
!sed -i '/^$/d' data/src-tatoeba-en-ug.txt
!sed -i '/^$/d' data/src-tatoeba-en-ug-translit.txt
!sed -i '/^$/d' data/tgt-tatoeba-en-ug.txt

In [None]:
# tatoeba stats
train = open("data/src-train-tat.txt", "r").readlines()
dev = open("data/src-dev-tat.txt", "r").readlines()
test = open("data/src-test-tat.txt", "r").readlines()

print("Train Sentences: " + str(len(train)))
print("Dev Sentences: " + str(len(dev)))
print("Test Sentences: " + str(len(test)))

unique_train_toks = {}
unique_dev_toks = {}
unique_test_toks = {}

count = 0
for sent in train:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_train_toks[tok] = 0
print("Train Tokens: " + str(count))
print("Unique Train Tokens: " + str(len(unique_train_toks)))

count = 0
for sent in dev:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_dev_toks[tok] = 0
print("Dev Tokens: " + str(count))
print("Unique Dev Tokens: " + str(len(unique_dev_toks)))

count = 0
for sent in test:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_test_toks[tok] = 0
print("Test Tokens: " + str(count))
print("Unique Test Tokens: " + str(len(unique_test_toks)))

### New Data Preprocess

In [None]:
import json
import os
import re

In [None]:
# transliteration
word_mappings = {}
char_mappings = {}
word_pairs_train = open("data/train.ug", "r", errors="ignore").readlines()
word_pairs_test = open("data/test.ug", "r", errors="ignore").readlines()
char_pairs = open("data/mappings.txt", "r", errors="ignore").readlines()
for pair in word_pairs_train:
    pair = pair.split()
    if len(pair) == 0:
        continue
    if pair[0] not in word_mappings:
        word_mappings[pair[0]] = pair[1]
for pair in word_pairs_test:
    pair = pair.split()
    if pair[0] not in word_mappings:
        word_mappings[pair[0]] = pair[1]
for pair in char_pairs:
    pair = pair.split()
    if pair[0] not in char_mappings:
        char_mappings[pair[0]] = pair[1]
# function that performs transliteration of a single token
def transliterate(token):
    if token in word_mappings:
        return word_mappings[token]
    else:
        tok = ""
        for c in token:
            if c not in char_mappings:
                tok = tok + c
            else:
                tok = tok + char_mappings[c]
        return tok

In [None]:
# Redo of the Tatoeba dataset preprocess for the fourth milestone, since we are adding some additional parallel data here
# this is also here to isolate what is actually relevant to the final finished product, since there was a lot of going back and forth with the data
import csv

# English to Uyghur data is 1
ug_tat_sents = []
ug_tat_sents_translit = []
ug_tat_sents_en = []
with open("data/English-Uyghur.tsv", "r") as f:
    rd = csv.reader(f, delimiter="\t", quotechar='"')
    for row in rd:
        ug_tat_sents.append(row[3])
        translit = []
        toks = row[3].split()
        for tok in toks:
            new_tok = transliterate(tok)
            translit.append(new_tok)
        ug_tat_sents_translit.append(" ".join(translit))
        ug_tat_sents_en.append(row[1])

#print(len(ug_tat_sents))
#count = 0
#for sent in ug_tat_sents:
#    toks = sent.split()
#    for tok in toks:
#        count +=1 
#print(count)

with open("data/src-tatoeba-1.txt", "w") as f:
    f.write("\n".join(ug_tat_sents))
with open("data/tgt-tatoeba-1.txt", "w") as f:
    f.write("\n".join(ug_tat_sents_en))
with open("data/src-tatoeba-1-translit.txt", "w") as f:
    f.write("\n".join(ug_tat_sents_translit))

# Uyghur to English data is 2
ug_tat_sents = []
ug_tat_sents_translit = []
ug_tat_sents_en = []
with open("data/Uyghur-English.tsv", "r") as f:
    rd = csv.reader(f, delimiter="\t", quotechar='"')
    for row in rd:
        ug_tat_sents.append(row[1])
        translit = []
        toks = row[1].split()
        for tok in toks:
            new_tok = transliterate(tok)
            translit.append(new_tok)
        ug_tat_sents_translit.append(" ".join(translit))
        ug_tat_sents_en.append(row[3])

#print(len(ug_tat_sents))
#count = 0
#for sent in ug_tat_sents:
#    toks = sent.split()
#    for tok in toks:
#        count +=1 
#print(count)

with open("data/src-tatoeba-2.txt", "w") as f:
    f.write("\n".join(ug_tat_sents))
with open("data/tgt-tatoeba-2.txt", "w") as f:
    f.write("\n".join(ug_tat_sents_en))
with open("data/src-tatoeba-2-translit.txt", "w") as f:
    f.write("\n".join(ug_tat_sents_translit))

In [None]:
# master docs of the data (pre-split)
ug_1 = open("data/src-tatoeba-1.txt", "r").readlines()
ug_2 = open("data/src-tatoeba-2.txt", "r").readlines()
ug_1_translit = open("data/src-tatoeba-1-translit.txt", "r").readlines()
ug_2_translit = open("data/src-tatoeba-2-translit.txt", "r").readlines()
en_1 = open("data/tgt-tatoeba-1.txt", "r").readlines()
en_2 = open("data/tgt-tatoeba-2.txt", "r").readlines()

ug_1.extend(ug_2)
ug_1_translit.extend(ug_2_translit)
en_1.extend(en_2)

with open("data/src-tatoeba-master.txt", "w") as f:
    f.write("\n".join(ug_1))
with open("data/tgt-tatoeba-master.txt", "w") as f:
    f.write("\n".join(en_1))
with open("data/src-tatoeba-master-translit.txt", "w") as f:
    f.write("\n".join(ug_1_translit))

In [None]:
# remove empty lines from the files
!sed -i '/^$/d' data/src-tatoeba-master.txt
!sed -i '/^$/d' data/src-tatoeba-master-translit.txt
!sed -i '/^$/d' data/tgt-tatoeba-master.txt

In [None]:
# stats of the overall data (pre-split)
ug = open("data/src-tatoeba-master.txt", "r").readlines()
ug_translit = open("data/src-tatoeba-master-translit.txt", "r").readlines()
en = open("data/tgt-tatoeba-master.txt", "r").readlines()

print("Total Sentences: " + str(len(ug)))

unique_ug_toks = {}
unique_ug_translit_toks = {} # should be the same but we shall see
unique_en_toks = {}

count = 0
for sent in ug:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_ug_toks[tok] = 0
print("Uyghur Arabic Tokens: " + str(count))
print("Unique Uyghur Arabic Tokens: " + str(len(unique_ug_toks)))

count = 0
for sent in ug_translit:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_ug_translit_toks[tok] = 0
print("Uyghur Latin Tokens: " + str(count))
print("Unique Uyghur Latin Tokens: " + str(len(unique_ug_translit_toks)))

count = 0
for sent in en:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_en_toks[tok] = 0
print("English Tokens: " + str(count))
print("Unique English Tokens: " + str(len(unique_en_toks)))

In [None]:
# make the train - dev - test splits
import random

ug = open("data/src-tatoeba-master.txt", "r").readlines()
ug_translit = open("data/src-tatoeba-master-translit.txt", "r").readlines()
en = open("data/tgt-tatoeba-master.txt", "r").readlines()

indices = random.sample(range(0, 8094), 1618)
dev_indices = indices[:809]
test_indices = indices[809:]

# train
#train_sents = []
#train_sents_translit = []
#rain_sents_en = []
#for i in range(8094):
#    if i not in indices:
#        train_sents.append(ug[i])
#        train_sents_translit.append(ug_translit[i])
#        train_sents_en.append(en[i])
#with open("data/src-train-new.txt", "w") as f:
#    f.write("\n".join(train_sents))
#with open("data/src-train-translit-new.txt", "w") as f:
#    f.write("\n".join(train_sents_translit))
#with open("data/tgt-train-new.txt", "w") as f:
#    f.write("\n".join(train_sents_en))

# dev
dev_sents = []
dev_sents_translit = []
dev_sents_en = []
for i in dev_indices:
    dev_sents.append(ug[i])
    dev_sents_translit.append(ug_translit[i])
    dev_sents_en.append(en[i])
with open("data/src-dev-new1.txt", "w") as f:
    f.write("\n".join(dev_sents))
with open("data/src-dev-translit-new1.txt", "w") as f:
    f.write("\n".join(dev_sents_translit))
with open("data/tgt-dev-new1.txt", "w") as f:
    f.write("\n".join(dev_sents_en))

# test
#test_sents = []
#test_sents_translit = []
#test_sents_en = []
#for i in test_indices:
#    test_sents.append(ug[i])
#    test_sents_translit.append(ug_translit[i])
#    test_sents_en.append(en[i])
#with open("data/src-test-new.txt", "w") as f:
#    f.write("\n".join(test_sents))
#with open("data/src-test-translit-new.txt", "w") as f:
#    f.write("\n".join(test_sents_translit))
#with open("data/tgt-test-new.txt", "w") as f:
#    f.write("\n".join(test_sents_en))

In [None]:
# remove the empty lines from these files
!sed -i '/^$/d' data/src-train-new.txt
!sed -i '/^$/d' data/src-train-translit-new.txt
!sed -i '/^$/d' data/tgt-train-new.txt

!sed -i '/^$/d' data/src-dev-new1.txt
!sed -i '/^$/d' data/src-dev-translit-new1.txt
!sed -i '/^$/d' data/tgt-dev-new1.txt

!sed -i '/^$/d' data/src-test-new.txt
!sed -i '/^$/d' data/src-test-translit-new.txt
!sed -i '/^$/d' data/tgt-test-new.txt

In [7]:
# stats on the train - dev - test splits

# train
train_ug = open("data/src-train-new.txt", "r").readlines()
train_ug_translit = open("data/src-train-translit-new.txt", "r").readlines()
train_en = open("data/tgt-train-new.txt", "r").readlines()

print("Total Train Sentences: " + str(len(train_ug)))

unique_ug_toks = {}
unique_ug_translit_toks = {}
unique_en_toks = {}

count = 0
for sent in train_ug:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_ug_toks[tok] = 0
print("Train Uyghur Arabic Tokens: " + str(count))
print("Train Unique Uyghur Arabic Tokens: " + str(len(unique_ug_toks)))

count = 0
for sent in train_ug_translit:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_ug_translit_toks[tok] = 0
print("Train Uyghur Latin Tokens: " + str(count))
print("Train Unique Uyghur Latin Tokens: " + str(len(unique_ug_translit_toks)))

count = 0
for sent in train_en:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_en_toks[tok] = 0
print("Train English Tokens: " + str(count))
print("Train Unique English Tokens: " + str(len(unique_en_toks)))

print("\n")

# dev
dev_ug = open("data/src-dev-new.txt", "r").readlines()
dev_ug_translit = open("data/src-dev-translit-new.txt", "r").readlines()
dev_en = open("data/tgt-dev-new.txt", "r").readlines()

print("Total Development Sentences: " + str(len(dev_ug)))

unique_ug_toks = {}
unique_ug_translit_toks = {} 
unique_en_toks = {}

count = 0
for sent in dev_ug:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_ug_toks[tok] = 0
print("Development Uyghur Arabic Tokens: " + str(count))
print("Development Unique Uyghur Arabic Tokens: " + str(len(unique_ug_toks)))

count = 0
for sent in dev_ug_translit:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_ug_translit_toks[tok] = 0
print("Development Uyghur Latin Tokens: " + str(count))
print("Development Unique Uyghur Latin Tokens: " + str(len(unique_ug_translit_toks)))

count = 0
for sent in dev_en:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_en_toks[tok] = 0
print("Development English Tokens: " + str(count))
print("Development Unique English Tokens: " + str(len(unique_en_toks)))

print("\n")

# test
test_ug = open("data/src-test-new.txt", "r").readlines()
test_ug_translit = open("data/src-test-translit-new.txt", "r").readlines()
test_en = open("data/tgt-test-new.txt", "r").readlines()

print("Total Test Sentences: " + str(len(test_ug)))

unique_ug_toks = {}
unique_ug_translit_toks = {} 
unique_en_toks = {}

count = 0
for sent in test_ug:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_ug_toks[tok] = 0
print("Test Uyghur Arabic Tokens: " + str(count))
print("Test Unique Uyghur Arabic Tokens: " + str(len(unique_ug_toks)))

count = 0
for sent in test_ug_translit:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_ug_translit_toks[tok] = 0
print("Test Uyghur Latin Tokens: " + str(count))
print("Test Unique Uyghur Latin Tokens: " + str(len(unique_ug_translit_toks)))

count = 0
for sent in test_en:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_en_toks[tok] = 0
print("Test English Tokens: " + str(count))
print("Test Unique English Tokens: " + str(len(unique_en_toks)))

Total Train Sentences: 6476
Train Uyghur Arabic Tokens: 27969
Train Unique Uyghur Arabic Tokens: 1829
Train Uyghur Latin Tokens: 27969
Train Unique Uyghur Latin Tokens: 1829
Train English Tokens: 41045
Train Unique English Tokens: 1524


Total Development Sentences: 484
Development Uyghur Arabic Tokens: 2045
Development Unique Uyghur Arabic Tokens: 88
Development Uyghur Latin Tokens: 2045
Development Unique Uyghur Latin Tokens: 88
Development English Tokens: 2802
Development Unique English Tokens: 89


Total Test Sentences: 571
Test Uyghur Arabic Tokens: 2421
Test Unique Uyghur Arabic Tokens: 96
Test Uyghur Latin Tokens: 2421
Test Unique Uyghur Latin Tokens: 96
Test English Tokens: 3444
Test Unique English Tokens: 104


In [None]:
# this is out of order here!! - fixing a thing
import random
test = open("../data/src-test-new.txt", "r").readlines()
dev = open("../data/src-dev-new.txt", "r").readlines()
train = open("../data/src-train-new.txt", "r").readlines()

test2 = open("../data/tgt-test-new.txt", "r").readlines()
dev2 = open("../data/tgt-dev-new.txt", "r").readlines()

test3 = open("../data/src-test-translit-new.txt", "r").readlines()
dev3 = open("../data/src-dev-translit-new.txt", "r").readlines()

new_test_idx = []
new_dev_idx = []

for idx, line in enumerate(test):
    randint = random.randint(0,10)
    if line not in train:
        new_test_idx.append(idx)
    #if randint < 9:
    #    new_test_idx.append(idx)

for idx, line in enumerate(dev):
    randint = random.randint(0,10)
    if line not in train:
        new_dev_idx.append(idx)
    #if randint < 9:
    #    new_dev_idx.append(idx)

new_test1 = []
new_dev1 = []

new_test2 = []
new_dev2 = []

new_test3 = []
new_dev3 = []

for idx in new_test_idx:
    new_test1.append(test[idx])
    new_test2.append(test2[idx])
    new_test3.append(test3[idx])

for idx in new_dev_idx:
    new_dev1.append(dev[idx])
    new_dev2.append(dev2[idx])
    new_dev3.append(dev3[idx])
 
with open("../data/src-test-new.txt", "w") as f:
    f.write("\n".join(new_test1))
with open("../data/src-dev-new.txt", "w") as f:
    f.write("\n".join(new_dev1))
with open("../data/src-test-translit-new.txt", "w") as f:
    f.write("\n".join(new_test3))
with open("../data/src-dev-translit-new.txt", "w") as f:
    f.write("\n".join(new_dev3))
with open("../data/tgt-test-new.txt", "w") as f:
    f.write("\n".join(new_test2))
with open("../data/tgt-dev-new.txt", "w") as f:
    f.write("\n".join(new_dev2))

In [6]:
#uzbek stats
unique_ug_toks = {}
unique_ug_translit_toks = {} 
unique_en_toks = {}
test_en = open("data/src-tatoeba-uz.txt", "r").readlines()
print("Uzbek Sentences: " + str(len(test_en)))
count = 0
for sent in test_en:
    toks = sent.split()
    for tok in toks:
        count +=1 
    unique_en_toks[tok] = 0
print("Uzbek Tokens: " + str(count))
print("Unique Uzbek Tokens: " + str(len(unique_en_toks)))

Uzbek Sentences: 494
Uzbek Tokens: 1652
Unique Uzbek Tokens: 341


## Simple Baseline

Our old simple baseline was a bit hacky. We were naively using Google translate on a token by token basis. This old simple baseline performed extremely poorly on our data. 

We decided that a model that was more in the spirit of the simple baseline would be to take the most common word in each language and translate all tokens to that most common word (more similar to taking the most common class in a multi-class classification problem). 

Code for the old simple baseline can be found in our previous milestone submissions; however, we feel that this code is a better simple baseline that more accurately reflects the task, and we would recommend running this instead. This is what we primarily report on in the final report.

In English, the most common word is "the", and in Uyghur, the most common word is "ئۇ" or "u".

For more information about the simple baseline, please refer to the final project report. 

In [None]:
# getting the most common Uyghur word
train_ug = open("data/src-train-new.txt", "r").readlines()
train_ug_translit = open("data/src-train-translit-new.txt", "r").readlines()

tok_dict = {}
for line in train_ug:
    toks = line.split()
    for tok in toks:
        if tok not in tok_dict:
            tok_dict[tok] = 1
        else:
            tok_dict[tok] += 1

import operator
most_common_word = max(tok_dict.items(), key=operator.itemgetter(1))[0]
most_common_word_translit = transliterate(most_common_word)
print("Most Common Uyghur Word Arabic Script: " + str(most_common_word))
print("Most Common Uyghur Word Latin Script: " + str(most_common_word_translit))

In [None]:
# Uyghur to English simple baseline
train_ug = open("data/src-train-new.txt", "r").readlines()
dev_ug = open("data/src-dev-new.txt", "r").readlines()
test_ug = open("data/src-test-new.txt", "r").readlines()

train_en = open("data/tgt-train-new.txt", "r").readlines()
dev_en = open("data/tgt-dev-new.txt", "r").readlines()
test_en = open("data/tgt-test-new.txt", "r").readlines()

# simple train predictions
preds = []
for sent in train_ug:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "the "
    preds.append(pred_sent)

with open("data/predictions/train-simple-baseline-preds-ugen.txt", "w") as f:
    f.write("\n".join(preds))

# simple dev predictions 
preds = []
for sent in dev_ug:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "the "
    preds.append(pred_sent)

with open("data/predictions/dev-simple-baseline-preds-ugen.txt", "w") as f:
    f.write("\n".join(preds))

# simple test predictions
preds = []
for sent in test_ug:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "the "
    preds.append(pred_sent)

with open("data/predictions/test-simple-baseline-preds-ugen.txt", "w") as f:
    f.write("\n".join(preds))

In [None]:
# English to Uyghur simple baseline
train_ug = open("data/src-train-new.txt", "r").readlines()
dev_ug = open("data/src-dev-new.txt", "r").readlines()
test_ug = open("data/src-test-new.txt", "r").readlines()

train_en = open("data/tgt-train-new.txt", "r").readlines()
dev_en = open("data/tgt-dev-new.txt", "r").readlines()
test_en = open("data/tgt-test-new.txt", "r").readlines()

# simple train predictions
preds = []
for sent in train_en:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "ئۇ" + " "
    preds.append(pred_sent)

with open("data/predictions/train-simple-baseline-preds-enug.txt", "w") as f:
    f.write("\n".join(preds))

# simple dev predictions
preds = []
for sent in dev_en:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "ئۇ" + " "
    preds.append(pred_sent)

with open("data/predictions/dev-simple-baseline-preds-enug.txt", "w") as f:
    f.write("\n".join(preds))

# simple test predictions
preds = []
for sent in test_en:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "ئۇ" + " "
    preds.append(pred_sent)

with open("data/predictions/test-simple-baseline-preds-enug.txt", "w") as f:
    f.write("\n".join(preds))

In [None]:
from nltk.translate.bleu_score import corpus_bleu

In [None]:
# method to evaluate the simple baseline
def evaluate(preds, golds):
    hypotheses = [pred.split() for pred in preds]
    references = [[gold.split()] for gold in golds]
    score = corpus_bleu(hypotheses=hypotheses, list_of_references=references)
    return score

In [None]:
# evaluate the Uyghur to English simple baseline

# train
train_preds = open("data/train-simple-baseline-preds-ugen.txt", "r").readlines()
train_gold = open("data/tgt-train-new.txt", "r").readlines()
score = evaluate(train_preds, train_gold)
print("Train Simple Baseline BLEU Score: " + str(100 * score))

# dev
dev_preds = open("data/dev-simple-baseline-preds-ugen.txt", "r").readlines()
dev_gold = open("data/tgt-dev-new.txt", "r").readlines()
score = evaluate(dev_preds, dev_gold)
print("Development Simple Baseline BLEU Score: " + str(100 * score))

# test
test_preds = open("data/test-simple-baseline-preds-ugen.txt", "r").readlines()
test_gold = open("data/tgt-test-new.txt", "r").readlines()
score = evaluate(test_preds, test_gold)
print("Test Simple Baseline BLEU Score: " + str(100 * score))

In [None]:
# evaluate the English to Uyghur simple baseline

# train
train_preds = open("data/train-simple-baseline-preds-enug.txt", "r").readlines()
train_gold = open("data/src-train-new.txt", "r").readlines()
score = evaluate(train_preds, train_gold)
print("Train Simple Baseline BLEU Score: " + str(100 * score))

# dev
dev_preds = open("data/dev-simple-baseline-preds-enug.txt", "r").readlines()
dev_gold = open("data/src-dev-new.txt", "r").readlines()
score = evaluate(dev_preds, dev_gold)
print("Development Simple Baseline BLEU Score: " + str(100 * score))

# test
test_preds = open("data/test-simple-baseline-preds-enug.txt", "r").readlines()
test_gold = open("data/src-test-new.txt", "r").readlines()
score = evaluate(test_preds, test_gold)
print("Test Simple Baseline BLEU Score: " + str(100 * score))

In [None]:
# for completeness, can also do this with the transliterated version of Uyghur
train_ug = open("data/src-train-translit-new.txt", "r").readlines()
dev_ug = open("data/src-dev-translit-new.txt", "r").readlines()
test_ug = open("data/src-test-translit-new.txt", "r").readlines()

train_en = open("data/tgt-train-new.txt", "r").readlines()
dev_en = open("data/tgt-dev-new.txt", "r").readlines()
test_en = open("data/tgt-test-new.txt", "r").readlines()

# simple train predictions
preds = []
for sent in train_en:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "u" + " "
    preds.append(pred_sent)

with open("data/predictions/train-simple-baseline-preds-enug-translit.txt", "w") as f:
    f.write("\n".join(preds))

# simple dev predictions
preds = []
for sent in dev_en:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "u" + " "
    preds.append(pred_sent)

with open("data/predictions/dev-simple-baseline-preds-enug-translit.txt", "w") as f:
    f.write("\n".join(preds))

# simple test predictions
preds = []
for sent in test_en:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "u" + " "
    preds.append(pred_sent)

with open("data/predictions/test-simple-baseline-preds-enug-translit.txt", "w") as f:
    f.write("\n".join(preds))

# train
train_preds = open("data/train-simple-baseline-preds-enug-translit.txt", "r").readlines()
train_gold = open("data/src-train-translit-new.txt", "r").readlines()
score = evaluate(train_preds, train_gold)
print("Train Transliterated Simple Baseline BLEU Score: " + str(100 * score))

# dev
dev_preds = open("data/dev-simple-baseline-preds-enug-translit.txt", "r").readlines()
dev_gold = open("data/src-dev-translit-new.txt", "r").readlines()
score = evaluate(dev_preds, dev_gold)
print("Development Transliterated Simple Baseline BLEU Score: " + str(100 * score))

# test
test_preds = open("data/test-simple-baseline-preds-enug-translit.txt", "r").readlines()
test_gold = open("data/src-test-translit-new.txt", "r").readlines()
score = evaluate(test_preds, test_gold)
print("Test Transliterated Simple Baseline BLEU Score: " + str(100 * score))

## Published Baseline

Here is the process of running our model code using the Uyghur data to implement our published baseline. 

The code used to train and evaluate these models is inspired by the code found [here](https://github.com/snnclsr/nmt/).

We include some sanity check models, test models, and then all of the models run for hyperparameter search for this part of the assignment. The results of running all of these experiments are contained in the final report.

In [4]:
%cd code

/content/drive/My Drive/CIS 530/Project/Milestone4/code


### Sanity Checks

In [None]:
# sanity check arabic script
!python train.py --train_data ../data/src-tatoeba-master.txt ../data/tgt-tatoeba-master.txt --valid_data ../data/src-tatoeba-master.txt ../data/tgt-tatoeba-master.txt --n_epochs 10 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.3 #--device cuda

In [None]:
# sanity check latin script
!python train.py --train_data ../data/src-tatoeba-master-translit.txt ../data/tgt-tatoeba-master.txt --valid_data ../data/src-tatoeba-master-translit.txt ../data/tgt-tatoeba-master.txt --n_epochs 10 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.3 #--device cuda

In [None]:
# test arabic script model
!python test.py --model_file experiments/05_04_2021_17_51_03/model.bin --valid_data ../data/src-tatoeba-master.txt ../data/tgt-tatoeba-master.txt

In [None]:
# test latin script model
!python test.py --model_file experiments/05_04_2021_18_20_04/model.bin --valid_data ../data/src-tatoeba-master-translit.txt ../data/tgt-tatoeba-master.txt

In [None]:
# can also try reverse direction and see if this performs better
# sanity check arabic script reverse
!python train.py --train_data ../data/tgt-tatoeba-master.txt ../data/src-tatoeba-master.txt --valid_data ../data/tgt-tatoeba-master.txt ../data/src-tatoeba-master.txt --n_epochs 10 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.3 #--device cuda

In [None]:
# sanity check latin script reverse
!python train.py --train_data ../data/tgt-tatoeba-master.txt ../data/src-tatoeba-master-translit.txt --valid_data ../data/tgt-tatoeba-master.txt ../data/src-tatoeba-master-translit.txt --n_epochs 10 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.3 #--device cuda

In [None]:
# test arabic script model
!python test.py --model_file experiments/05_04_2021_18_49_29/model.bin --valid_data ../data/tgt-tatoeba-master.txt ../data/src-tatoeba-master.txt

In [None]:
# test latin script model
!python test.py --model_file experiments/05_04_2021_19_07_07/model.bin --valid_data ../data/tgt-tatoeba-master.txt ../data/src-tatoeba-master-translit.txt

### Published Baseline Models

#### Arabic Script Uyghur to English

In [4]:
# train a model on Uyghur Arabic data
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.1 --device cuda

{'train_data': ['../data/src-train-new.txt', '../data/tgt-train-new.txt'], 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.1, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 20:36:39 - INFO - __main__ - Total train sentences: 6475
05/06/2021 20:36:39 - INFO - __main__ - Total valid sentences: 484
05/06/2021 20:36:39 - INFO - __main__ - Random samples from training data
Source Language Sentence:  پاكىز تەخسە يوق.
Target Language Sentence:  There are no clean plates.
Source Language Sentence:  ئاپام مېنى دورا ئىچكۈزدى.
Target Language Sentence:  My mother made me take some medicine.
Source Language Sentence:  مەن بىلەن ئۇ بىر ھەپتە بىر قېتىم كۆرۈشىمىز.
Target Language Sentence:  I meet her once a week.
05/06/2021 20:36:39 

In [5]:
# evaluate
!python test.py --model_file experiments/05_06_2021_20_36_36/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_06_2021_20_36_36/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_06_2021_20_36_36/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:09<00:00, 52.17it/s]
Corpus BLEU: 12.233940556113389
{'model_file': 'experiments/05_06_2021_20_36_36/model.bin', 'valid_data': ['../data/src-test-new.

#### Latin Script Uyghur to English

In [5]:
# train a model on Uyghur Latin data
!python train.py --train_data ../data/src-train-translit-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.1 --device cuda

{'train_data': ['../data/src-train-translit-new.txt', '../data/tgt-train-new.txt'], 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.1, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 20:39:42 - INFO - __main__ - Total train sentences: 6475
05/06/2021 20:39:42 - INFO - __main__ - Total valid sentences: 484
05/06/2021 20:39:42 - INFO - __main__ - Random samples from training data
Source Language Sentence:  sorap bilish — eyib aemes.
Target Language Sentence:  There is nothing wrong with knowledge obtained by asking.
Source Language Sentence:  biz bilen chüshlük tamaq yémemsen?
Target Language Sentence:  Won't you go out to lunch with us?
Source Language Sentence:  men yardemge mohtaj.
Target Language Sentence:  I 

In [6]:
# evaluate
!python test.py --model_file experiments/05_06_2021_20_39_42/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_06_2021_20_39_42/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_06_2021_20_39_42/model.bin', 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:08<00:00, 58.72it/s]
Corpus BLEU: 16.050991677607918
{'model_file': 'experiments/05_06_2021_20_39_42/model.bin', 'valid_data': ['../data/src-

#### English to Uyghur (Arabic Script)

In [6]:
# train a model on English data
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.1 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.1, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 20:41:51 - INFO - __main__ - Total train sentences: 6475
05/06/2021 20:41:51 - INFO - __main__ - Total valid sentences: 484
05/06/2021 20:41:51 - INFO - __main__ - Random samples from training data
Source Language Sentence:  Also, I've answered the question that you asked.
Target Language Sentence:  ھەمدە سورىغان سوئالىڭغا جاۋاب بېرىپ قويدۇم.
Source Language Sentence:  Is there any cold water?
Target Language Sentence:  سوغۇق سۇ بارمۇ؟
Source Language Sentence:  If you just listen to what the teacher says, you'll be able to become a good student.
Tar

In [7]:
# evaluate
!python test.py --model_file experiments/05_06_2021_20_41_51/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt 
!python test.py --model_file experiments/05_06_2021_20_41_51/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-new.txt 

{'model_file': 'experiments/05_06_2021_20_41_51/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6166, bias=False)
  )
)
100% 484/484 [00:07<00:00, 61.79it/s]
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus BLEU: 16.

####English to Uyghur (Latin Script)

In [7]:
# train a model on English data
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-translit-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.1 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-translit-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.1, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 20:45:02 - INFO - __main__ - Total train sentences: 6475
05/06/2021 20:45:02 - INFO - __main__ - Total valid sentences: 484
05/06/2021 20:45:02 - INFO - __main__ - Random samples from training data
Source Language Sentence:  Who organized that meeting?
Target Language Sentence:  u yighinni kim auyushturghan?
Source Language Sentence:  I hope you have a good trip.
Target Language Sentence:  seper ongushluq bolsun.
Source Language Sentence:  That was a lie.
Target Language Sentence:  u yalghan gep aidi.
05/06/2021 20:45:02 - INFO - __

In [8]:
# evaluate
!python test.py --model_file experiments/05_06_2021_20_45_02/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt 
!python test.py --model_file experiments/05_06_2021_20_45_02/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-translit-new.txt 

{'model_file': 'experiments/05_06_2021_20_45_02/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6147, bias=False)
  )
)
100% 484/484 [00:07<00:00, 67.04it/s]
Corpus BLEU: 7.866607630718258
{'model_file': 'experiments/05_06_2021_20_45_02/model.bin', 'valid_data': ['../data/tgt-t

###Hyperparameter Search for Models (Developed with Combined Dev Set) -- Extension

The training process implements early stopping. Some hyperparameters we can play around with include dropout, number of layers, and batch size. The learning rate automatically decreases during training if progress stagnates. We use the Adam optimizer while training. 

The initial models implemented above were trained for 100 epochs (though with early stopping most ended up halting around 50 epochs, give or take a couple). They had dropout set to 0.1, and number of layers set to 2. The batch size was 48. The initial learning rate is 0.001.

In theory we could also change the model to not be bidirectional, but this doesn't really seem like it would be an advantage to us. 

We will continue to set the number of training epochs to 100, with the knowledge that the model will stop when it reaches what appears to be an appropriate minimum. We can test various dropout probabilities, including 0.1, 0.2, 0.3, and 0.5. We can test numbers of layers including 1, 2, 5, and 10.

Note - filepaths have since changed for the dev sets, so disregard the current dev files listed - they will produce different results.

#### Hyperparameter Search for Uyghur (Arabic Script) to English Models

In [None]:
# dropout 0.1
# num layers 1
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.1 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_03_09_35/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_03_09_35/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_03_09_35/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:07<00:00, 60.96it/s]
Corpus BLEU: 19.607626164326813
{'model_file': 'experiments/05_05_2021_03_09_35/model.bin', 'valid_data': ['../data/src-test-new.txt', '../data

In [None]:
# dropout 0.1
# num layers 2 
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.1 --device cuda

In [None]:
# evaluate 
!python test.py --model_file experiments/05_05_2021_03_16_01/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_03_16_01/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_03_16_01/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:07<00:00, 61.29it/s]
Corpus BLEU: 20.331390150813046
{'model_file': 'experiments/05_05_2021_03_16_01/model.bin', 'valid_data': ['../data/src-test-new.

In [None]:
# dropout 0.1
# num layers 5
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 5 --bidirectional --dropout_p 0.1 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_03_22_38/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_03_22_38/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_03_22_38/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=5, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:11<00:00, 42.71it/s]
Corpus BLEU: 15.115842096612248
{'model_file': 'experiments/05_05_2021_03_22_38/model.bin', 'valid_data': ['../data/src-test-new.

In [None]:
# dropout 0.1
# num layers 10
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 10 --bidirectional --dropout_p 0.1 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_03_41_43/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_03_41_43/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_03_41_43/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=10, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:15<00:00, 30.36it/s]
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus BLEU: 10

In [None]:
# dropout 0.2
# num layers 1
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.2 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_04_12_18/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_04_12_18/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_04_12_18/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:08<00:00, 55.62it/s]
Corpus BLEU: 21.786108714834608
{'model_file': 'experiments/05_05_2021_04_12_18/model.bin', 'valid_data': ['../data/src-test-new.txt', '../data

In [None]:
# dropout 0.2
# num layers 2
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.2 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_04_17_38/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_04_17_38/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_04_17_38/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:11<00:00, 43.15it/s]
Corpus BLEU: 17.38299482997936
{'model_file': 'experiments/05_05_2021_04_17_38/model.bin', 'valid_data': ['../data/src-test-new.t

In [None]:
# dropout 0.2
# num layers 5
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 5 --bidirectional --dropout_p 0.2 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_04_24_32/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_04_24_32/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_04_24_32/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=5, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:11<00:00, 41.71it/s]
Corpus BLEU: 11.028092893896394
{'model_file': 'experiments/05_05_2021_04_24_32/model.bin', 'valid_data': ['../data/src-test-new.

In [None]:
# dropout 0.2
# num layers 10
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 10 --bidirectional --dropout_p 0.2 --device cuda

In [None]:
# evaluate 
!python test.py --model_file experiments/05_05_2021_04_38_40/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_04_38_40/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_04_38_40/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=10, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:15<00:00, 30.58it/s]
Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus BLEU: 3.

In [None]:
# dropout 0.3
# num layers 1
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.3 --device cuda

In [None]:
# evaluate 
!python test.py --model_file experiments/05_05_2021_05_10_13/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_05_10_13/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_05_10_13/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:06<00:00, 71.05it/s]
Corpus BLEU: 20.783383119491674
{'model_file': 'experiments/05_05_2021_05_10_13/model.bin', 'valid_data': ['../data/src-test-new.txt', '../data

In [None]:
# dropout 0.3
# num layers 2
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.3 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_05_15_47/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_05_15_47/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_05_15_47/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:07<00:00, 63.53it/s]
Corpus BLEU: 20.754259090146462
{'model_file': 'experiments/05_05_2021_05_15_47/model.bin', 'valid_data': ['../data/src-test-new.

In [None]:
# dropout 0.3
# num layers 5
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 5 --bidirectional --dropout_p 0.3 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_05_23_36/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_05_23_36/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_05_23_36/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=5, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:11<00:00, 43.81it/s]
Corpus BLEU: 13.980573684753614
{'model_file': 'experiments/05_05_2021_05_23_36/model.bin', 'valid_data': ['../data/src-test-new.

In [None]:
# dropout 0.3
# num layers 10
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 10 --bidirectional --dropout_p 0.3 --device cuda

In [None]:
#evaluate 
!python test.py --model_file experiments/05_05_2021_05_44_38/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_05_44_38/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_05_44_38/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=10, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:15<00:00, 32.26it/s]
Corpus BLEU: 10.050280652559481
{'model_file': 'experiments/05_05_2021_05_44_38/model.bin', 'valid_data': ['../data/src-test-new

In [None]:
# dropout 0.5
# num layers 1
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.5 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_06_19_39/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_06_19_39/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_06_19_39/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:07<00:00, 68.28it/s]
Corpus BLEU: 21.335547491818915
{'model_file': 'experiments/05_05_2021_06_19_39/model.bin', 'valid_data': ['../data/src-test-new.txt', '../data

In [None]:
# dropout 0.5
# num layers 2
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.5 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_06_26_57/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_06_26_57/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_06_26_57/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:08<00:00, 57.69it/s]
Corpus BLEU: 19.19884396422932
{'model_file': 'experiments/05_05_2021_06_26_57/model.bin', 'valid_data': ['../data/src-test-new.t

In [None]:
# dropout 0.5
# num layers 5
!python train.py --train_data ../data/src-train-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 5 --bidirectional --dropout_p 0.5 --device cuda

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_06_34_38/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_06_34_38/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_06_34_38/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=5, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:11<00:00, 41.16it/s]
Corpus BLEU: 10.868561569945781
{'model_file': 'experiments/05_05_2021_06_34_38/model.bin', 'valid_data': ['../data/src-test-new.

#### Hyperparameter Search for Uyghur (Latin Script) to English Models

In [None]:
# dropout 0.1
# num layers 1
!python train.py --train_data ../data/src-train-translit-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.1 --device cuda

{'train_data': ['../data/src-train-translit-new.txt', '../data/tgt-train-new.txt'], 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.1, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/05/2021 22:22:05 - INFO - __main__ - Total train sentences: 6475
05/05/2021 22:22:05 - INFO - __main__ - Total valid sentences: 809
05/05/2021 22:22:05 - INFO - __main__ - Random samples from training data
Source Language Sentence:  aalliburun bu yerge kelgenidim.
Target Language Sentence:  I've already come here before.
Source Language Sentence:  emdi razi boldungmu?
Target Language Sentence:  Are you happy now?
Source Language Sentence:  men kitab yéziwatimen.
Target Language Sentence:  I'm writing a book.
05/05/2021 22:22:05 - INFO - __m

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_22_22_02/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_22_22_02/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_22_22_02/model.bin', 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:06<00:00, 69.20it/s]
Corpus BLEU: 22.02157066874959
{'model_file': 'experiments/05_05_2021_22_22_02/model.bin', 'valid_data': ['../data/src-test-translit-n

In [None]:
# dropout 0.1
# num layers 2
!python train.py --train_data ../data/src-train-translit-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.1 --device cuda

{'train_data': ['../data/src-train-translit-new.txt', '../data/tgt-train-new.txt'], 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.1, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/05/2021 22:28:14 - INFO - __main__ - Total train sentences: 6475
05/05/2021 22:28:14 - INFO - __main__ - Total valid sentences: 809
05/05/2021 22:28:14 - INFO - __main__ - Random samples from training data
Source Language Sentence:  men auninggha tayandim.
Target Language Sentence:  I relied on him.
Source Language Sentence:  bu aishni qiz dostumgha aéytip yürme, yene!
Target Language Sentence:  You absolutely must not tell my girlfriend about this!
Source Language Sentence:  auning gépige diqqet qilishingiz kérek.
Target Language Sentence:

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_22_28_14/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_22_28_14/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_22_28_14/model.bin', 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:07<00:00, 62.08it/s]
Corpus BLEU: 19.451085582755088
{'model_file': 'experiments/05_05_2021_22_28_14/model.bin', 'valid_data': ['../data/src-

In [None]:
# dropout 0.2
# num layers 1
!python train.py --train_data ../data/src-train-translit-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.2 --device cuda

{'train_data': ['../data/src-train-translit-new.txt', '../data/tgt-train-new.txt'], 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.2, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/05/2021 22:35:53 - INFO - __main__ - Total train sentences: 6475
05/05/2021 22:35:53 - INFO - __main__ - Total valid sentences: 809
05/05/2021 22:35:53 - INFO - __main__ - Random samples from training data
Source Language Sentence:  u birdemdila aishni tügetti.
Target Language Sentence:  He finished the job in an instant.
Source Language Sentence:  gül sughiring.
Target Language Sentence:  Please water the flowers.
Source Language Sentence:  zukam bolup qaldim.
Target Language Sentence:  I have a cold.
05/05/2021 22:35:53 - INFO - __main__ 

In [None]:
# evaluate 
!python test.py --model_file experiments/05_05_2021_22_35_53/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_22_35_53/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_22_35_53/model.bin', 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:06<00:00, 70.42it/s]
Corpus BLEU: 23.07112631646015
{'model_file': 'experiments/05_05_2021_22_35_53/model.bin', 'valid_data': ['../data/src-test-translit-n

In [None]:
# dropout 0.2
# num layers 2 
!python train.py --train_data ../data/src-train-translit-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.2 --device cuda

{'train_data': ['../data/src-train-translit-new.txt', '../data/tgt-train-new.txt'], 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.2, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/05/2021 22:39:06 - INFO - __main__ - Total train sentences: 6475
05/05/2021 22:39:06 - INFO - __main__ - Total valid sentences: 809
05/05/2021 22:39:06 - INFO - __main__ - Random samples from training data
Source Language Sentence:  dadam doxtur.
Target Language Sentence:  My father is a doctor.
Source Language Sentence:  bu kitabni sizze yézipsize!
Target Language Sentence:  As if you actually wrote this book!
Source Language Sentence:  u muzéy nahayiti chong aiken.
Target Language Sentence:  That museum turned out to be huge.
05/05/2021 2

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_22_39_06/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_22_39_06/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_22_39_06/model.bin', 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:08<00:00, 59.81it/s]
Corpus BLEU: 21.057671274083724
{'model_file': 'experiments/05_05_2021_22_39_06/model.bin', 'valid_data': ['../data/src-

In [None]:
# dropout 0.3
# num layers 1
!python train.py --train_data ../data/src-train-translit-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.3 --device cuda

{'train_data': ['../data/src-train-translit-new.txt', '../data/tgt-train-new.txt'], 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.3, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/05/2021 22:43:04 - INFO - __main__ - Total train sentences: 6475
05/05/2021 22:43:04 - INFO - __main__ - Total valid sentences: 809
05/05/2021 22:43:04 - INFO - __main__ - Random samples from training data
Source Language Sentence:  bu méhmanxanida turimen.
Target Language Sentence:  I'm staying at this hotel.
Source Language Sentence:  téxiche héchnime aözgergini yoq.
Target Language Sentence:  Nothing's changed yet.
Source Language Sentence:  auning aüch akisi bar.
Target Language Sentence:  He has three brothers.
05/05/2021 22:43:04 - IN

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_22_43_04/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_22_43_04/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_22_43_04/model.bin', 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:07<00:00, 66.93it/s]
Corpus BLEU: 23.442011516497185
{'model_file': 'experiments/05_05_2021_22_43_04/model.bin', 'valid_data': ['../data/src-test-translit-

In [None]:
# dropout 0.3
# num layers 2 
!python train.py --train_data ../data/src-train-translit-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.3 --device cuda

{'train_data': ['../data/src-train-translit-new.txt', '../data/tgt-train-new.txt'], 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.3, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/05/2021 22:46:00 - INFO - __main__ - Total train sentences: 6475
05/05/2021 22:46:00 - INFO - __main__ - Total valid sentences: 809
05/05/2021 22:46:00 - INFO - __main__ - Random samples from training data
Source Language Sentence:  300 som bilen qandaq tijaret qilisiz?
Target Language Sentence:  How do you do business with 300 yuan?
Source Language Sentence:  shamal köp chiqti.
Target Language Sentence:  It was really windy.
Source Language Sentence:  biz pat_ pat shahmat aoynaymiz.
Target Language Sentence:  We often play chess.
05/05/202

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_22_46_00/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_22_46_00/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_22_46_00/model.bin', 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:08<00:00, 58.12it/s]
Corpus BLEU: 18.958457522482462
{'model_file': 'experiments/05_05_2021_22_46_00/model.bin', 'valid_data': ['../data/src-

In [None]:
# dropout 0.5
# num layers 1
!python train.py --train_data ../data/src-train-translit-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.5 --device cuda

{'train_data': ['../data/src-train-translit-new.txt', '../data/tgt-train-new.txt'], 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.5, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/05/2021 22:50:13 - INFO - __main__ - Total train sentences: 6475
05/05/2021 22:50:13 - INFO - __main__ - Total valid sentences: 809
05/05/2021 22:50:13 - INFO - __main__ - Random samples from training data
Source Language Sentence:  shwétsariyide menzire köp.
Target Language Sentence:  Switzerland boasts many sights.
Source Language Sentence:  men auninggha bir qanche qétim téléfon qildim, lékin u qayturmidi.
Target Language Sentence:  I called him a few times, but he hasn't called back.
Source Language Sentence:  nahayiti shepqetlik we méh

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_22_50_13/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_22_50_13/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_22_50_13/model.bin', 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:07<00:00, 65.68it/s]
Corpus BLEU: 20.44456782613084
{'model_file': 'experiments/05_05_2021_22_50_13/model.bin', 'valid_data': ['../data/src-test-translit-n

In [None]:
# dropout 0.5
# num layers 2 
!python train.py --train_data ../data/src-train-translit-new.txt ../data/tgt-train-new.txt --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.5 --device cuda

{'train_data': ['../data/src-train-translit-new.txt', '../data/tgt-train-new.txt'], 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.5, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/05/2021 22:59:05 - INFO - __main__ - Total train sentences: 6475
05/05/2021 22:59:05 - INFO - __main__ - Total valid sentences: 809
05/05/2021 22:59:05 - INFO - __main__ - Random samples from training data
Source Language Sentence:  bizge bu yerde bir dawalash aetridi lazim!
Target Language Sentence:  We need a medical team here!
Source Language Sentence:  aexmeq bolushqa aurunghanlar aexmeqtur.
Target Language Sentence:  Those who try to be foolish are foolish.
Source Language Sentence:  aöz-aözüngge qilding.
Target Language Sentence:  You

In [None]:
# evaluate
!python test.py --model_file experiments/05_05_2021_22_59_05/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_05_2021_22_59_05/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_05_2021_22_59_05/model.bin', 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4344, bias=False)
  )
)
100% 484/484 [00:08<00:00, 58.71it/s]
Corpus BLEU: 20.08517797502676
{'model_file': 'experiments/05_05_2021_22_59_05/model.bin', 'valid_data': ['../data/src-t

#### Hyperparameter Search for English to Uyghur (Arabic Script) Models

In [None]:
# dropout 0.1
# num layers 1
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.1 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.1, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 01:10:16 - INFO - __main__ - Total train sentences: 6475
05/06/2021 01:10:16 - INFO - __main__ - Total valid sentences: 809
05/06/2021 01:10:16 - INFO - __main__ - Random samples from training data
Source Language Sentence:  Why did you go there?
Target Language Sentence:  ئۇ يەرگە نېمە ئۈچۈن باردىڭلار؟
Source Language Sentence:  He has children.
Target Language Sentence:  ئۇنىڭ بالىلىرى بار.
Source Language Sentence:  The hawk caught a mouse.
Target Language Sentence:  بۈركۈت بىر چاشقاننى تۇتىۋالدى.
05/06/2021 01:10:16 - INFO - __main__ - Random sam

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_01_10_13/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt 
!python test.py --model_file experiments/05_06_2021_01_10_13/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-new.txt 

{'model_file': 'experiments/05_06_2021_01_10_13/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6166, bias=False)
  )
)
100% 484/484 [00:06<00:00, 72.00it/s]
Corpus BLEU: 9.934066426510482
{'model_file': 'experiments/05_06_2021_01_10_13/model.bin', 'valid_data': ['../data/tgt-test-new.txt', '../data/

In [None]:
# dropout 0.1
# num layers 2 
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.1 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.1, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 01:16:29 - INFO - __main__ - Total train sentences: 6475
05/06/2021 01:16:29 - INFO - __main__ - Total valid sentences: 809
05/06/2021 01:16:29 - INFO - __main__ - Random samples from training data
Source Language Sentence:  I saw a fight.
Target Language Sentence:  مەن ئۇرۇشنى كۆردۈم.
Source Language Sentence:  Would you care for another cup of tea?
Target Language Sentence:  يەنە بىر ئىستاكان چاي ئىچەمسىز؟
Source Language Sentence:  He went into the bazaar. I don't know what's keeping him there.
Target Language Sentence:  ئۇ بازارغا كىرىپ كەتكەن، چ

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_01_16_29/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt 
!python test.py --model_file experiments/05_06_2021_01_16_29/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-new.txt 

{'model_file': 'experiments/05_06_2021_01_16_29/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6166, bias=False)
  )
)
100% 484/484 [00:07<00:00, 62.47it/s]
Corpus BLEU: 9.331117293974678
{'model_file': 'experiments/05_06_2021_01_16_29/model.bin', 'valid_data': ['../data/tgt-test-new.t

In [None]:
# dropout 0.2
# num layers 1
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.2 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.2, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 01:20:10 - INFO - __main__ - Total train sentences: 6475
05/06/2021 01:20:10 - INFO - __main__ - Total valid sentences: 809
05/06/2021 01:20:10 - INFO - __main__ - Random samples from training data
Source Language Sentence:  Come!
Target Language Sentence:  كېلىڭ!
Source Language Sentence:  The cadres inspected our school.
Target Language Sentence:  كادىرلار مەكتىپىمىزنى كۆزدىن كەچۈردى.
Source Language Sentence:  There'll be a problem.
Target Language Sentence:  مەسىلە كۆرۈلىدۇ.
05/06/2021 01:20:10 - INFO - __main__ - Random samples from validation d

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_01_20_10/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt 
!python test.py --model_file experiments/05_06_2021_01_20_10/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-new.txt 

{'model_file': 'experiments/05_06_2021_01_20_10/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6166, bias=False)
  )
)
100% 484/484 [00:06<00:00, 72.13it/s]
Corpus BLEU: 11.46000380964711
{'model_file': 'experiments/05_06_2021_01_20_10/model.bin', 'valid_data': ['../data/tgt-test-new.txt', '../data/

In [None]:
# dropout 0.2
# num layers 2 
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.2 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.2, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 01:27:14 - INFO - __main__ - Total train sentences: 6475
05/06/2021 01:27:14 - INFO - __main__ - Total valid sentences: 809
05/06/2021 01:27:14 - INFO - __main__ - Random samples from training data
Source Language Sentence:  To tell the truth, I don't like him.
Target Language Sentence:  گەپنىڭ راستى دېسە، مەن ئۇنى ياخشى كۆرمەيمەن.
Source Language Sentence:  I have to prepare for the English test.
Target Language Sentence:  ئىنگلىزچە سىنىقىغا تەييارلېنىشىم كېرەك.
Source Language Sentence:  All right, that's as much as I'm going to get done today.
Tar

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_01_27_14/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt 
!python test.py --model_file experiments/05_06_2021_01_27_14/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-new.txt 

{'model_file': 'experiments/05_06_2021_01_27_14/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6166, bias=False)
  )
)
100% 484/484 [00:08<00:00, 60.12it/s]
Corpus BLEU: 9.46359488316955
{'model_file': 'experiments/05_06_2021_01_27_14/model.bin', 'valid_data': ['../data/tgt-test-new.tx

In [None]:
# dropout 0.3
# num layers 1 
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.3 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.3, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 01:42:22 - INFO - __main__ - Total train sentences: 6475
05/06/2021 01:42:22 - INFO - __main__ - Total valid sentences: 809
05/06/2021 01:42:22 - INFO - __main__ - Random samples from training data
Source Language Sentence:  This is against the law.
Target Language Sentence:  بۇ قانۇنغا خىلاپ.
Source Language Sentence:  The chance is gone.
Target Language Sentence:  پەيتنى قولدىن بەردىڭ.
Source Language Sentence:  Every little bit counts.
Target Language Sentence:  قۇشقاچ بولسىمۇ گۆش.
05/06/2021 01:42:22 - INFO - __main__ - Random samples from valida

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_01_42_22/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt 
!python test.py --model_file experiments/05_06_2021_01_42_22/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-new.txt 

{'model_file': 'experiments/05_06_2021_01_42_22/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6166, bias=False)
  )
)
100% 484/484 [00:07<00:00, 65.74it/s]
Corpus BLEU: 9.588876885842405
{'model_file': 'experiments/05_06_2021_01_42_22/model.bin', 'valid_data': ['../data/tgt-test-new.txt', '../data/

In [None]:
# dropout 0.3
# num layers 2 
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.3 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.3, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 01:56:14 - INFO - __main__ - Total train sentences: 6475
05/06/2021 01:56:14 - INFO - __main__ - Total valid sentences: 809
05/06/2021 01:56:14 - INFO - __main__ - Random samples from training data
Source Language Sentence:  They often help each other.
Target Language Sentence:  ئۇلار دائىم بىرسى-بىرسىگە ياردەم قىلىدۇ.
Source Language Sentence:  I know that Nancy likes music.
Target Language Sentence:  نېنسىنىڭ مۇزىكىنى ياخشى كۆرىدىغانلىقىنى بىلىمەن.
Source Language Sentence:  You're not obliged to thank me.
Target Language Sentence:  ماڭا رەھمەتنىڭ 

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_01_56_14/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt 
!python test.py --model_file experiments/05_06_2021_01_56_14/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-new.txt 

{'model_file': 'experiments/05_06_2021_01_56_14/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6166, bias=False)
  )
)
100% 484/484 [00:08<00:00, 56.25it/s]
Corpus BLEU: 9.879672179719533
{'model_file': 'experiments/05_06_2021_01_56_14/model.bin', 'valid_data': ['../data/tgt-test-new.t

In [None]:
# dropout 0.5
# num layers 1
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.5 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.5, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 02:38:00 - INFO - __main__ - Total train sentences: 6475
05/06/2021 02:38:00 - INFO - __main__ - Total valid sentences: 809
05/06/2021 02:38:00 - INFO - __main__ - Random samples from training data
Source Language Sentence:  Everyone likes her.
Target Language Sentence:  ئادەملەرنىڭ ھەممىسى ئۇنى ياخشى كۆرىدۇ.
Source Language Sentence:  He left the office just now.
Target Language Sentence:  ئۇ ئەمدى ئىشخانىدىن كەتتى.
Source Language Sentence:  I was very satisfied with this.
Target Language Sentence:  بۇنىڭدىن ناھايىتى رازى بولدۇم.
05/06/2021 02:38:0

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_02_38_00/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt 
!python test.py --model_file experiments/05_06_2021_02_38_00/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-new.txt 

{'model_file': 'experiments/05_06_2021_02_38_00/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6166, bias=False)
  )
)
100% 484/484 [00:07<00:00, 69.07it/s]
Corpus BLEU: 9.491247997539508
{'model_file': 'experiments/05_06_2021_02_38_00/model.bin', 'valid_data': ['../data/tgt-test-new.txt', '../data/

In [None]:
# dropout 0.5
# num layers 2 
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.5 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.5, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 02:40:41 - INFO - __main__ - Total train sentences: 6475
05/06/2021 02:40:41 - INFO - __main__ - Total valid sentences: 809
05/06/2021 02:40:41 - INFO - __main__ - Random samples from training data
Source Language Sentence:  I am very tired after a class.
Target Language Sentence:  دەرستىن بەك ھېرىپ كەتكەنىدىم.
Source Language Sentence:  One needs to work hard to get a good score.
Target Language Sentence:  ياخشى نەتىجىنى قولغا كەلتۈرۈش ئۈچۈن، تىرىشىش شەرت.
Source Language Sentence:  What's the matter?
Target Language Sentence:  نېمە بولدى؟
05/06/202

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_02_40_41/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-new.txt 
!python test.py --model_file experiments/05_06_2021_02_40_41/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-new.txt 

{'model_file': 'experiments/05_06_2021_02_40_41/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6166, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6166, bias=False)
  )
)
100% 484/484 [00:08<00:00, 59.84it/s]
Corpus BLEU: 9.539760433309619
{'model_file': 'experiments/05_06_2021_02_40_41/model.bin', 'valid_data': ['../data/tgt-test-new.t

#### Hyperparameter Search for English to Uyghur (Latin Script) Models

In [None]:
# dropout 0.1
# num layers 1
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-translit-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.1 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-translit-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.1, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 03:05:25 - INFO - __main__ - Total train sentences: 6475
05/06/2021 03:05:25 - INFO - __main__ - Total valid sentences: 809
05/06/2021 03:05:25 - INFO - __main__ - Random samples from training data
Source Language Sentence:  Traffic on this road has been disrupted due to flooding.
Target Language Sentence:  mushu yolda kelkün seweblik qatnash aüzülüp qaldi.
Source Language Sentence:  They most certainly know.
Target Language Sentence:  aularghu choqum bilidu.
Source Language Sentence:  Tatoeba is the most beautiful place in the onli

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_03_05_23/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt 
!python test.py --model_file experiments/05_06_2021_03_05_23/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-translit-new.txt 

{'model_file': 'experiments/05_06_2021_03_05_23/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6147, bias=False)
  )
)
100% 484/484 [00:07<00:00, 67.88it/s]
Corpus BLEU: 11.46888742464364
{'model_file': 'experiments/05_06_2021_03_05_23/model.bin', 'valid_data': ['../data/tgt-test-new.txt', 

In [None]:
# dropout 0.1
# num layers 2
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-translit-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.1 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-translit-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.1, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 03:09:13 - INFO - __main__ - Total train sentences: 6475
05/06/2021 03:09:13 - INFO - __main__ - Total valid sentences: 809
05/06/2021 03:09:13 - INFO - __main__ - Random samples from training data
Source Language Sentence:  Do you like apples?
Target Language Sentence:  alma yaxshi köremsen?
Source Language Sentence:  I didn't do it on purpose.
Target Language Sentence:  qesten qilghinim yoq.
Source Language Sentence:  Sorry, I couldn't help it.
Target Language Sentence:  kechürüng, aözümni tutuwalalmidim.
05/06/2021 03:09:13 - INF

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_03_09_13/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt 
!python test.py --model_file experiments/05_06_2021_03_09_13/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-translit-new.txt 

{'model_file': 'experiments/05_06_2021_03_09_13/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6147, bias=False)
  )
)
100% 484/484 [00:08<00:00, 58.89it/s]
Corpus BLEU: 10.457954146910325
{'model_file': 'experiments/05_06_2021_03_09_13/model.bin', 'valid_data': ['../data/tgt-

In [None]:
# dropout 0.2
# num layers 1
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-translit-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.2 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-translit-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.2, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 03:14:05 - INFO - __main__ - Total train sentences: 6475
05/06/2021 03:14:05 - INFO - __main__ - Total valid sentences: 809
05/06/2021 03:14:05 - INFO - __main__ - Random samples from training data
Source Language Sentence:  In any case, I've finished writing the article.
Target Language Sentence:  qandaq bolmisun, maqalini yézip boldum.
Source Language Sentence:  It wasn't easy for me to write this letter in French.
Target Language Sentence:  bu xetni fransuzche yézishim aasangha toxtimidi.
Source Language Sentence:  Get me a towel

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_03_14_05/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt 
!python test.py --model_file experiments/05_06_2021_03_14_05/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-translit-new.txt 

{'model_file': 'experiments/05_06_2021_03_14_05/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6147, bias=False)
  )
)
100% 484/484 [00:07<00:00, 66.85it/s]
Corpus BLEU: 9.736401927429956
{'model_file': 'experiments/05_06_2021_03_14_05/model.bin', 'valid_data': ['../data/tgt-test-new.txt', 

In [None]:
# dropout 0.2
# num layers 2
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-translit-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.2 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-translit-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.2, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 03:18:29 - INFO - __main__ - Total train sentences: 6475
05/06/2021 03:18:29 - INFO - __main__ - Total valid sentences: 809
05/06/2021 03:18:29 - INFO - __main__ - Random samples from training data
Source Language Sentence:  If it should rain, he will not come.
Target Language Sentence:  yamghur yaghsa, u kelmeydu.
Source Language Sentence:  This sanza is really good!
Target Language Sentence:  bu sangza bek aoxshaptu!
Source Language Sentence:  This is a really beautiful city!
Target Language Sentence:  bu xoymu chirayliq sheher ai

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_03_18_29/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt 
!python test.py --model_file experiments/05_06_2021_03_18_29/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-translit-new.txt 

{'model_file': 'experiments/05_06_2021_03_18_29/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6147, bias=False)
  )
)
100% 484/484 [00:08<00:00, 59.68it/s]
Corpus BLEU: 11.653402816742787
{'model_file': 'experiments/05_06_2021_03_18_29/model.bin', 'valid_data': ['../data/tgt-

In [None]:
# dropout 0.3
# num layers 1
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-translit-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.3 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-translit-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.3, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 03:23:07 - INFO - __main__ - Total train sentences: 6475
05/06/2021 03:23:07 - INFO - __main__ - Total valid sentences: 809
05/06/2021 03:23:07 - INFO - __main__ - Random samples from training data
Source Language Sentence:  Hurry up, or you'll be late for school.
Target Language Sentence:  téz bol, bolmisa derske kéchikip qalisen.
Source Language Sentence:  Did you see my daughter?
Target Language Sentence:  qizimni kördingizmu?
Source Language Sentence:  He went to the shop.
Target Language Sentence:  u dukangha bardi.
05/06/2021 

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_03_23_07/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt 
!python test.py --model_file experiments/05_06_2021_03_23_07/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-translit-new.txt 

{'model_file': 'experiments/05_06_2021_03_23_07/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6147, bias=False)
  )
)
100% 484/484 [00:07<00:00, 68.06it/s]
Corpus BLEU: 9.662793408486472
{'model_file': 'experiments/05_06_2021_03_23_07/model.bin', 'valid_data': ['../data/tgt-test-new.txt', 

In [None]:
# dropout 0.3
# num layers 2
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-translit-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.3 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-translit-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.3, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 03:30:27 - INFO - __main__ - Total train sentences: 6475
05/06/2021 03:30:27 - INFO - __main__ - Total valid sentences: 809
05/06/2021 03:30:27 - INFO - __main__ - Random samples from training data
Source Language Sentence:  The bus will arrive shortly. Please wait a bit.
Target Language Sentence:  aaptobus hazirla kélidu, biraaz saqlap turung.
Source Language Sentence:  They believe in Marxism and don't believe in religion.
Target Language Sentence:  aular marksizmgha aishinidu, dingha aishenmeydu.
Source Language Sentence:  I walk

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_03_30_27/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt 
!python test.py --model_file experiments/05_06_2021_03_30_27/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-translit-new.txt 

{'model_file': 'experiments/05_06_2021_03_30_27/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6147, bias=False)
  )
)
100% 484/484 [00:08<00:00, 59.15it/s]
Corpus BLEU: 9.82567695909928
{'model_file': 'experiments/05_06_2021_03_30_27/model.bin', 'valid_data': ['../data/tgt-te

In [None]:
# dropout 0.5
# num layers 1
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-translit-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.5 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-translit-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.5, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 03:34:07 - INFO - __main__ - Total train sentences: 6475
05/06/2021 03:34:07 - INFO - __main__ - Total valid sentences: 809
05/06/2021 03:34:07 - INFO - __main__ - Random samples from training data
Source Language Sentence:  Is this your bicycle?
Target Language Sentence:  bu séning wélisipiting?
Source Language Sentence:  How long did it take you to drive from here to Tokyo?
Target Language Sentence:  bu yerdin tokyogha heydishingizge qanchilik waqit lazim aidi?
Source Language Sentence:  I don't have as much money as you think.
Ta

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_03_34_07/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt 
!python test.py --model_file experiments/05_06_2021_03_34_07/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-translit-new.txt 

{'model_file': 'experiments/05_06_2021_03_34_07/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6147, bias=False)
  )
)
100% 484/484 [00:07<00:00, 64.80it/s]
Corpus BLEU: 10.025026942104203
{'model_file': 'experiments/05_06_2021_03_34_07/model.bin', 'valid_data': ['../data/tgt-test-new.txt',

In [None]:
# dropout 0.5
# num layers 2
!python train.py --train_data ../data/tgt-train-new.txt ../data/src-train-translit-new.txt --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 2 --bidirectional --dropout_p 0.5 --device cuda

{'train_data': ['../data/tgt-train-new.txt', '../data/src-train-translit-new.txt'], 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 2, 'bidirectional': True, 'dropout_p': 0.5, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 03:37:43 - INFO - __main__ - Total train sentences: 6475
05/06/2021 03:37:43 - INFO - __main__ - Total valid sentences: 809
05/06/2021 03:37:43 - INFO - __main__ - Random samples from training data
Source Language Sentence:  Fashion is not my specialty.
Target Language Sentence:  moda - méning talantim aemes.
Source Language Sentence:  Won't you have some more tea?
Target Language Sentence:  chayni köprek aichmemsiler?
Source Language Sentence:  I didn't pay anything - he treated me.
Target Language Sentence:  men héchqanche pul töl

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_03_37_43/model.bin --valid_data ../data/tgt-dev-new.txt ../data/src-dev-translit-new.txt 
!python test.py --model_file experiments/05_06_2021_03_37_43/model.bin --valid_data ../data/tgt-test-new.txt ../data/src-test-translit-new.txt 

{'model_file': 'experiments/05_06_2021_03_37_43/model.bin', 'valid_data': ['../data/tgt-dev-new.txt', '../data/src-dev-translit-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4344, 256, padding_idx=0)
    (lstm): LSTM(256, 256, num_layers=2, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(6147, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=6147, bias=False)
  )
)
100% 484/484 [00:08<00:00, 57.98it/s]
Corpus BLEU: 9.602090278954162
{'model_file': 'experiments/05_06_2021_03_37_43/model.bin', 'valid_data': ['../data/tgt-t

## Extension: Related Languages

Since Uyghur is a low-resource language, and Uzbek is the most closely related language to Uyghur (and a higher-resource language), we decided to augment our training data with some parallel Uzbek-English data to see if this might improve our model performance. 

For more information about our extension, see the final project report. 

In [None]:
%cd code

/content/drive/My Drive/CIS 530/Project/Milestone4/code


### Preprocessing for Extension

In [None]:
# append Uzbek data to the train data and write out to new files
uz_data = open("../data/src-tatoeba-uz.txt", "r").readlines()
en_data = open("../data/tgt-tatoeba-uz.txt", "r").readlines()

train_data_ug = open("../data/src-train-new.txt", "r").readlines()
train_translit_data_ug = open("../data/src-train-translit-new.txt", "r").readlines()
train_data_en = open("../data/tgt-train-new.txt", "r").readlines()

train_data_ug.extend(uz_data)
train_translit_data_ug.extend(uz_data)
train_data_en.extend(en_data)

with open("../data/src-train-augment.txt", "w") as f:
    f.write("\n".join(train_data_ug))
with open("../data/src-train-translit-augment.txt", "w") as f:
    f.write("\n".join(train_translit_data_ug))
with open("../data/tgt-train-augment.txt", "w") as f:
    f.write("\n".join(train_data_en))

In [None]:
!sed -i '/^$/d' ../data/src-train-augment.txt
!sed -i '/^$/d' ../data/src-train-translit-augment.txt
!sed -i '/^$/d' ../data/tgt-train-augment.txt

### Extension Models

#### Uyghur (Arabic Script) to English

In [None]:
# with disjoint dev set 
!python train.py --train_data ../data/src-train-augment.txt ../data/tgt-train-augment.txt --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.2 --device cuda

{'train_data': ['../data/src-train-augment.txt', '../data/tgt-train-augment.txt'], 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.2, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 17:55:02 - INFO - __main__ - Total train sentences: 6969
05/06/2021 17:55:02 - INFO - __main__ - Total valid sentences: 484
05/06/2021 17:55:02 - INFO - __main__ - Random samples from training data
Source Language Sentence:  جۈملىنى سۆزمۇسۆز تەرجىمە قىلىشقا بولمايدۇ.
Target Language Sentence:  You cannot translate the sentence word-for-word.
Source Language Sentence:  ماۋۇنى ياخشى كۆرمەيمەن.
Target Language Sentence:  I don't like this one.
Source Language Sentence:  ھەئە، كېلىڭلار، مەر ھەمەت.
Target Language Sentence:  Yes, please come.
05/0

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_17_55_00/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_06_2021_17_55_00/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_06_2021_17_55_00/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7237, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4534, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4534, bias=False)
  )
)
100% 484/484 [00:11<00:00, 43.98it/s]
Corpus BLEU: 11.287928862348833
{'model_file': 'experiments/05_06_2021_17_55_00/model.bin', 'valid_data': ['../data/src-test-new.txt', '../data

In [None]:
# with combined dev set
!python train.py --train_data ../data/src-train-augment.txt ../data/tgt-train-augment.txt --valid_data ../data/src-dev-new1.txt ../data/tgt-dev-new1.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.2 --device cuda

{'train_data': ['../data/src-train-augment.txt', '../data/tgt-train-augment.txt'], 'valid_data': ['../data/src-dev-new1.txt', '../data/tgt-dev-new1.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.2, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 17:57:46 - INFO - __main__ - Total train sentences: 6969
05/06/2021 17:57:46 - INFO - __main__ - Total valid sentences: 809
05/06/2021 17:57:46 - INFO - __main__ - Random samples from training data
Source Language Sentence:  ماڭا ئېيتما.
Target Language Sentence:  Don't tell me.
Source Language Sentence:  ھەئە. بولىمەن. سىز ناكانو ئەپەندىمۇ؟
Target Language Sentence:  Yes, I am. Are you Mr Nakano?
Source Language Sentence:  «pretty» قانداق يازىدۇ؟
Target Language Sentence:  How do you spell "pretty"?
05/06/2021 17:57:46 - INFO - __main__ - 

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_17_57_45/model.bin --valid_data ../data/src-dev-new1.txt ../data/tgt-dev-new1.txt
!python test.py --model_file experiments/05_06_2021_17_57_45/model.bin --valid_data ../data/src-dev-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_06_2021_17_57_45/model.bin --valid_data ../data/src-test-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_06_2021_17_57_45/model.bin', 'valid_data': ['../data/src-dev-new1.txt', '../data/tgt-dev-new1.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7237, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4534, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4534, bias=False)
  )
)
100% 809/809 [00:16<00:00, 49.48it/s]
Corpus BLEU: 94.40528322128232
{'model_file': 'experiments/05_06_2021_17_57_45/model.bin', 'valid_data': ['../data/src-dev-new.txt', '../data

#### Uyghur (Latin Script) to English

In [None]:
# with disjoint dev set
!python train.py --train_data ../data/src-train-translit-augment.txt ../data/tgt-train-augment.txt --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.2 --device cuda

{'train_data': ['../data/src-train-translit-augment.txt', '../data/tgt-train-augment.txt'], 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.2, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 18:08:50 - INFO - __main__ - Total train sentences: 6969
05/06/2021 18:08:50 - INFO - __main__ - Total valid sentences: 484
05/06/2021 18:08:50 - INFO - __main__ - Random samples from training data
Source Language Sentence:  menmu bardim.
Target Language Sentence:  I also went.
Source Language Sentence:  men auyghur
Target Language Sentence:  I am an Uyghur.
Source Language Sentence:  yaq. u suni yaxshi körmeydu!
Target Language Sentence:  No. He doesn't like water!
05/06/2021 18:08:50 - INFO - __main__ - Random samples from

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_18_08_49/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_06_2021_18_08_49/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_06_2021_18_08_49/model.bin', 'valid_data': ['../data/src-dev-translit-new.txt', '../data/tgt-dev-new.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7174, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4534, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4534, bias=False)
  )
)
100% 484/484 [00:09<00:00, 50.99it/s]
Corpus BLEU: 13.769165368135356
{'model_file': 'experiments/05_06_2021_18_08_49/model.bin', 'valid_data': ['../data/src-test-translit-

In [None]:
# with combined dev set
!python train.py --train_data ../data/src-train-translit-augment.txt ../data/tgt-train-augment.txt --valid_data ../data/src-dev-translit-new1.txt ../data/tgt-dev-new1.txt --n_epochs 100 --batch_size 48 --embedding_dim 256 --hidden_size 256 --num_layers 1 --bidirectional --dropout_p 0.2 --device cuda

{'train_data': ['../data/src-train-translit-augment.txt', '../data/tgt-train-augment.txt'], 'valid_data': ['../data/src-dev-translit-new1.txt', '../data/tgt-dev-new1.txt'], 'n_epochs': 100, 'batch_size': 48, 'embedding_dim': 256, 'hidden_size': 256, 'num_layers': 1, 'bidirectional': True, 'dropout_p': 0.2, 'initial_lr': 0.001, 'uniform_init': 0.0, 'clip_grad': 5.0, 'lr_decay': 0.5, 'patience': 5, 'max_trial': 5, 'device': 'cuda', 'model_name': 'model.bin'}
05/06/2021 18:11:47 - INFO - __main__ - Total train sentences: 6969
05/06/2021 18:11:47 - INFO - __main__ - Total valid sentences: 809
05/06/2021 18:11:47 - INFO - __main__ - Random samples from training data
Source Language Sentence:  bu réstoranni aintayin yaxshi körimen.
Target Language Sentence:  I really like this restaurant.
Source Language Sentence:  u daaim tamiqigha biliq yeydu.
Target Language Sentence:  He often eats fish for dinner.
Source Language Sentence:  yenimu tirishqin!
Target Language Sentence:  Keep working hard!

In [None]:
# evaluate
!python test.py --model_file experiments/05_06_2021_18_11_47/model.bin --valid_data ../data/src-dev-translit-new1.txt ../data/tgt-dev-new1.txt
!python test.py --model_file experiments/05_06_2021_18_11_47/model.bin --valid_data ../data/src-dev-translit-new.txt ../data/tgt-dev-new.txt
!python test.py --model_file experiments/05_06_2021_18_11_47/model.bin --valid_data ../data/src-test-translit-new.txt ../data/tgt-test-new.txt

{'model_file': 'experiments/05_06_2021_18_11_47/model.bin', 'valid_data': ['../data/src-dev-translit-new1.txt', '../data/tgt-dev-new1.txt']}
Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7174, 256, padding_idx=0)
    (lstm): LSTM(256, 256, bidirectional=True)
    (h_projection): Linear(in_features=512, out_features=256, bias=False)
    (c_projection): Linear(in_features=512, out_features=256, bias=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(4534, 256, padding_idx=0)
    (attn_projection): Linear(in_features=512, out_features=256, bias=False)
    (lstm_cell): LSTMCell(512, 256)
    (combined_output_projection): Linear(in_features=768, out_features=256, bias=False)
    (dropout): Dropout(p=0.2, inplace=False)
    (vocab_projection): Linear(in_features=256, out_features=4534, bias=False)
  )
)
100% 809/809 [00:16<00:00, 47.91it/s]
Corpus BLEU: 94.58710310830438
{'model_file': 'experiments/05_06_2021_18_11_47/model.bin', 'valid_data': ['../data/src-dev-translit-

## Translate with Best Models

Here is some code which loads a given model and translates an input sentence. This can be used to write translated outputs to file, or to translate a given input sentence.

In [None]:
%cd code

###Code to Translate with the Models

In [10]:
import pickle
from vocab import Vocab, Vocabularies
from models import Seq2Seq
from utils import add_start_end_tokens, beam_search
import numpy as np
import torch

In [11]:
# a copy of the transliterate function from preprocessing
word_mappings = {}
char_mappings = {}
word_mappings_reverse = {}
char_mappings_reverse = {}
word_pairs_train = open("../data/train.ug", "r", errors="ignore").readlines()
word_pairs_test = open("../data/test.ug", "r", errors="ignore").readlines()
char_pairs = open("../data/mappings.txt", "r", errors="ignore").readlines()

for pair in word_pairs_train:
    pair = pair.split()
    if len(pair) == 0:
        continue
    if pair[0] not in word_mappings:
        word_mappings[pair[0]] = pair[1]
    if pair[1] not in word_mappings_reverse:
        word_mappings_reverse[pair[1]] = pair[0]
for pair in word_pairs_test:
    pair = pair.split()
    if pair[0] not in word_mappings:
        word_mappings[pair[0]] = pair[1]
    if pair[1] not in word_mappings_reverse:
        word_mappings_reverse[pair[1]] = pair[0]
for pair in char_pairs:
    pair = pair.split()
    if pair[0] not in char_mappings:
        char_mappings[pair[0]] = pair[1]
    if pair[1] not in char_mappings_reverse:
        char_mappings_reverse[pair[1]] = pair[0]

# function that performs transliteration of a single token
def transliterate(token):
    if token in word_mappings:
        return word_mappings[token]
    else:
        tok = ""
        for c in token:
            if c not in char_mappings:
                tok = tok + c
            else:
                tok = tok + char_mappings[c]
        return tok

# function that reverses the transliteration of a single token
def reverse_transliterate(token):
    if token in word_mappings_reverse:
        return word_mappings_reverse[token]
    else:
        tok = ""
        for c in token:
            if c not in char_mappings_reverse:
                tok = tok + c
            else:
                tok = tok + char_mappings_reverse[c]
        return tok

In [16]:
# this method will be used to translate sentences using a given model
# expects data inputted as a list of sentences to be translated
def translate(model_path, data, transliterate_input=False, transliterate_output=False, write_to_file=False, output_path=None):

    # set up the model
    model = Seq2Seq.load(model_path)
    model.device = "cpu"

    # get the input data into the right format for translation
    if transliterate_input:
        transliterated_data = []
        for sent in data:
            new_sent = ""
            toks = sent.split()
            for tok in toks:
                new_tok = transliterate(tok)
                new_sent = new_sent + new_tok + " "
            transliterated_data.append(new_sent)
        data = transliterated_data
    data_dup = zip(data, data)
    data, _ = add_start_end_tokens(data_dup)

    # predict using the given model
    hypotheses = beam_search(model, data, beam_size=3, max_decoding_time_step=70)
    top_hypotheses = [" ".join(hyps[0].value) for hyps in hypotheses]

    if transliterate_output:
        transliterated_data = []
        for sent in top_hypotheses:
            new_sent = ""
            toks = sent.split()
            for tok in toks:
                new_tok = reverse_transliterate(tok)
                new_sent = new_sent + new_tok + " "
            transliterated_data.append(new_sent)
        top_hypotheses = transliterated_data

    # if we are supposed to write out to file, write to file here
    sentences = zip(data, top_hypotheses)
    if write_to_file:
        if output_path is not None:
            with open(output_path, "w") as f:
                f.write("\n".join(top_hypotheses))
        else:
            with open("../output/pred.txt", "w") as f:
                f.write("\n".join(top_hypotheses))
    else: # if not, then just print the translated value
        for x, y in sentences:
            print("Input Sentence: " + str(" ".join(x)))
            print("Output Sentence: " + str(y))
    return sentences

### Testing the Translate Function

Here, we use a model to test an example sentence.

In [None]:
# sanity check - example from the training data below
ug_sentence = ["ھەركىمنىڭ ئارتۇقچىلىقىمۇ، ئاجىزلىقىمۇ بار."]
ug_translit_sentence = ["herkimning aartuqchiliqimu, aajizliqimu bar."]
en_sentence = ["Everyone has strengths and weaknesses."]

# note -- these are not necessarily the best models - this was just to test this functionality here
en2ug_model_path = "experiments/05_06_2021_01_27_14/model.bin"
en2ugtranslit_model_path = "experiments/05_06_2021_03_23_07/model.bin"
ug2en_model_path = "experiments/05_05_2021_03_16_01/model.bin"
ugtranslit2en_model_path = "experiments/05_05_2021_22_59_05/model.bin"

# English to Uyghur Arabic
print("\nEnglish to Uyghur Arabic:")
translate(en2ug_model_path, en_sentence)

# English to Uyghur Latin
print("\nEnglish to Uyghur Latin:")
translate(en2ugtranslit_model_path, en_sentence)

# Uyghur Arabic to English
print("\nUyghur Arabic to English:")
translate(ug2en_model_path, ug_sentence)

# Uyghur Latin to English
print("\nUyghur Latin to English:")
translate(ugtranslit2en_model_path, ug_translit_sentence)

100%|██████████| 1/1 [00:00<00:00, 52.57it/s]


English to Uyghur Arabic:



100%|██████████| 1/1 [00:00<00:00, 46.21it/s]


Input Sentence: Everyone has strengths and weaknesses.
Output Sentence: ھەركىمنىڭ ئارتۇقچىلىقىمۇ، ئاجىزلىقىمۇ بار.

English to Uyghur Latin:
Input Sentence: Everyone has strengths and weaknesses.
Output Sentence: herkimning aartuqchiliqimu, aajizliqimu bar.

Uyghur Arabic to English:


100%|██████████| 1/1 [00:00<00:00, 55.23it/s]
100%|██████████| 1/1 [00:00<00:00, 58.36it/s]

Input Sentence: ھەركىمنىڭ ئارتۇقچىلىقىمۇ، ئاجىزلىقىمۇ بار.
Output Sentence: Everyone has strengths and weaknesses.

Uyghur Latin to English:
Input Sentence: herkimning aartuqchiliqimu, aajizliqimu bar.
Output Sentence: Everyone has strengths and weaknesses.





<zip at 0x7f6bb9ac4af0>

###Writing Predictions to File

Here we write the predictions of our best models out to file.

In [13]:
test_ug = open("../data/src-test-new.txt", "r").readlines()
test_ug_translit = open("../data/src-test-translit-new.txt", "r").readlines()
test_en = open("../data/tgt-test-new.txt", "r").readlines()

In [14]:
# writing the simple baseline predictions
en_path = "../output/test-simple-baseline/en-pred.txt"
ug_path = "../output/test-simple-baseline/ug-pred.txt"
ug_translit_path = "../output/test-simple-baseline/ug-translit-pred.txt"

# code taken from up in the simple baseline section
preds = []
for sent in test_ug:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "the "
    preds.append(pred_sent)

with open(en_path, "w") as f:
    f.write("\n".join(preds))

preds = []
for sent in test_en:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "ئۇ" + " "
    preds.append(pred_sent)

with open(ug_path, "w") as f:
    f.write("\n".join(preds))

preds = []
for sent in test_en:
    toks = sent.split()
    pred_sent = ""
    for tok in toks:
        pred_sent = pred_sent + "u" + " "
    preds.append(pred_sent)

with open(ug_translit_path, "w") as f:
    f.write("\n".join(preds))

In [17]:
# writing the published baseline predictions
en2ug_path = "../output/test-published-baseline/en2ug-pred.txt"
en2ug_translit_path = "../output/test-published-baseline/en2ug-translit-pred.txt"
ug2en_path = "../output/test-published-baseline/ug2en-pred.txt"
ug_translit2en_path = "../output/test-published-baseline/ug-translit2en-pred.txt"

en2ug_model = "experiments/05_06_2021_20_41_51/model.bin"
en2ug_translit_model = "experiments/05_06_2021_20_45_02/model.bin"
ug2en_model = "experiments/05_06_2021_20_36_36/model.bin"
ug_translit2en_model = "experiments/05_06_2021_20_39_42/model.bin"

translate(en2ug_model, test_en, write_to_file=True, output_path=en2ug_path)
translate(en2ug_translit_model, test_en, write_to_file=True, output_path=en2ug_translit_path)
translate(ug2en_model, test_ug, write_to_file=True, output_path=ug2en_path)
translate(ug_translit2en_model, test_ug_translit, write_to_file=True, output_path=ug_translit2en_path)

100%|██████████| 571/571 [00:10<00:00, 54.92it/s]
100%|██████████| 571/571 [00:09<00:00, 62.11it/s]
100%|██████████| 571/571 [00:11<00:00, 49.89it/s]
100%|██████████| 571/571 [00:10<00:00, 56.76it/s]


<zip at 0x7fa964532280>

In [19]:
# writing the combined dev extension predictions
en2ug_path = "../output/test-combined-dev/en2ug-pred.txt"
en2ug_translit_path = "../output/test-combined-dev/en2ug-translit-pred.txt"
ug2en_path = "../output/test-combined-dev/ug2en-pred.txt"
ug_translit2en_path = "../output/test-combined-dev/ug-translit2en-pred.txt"

en2ug_model = "experiments/05_06_2021_01_16_29/model.bin"
en2ug_translit_model = "experiments/05_06_2021_03_18_29/model.bin"
ug2en_model = "experiments/05_05_2021_04_12_18/model.bin"
ug_translit2en_model = "experiments/05_05_2021_22_35_53/model.bin"

translate(en2ug_model, test_en, write_to_file=True, output_path=en2ug_path)
translate(en2ug_translit_model, test_en, write_to_file=True, output_path=en2ug_translit_path)
translate(ug2en_model, test_ug, write_to_file=True, output_path=ug2en_path)
translate(ug_translit2en_model, test_ug_translit, write_to_file=True, output_path=ug_translit2en_path)

100%|██████████| 571/571 [00:10<00:00, 55.80it/s]
100%|██████████| 571/571 [00:09<00:00, 57.18it/s]
100%|██████████| 571/571 [00:09<00:00, 60.72it/s]
100%|██████████| 571/571 [00:09<00:00, 61.10it/s]


<zip at 0x7fa9643ef910>

In [18]:
# writing the related languages extension predictions
ug2en_path = "../output/test-related-languages/ug2en-pred.txt"
ug_translit2en_path = "../output/test-related-languages/ug-translit2en-pred.txt"

ug2en_model = "experiments/05_06_2021_17_57_45/model.bin"
ug_translit2en_model = "experiments/05_06_2021_18_11_47/model.bin"

translate(ug2en_model, test_ug, write_to_file=True, output_path=ug2en_path)
translate(ug_translit2en_model, test_ug_translit, write_to_file=True, output_path=ug_translit2en_path)

100%|██████████| 571/571 [00:09<00:00, 58.75it/s]
100%|██████████| 571/571 [00:10<00:00, 56.77it/s]


<zip at 0x7fa9643b7410>

## Error Analysis

Here is some code which was used to perform a basic error analysis of our best models.

For more information about our error analysis, please refer to the final project report. 

In [5]:
%cd code

/content/drive/My Drive/CIS 530/Project/Milestone4/code


In [7]:
from nltk.translate.bleu_score import sentence_bleu
import pprint

In [6]:
# load the gold test data
test_ug = open("../data/src-test-new.txt", "r").readlines()
test_ug_translit = open("../data/src-test-translit-new.txt", "r").readlines()
test_en = open("../data/tgt-test-new.txt", "r").readlines()

In [8]:
# Uyghur (Arabic script) to English -- combined dev
pred_en = open("../output/test-combined-dev/ug2en-pred.txt", "r").readlines()
pairs = zip(pred_en, test_en)

d = {}
for pred, gold in pairs:
    score = sentence_bleu([gold], pred)
    d[(pred, gold)] = score

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(d)

# Uyghur (Arabic script) to English -- published baseline
pred_en = open("../output/test-published-baseline/ug2en-pred.txt", "r").readlines()
pairs = zip(pred_en, test_en)

d = {}
for pred, gold in pairs:
    score = sentence_bleu([gold], pred)
    d[(pred, gold)] = score

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(d)

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


{   ('Answer the question.\n', 'Answer the question.\n'): 1.0,
    ('Are you happy?\n', 'Are you happy?\n'): 1.0,
    ('Are you studying?\n', 'Are you studying?\n'): 1.0,
    ('Did you call me up last night?\n', 'Did you call me up last night?\n'): 1.0,
    ('Do you have something\n', 'Do you have paper?\n'): 0.5300714512917181,
    ('Everybody is waiting\n', 'Everybody desires happiness.\n'): 0.355012007498269,
    ('Everyone loves that place.\n', "He's a singer that's loved by everyone.\n"): 0.3375865664312471,
    ('For me, here is a little bit\n', 'Hi, is this you? "Yes, this is me."\n'): 0.12294292420905262,
    ('Give me a cigarette.\n', 'Give me a cigarette.\n'): 1.0,
    ('He became in English.\n', "That won't happen.\n"): 0.3735942070746957,
    ('He had a translation of the meeting.\n', 'He borrowed one hundred bucks from me.\n'): 0.3304409240715205,
    ('He has a lot of women.\n', 'There are a lot of roses in this garden.\n'): 0.2605130087716185,
    ('He has gone to London

In [9]:
# Uyghur (Latin script) to English -- combined dev
pred_en = open("../output/test-combined-dev/ug-translit2en-pred.txt", "r").readlines()
pairs = zip(pred_en, test_en)

d = {}
for pred, gold in pairs:
    score = sentence_bleu([gold], pred)
    d[(pred, gold)] = score

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(d)

# Uyghur (Latin script) to English -- published baseline
pred_en = open("../output/test-published-baseline/ug-translit2en-pred.txt", "r").readlines()
pairs = zip(pred_en, test_en)

d = {}
for pred, gold in pairs:
    score = sentence_bleu([gold], pred)
    d[(pred, gold)] = score

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(d)

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


{   ('A father is an American\n', 'Venice is a city on water.\n'): 0.20988938070492225,
    ('Answer the question.\n', 'Answer the question.\n'): 1.0,
    ('Are you an Uighur?\n', 'Mind your own business!\n'): 0.1456661149805101,
    ('Are you from Kashgar\n', 'Get a hold of yourself.\n'): 0.13480590103590676,
    ('Are you happy now?\n', "We'll soon be leaving.\n"): 0.32630496621343463,
    ('Are you happy?\n', 'Are you happy?\n'): 1.0,
    ('Are you in a brothel.\n', "There's a storm coming.\n"): 0.29152009640844334,
    ('Are you studying?\n', 'Are you studying?\n'): 1.0,
    ('Are you studying?\n', 'Everybody desires happiness.\n'): 0.31786527420836924,
    ('Are you studying?\n', 'The dog is dying.\n'): 0.25890790939055336,
    ('Are you studying?\n', "You were hurt, weren't you?\n"): 0.14921996717912747,
    ('Are you very much.\n', 'He graduated from Tokyo University.\n'): 0.13774228579562914,
    ('Are you very much.\n', 'I am ready for death.\n'): 0.26260879452593483,
    ('Ba

In [10]:
# English to Uyghur (Arabic script) -- published baseline (better model)
pred_ug = open("../output/test-published-baseline/en2ug-pred.txt", "r").readlines()
pairs = zip(pred_ug, test_ug)

d = {}
for pred, gold in pairs:
    score = sentence_bleu([gold], pred)
    d[(pred, gold)] = score

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(d)

# English to Uyghur (Arabic script) -- combined dev
pred_ug = open("../output/test-combined-dev/en2ug-pred.txt", "r").readlines()
pairs = zip(pred_ug, test_ug)

d = {}
for pred, gold in pairs:
    score = sentence_bleu([gold], pred)
    d[(pred, gold)] = score

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(d)

Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


{   ('ئاللاھ نەدە؟\n', 'ئىشىڭىزنى قىلىڭ!\n'): 0.5789320481820723,
    ('ئىسمىڭىز نېمە؟\n', 'سىلەرنىڭ پىكرىڭلار قانداق؟\n'): 0.2496129366529591,
    ('ئىسمىڭىز نېمە؟\n', 'پىكرىڭلار قانداق؟\n'): 0.41602390756021224,
    ('ئىككىمىز بىر بوتىلاق\n', 'مەرىيە ئەينەكتىن ئۆزىگە قارىدى.\n'): 0.2616556321017192,
    ('ئۇ ئۇ ئۆيدە تۇرىدۇ.\n', 'Токио университетини түгәтти.\u200e\n'): 0.3858295921499355,
    ('ئۇ ئۇ ئۆيدە تۇرىدۇ.\n', 'ئۇ توكيو ئۇنىۋېرسىتېتىنى پۈتتۈرگەن.\n'): 0.17788024112296172,
    ('ئۇ ئۇ ئۇ ماڭا ماڭا كەتتى.\n', 'ئۇ مېنىڭدىن بىر يۈز كوي ئارىيەت ئالدى.\n'): 0.11794213678678349,
    ('ئۇ ئۇ بىر بىر بىر بىر بىر بىر بىر بىر بىر ياخشى كۆرىدۇ.\n', 'ئۇ دوختۇر ئەمەس، ئوقۇتقۇچى.\n'): 0.14323145079400493,
    ('ئۇ ئۇ ماڭا بىر يەرگە\n', 'ئۇ ھەممە ئادەم ياخشى كۆرىدىغان ناخشىچى.\n'): 0.1210908054549626,
    ('ئۇ ئۇ يەرگە ئۇ يەرگە\n', 'يەكشەنبە كۈنى ئۇ راستىنلا سىرتقا چىقامدۇ؟\n'): 0.06500079642737754,
    ('ئۇ قەشقەرگە بارغۇدەك.\n', 'تۈنۈگۈن ھاۋا ئىسسىق ئىدى.\n'): 0.3114852603245108,
    ('ئۇ

In [11]:
# English to Uyghur (Latin script) -- combined dev
pred_ug_translit = open("../output/test-combined-dev/en2ug-translit-pred.txt", "r").readlines()
pairs = zip(pred_ug_translit, test_ug_translit)

d = {}
for pred, gold in pairs:
    score = sentence_bleu([gold], pred)
    d[(pred, gold)] = score

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(d)

# English to Uyghur (Latin script) -- published baseline
pred_ug_translit = open("../output/test-published-baseline/en2ug-translit-pred.txt", "r").readlines()
pairs = zip(pred_ug_translit, test_ug_translit)

d = {}
for pred, gold in pairs:
    score = sentence_bleu([gold], pred)
    d[(pred, gold)] = score

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(d)

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


{   ('aaailemde balilar qilish adem aötti.\n', 'meydangha nechche yüzligen adem chiqti.\n'): 0.22400561610036246,
    ('aaailemde u nerse yoq.\n', 'billi tenterbiyige bek austa aiken.\n'): 0.37404381511105966,
    ('aachquchni nedin tépip keldingiz?\n', 'aachquchni nedin tépip keldinglar?\n'): 0.8725669468916186,
    ('aadette saaet aaltide aaltide qaytip kélidu.\n', 'aemet méning yatiqimda on saaet aolturdi.\n'): 0.2331150139507996,
    ('aaltunning bahasi chiqti.\n', 'shamal chiqti.\n'): 0.35779249194596663,
    ('aamal bar.\n', 'boran chiqiwatidu.\n'): 0.24270336986255364,
    ('aarimizda qiz yoq.\n', 'bizning aarimizda qiz yoq.\n'): 0.6563555554708402,
    ('aeger sizning aishenchingiz u sizning meyli.\n', 'aeger peqet aöz xilingiz bilen aarilashsingiz, xiyalingiz hergiz kéngeymeydu.\n'): 0.17920699377238014,
    ('aempi aüch kömür bar.\n', 'qizziqraq birnerse aichkim bar.\n'): 0.19062026814563665,
    ('aerkinning yaxshiliqighu bir yaman dangliq aiken.\n', 'aattin aégiz, aittin pe

## Running Evaluation Script on Model Output Files

We have already run the simple and published baselines and evaluated their performance above. However, we have also taken the code included in the notebook and made an official evaluation script that operates on model output files, as required in the project description. Here we run the script. 

For more information on our chosen evaluation metric, please refer to the final project report. 

In [3]:
%cd code

/content/drive/My Drive/CIS 530/Project/Milestone4/code


In [36]:
# first input is the gold file and second input is the prediction file
!python score.py ../output/test-gold/tgt-test-new.txt ../output/test-combined-dev/ug2en-pred.txt

BLEU Score: 25.15583690087786


## Thank you! :)