<h1 id="tocheading">Spring 2018 NLP Class Project: Neural Machine Translation</h1>
<div id="toc"></div>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [5]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random
import pandas as pd
import spacy
import pdb
import os
import jieba
import numpy as np

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Part 0: Project Overview

The goal of this project is to build a neural machine translation system and experience how recent advances have made their way. Each team will build the following sequence of neural translation systems for two language pairs, __Vietnamese (Vi)→English (En)__ and __Chinese (Zh)→En__ (prepared corpora is be provided):

1. Recurrent neural network based encoder-decoder without attention
2. Recurrent neural network based encoder-decoder with attention
2. Replace the recurrent encoder with either convolutional or self-attention based encoder.
4. [Optional] Build either or both fully self-attention translation system or/and multilingual translation system.

## Part 1: Data Upload & Preprocessing

In [10]:
# start of sentence
SOS_token = 0
# end of sentence
EOS_token = 1

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1

In [35]:
# Turn a Unicode string to plain ASCII, thanks to
# http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    """About "NFC" and "NFD": 
    
    For each character, there are two normal forms: normal form C 
    and normal form D. Normal form D (NFD) is also known as canonical 
    decomposition, and translates each character into its decomposed form. 
    Normal form C (NFC) first applies a canonical decomposition, then composes 
    pre-combined characters again.
    
    About unicodedata.category: 
    
    Returns the general category assigned to the Unicode character 
    unichr as string."""
    
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Trim
def normalizeString(s):
    s = unicodeToAscii(s.strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

In [44]:
def readLangs(lang1, lang2, reverse=False,
             dataset="train"):
    
    """Takes as input;
    - lang1, lang2: either (vi, en) or (zh, en)
    - dataset: one of ("train","dev","test")"""
    print("Reading lines...")

    # Read the lang1 file and split into lines
    lang1_lines = open("../data/iwslt-%s-%s/%s.tok.%s" % (lang1, lang2, dataset, lang1), encoding="utf-8").\
        read().strip().split("\n")
    # Read the lang2 file and split into lines
    lang2_lines = open("../data/iwslt-%s-%s/%s.tok.%s" % (lang1, lang2, dataset, lang2), encoding="utf-8").\
        read().strip().split("\n")
    
    # create sentence pairs (lists of length 2 that consist of string pairs)
    # e.g. ["And we &apos;re going to tell you some stories from the sea here in video .",
    #       "我们 将 用 一些 影片 来讲 讲述 一些 深海 海里 的 故事  "]
    # check if there are the same number of sentences in each set
    assert len(lang1_lines) == len(lang2_lines), "Two languages must have the same number of sentences. "+ str(len(lang1_lines)) + " sentences were passed for " + str(lang1) + "." + str(len(lang2_lines)) + " sentences were passed for " + str(lang2)+"."
    # normalize
    lang1_lines = [normalizeString(s) for s in lang1_lines]
    lang2_lines = [normalizeString(s) for s in lang2_lines]
    # construct pairs
    pair_ran = range(len(lang1_lines))
    pairs = [[lang1_lines[i]] + [lang2_lines[i]] for i in pair_ran]
    
#     # Split every line into pairs and normalize
#     pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

In [45]:
def prepareData(lang1, lang2, reverse=False, dataset="train"):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse, dataset=dataset)
    print("Read %s sentence pairs" % len(pairs))
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs

input_lang, output_lang, pairs = prepareData('vi', 'en', False, dataset="train")
print(random.choice(pairs))

Reading lines...
Read 133317 sentence pairs
Trimmed to 133317 sentence pairs
Counting words...
Counted words:
en 47568
vi 16144
['And Madhav said quot Why do we have to do it that way ?', 'Va Madhav a noi rang Tai sao chung ta phai lam theo cach o ?']


### 1.1 Vietnamese to English

In [53]:
# Format: languagepair_language_dataset
# Train 
vien_vi_train, vien_en_train, vi_en_train_pairs = prepareData('vi', 'en', False, dataset="train")
# Dev 
vien_vi_dev, vien_en_dev, vi_en_dev_pairs = prepareData('vi', 'en', False, dataset="dev")
# Test
vien_vi_test, vien_en_test, vi_en_test_pairs = prepareData('vi', 'en', False, dataset="test")

Reading lines...
Read 133317 sentence pairs
Trimmed to 133317 sentence pairs
Counting words...
Counted words:
vi 16144
en 47568
Reading lines...
Read 1268 sentence pairs
Trimmed to 1268 sentence pairs
Counting words...
Counted words:
vi 1370
en 3816
Reading lines...
Read 1553 sentence pairs
Trimmed to 1553 sentence pairs
Counting words...
Counted words:
vi 1325
en 3619


### 1.2 Chinese to English

In [11]:
# Please find the original tokenizing code provided by Elman Mansimov in the following link:
# 

def tokenize_zh(f_names, f_out_names):
    for f_name, f_out_name in zip(f_names, f_out_names):
        lines = open(f_name, 'r').readlines()
        tok_lines = open(f_out_name, 'w')
        for i, sentence in enumerate(lines):
            if i > 0 and i % 100 == 0:
                print (f_name.split('/')[-1], i, len(lines))
            tok_lines.write(' '.join(jieba.cut(sentence, cut_all=True)))
        tok_lines.close()

def tokenize_en(f_names, f_out_names):
    tokenizer = spacy.load('en_core_web_sm')

    for f_name, f_out_name in zip(f_names, f_out_names):
        lines = open(f_name, 'r').readlines()
        tok_lines = open(f_out_name, 'w')
        for i, sentence in enumerate(lines):
            if i > 0 and i % 10000 == 0:
                print (f_name.split('/')[-1], i, len(lines))
            tok_lines.write(' '.join(tokenizer(sentence)) + '\n')
        tok_lines.close()


if __name__ == "__main__":
    root = '../data/tokens_and_preprocessing_em/pretokenized_data/iwslt-zh-en-processed/'
#     tokenize_zh([os.path.join(root, 'dev.zh'), os.path.join(root, 'test.zh'), os.path.join(root, 'train.zh')],\
#                 [os.path.join(root, 'dev.tok.zh'), os.path.join(root, 'test.tok.zh'), os.path.join(root, 'train.tok.zh')])

    tokenize_en([os.path.join(root, 'dev.en'), os.path.join(root, 'test.en'), os.path.join(root, 'train.en')],\
               [os.path.join(root, 'dev.tok.en'), os.path.join(root, 'test.tok.en'), os.path.join(root, 'train.tok.en')])


TypeError: sequence item 0: expected str instance, spacy.tokens.token.Token found

In [54]:
# Format: languagepair_language_dataset
# Train 
zhen_zh_train, zhen_en_train, zh_en_train_pairs = prepareData('zh', 'en', False, dataset="train")
# Dev 
zhen_zh_dev, zhen_en_dev, zh_en_dev_pairs = prepareData('zh', 'en', False, dataset="dev")
# Test
zhen_zh_test, zhen_en_test, zh_en_test_pairs = prepareData('zh', 'en', False, dataset="test")

Reading lines...
Read 213376 sentence pairs
Trimmed to 213376 sentence pairs
Counting words...
Counted words:
zh 8006
en 59329
Reading lines...
Read 1261 sentence pairs
Trimmed to 1261 sentence pairs
Counting words...
Counted words:
zh 92
en 3916
Reading lines...
Read 1397 sentence pairs
Trimmed to 1397 sentence pairs
Counting words...
Counted words:
zh 66
en 3423


## Part 2: Model

1. Recurrent neural network based encoder-decoder without attention
2. Recurrent neural network based encoder-decoder with attention
2. Replace the recurrent encoder with either convolutional or self-attention based encoder.

### 2.1: RNN-based Encoder-Decoder without Attention

### 2.2 RNN-based Encoder-Decoder with Attention

### 2.3 Encoder Replacement with Eonvolutional or Self-attention-based Encoder

### 2.4 Fully self-attention Translation System

### 2.5 Multilingual Translation System