# Seq2Seq Translation (with attention)

The character level RNN was a great intro to RNN's and pytorch in general but its not really where we'd like to go.

There were a few key (abilt somewhat obvious) problems with it for our application.

  1. We're operating at two low of a granularity - the fundemental unit of lanuage is the word, not the character (at least in english)
  2. We have no control over the content we genereated beyond the category

Our aim here will be to implement a basic version of the Machine Translation task: converting a sentence from French to English

In [1]:
# Some Imports!
from __future__ import unicode_literals, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F

# this will end up being false for everyone execpt duke
use_cuda = torch.cuda.is_available()

# Data representation


While representing every letter in a language was feasible, representing every word is far from. While there are sophisticated ways to solve this problem, for the moment we'll just use a one-hot vector of most common few thousand words in the language to represent the complete possible vocabulary.


In the char-rnn example we used one-hot vectors to represent each letter in a the word. Since we're changing our language unit (from chars to words) we're going to need a new representation.

However, before we can even worry about how we repsent each word, we need to be able to uniquely identify each word by some numeric value. 

To do this in a clean way (since we now have more than one set of language units (english vs french words)) we'll create a class to handle language operations.

In [2]:
SOS_token = 0 # Start Of Sentence
EOS_token = 1 # End   "      "

class Lang:
    def __init__(self, name):
            self.name = name
            self.word2index = {}
            self.word2count = {}
            self.index2word = {0: "SOS", 1: "EOS"}
            self.num_words = 2
 
    def add_sentence(self, sentence):
        for word in sentence.split(' '): # split on space char
            self.add_word(word)
            
    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words # next largest number
            self.word2count[word] = 1 # first occurance
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1
    
            

## Same Data cleaning shit from last time

In [3]:
# Turn a Unicode string to plain ASCII, thanks to
# http://stackoverflow.com/a/518232/2809427
def unicode_2_ascii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def clean_string(s):
    s = unicode_2_ascii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

## Data Reading

In [4]:
def read_langs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').read().strip().split('\n')

    # Split every line into pairs and clean
    pairs = [[clean_string(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs

# Data set trimming
Sometimes we are either limited by computation or we only care about a subproblem within a data set - at times like this its important to intelligently limit the data we operate on.

In the compuation space, if youre trying to see if an approach is viable you could just arbitraily limit yourself to only look at 10k examples or something but you'd just get a shitty model that potentially wouldnt even validate your appoarch as resonable.

A smarter trick (and the one we'll apply here) is to cherry pick data that has something in common, and just try to learn the relationship expressed within that subset. In this example we're gonna choose to train a model that knows how to translate to the form "He is..." or "We are..."  

In [5]:
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

def filter_pair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH and \
        p[1].startswith(eng_prefixes)

def filter_pairs(pairs):
    return [p for p in pairs if filter_pair(p)]

# Extract Data

In [8]:
def prepare_data(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = read_langs(lang1, lang2, reverse)
    print("Read %s sentences" % len(pairs))
    pairs = filter_pairs(pairs)
    print("Trimed to %s sens" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.add_sentence(pair[0])
        output_lang.add_sentence(pair[1])
    print(input_lang.name, input_lang.num_words)
    print(output_lang.name, output_lang.num_words)
    return input_lang, output_lang, pairs


input_lang, output_lang, pairs = prepare_data('eng', 'fra', True)
print(random.choice(pairs))

Reading lines...
Read 135842 sentences
Trimed to 10853 sens
Counting words...
fra 4489
eng 2925
['c est un bleu .', 'he s a freshman .']


# The Network

##  sequence to sequence networks

So we have an idea of what RNN's are from the char-rnn. What's the diffrence between those and these sequence to sequence models?

A seq2seq network is actually a joint model that combines two recurrent networks in an encoder-decoder architecture...

![alt text](encdec.jpg "Logo Title Text 1")

where both the encoder and the decoder is a RNN.

In such an architecture the encoder network converts a sentence into a "meaning" vector (think semantic embeding) and this vector is the passed to the decoder which converts the embedding to its translation in the choosen language.

A major advantage of this sort of architecture is that it saves us from having to make our translation a word to word conversion and thereby keep the same order of words and length of sentence.

## The Encoder

The encoder is an rnn that generates the "meaning" or context vector to feed the decoder by reading each element in the input sequence.

Given the seq "This is a cat" (A B C D) and a goal seq "yeh ek billi hai" (W X Y Z):

![alt text](input_output.jpg "Logo Title Text 1")



the encoder will start with some default initalized hidden state and then update at each timestep of the input sequence. In the simiplest version of the seq2seq model we will just use the final vector produced by the encoder to fully specify the context.


![alt text](simple_enc_dec.jpg "Logo Title Text 2")

### Encoder Architecture
