## [**NLP FROM SCRATCH: TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION**](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#nlp-from-scratch-translation-with-a-sequence-to-sequence-network-and-attention)

### This is the third and final tutorial on doing “NLP From Scratch”, where we write our own classes and functions to preprocess the data to do our NLP modeling tasks. We hope after you complete this tutorial that you’ll proceed to learn how torchtext can handle much of this preprocessing for you in the three tutorials immediately following this one.

#### [KEY: > input, = target, < output]

\> il est en train de peindre un tableau .

= he is painting a picture .

< he is painting a picture .

This is made possible by the simple but powerful idea of the sequence to sequence network, in which two recurrent neural networks work together to transform one sequence to another. An encoder network condenses an input sequence into a vector, and a decoder network unfolds that vector into a new sequence.

![](https://pytorch.org/tutorials/_images/seq2seq.png)

To improve upon this model we’ll use an attention mechanism, which lets the decoder learn to focus over a specific range of the input sequence.

In [1]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
url = "https://download.pytorch.org/tutorial/data.zip"

In [19]:
home = os.environ['HOME']
data_dir = f"{home}/torch/"
file_dir = data_dir + "data/"

In [9]:
os.path.exists(data_file)

True

In [10]:
SOS_token = 0
EOS_token = 1

In [11]:
class Lang:
    def __init__(self, name) -> None:
        self.name = name
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: 'SOS', 1: 'EOS'}
        self.n_words = 2
    
    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1
    
    def addSentence(self, sentence):
        for w in sentence.split(' '):
            self.addWord(w)


In [12]:
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

In [15]:
# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

In [14]:
normalizeString(' Adf56?')

'adf ?'

In [28]:
def realLangs(lang1, lang2, reverse=False):
    file_name = file_dir + f"{lang1}-{lang2}.txt"
    lines = (open(file_name, encoding='utf-8').
                read().strip().split('\n'))
    print(len(lines))

In [29]:
realLangs('eng', 'fra')

135842
