# Sequence to Sequence

Sequence to Sequence are normal rnns despite the fact that they consists of two different rnn structures working together. The first one is called the `Encoder` and the another one is called the `Decoder`. Encoder encodes the input and generates a final vector specifically known as the `Context Vector`. The decoder then takes this context vector as an input and decodes it to generate the required result. This has a number of applications in the world of NLP and Machine Learning like Machine Translation, Speech recognition, Image Captioning and many more.

For this task, we will use `Multi30k` dataset from torchtext library that yields a pair of source-target raw sentences.

In [None]:
import torch.nn as nn
from typing import Iterable, List
from torchtext.datasets import Multi30k
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

## Tokenization and Vocabulary Building

In [None]:
SRC_LANG = 'de'
TGT_LANG = 'en'
specials = {'<UNK>': 0, '<PAD>': 1, '<SOS>': 2, '<EOS>': 3}

tokens = dict()
vocab = dict()

Create source and target language tokenizer. Make sure to install the dependencies.

```
pip install -U torchdata
pip install -U spacy
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
```

In [None]:
tokens[SRC_LANG] = get_tokenizer('spacy', language='de_core_news_sm')
tokens[TGT_LANG] = get_tokenizer('spacy', language='en_core_web_sm')

In [None]:
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANG: 0, TGT_LANG: 1}

    for data_sample in data_iter:
        yield tokens[language](data_sample[language_index[language]])

In [None]:
for lang in [SRC_LANG, TGT_LANG]:
    train_iter = Multi30k(split='train', language_pair=(SRC_LANG, TGT_LANG))    # Training data Iterator
    vocab[lang] = build_vocab_from_iterator(yield_tokens(train_iter, lang), min_freq=1, specials=specials.keys(), special_first=True)

Set <UNK> token index (i.e. 0 here) as the default index. This index is returned when the token is not found. If not set, it throws RuntimeError when the queried token is not found in the Vocabulary.

In [None]:
for lang in [SRC_LANG, TGT_LANG]:
  vocab[lang].set_default_index(0)

## Defining Seq2seq Model

In [None]:
class Encoder(nn.Module):
    pass

In [None]:
class Decoder(nn.Module):
    pass

In [None]:
class Seq2seq(nn.Module):
    pass

## Training Seq2seq Model

## References

- [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215.pdf)
- [LANGUAGE TRANSLATION WITH NN.TRANSFORMER AND TORCHTEXT](https://pytorch.org/tutorials/beginner/translation_transformer.html)