# Modul Spezielle Anwendungen der Informatik: K.I. in der Robotik

## Projektpräsentation: Sequenzmodelle in PyTorch am Beispiel eines simplen LSTM-Maschinenübersetzers

## 1. Modellarchitektur

### Sequence-to-Sequence Modell

<img src="documentation/seq2seq.png" alt="Seq2Seq Models" width=800>

Quelle: https://towardsdatascience.com/nlp-sequence-to-sequence-networks-part-2-seq2seq-model-encoderdecoder-model-6c22e29fd7e1


### Verarbeitung: RNN
<img src="documentation/RNN-unrolled.png" alt="Seq2Seq Models" width=600 height=400>

Quelle: https://colah.github.io/posts/2015-08-Understanding-LSTMs/


In [4]:
### Framework imports
import torch
from torch import optim
import os
import random

### Custom imports 
from model.model import *
from experiment.train_eval import evaluateInput, GreedySearchDecoder, trainIters, eval_batch, plot_training_results
from global_settings import device, FILENAME, SAVE_DIR, PREPRO_DIR, TRAIN_FILE, TEST_FILE, EXPERIMENT_DIR, LOG_FILE
from model.model import EncoderLSTM, DecoderLSTM
from utils.prepro import read_lines, preprocess_pipeline, load_cleaned_data, save_clean_data
from utils.tokenize import build_vocab, batch2TrainData, indexesFromSentence

from global_settings import DATA_DIR
from utils.utils import split_data, filter_pairs, max_length, plot_grad_flow

In [5]:
### Data cleaning
start_root = "."
exp_contraction = True # don't --> do not
file_to_load = "simple_dataset_praesi.txt"
file_name = "simple_dataset_praesi.pkl"


if os.path.isfile(os.path.join(start_root, PREPRO_DIR,file_name)):
    ##load
    print("File exists. Loading cleaned pairs...")
    pairs = load_cleaned_data(PREPRO_DIR, filename=file_name)
else: 
    print("Preprocessing file...")
    ### read lines from file
    pairs = read_lines(os.path.join(start_root,DATA_DIR),file_to_load)
    ### Preprocess file
    pairs, path = preprocess_pipeline(pairs, file_name, exp_contraction, max_len = 0)

File exists. Loading cleaned pairs...


In [7]:
print(random.choice(pairs))
print("Total pairs in the small dataset:")
print(len(pairs))

train_pairs = pairs
src_sents = [pair[0] for pair in pairs]
trg_sents = [pair[1] for pair in pairs]

max_src_l = max_length(src_sents)
max_trg_l = max_length(trg_sents)

print("Max length in source sentences:", max_src_l)
print("Max length in target sentences:", max_trg_l)

['i fell', 'ich fiel']
Total pairs in the small dataset:
100
Max length in source sentences: [3]
Max length in target sentences: [5]


In [8]:
### Getting src and trg sents
src_sents, trg_sents = [], []
src_sents = [item[0] for item in pairs]
trg_sents = [item[1] for item in pairs]
print(random.choice(src_sents))
print(random.choice(trg_sents))

get out
verkruemele dich


In [23]:
### Creating vocabularies
input_lang = build_vocab(src_sents, "eng")
output_lang = build_vocab(trg_sents, "deu")

print("Total source words:", input_lang.num_words)
print("Total target words:", output_lang.num_words)
print("")
print(input_lang.word2index)
print("")
print(output_lang.word2index)
print("")
print("Example of conversion word > index:")
print("Word {} > Index {}".format('hello', input_lang.word2index.get('hello')))
print("Index {} > Word {}".format(20, input_lang.index2word.get(20)))

Total source words: 55
Total target words: 125

{'<PAD>': 0, '<SOS>': 1, '<EOS>': 2, '<UNK>': 3, 'hi': 4, 'run': 5, 'wow': 6, 'fire': 7, 'help': 8, 'stop': 9, 'wait': 10, 'go': 11, 'on': 12, 'hello': 13, 'i': 14, 'ran': 15, 'see': 16, 'try': 17, 'won': 18, 'smile': 19, 'cheers': 20, 'freeze': 21, 'got': 22, 'it': 23, 'he': 24, 'hop': 25, 'in': 26, 'hug': 27, 'me': 28, 'fell': 29, 'know': 30, 'lied': 31, 'lost': 32, 'paid': 33, 'swim': 34, 'am': 35, 'ok': 36, 'up': 37, 'no': 38, 'way': 39, 'really': 40, 'thanks': 41, 'why': 42, 'ask': 43, 'tom': 44, 'be': 45, 'cool': 46, 'fair': 47, 'nice': 48, 'beat': 49, 'call': 50, 'come': 51, 'get': 52, 'out': 53, 'away': 54}

{'<PAD>': 0, '<SOS>': 1, '<EOS>': 2, '<UNK>': 3, 'hallo': 4, 'gruess': 5, 'gott': 6, 'lauf': 7, 'potzdonner': 8, 'donnerwetter': 9, 'feuer': 10, 'hilfe': 11, 'zu': 12, 'huelf': 13, 'stopp': 14, 'warte': 15, 'mach': 16, 'weiter': 17, 'ich': 18, 'rannte': 19, 'verstehe': 20, 'aha': 21, 'probiere': 22, 'es': 23, 'hab': 24, 'gewon

In [10]:
### Simple conversion sentence to tensor:
random_pair = train_pairs[40]
print(random_pair)

['i paid', 'ich zahlte']


In [11]:
english_sent = indexesFromSentence(input_lang, random_pair[0])
german_sent = indexesFromSentence(output_lang, random_pair[1])

print(english_sent)
print(german_sent)

[14, 33, 2]
[18, 56, 2]


In [12]:
### No splitting for this short presentation :-)
train_pairs = pairs
mini_batch = 5
batch_pair = [random.choice(train_pairs) for _ in range(5)]
batch_pair.sort(key=lambda x: len(x[0].split(" ")), reverse=True)
for pair in batch_pair:
    print("Source:", pair[0],"Target:", pair[1])    
    print("Src tensor:", indexesFromSentence(input_lang, pair[0]),"Trg tensor:", indexesFromSentence(output_lang, pair[1]))    

Source: got it Target: verstanden
Src tensor: [22, 23, 2] Trg tensor: [34, 2]
Source: got it Target: einverstanden
Src tensor: [22, 23, 2] Trg tensor: [35, 2]
Source: i see Target: ich verstehe
Src tensor: [14, 16, 2] Trg tensor: [18, 20, 2]
Source: i am Target: ich bin jahre alt
Src tensor: [14, 35, 2] Trg tensor: [18, 49, 58, 59, 2]
Source: really Target: im ernst
Src tensor: [40, 2] Trg tensor: [78, 79, 2]


In [13]:
### Creating a simple batch of 5 sentences --> Shape (seq_len, batch_size)
training_batch = batch2TrainData(input_lang, output_lang, batch_pair)

In [14]:
input_tensor, input_lengths, target_tensor, mask, target_max_len, target_lengths = training_batch

In [15]:
print("Length of source sentences:", input_lengths)

Length of source sentences: tensor([3, 3, 3, 3, 2])


In [16]:
print("Tensorized input:")
print(input_tensor)

Tensorized input:
tensor([[22, 22, 14, 14, 40],
        [23, 23, 16, 35,  2],
        [ 2,  2,  2,  2,  0]])


In [17]:
print("Tensorized output:")
print(target_tensor)

Tensorized output:
tensor([[34, 35, 18, 18, 78],
        [ 2,  2, 20, 49, 79],
        [ 0,  0,  2, 58,  2],
        [ 0,  0,  0, 59,  0],
        [ 0,  0,  0,  2,  0]])


## 2. Encoding - Decoding Verfahren


Sowohl Encoder als auch Decoder greifen auf das erste Index zu, sprich die Eingaben nicht über die batch_size Dimension verarbeitet, sondern als Sequenz nach ihrer Sequenzlänge verarbeitet, wie folgt:

In [18]:
### Das bekommt das Encoder bzw. Decoder zu jedem Zeitschritt t:
for i, elem in enumerate(input_tensor):
    print("Timestep:", i)
    print("Input:", elem)
    print("Woerter:", [input_lang.index2word[word.item()] for word in elem])


Timestep: 0
Input: tensor([22, 22, 14, 14, 40])
Woerter: ['got', 'got', 'i', 'i', 'really']
Timestep: 1
Input: tensor([23, 23, 16, 35,  2])
Woerter: ['it', 'it', 'see', 'am', '<EOS>']
Timestep: 2
Input: tensor([2, 2, 2, 2, 0])
Woerter: ['<EOS>', '<EOS>', '<EOS>', '<EOS>', '<PAD>']


In [19]:
### Genauso im Decoder
### Das bekommt das Encoder bzw. Decoder zu jedem Zeitschritt t:
for i, elem in enumerate(target_tensor):
    print("Timestep:", i)
    print("Input:", elem)
    print("Woerter:", [output_lang.index2word[word.item()] for word in elem])

Timestep: 0
Input: tensor([34, 35, 18, 18, 78])
Woerter: ['verstanden', 'einverstanden', 'ich', 'ich', 'im']
Timestep: 1
Input: tensor([ 2,  2, 20, 49, 79])
Woerter: ['<EOS>', '<EOS>', 'verstehe', 'bin', 'ernst']
Timestep: 2
Input: tensor([ 0,  0,  2, 58,  2])
Woerter: ['<PAD>', '<PAD>', '<EOS>', 'jahre', '<EOS>']
Timestep: 3
Input: tensor([ 0,  0,  0, 59,  0])
Woerter: ['<PAD>', '<PAD>', '<PAD>', 'alt', '<PAD>']
Timestep: 4
Input: tensor([0, 0, 0, 2, 0])
Woerter: ['<PAD>', '<PAD>', '<PAD>', '<EOS>', '<PAD>']


## 3. LSTMs 

<img src="documentation/LSTM3-chain.png" alt="Seq2Seq Models" width=800>
Quelle: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Gating-Mechanismus:
- Forget-Gate
- Input-Gate
- Berechnung der C-Kandidaten
- Output-Gate

Berechnung des finalen hidden state


## 3. Übersetzen 

In [24]:
from translate import translate

In [25]:
BEST_EXPERIMENT = "experiment/checkpoints/dry_run_simple_nmt_model_full_158544_teacher_1.0_train_voc_adam_lr-0.001-1/deu.txt/2-2_512-512_100"
SECOND_BEST_EXPERIMENT = ""

In [26]:
translate(start_root=".", path=BEST_EXPERIMENT)

Reading experiment information from: 
Starting translation process...
Source > hi
Translation >  hallo
Source > how are you
Translation >  wie gehts dir
Source > I am fine
Translation >  es geht mir gut
Source > How old are you
Translation >  wie alt sind sie
Source > WHere are you
Translation >  wo sind sie
Source > I live in Germany
Translation >  ich lebe in deutschland
Source > I live near you
Translation >  ich wohne in der naehe
Source > I want to go to the supermarket
Translation >  ich will zum supermarkt gehen
Source > I think you are amazing
Translation >  ich finde du bist toll
Source > I am amazing
Translation >  ich bin unglaublich
Source > I need some holiday
Translation >  ich brauche etwas urlaub
Source > I need some cake
Translation >  ich brauche etwas kuchen
Source > I need sweeties
Translation >  ich brauche politisches
Source > I think you should stop screaming
Translation >  ich finde du sollten aufhoeren zu schreien
Source > i think you should stop writing this p