# SS 2021 SEMINAR 10 Reinforcement Learning in der Sprachtechnologie
## Supervised-Learning Based Speech Agents

### Announcements

#### Papers

* New video out from Yannic Kilcher about Decision Transformers and their application in Reinforcement Learning: https://www.youtube.com/watch?v=-buULmf7dec

#### Homework
        
* DEADLINE: JUNE 23th

#### Today

* Finishing the Deep-Q-Learning Example

* Starting the Supvervised-Learning Based Speech Agent Example

***

### A) Paper Presentation I: Bianca

***
# Grounding Natural Language Commands to StarCraft II Game States 
## for Narration-Guided Reinforcement Learning

Source:

Waytowich, Nicholas, et al. "Grounding natural language commands to StarCraft II game states for narration-guided reinforcement learning." Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications. Vol. 11006. International Society for Optics and Photonics, 2019.

(https://arxiv.org/pdf/1906.02671.pdf)

## Aim of the paper
* address the problem of reward sparsity
* apply reward shaping using natural language (NL) narration
* ground NL commands to goal-specific states by learning a mutual embedding space

## The Game
StarCraft II is a complex real-time strategy game 

https://www.youtube.com/watch?v=yaqeZ9Snt4E

#### Why is StarCraft II challenging for RL?
* large action spaces
* environment state is only partially observable 
* long time horizons
* sparse game score

here: focus on the StarCraft II BuildMarines mini-game

goal: build as many marines as possible

to do that: sequences of actions are necessary: 
* build workers
* collect resources
* build supply depots
* build barracks
* train marines

## Reward Sparsity
Sparse reward functions have many zero reward values

$+$ they are easier to specify: 
 * 1 for winning the game, 0 for all other positions
 * no expert knowledge is necessary


$-$ they lead to meagre results: 
 * the agent receives little/no meaningful feedback
 * zero reward leads to no changes in the agent's policy
 * the agent takes many random actions and maybe stumbles into a meaningful goal state 
 
### State Space

* 7 mini-map feature layers (size 64x64)
* 13 screen feature layer maps (size 64x64) for a total of 20 2d images (size 64x64)
* 13 non-spatial features with information such as player resources and build queues


### Action Space

* compound action consisting of 
 * action identifier (action to be run)
 * two spatial actions (x and y), represented as two vectors (length 64)
 
## The Mutual Embedding Model (MEM)

* learns a mapping between game states and NL commands
* this way, the agent can assign contextual meaning to its current game state (compare current state with desired goal state)


### State embedding 

* extraction of visual features from the mini-maps and screens (2 layer CNNs)
* non-spatial features passed through a fully connected layer
* concatenation of mini-map, screen and non-spatial features 
* projection into an embedding space of size 256

### Language embedding

* using pretrained word2vec vectors (vocabulary size: 50,000, embedding size: 128)
* for every word in a command, the word2vec embedding was extracted 
* then, a command-level embedding (size: 256) was trained in an LSTM 


### Mutual embedding

* goal: game states and their corresponding language commands are close in the mutual embedding space 
* minimizing the euclidean distance of matching state and language vectors 
* maximizing the euclidean distance of mismatching state and language vectors 


### Dataset generation (for learning the embeddings): 
* a random agent generated pairs of game states and their matching commands using hand-crafted rules (the rules identify the states that satisfy the commands)
* e.g. the command "build a supply depot" is satisfied when a new supply depot is created, this state is then labelled correspondingly 

* dimensions of the dataset:
 * 50,000 matched pairs (10,000 game states each for the 5 commands) 
 * 50,000 mismatched pairs
 * 50,000 null states (not corresponding to any command) --> to distinguish desirable states from other states
 * --> 150,000 samples
 
#### Splitting of the dataset:
* training: 100,000 samples
* validation: 25,000 samples
* testing: 25,000 samples

#### Training:
* Optimizer: Adam
* Learning rate: 0.0005
* Batch size: 32
* Epochs: 20
* Threshold for the maximum euclidean distance for command and state to be associated: 0.5

#### Results:
* training accuracy: 95.61% 
* validation accuracy: 82.35% 
* test accuracy: 80.40%

### Visualization

t-SNE Clustering (t-distributed stochastic neighbor embedding):

* Point clouds = state examples
* circular symbols = projection of the NL command embedding 

--> MEM learned how to distinguish between the possible goal states and how to recognize if its current state
matches the desired state provided by the NL command

### Discussion

* the MEM can ground natural language goals in an agent’s state space 
* NL commands can be used to indicate desired goal states 
* thanks to shared representation, agent can compare current state with desired goal state
* using human guidance via NL for a learning agent means a great potential:
    * is flexible ("build" and "construct" should lead to same goal)
    * sequential policies can be learned without the need of expert knowledge about reward functions 

$+$ clearly structured paper, well written, easy to understand

$+$ using human guidance in an MEM seems to have great potential to avoid reward sparsity and enable better and faster learning



***

### Discussion



***

## C) Practice

 **RL in Python**


### CartPole Environment

Find the environment here: https://gym.openai.com/envs/CartPole-v0/

For solving the CartPole environment, Q-Learning with two different approaches can be applied. 

The first approach would be to **discretize** the state space of the environment and based on this build a Q-Table following the Q-Learning algorithm. 

The second approach would be to **approximate** the Q-table with a neural network, due to the fact that a continous state space would lead to an infinite q-table. 

In the following there are 2 examples shown. 

Example 1 is an implementation of the Q-Learning algorithm using a discretization of the state space of the cartpole environment. 

Example 2 is an implementation of the Q-Learning algorithm that approximates the (infinite) q-table with a neural network. 

**MY TIPP FOR YOU:** If you are new to reinforcement learning and object oriented programming try to discretize your environments and use a simple, straigh forward implementation of the Q-Learning environment. If you are an advanced programmer and familiar to concepts of RL and ML when dealing with a continous state space of your environment use a neural network for approximating the q-table and build it into an agent which fits the openai interfaces. 

#### Discretization Approach

In [1]:
import numpy as np
import gym 
import random 
import time
from abc import ABC, abstractmethod
from IPython.display import clear_output

In [2]:
# get an instance of the frozen lake gym environment
env = gym.make("CartPole-v0")

In [3]:
# get some information from the environment
action_space = env.action_space
print(action_space)
action_space_size = env.action_space.n
print(action_space_size)
state_space = env.observation_space 
print(state_space)
state_space_size = env.observation_space.n
print(state_space_size)

Discrete(2)
2
Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)


AttributeError: 'Box' object has no attribute 'n'

In [4]:
?env.env

[0;31mType:[0m        CartPoleEnv
[0;31mString form:[0m <CartPoleEnv<CartPole-v0>>
[0;31mFile:[0m        /mnt/local/anaconda3/envs/pytorch-env/lib/python3.8/site-packages/gym/envs/classic_control/cartpole.py
[0;31mDocstring:[0m  
Description:
    A pole is attached by an un-actuated joint to a cart, which moves along
    a frictionless track. The pendulum starts upright, and the goal is to
    prevent it from falling over by increasing and reducing the cart's
    velocity.

Source:
    This environment corresponds to the version of the cart-pole problem
    described by Barto, Sutton, and Anderson

Observation:
    Type: Box(4)
    Num     Observation               Min                     Max
    0       Cart Position             -4.8                    4.8
    1       Cart Velocity             -Inf                    Inf
    2       Pole Angle                -0.418 rad (-24 deg)    0.418 rad (24 deg)
    3       Pole Angular Velocity     -Inf                    Inf

Actions:
 

In [5]:
# State Discretizer
def discretize_range(lower_bound, upper_bound, num_bins):
    return np.linspace(lower_bound, upper_bound, num_bins + 1)[1:-1]

# Discretize the continuous state space for each of the 4 features.
num_discretization_bins = 7
state_bins = [
    # Cart position.
    discretize_range(-4.8, 4.8, num_discretization_bins),
    # Cart velocity.
    discretize_range(-3.0, 3.0, num_discretization_bins),
    # Pole angle.
    discretize_range(-0.5, 0.5, num_discretization_bins),
    # Tip velocity.
    discretize_range(-2.0, 2.0, num_discretization_bins)
]
print(state_bins)
max_bins = max(len(bin) for bin in state_bins)
print(max_bins)
num_states = (max_bins + 1) ** len(state_bins)
print(num_states)
# our state space therefore has a size of 7x7x7x7 ~2500
state_space_size = num_states

# lets see how that looks like in practice
state = env.reset()
print(state)
print(state[0])

# discretize a given state into our state model, found here: https://www.statology.org/numpy-digitize/
def discretize_value(value, bins):
    return np.digitize(x=value, bins=bins)

def discretize_state(observation):
        # Discretize the observation features and reduce them to a single integer
        #for i, feature in enumerate(observation):
            #getting the bins
            #print(state_bins[i]) 
            # extending to the max number of bins
            #print(state_bins[i] * ((max_bins + 1)))
            # putting it into the power of the decimal place 
            #print(state_bins[i] * ((max_bins + 1)) ** i)
            # this encoding enables us to reach that: first states encodes one potency, second state encodes ten potency, third state encodes hundred potency, ...
            #print(discretize_value(feature, state_bins[i]) * ((max_bins + 1) ** i))
            
        state = sum(
            discretize_value(feature, state_bins[i]) * ((max_bins + 1) ** i)
            for i, feature in enumerate(observation)
        )
        return state
    
test = discretize_state(state)
print(test)

[array([-3.42857143, -2.05714286, -0.68571429,  0.68571429,  2.05714286,
        3.42857143]), array([-2.14285714, -1.28571429, -0.42857143,  0.42857143,  1.28571429,
        2.14285714]), array([-0.35714286, -0.21428571, -0.07142857,  0.07142857,  0.21428571,
        0.35714286]), array([-1.42857143, -0.85714286, -0.28571429,  0.28571429,  0.85714286,
        1.42857143])]
6
2401
[ 0.01706873  0.00587641  0.04778642 -0.0470228 ]
0.017068731149897334
1200


In [6]:
# Q-TABLE
# build our action-value table | Q-TABLE
# as you already know, the q-table looks like this
# state | action_space

q_table = np.zeros((state_space_size, action_space_size))
#print(q_table)

In [7]:
# Training Parameters
num_episodes = 10000
max_steps_per_episode = 100
num_test_episodes = 1     

# q-learning | update parameters
learning_rate = 0.1
discount_rate = 0.99

# exploration-exploitation trade off
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

In [8]:
# Q-LEARNING

# collect rewards somewhere to visualize our learning curve
rewards_of_all_episodes = []

# Training Loop
for episode in range(num_episodes):
    # reset/initialize the environment first
    state = env.reset()
    # discretize state
    state = discretize_state(state)
    # set done back to false at the beginning of an episode
    done = False
    # reset our rewards collector | return for the beginning episode
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode):
        # select an action
        # use our exploration exploitation trade off -> do we explore or exploit in this timestep ?
        exploration_rate_threshold = random.uniform(0,1)
        if(exploration_rate_threshold > exploration_rate):
            action = np.argmax(q_table[state, : ])
        else:
            action = env.action_space.sample()
            
        new_state, reward, done, info = env.step(action)
        
        # discretize the observed state
        new_state = discretize_state(new_state)
        
        # Update Q-Table Q(s,a) using the bellman update  
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, : ]))
        
        # update the state to the new state
        state = new_state
        # collect the reward
        rewards_current_episode += reward
        
        if (done == True):
            break
    
    # after we finish an episode, make sure to update the exploration rate
    # decay the exploration rate the longer the time goes on
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    # append our rewards for this episode for learning curve
    rewards_of_all_episodes.append(rewards_current_episode)

In [9]:
# LEARNING STATISTICS 
# for each episode print the stats of the episode
rewards_per_thousand_episodes = np.split(np.array(rewards_of_all_episodes),num_episodes/1000)
count = 1000
print('*****INFO: average reward per thousand episodes: ***** \n')
for reward in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(reward/1000)))
    count += 1000
    
# print our learned q-table
print("\n\n ***** Q-TABLE ***** \n")
print(q_table)

*****INFO: average reward per thousand episodes: ***** 

1000 :  37.593000000000025
2000 :  68.66300000000015
3000 :  72.87700000000011
4000 :  89.27999999999942
5000 :  94.194999999999
6000 :  95.04699999999899
7000 :  94.62699999999899
8000 :  95.81299999999922
9000 :  98.4559999999988
10000 :  92.45599999999955


 ***** Q-TABLE ***** 

[[0. 0.]
 [0. 0.]
 [0. 0.]
 ...
 [0. 0.]
 [0. 0.]
 [0. 0.]]


In [11]:
# EVALUATION | TESTING | watching our agent play
for episode in range(num_test_episodes):
    state = env.reset()
    # discretize state
    state = discretize_state(state)
    done = False
    print("INFO:*****EPISODE ", episode+1, "\n\n\n")
    time.sleep(1)
    
    for step in range(max_steps_per_episode):
        clear_output(wait=True)
        env.render()
        time.sleep(0.1)
        
        action = np.argmax(q_table[state, :])
        new_state, reward, done, info = env.step(action)
        # discretize new state
        new_state = discretize_state(new_state)
        
        if done:
            clear_output(wait=True)
            env.render()
            if reward == 1: # check reward from environment for correct display
                print("INFO: ***** agent reached the goal. *****")
                time.sleep(3)
            else:
                print("INFO: ***** agent did not reach the goal.")
                time.sleep(3)
            clear_output(wait=True)
            break
        
        state = new_state

env.close()

NoSuchDisplayException: Cannot connect to "None"

***

#### Deep Q Networks

![alt_text](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/04/Screenshot-2019-04-16-at-5.46.01-PM.png)

In [42]:
import torch
from torch.autograd import Variable

In [43]:
# get an instance of the frozen lake gym environment
env = gym.make("CartPole-v0")

In [44]:
# get some information from the environment
action_space = env.action_space
print(action_space)
action_space_size = env.action_space.n
print(action_space_size)
state_space = env.observation_space 
print(state_space)
state_space_size = env.observation_space.shape[0]
print(state_space_size)

Discrete(2)
2
Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)
4


In [47]:
# Training Parameters
num_episodes = 1000
max_steps_per_episode = 100
num_test_episodes = 1     

# q-learning | update parameters
discount_rate = 0.99

# exploration-exploitation trade off
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.001

# q network parameters
learning_rate = 0.001
hidden_dim = 64

In [48]:
class DeepQNetwork():
    ''' Deep Q Neural Network Class. '''
    def __init__(self, state_dim, action_dim, hidden_dim=64, lr=0.05):
            self.criterion = torch.nn.MSELoss()
            self.model = torch.nn.Sequential(
                            torch.nn.Linear(state_dim, hidden_dim),
                            torch.nn.LeakyReLU(),
                            torch.nn.Linear(hidden_dim, hidden_dim*2),
                            torch.nn.LeakyReLU(),
                            torch.nn.Linear(hidden_dim*2, action_dim)
                    )
            self.optimizer = torch.optim.Adam(self.model.parameters(), lr)

    def update(self, state, y):
            """Update the weights of the network given a training sample. """
            y_pred = self.model(torch.Tensor(state))
            loss = self.criterion(y_pred, Variable(torch.Tensor(y)))
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

    def predict(self, state):
            """ Compute Q values for all actions using the DeepQNetwork """
            with torch.no_grad():
                return self.model(torch.Tensor(state))

In [49]:
# get our deep q network
deep_q_network = DeepQNetwork(state_space_size, action_space_size, hidden_dim, learning_rate)

In [50]:
# Deep Q-LEARNING

# collect rewards somewhere to visualize our learning curve
rewards_of_all_episodes = []

# Training Loop
for episode in range(num_episodes):
    # reset/initialize the environment first
    state = env.reset()
    # set done back to false at the beginning of an episode
    done = False
    # reset our rewards collector | return for the beginning episode
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode):
        # select an action
        # use our exploration exploitation trade off -> do we explore or exploit in this timestep ?
        exploration_rate_threshold = random.uniform(0,1)
        if(exploration_rate_threshold > exploration_rate):
            q_values = deep_q_network.predict(state)
            action = torch.argmax(q_values).item()
        else:
            action = env.action_space.sample()
            
        new_state, reward, done, info = env.step(action)
        
        # Update our Q Network due to the reward we got
        q_values = deep_q_network.predict(state).tolist() 
        # Update network weights using the last step only
        q_values_next = deep_q_network.predict(new_state)
        q_values[action] = reward + discount_rate * torch.max(q_values_next).item() # BELLMANN EQUATION -> APPROXIMATE THE GOLD ACTION -> find the best possible action in every state -> APPROXIMATION
        deep_q_network.update(state, q_values)
        
        # update the state to the new state
        state = new_state
        # collect the reward
        rewards_current_episode += reward
        
        if (done == True):
            q_values[action] = reward
            # Update network weights
            deep_q_network.update(state, q_values)
            break
    
    # after we finish an episode, make sure to update the exploration rate
    # decay the exploration rate the longer the time goes on
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    # append our rewards for this episode for learning curve
    rewards_of_all_episodes.append(rewards_current_episode)

In [51]:
# LEARNING STATISTICS 
# for each episode print the stats of the episode
rewards_per_hundred_episodes = np.split(np.array(rewards_of_all_episodes),num_episodes/100)
count = 100
print('*****INFO: average reward per thousand episodes: ***** \n')
for reward in rewards_per_hundred_episodes:
    print(count, ": ", str(sum(reward/100)))
    count += 100

*****INFO: average reward per thousand episodes: ***** 

100 :  23.05
200 :  28.470000000000006
300 :  32.730000000000004
400 :  41.63
500 :  53.45999999999999
600 :  67.58
700 :  65.26999999999998
800 :  74.92999999999999
900 :  74.44
1000 :  82.31000000000002


In [52]:
# EVALUATION | TESTING | watching our agent play
for episode in range(num_test_episodes):
    state = env.reset()
    done = False
    print("INFO:*****EPISODE ", episode+1, "\n\n\n")
    time.sleep(1)
    
    for step in range(max_steps_per_episode):
        clear_output(wait=True)
        env.render()
        time.sleep(0.1)
        
        q_values = deep_q_network.predict(state)
        action = torch.argmax(q_values).item()
        new_state, reward, done, info = env.step(action)
        
        if done:
            clear_output(wait=True)
            env.render()
            if reward == 1: # check reward from environment for correct display
                print("INFO: ***** agent reached the goal. *****")
                time.sleep(3)
            else:
                print("INFO: ***** agent did not reach the goal.")
                time.sleep(3)
            clear_output(wait=True)
            break
        
        state = new_state

env.close()

NoSuchDisplayException: Cannot connect to "None"

***

## Supervised Learning Based Speech Agents

A good way to start with supervised learning based speech agents are seq2seq chatbots. 

The general idea behind these bots is the following: 

![alt_text](https://miro.medium.com/max/860/1*vGhoOtfPSuv3gEEWvTP47g.png)

![alt_text](https://ftp.slidegeeks.com/pics/dgm/l/3/3d_chain_sequence_diagram_illustrating_4_steps_make_flowchart_powerpoint_templates_1.jpg)

***

![alt_text](https://miro.medium.com/max/1928/1*CkeGXClZ5Xs0MhBc7xFqSA.png)

You can find an implementation using RNNs here: https://github.com/praeclarumjj3/Chatbot-with-Pytorch

***

![alt_text](https://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png)

***

My implementation is based on a transformer model. 

Nevertheless, you can also plug-in an RNN based approach and play around with that. 

Two repositories I considered and I can recommend are: https://github.com/fawazsammani/chatbot-transformer or https://github.com/jfriedson/Seq2Seq-Chatbot

***

There is detailed tutorial on that from pytorch themsleves here: https://pytorch.org/tutorials/beginner/chatbot_tutorial.html or https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/chatbot_tutorial.ipynb

For using pre-trained transformer models you can find a some interesting ideas on that here: https://www.thepythoncode.com/article/conversational-ai-chatbot-with-huggingface-transformers-in-python




In [53]:
# Dataset

# We use the Cornell Movie Dialogue Corpus, which is a question-answer dataset having movies as the topic. 
# A description of the corpus can be found here: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
# You can find it here: http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip

# Pre-processing
from collections import Counter
import json
import torch
import torch.nn as nn
from torch.utils.data import Dataset
import torch.utils.data
import math
import torch.nn.functional as F

corpus_movie_conv = 'data/movie_conversations.txt'
corpus_movie_lines = 'data/movie_lines.txt'
max_len = 25

# load conversations
with open(corpus_movie_conv, 'r') as c:
    conv = c.readlines()

# load movie lines
with open(corpus_movie_lines, 'r') as l:
    lines = l.readlines()
    
# split and filter
lines_dic = {}
for line in lines:
    objects = line.split(" +++$+++ ")
    lines_dic[objects[0]] = objects[-1]
    
def remove_punc(string):
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    no_punct = ""
    for char in string:
        if char not in punctuations:
            no_punct = no_punct + char  # space is also a character
    return no_punct.lower()

pairs = []
for con in conv:
    ids = eval(con.split(" +++$+++ ")[-1])
    for i in range(len(ids)):
        qa_pairs = []
        if i==len(ids)-1:
            break
        first = remove_punc(lines_dic[ids[i]].strip())      
        second = remove_punc(lines_dic[ids[i+1]].strip())
        qa_pairs.append(first.split()[:max_len])
        qa_pairs.append(second.split()[:max_len])
        pairs.append(qa_pairs)

In [54]:
# create vocabulary
word_freq = Counter()
for pair in pairs:
    word_freq.update(pair[0])
    word_freq.update(pair[1])
    
min_word_freq = 5
words = [w for w in word_freq.keys() if word_freq[w] > min_word_freq]
word_map = {k: v + 1 for v, k in enumerate(words)}
word_map['<unk>'] = len(word_map) + 1
word_map['<start>'] = len(word_map) + 1
word_map['<end>'] = len(word_map) + 1
word_map['<pad>'] = 0

print("Total words are: {}".format(len(word_map)))

# write vocabulary to file
with open('data/WORDMAP_corpus.json', 'w') as j:
    json.dump(word_map, j)

Total words are: 18238


In [55]:
# create dialogue pairs for training
def encode_question(words, word_map):
    enc_c = [word_map.get(word, word_map['<unk>']) for word in words] + [word_map['<pad>']] * (max_len - len(words))
    return enc_c

def encode_reply(words, word_map):
    enc_c = [word_map['<start>']] + [word_map.get(word, word_map['<unk>']) for word in words] + \
    [word_map['<end>']] + [word_map['<pad>']] * (max_len - len(words))
    return enc_c

pairs_encoded = []
for pair in pairs:
    qus = encode_question(pair[0], word_map)
    ans = encode_reply(pair[1], word_map)
    pairs_encoded.append([qus, ans])
    
# write pairs to file
with open('data/pairs_encoded.json', 'w') as p:
    json.dump(pairs_encoded, p)

In [56]:
# create pytorch data set class
class Dataset(Dataset):
    def __init__(self):
        self.pairs = json.load(open('data/pairs_encoded.json'))
        self.dataset_size = len(self.pairs)

    def __getitem__(self, i):
        question = torch.LongTensor(self.pairs[i][0])
        reply = torch.LongTensor(self.pairs[i][1])
        return question, reply

    def __len__(self):
        return self.dataset_size

In [57]:
# create pytorch data loader for training
train_loader = torch.utils.data.DataLoader(Dataset(),
                                           batch_size = 100, 
                                           shuffle=True, 
                                           pin_memory=True)

In [58]:
# Model 
# we are using a standard transformer model in this tutorial, BUT you could apply ANY sequence to sequence model here
def create_masks(question, reply_input, reply_target):
    def subsequent_mask(size):
        mask = torch.triu(torch.ones(size, size)).transpose(0, 1).type(dtype=torch.uint8)
        return mask.unsqueeze(0)
    
    question_mask = question!=0
    question_mask = question_mask.to(device)
    question_mask = question_mask.unsqueeze(1).unsqueeze(1)         # (batch_size, 1, 1, max_words)
    reply_input_mask = reply_input!=0
    reply_input_mask = reply_input_mask.unsqueeze(1)  # (batch_size, 1, max_words)
    reply_input_mask = reply_input_mask & subsequent_mask(reply_input.size(-1)).type_as(reply_input_mask.data) 
    reply_input_mask = reply_input_mask.unsqueeze(1) # (batch_size, 1, max_words, max_words)
    reply_target_mask = reply_target!=0              # (batch_size, max_words)
    return question_mask, reply_input_mask, reply_target_mask


class Embeddings(nn.Module):
    """
    Implements embeddings of the words and adds their positional encodings. 
    """
    def __init__(self, vocab_size, d_model, max_len = 50):
        super(Embeddings, self).__init__()
        self.d_model = d_model
        self.dropout = nn.Dropout(0.1)
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pe = self.create_positinal_encoding(max_len, self.d_model)
        self.dropout = nn.Dropout(0.1)
        
    def create_positinal_encoding(self, max_len, d_model):
        pe = torch.zeros(max_len, d_model).to(device)
        for pos in range(max_len):   # for each position of the word
            for i in range(0, d_model, 2):   # for each dimension of the each position
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
        pe = pe.unsqueeze(0)   # include the batch size
        return pe
        
    def forward(self, encoded_words):
        embedding = self.embed(encoded_words) * math.sqrt(self.d_model)
        embedding += self.pe[:, :embedding.size(1)]   # pe will automatically be expanded with the same batch size as encoded_words
        embedding = self.dropout(embedding)
        return embedding
    
class MultiHeadAttention(nn.Module):
    def __init__(self, heads, d_model):
        super(MultiHeadAttention, self).__init__()
        assert d_model % heads == 0
        self.d_k = d_model // heads
        self.heads = heads
        self.dropout = nn.Dropout(0.1)
        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)
        self.concat = nn.Linear(d_model, d_model)
        
    def forward(self, query, key, value, mask):
        """
        query, key, value of shape: (batch_size, max_len, 512)
        mask of shape: (batch_size, 1, 1, max_words)
        """
        # (batch_size, max_len, 512)
        query = self.query(query)
        key = self.key(key)        
        value = self.value(value)   
        # (batch_size, max_len, 512) --> (batch_size, max_len, h, d_k) --> (batch_size, h, max_len, d_k)
        query = query.view(query.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)   
        key = key.view(key.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)  
        value = value.view(value.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)  
        # (batch_size, h, max_len, d_k) matmul (batch_size, h, d_k, max_len) --> (batch_size, h, max_len, max_len)
        scores = torch.matmul(query, key.permute(0,1,3,2)) / math.sqrt(query.size(-1))
        scores = scores.masked_fill(mask == 0, -1e9)    # (batch_size, h, max_len, max_len)
        weights = F.softmax(scores, dim = -1)           # (batch_size, h, max_len, max_len)
        weights = self.dropout(weights)
        # (batch_size, h, max_len, max_len) matmul (batch_size, h, max_len, d_k) --> (batch_size, h, max_len, d_k)
        context = torch.matmul(weights, value)
        # (batch_size, h, max_len, d_k) --> (batch_size, max_len, h, d_k) --> (batch_size, max_len, h * d_k)
        context = context.permute(0,2,1,3).contiguous().view(context.shape[0], -1, self.heads * self.d_k)
        # (batch_size, max_len, h * d_k)
        interacted = self.concat(context)
        return interacted
    
class FeedForward(nn.Module):
    def __init__(self, d_model, middle_dim = 2048):
        super(FeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, middle_dim)
        self.fc2 = nn.Linear(middle_dim, d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        out = F.relu(self.fc1(x))
        out = self.fc2(self.dropout(out))
        return out
    

class EncoderLayer(nn.Module):
    def __init__(self, d_model, heads):
        super(EncoderLayer, self).__init__()
        self.layernorm = nn.LayerNorm(d_model)
        self.self_multihead = MultiHeadAttention(heads, d_model)
        self.feed_forward = FeedForward(d_model)
        self.dropout = nn.Dropout(0.1)

    def forward(self, embeddings, mask):
        interacted = self.dropout(self.self_multihead(embeddings, embeddings, embeddings, mask))
        interacted = self.layernorm(interacted + embeddings)
        feed_forward_out = self.dropout(self.feed_forward(interacted))
        encoded = self.layernorm(feed_forward_out + interacted)
        return encoded
    
class DecoderLayer(nn.Module):
    def __init__(self, d_model, heads):
        super(DecoderLayer, self).__init__()
        self.layernorm = nn.LayerNorm(d_model)
        self.self_multihead = MultiHeadAttention(heads, d_model)
        self.src_multihead = MultiHeadAttention(heads, d_model)
        self.feed_forward = FeedForward(d_model)
        self.dropout = nn.Dropout(0.1)
        
    def forward(self, embeddings, encoded, src_mask, target_mask):
        query = self.dropout(self.self_multihead(embeddings, embeddings, embeddings, target_mask))
        query = self.layernorm(query + embeddings)
        interacted = self.dropout(self.src_multihead(query, encoded, encoded, src_mask))
        interacted = self.layernorm(interacted + query)
        feed_forward_out = self.dropout(self.feed_forward(interacted))
        decoded = self.layernorm(feed_forward_out + interacted)
        return decoded
    
class Transformer(nn.Module):
    def __init__(self, d_model, heads, num_layers, word_map):
        super(Transformer, self).__init__()
        self.d_model = d_model
        self.vocab_size = len(word_map)
        self.embed = Embeddings(self.vocab_size, d_model)
        self.encoder = nn.ModuleList([EncoderLayer(d_model, heads) for _ in range(num_layers)])
        self.decoder = nn.ModuleList([DecoderLayer(d_model, heads) for _ in range(num_layers)])
        self.logit = nn.Linear(d_model, self.vocab_size)
        
    def encode(self, src_words, src_mask):
        src_embeddings = self.embed(src_words)
        for layer in self.encoder:
            src_embeddings = layer(src_embeddings, src_mask)
        return src_embeddings
    
    def decode(self, target_words, target_mask, src_embeddings, src_mask):
        tgt_embeddings = self.embed(target_words)
        for layer in self.decoder:
            tgt_embeddings = layer(tgt_embeddings, src_embeddings, src_mask, target_mask)
        return tgt_embeddings
        
    def forward(self, src_words, src_mask, target_words, target_mask):
        encoded = self.encode(src_words, src_mask)
        decoded = self.decode(target_words, target_mask, encoded, src_mask)
        out = F.log_softmax(self.logit(decoded), dim = 2)
        return out

In [59]:
# Training
# transformer specific warmup for quicker weight adaptation in the first epochs
class AdamWarmup:
    def __init__(self, model_size, warmup_steps, optimizer):
        self.model_size = model_size
        self.warmup_steps = warmup_steps
        self.optimizer = optimizer
        self.current_step = 0
        self.lr = 0
        
    def get_lr(self):
        return self.model_size ** (-0.5) * min(self.current_step ** (-0.5), self.current_step * self.warmup_steps ** (-1.5))
        
    def step(self):
        # Increment the number of steps each time we call the step function
        self.current_step += 1
        lr = self.get_lr()
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr
        # update the learning rate
        self.lr = lr
        self.optimizer.step()

# Loss function
class LossWithLS(nn.Module):
    def __init__(self, size, smooth):
        super(LossWithLS, self).__init__()
        self.criterion = nn.KLDivLoss(size_average=False, reduce=False)
        self.confidence = 1.0 - smooth
        self.smooth = smooth
        self.size = size
        
    def forward(self, prediction, target, mask):
        """
        prediction of shape: (batch_size, max_words, vocab_size)
        target and mask of shape: (batch_size, max_words)
        """
        prediction = prediction.view(-1, prediction.size(-1))   # (batch_size * max_words, vocab_size)
        target = target.contiguous().view(-1)   # (batch_size * max_words)
        mask = mask.float()
        mask = mask.view(-1)       # (batch_size * max_words)
        labels = prediction.data.clone()
        labels.fill_(self.smooth / (self.size - 1))
        labels.scatter_(1, target.data.unsqueeze(1), self.confidence)
        loss = self.criterion(prediction, labels)    # (batch_size * max_words, vocab_size)
        loss = (loss.sum(1) * mask).sum() / mask.sum()
        return loss
        
# hyperparameters
d_model = 512
heads = 8
num_layers = 3
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
epochs = 1

with open('data/WORDMAP_corpus.json', 'r') as j:
    word_map = json.load(j)
    
transformer = Transformer(d_model = d_model, heads = heads, num_layers = num_layers, word_map = word_map)
transformer = transformer.to(device)
adam_optimizer = torch.optim.Adam(transformer.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9)
transformer_optimizer = AdamWarmup(model_size = d_model, warmup_steps = 4000, optimizer = adam_optimizer)
criterion = LossWithLS(len(word_map), 0.1)

In [60]:
def train(train_loader, transformer, criterion, epoch):    
    transformer.train()
    sum_loss = 0
    count = 0

    for i, (question, reply) in enumerate(train_loader):
        samples = question.shape[0]
        # Move to device
        question = question.to(device)
        reply = reply.to(device)
        # Prepare Target Data
        reply_input = reply[:, :-1]
        reply_target = reply[:, 1:]
        # Create mask and add dimensions
        question_mask, reply_input_mask, reply_target_mask = create_masks(question, reply_input, reply_target)
        # Get the transformer outputs
        out = transformer(question, question_mask, reply_input, reply_input_mask)
        # Compute the loss
        loss = criterion(out, reply_target, reply_target_mask)
        # Backprop
        transformer_optimizer.optimizer.zero_grad()
        loss.backward()
        transformer_optimizer.step()
        sum_loss += loss.item() * samples
        count += samples
        if i % 100 == 0:
            print("Epoch [{}][{}/{}]\tLoss: {:.3f}".format(epoch, i, len(train_loader), sum_loss/count))
            

def evaluate(transformer, question, question_mask, max_len, word_map):
    """
    Performs Greedy Decoding with a batch size of 1
    """
    rev_word_map = {v: k for k, v in word_map.items()}
    transformer.eval()
    start_token = word_map['<start>']
    encoded = transformer.encode(question, question_mask)
    words = torch.LongTensor([[start_token]]).to(device)
    
    for step in range(max_len - 1):
        size = words.shape[1]
        target_mask = torch.triu(torch.ones(size, size)).transpose(0, 1).type(dtype=torch.uint8)
        target_mask = target_mask.to(device).unsqueeze(0).unsqueeze(0)
        decoded = transformer.decode(words, target_mask, encoded, question_mask)
        predictions = transformer.logit(decoded[:, -1])
        _, next_word = torch.max(predictions, dim = 1)
        next_word = next_word.item()
        if next_word == word_map['<end>']:
            break
        words = torch.cat([words, torch.LongTensor([[next_word]]).to(device)], dim = 1)   # (1,step+2)
        
    # Construct Sentence
    if words.dim() == 2:
        words = words.squeeze(0)
        words = words.tolist()
    sen_idx = [w for w in words if w not in {word_map['<start>']}]
    sentence = ' '.join([rev_word_map[sen_idx[k]] for k in range(len(sen_idx))])
    return sentence

In [40]:
# execute training
for epoch in range(epochs):
    train(train_loader, transformer, criterion, epoch)
    state = {'epoch': epoch, 'transformer': transformer, 'transformer_optimizer': transformer_optimizer}
    torch.save(state, 'data/checkpoint_' + str(epoch) + '.pth.tar')

Epoch [0][0/2217]	Loss: 8.694
Epoch [0][100/2217]	Loss: 7.915
Epoch [0][200/2217]	Loss: 7.217
Epoch [0][300/2217]	Loss: 6.678
Epoch [0][400/2217]	Loss: 6.315
Epoch [0][500/2217]	Loss: 6.061
Epoch [0][600/2217]	Loss: 5.876
Epoch [0][700/2217]	Loss: 5.728
Epoch [0][800/2217]	Loss: 5.609
Epoch [0][900/2217]	Loss: 5.511
Epoch [0][1000/2217]	Loss: 5.429
Epoch [0][1100/2217]	Loss: 5.357
Epoch [0][1200/2217]	Loss: 5.294
Epoch [0][1300/2217]	Loss: 5.240
Epoch [0][1400/2217]	Loss: 5.193
Epoch [0][1500/2217]	Loss: 5.152
Epoch [0][1600/2217]	Loss: 5.113
Epoch [0][1700/2217]	Loss: 5.079
Epoch [0][1800/2217]	Loss: 5.048
Epoch [0][1900/2217]	Loss: 5.020
Epoch [0][2000/2217]	Loss: 4.994
Epoch [0][2100/2217]	Loss: 4.970
Epoch [0][2200/2217]	Loss: 4.947
Epoch [1][0/2217]	Loss: 4.387
Epoch [1][100/2217]	Loss: 4.419
Epoch [1][200/2217]	Loss: 4.417
Epoch [1][300/2217]	Loss: 4.413
Epoch [1][400/2217]	Loss: 4.415
Epoch [1][500/2217]	Loss: 4.415
Epoch [1][600/2217]	Loss: 4.417
Epoch [1][700/2217]	Loss: 4.417

In [61]:
# testing / application of the model in chat
checkpoint = torch.load('data/checkpoint_4.pth.tar')
transformer = checkpoint['transformer']

while(1):
    question = input("Question: ") 
    if question == 'quit':
        break
    max_len = input("Maximum Reply Length: ")
    enc_qus = [word_map.get(word, word_map['<unk>']) for word in question.split()]
    question = torch.LongTensor(enc_qus).to(device).unsqueeze(0)
    question_mask = (question!=0).to(device).unsqueeze(1).unsqueeze(1)  
    sentence = evaluate(transformer, question, question_mask, int(max_len), word_map)
    print(sentence)

Question:  how do you feel ? 
Maximum Reply Length:  20


i dont know


Question:  how is the weather today ? 
Maximum Reply Length:  20


i dont know


Question:  quit


# TODO's

1. Send your finished presentations (+ possibly annotated paper) by **Monday 12.00 AM/midnight** via email to henrik.voigt@uni-jena.de

2. Send your little HOMEWORK to henrik.voigt@uni-jena.de by using the naming convention: HOMEWORK_02_FIRSTNAME_LASTNAME.ipynb until **June 23th 12.00 AM/midnight**

***
