# HW9: Beam Search Decoding - News Headline Generation

In this exercise, you are going to learn and implement decoding techniques for sequence generation. Usually, the sequence is generated word-by-word from a model. In each step, the model predicted the most likely word based on the predicted words in previous steps (this is called auto-regressive decoding).

As such, it is very important how you decide on what to predicted at each step, as it will be conditioned on to predicted all of the following steps. There are two main decoding techniques: **Greedy Decoding** and **Beam Search Decoding**. Greedy Decoding immediately chooses the word with best score at each step, while Beam Search Decoding focuses on the sequence that give the best score overall.

To complete this exercise, you will need to complete the methods for decoding for a text generation model trained on [New York Times Comments and Headlines dataset](https://www.kaggle.com/aashita/nyt-comments). The model is trained to predict a headline for the news given seed text. You do not need to train any model model in this exercise as we provide both the pretrained model and dictionary.

This homework does not require you to use Google Cloud as the model is quite small (but you can still use it if you want)

#### Don't forget to shut down your instance on Gcloud when you are not using it

## 1. Preparing model and dictionary

In [None]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout, Reshape, Dropout, Flatten
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.optimizers import Adam
import keras.utils as ku 

# set seeds for reproducability
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)

import pandas as pd
import numpy as np
import string, os 

#### Load dictionary
- Index 0 is empty as it is researved for unknown words
- Index 1 is "eos", end-of-sentence symbol used for indicating the end of generation

In [None]:
index_to_word = {}
word_to_index = {}

with open("word_list.txt", "r") as word_list_file:
  i = 0
  for line in word_list_file:
    line = line.strip()
    index_to_word[i] = line
    word_to_index[line] = i
    i += 1

total_word_count = len(index_to_word)

In [None]:
print("dict size:", len(index_to_word))
# Sample words
for i in range(10):
  print(index_to_word[i])

#### Load pretrained model
- The provided model is built with only a layer of feedforward neural networks. 
- The model takes a sequence of indices of **5 previously generated words to precited the next one**.
- The sequence is padded with zero.

In [None]:
input_len = 5

In [None]:
model = Sequential()
model.add(Embedding(total_word_count , 50, input_length=5))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(total_word_count, activation='softmax'))
model.summary()
adam = Adam(lr=0.001)
model.compile(optimizer=adam,  loss='categorical_crossentropy', metrics=['categorical_accuracy'])

In [None]:
from keras.models import load_model
model = load_model('model.h5')

## 2. Decoding

First, we write a function for converting a string to a sequence of indices

In [None]:
def texts_to_sequences(text, word_to_index):
  text = text.strip().split(" ")
  token_list = [word_to_index[x] for x in text]
  return token_list

### Greedy Decoding

Normally, in sequence generation task, the model will continue generating tokens until an end-of-sequence symbol appear or the maximum length is reached. For this task:
- The end-of-sequence symbol is "Eos" and its index is 1
- Use the maximum generation length of 15

In [None]:
eos_token = "Eos"
eos_index = 1
max_gen_length = 15

### TODO 1:
Now, complete the greedy decoding function below. 

In [None]:
def greedy_decode(seed_text, max_gen_length, model, word_to_index, index_to_word, input_len):
  """Greedy decodes with seed text.

  Args:
    seed_text: The seed string to be used as initial input to the model.
    max_gen_length: Maximum length for generation.
                    The decoding process must terminate when this length is reached
    model: Pretrained keras model for prediction.
    word_to_index: The dictionary for converting word to index.
    input_len: A number indicating how many previously generated words will be used as 
               inputs for the model.
  
  Your code should do the followings:
      1. Convert current_text to sequences of indices by calling texts_to_sequences()
      2. Pad the sequence with 0s. You might find pad_sequences() function useful
      3. Predict the next token using the model (by calling model.predict() or model.predict_classes())
         and choose the token with the highest score as output
      4. Convert the predicted index to word and concat it to current_text
      5. Return text prediction and a list of probabilities of each step
      
  You do not need to stop early when end-of-sequence token is generated and can continue decoding
  until max_gen_length is reached. We can filter the eos token out later.
      
  The index is converted back to text after every step purely for simplicity. 
  When working with real problem you should stick with index until the decoding is done.
  But you can always call a function provided by the library so there is no need to implement this yourself.
  """
  current_text = seed_text
  for _ in range(max_gen_length):
    ### YOUR CODE HERE
    ### END YOUR CODE
    
    current_text += " " + output_word
  return current_text.title(), probs

Test decoding with seed "the"

You output must be 'The Alienist Season 1 Episode 2 Darkness Descends Eos Eos In Fall Of Spies For Us'

In [None]:
# Test decoding with seed "the"
greedy_decode("the", max_gen_length, model, word_to_index, index_to_word, input_len)

You may notice that there are serveral end-of-sequence token in your output sequence. 
### TODO 2:
Complete the following function to clean your output and decode with provided seed texts

In [None]:
def clean_output(text, eos_token):
  """Drop eos_token and every words that follow"""
  text = ""
  pass
  return text

In [None]:
sample_seeds = ["to", "america", "people", "next", "picture", "on", "usa"]
for seed in sample_seeds:
  pass

Your output should be
- To Hell With 1979 
- America Writer And A Laugher 
- People To Work Make Them Healthier 
- Next On The Christie Beat In New Jersey 
- Picture Trump Obstruct Justice 
- On The Whole30 Diet Vowing To Eat Smarter Carbs For More Than 30 Days 
- Usa Gymnastics Still Values Medals More Than Girls 

### Beam Search Decoding

Another well-known decoding method is beam search decoding that focuses more on the overall sequence score.

Instead of greedily choosing the token with the highest score for each step, beam search decoding expands all possible next tokens and keeps the __k__ most likely sequence at each step, where __k__ is a user-specified beam size. A sequence score is also calculated according user-specified cal_score() function.
The beam with the highest score after the decoding process is done will be the output.

There are a fews things that you need to know before implementing a beam search decoder:
- When eos token is produced, you can stop expanding that beam
- However, the ended beams must be sorted together with active beams
- The decoding end when every kept beams are either ended or reached the maximum length, but for this task, you can continue decoding until the max_gen_len is reached
- We usually work with probability in log scale to avoid numerical underflow. You should use np.log(score) before any calculation
- **As probabilities for some classes will be very small, you must add a very small value to the score before taking log e.g np.log(prob + 0.00000001)**

#### Sequence Score
The naive way to calculate the sequence score is to __multipy every token scores__ together. However, doing so will make the decoder prefer shorter sequence as you multiply the sequence score with a value between \[0,1\] for every tokens in the sequence. Thus, we usually normalize the sequence score with its length by calculating its __geometric mean__ instead.

### TODO 3:
Complete cal_score() function.
**You should do this in log scale**

In [None]:
def cal_score(score_list, length, normalized=False):
  if normalized:
    pass
  else:
    pass
  return seq_score

Complete beam_search_decode() according to above description.

In [None]:
def beam_search_decode(seed_text, max_gen_len, model, word_to_index, index_to_word, max_sequence_len, beam_size, normalized=False):
  """We will do beam search decoing using seed text in this function.
    
  Output:
    beams: A list of top k beams after the decoding ended, each beam is a list of 
      [seed_text, list of scores, length]

  Your code should do the followings:
    1.Loop until max_gen_len is reached.
    2.During each step, loop thorugh each beam and use it to predict the next word.
      If a beam is already ended, continues without expanding.
    3.Sort all hypotheses according to cal_score().
    4.Keep top k hypotheses to be used at the next step.
  """
  # For each beam we will store (generated text, list of scores, and current length)
  # Add initial beam
  beams = [[seed_text, [], 0]]
  
  for _ in range(max_gen_len):
    pass

  return beams

### TODO 4 (Coding and Written):
Decode with the provided seed texts with beam_size 5 and max_gen_len 10.
Compare the results between __greedy, normalized, and unnormalized decoding__.

Print a result using greedy decoding and top 2 results using unnormalized and normalized decoing for each seed text.

Also, print scores of each candidate according to cal_score(). Use normalization for greedy decoding.

In [None]:
sample_seeds = ["to", "america", "people", "next", "picture", "on", "usa"]
for seed in sample_seeds:
  pass


Your outputs should be
```
-Greedy-
To Hell With 1979  0.99
-Unnormalized-
To Hell With 1979  0.98
To Live In A Nation Of Holers  0.00
-Normalized-
To Hell With 1979  0.99
To Live In A Nation Of Holers  0.39

-Greedy-
America Writer And A Laugher  0.52
-Unnormalized-
America Process  0.05
America Bb And Pellet Gun Injuries Pose Serious Risk To Childrens 0.04
-Normalized-
America Bb And Pellet Gun Injuries Pose Serious Risk To Childrens 0.73
America Jacket On Smoking Should Us Even Tougher  0.65

-Greedy-
People To Work Make Them Healthier  0.84
-Unnormalized-
People To Work Make Them Healthier  0.35
People Liberal Democracies Perish  0.03
-Normalized-
People To Work Make Them Healthier  0.84
People To Work Make High Healthier  0.52

-Greedy-
Next On The Christie Beat In New Jersey  0.91
-Unnormalized-
Next On The Christie Beat In New Jersey  0.47
Next On The Whole30 Diet Vowing To Eat Smarter Carbs For 0.33
-Normalized-
Next On The Christie Beat In New Jersey  0.91
Next On The Whole30 Diet Vowing To Eat Smarter Carbs For 0.90

-Greedy-
Picture Trump Obstruct Justice  0.48
-Unnormalized-
Picture Trump Obstruct Justice  0.05
Picture Trump Save American Steel  0.05
-Normalized-
Picture Are Conflict Victims But All Is Not Lost  0.67
Picture To Live In A Nation Of Holers  0.65

-Greedy-
On The Whole30 Diet Vowing To Eat Smarter Carbs For More Than 30 Days  0.90
-Unnormalized-
On Family Farms Little Hands Steer Big Machines  0.46
On The Whole30 Diet Vowing To Eat Smarter Carbs For More 0.25
-Normalized-
On Family Farms Little Hands Steer Big Machines  0.91
On The Whole30 Diet Vowing To Eat Smarter Carbs For More 0.87

-Greedy-
Usa Gymnastics Still Values Medals More Than Girls  1.00
-Unnormalized-
Usa Gymnastics Still Values Medals More Than Girls  0.99
Usa Longterm Cesarean Risks  0.00
-Normalized-
Usa Gymnastics Still Values Medals More Than Girls  1.00
Usa Finds A Match Helping Police Solve An Infamous 1994 Rape 0.49
```

__Q__: From the ouputs, what is the effect of using length normalization?

__Ans__:

### Temperature Sampling

Now, you should be able to tell that in greedy decoding, the output will always be the same if you initialize it with the same seed, regardless how many times you try.

This behaviour provides consistency to the output of your model but, at the same time, limits the ability to explore the output space. For example, you might not want the same news headline every times you start with the word "The".

As such, we will introduce randomness to the model when decoding by using weighted sampling instead of argmax. At every step, we will sample the output using softmax outputs as probabilities for each word. One way you can implement this is random a number between 0 and 1 then loop through the probabilities of each word while iteratively adding it together. When the sum is more than the sampled number, you select that word as an output at that step.

However, you might notice that even with the sampling method we just introduced, the output is most likely to be the same as greedy decoding because the probabilities of each word are too different (0.99 vs 0.01). 
Thus, we will use another method called temperature sampling to smoothen the probilities. Before sampling, we will scale each probabilites by powering it with _1/T_ then divide each value with the sum of all values to make its sum equals to 1 again. 

$$f_T(p_i) = \frac{p_i^{1/T}}{\sum p_i^{1/T}} $$

Larger T will make the model more likely to choose unlikely words at each step. If T is close to 0, it will be the same as argmax.

### TODO 5:
Implement greedy decoding function with temperature sampling. This function shoud be almost identical to your greedy_decode() except it does not use argmax.

In [4]:
def sample_output(probs, temperature=1.0):
  """
  probs: an array of probabilities
  temperature: temperature
  
  Return: index of the predicted words
  """
  pass
  return 0

In [None]:
def temperature_sampling_decode(seed_text, max_gen_length, model, word_to_index, index_to_word, input_len, temperature):
  """Greedy decodes with seed text using temperature sampling.

  Args:
    seed_text: The seed string to be used as initial input to the model.
    max_gen_length: Maximum length for generation.
                    The decoding process must terminate when this length is reached
    model: Pretrained keras model for prediction.
    word_to_index: The dictionary for converting word to index.
    input_len: A number indicating how many previously generated words will be used as 
               inputs for the model.
    temperature: temperature.

  The retured probs must be before rescaling.
  """
  current_text = seed_text
  probs = []
  for _ in range(max_gen_length):
    ### YOUR CODE HERE
    
    ### END YOUR CODE
    current_text += " " + output_word
  return current_text.title(), probs

In [None]:
temperature_sampling_decode("the", max_gen_length, model, word_to_index, index_to_word, input_len, 0.90)

Using the same seed texts as above, compare the output from normal greedy decoding to Temperature Sampling with T=1.5 and T=5.0

In [3]:
pass

#### Don't forget to shut down your instance on Gcloud when you are not using it