<a href="https://colab.research.google.com/github/compi1234/spchlab/blob/master/RNN_LM/rnn_lm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Modeling with Recurrent Neural Networks

In this notebook, we will see how you can train a recurrent neural network language model.

We will start by importing TensorFlow, which is Google's open-source library for machine learning. Next, we will explain how to do data processing for language modeling and explain the most important classes and function that we will use. We will give a short introduction to word embeddings, testing a network and the importance of hyperparameters during training.

Dependencies:
- Tensorflow: This notebook uses tensorflow.compat.v1.  This does generate a number of deprecation WARNINGS in COLAB. This is normal and no action is required.

## Importing TensorFlow and other requirements

We start by importing TensorFlow and checking if we are running on GPU. You will probably be asked two confirmations; click YES twice. 

In [1]:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')

ModuleNotFoundError: No module named 'tensorflow'

If the code above raised an error, you should make sure that you are using a GPU in the following way: select 'Runtime' in the top bar, then 'Change runtime type' and choose 'GPU' as hardware accelerator. Training neural networks is much faster on a GPU (graphics processing unit) than on a CPU.

Next, we import the other standard Python packages that we need. 

In [None]:
import numpy as np
import urllib, collections, os
from __future__ import print_function

In the 3 cells below we are defining some classes and functions that we will use throughout the notebook. You have to run the cells to make sure you can use them, but you do not have to look at the details in the code. By default we have hidden this code, but you may click on SHOW CODE to visualize each of them (batchGenerator.py, rnn_lm.py and run_lm.py).

In [None]:
#@title
class batchGenerator(object):
  '''
  This class generates batches for a dataset.
  Input arguments:
    data: list of indices (word ids)
    batch_size: number of sequences in a mini-batch
    num_steps: length of each sequence in the mini-batch
    test: boolean, is True if we are testing; in that case batch_size and num_steps are 1
  '''
  
  def __init__(self, data, batch_size=32, num_steps=50, test=False):
    '''
    Prepares a dataset.
    '''
    self.batch_size = batch_size
    self.num_steps = num_steps
    self.test = test 

    self.data_array = np.array(data)
  
    if not self.test:
      len_batch_instance = int(len(data) / batch_size)

      print(int(batch_size*len_batch_instance))
      data_array = self.data_array[:batch_size*len_batch_instance]

      # divide data in batch_size parts
      self.data_reshaped = np.reshape(data_array, (batch_size, len_batch_instance))

      # number of mini-batches that can be generated
      self.num_batches_in_data = len_batch_instance / num_steps - 1
    
    self.curr_idx = 0
  
  def generate(self):
    '''
    Generates
      input_batch: numpy array (batch_size x num_steps) or None, if the end of the dataset is reached
      target_batch: numpy array (batch_size x num_steps) or None, if the end of the dataset is reached
      end_reached: boolean, True is end of dataset is reached
    '''
    
    if self.test:
      if self.curr_idx+1 >= len(self.data_array):
        return None, None, True
      
      input_batch = [[self.data_array[self.curr_idx]]]
      target_batch = [[self.data_array[self.curr_idx+1]]]
      
    else:
      if self.curr_idx >= self.num_batches_in_data:
        return None, None, True

      # input: take slice of size 
      input_batch = self.data_reshaped[:,self.curr_idx*self.num_steps:self.curr_idx*self.num_steps+self.num_steps]

      # target = input shifted 1 time step
      target_batch = self.data_reshaped[:,self.curr_idx*self.num_steps+1:self.curr_idx*self.num_steps+self.num_steps+1]

    self.curr_idx += 1
    
    return input_batch, target_batch, False


In [None]:
#@title
class rnn_lm(object):
  '''
  This is a class to build and execute a recurrent neural network language model.
  Arguments:
    cell: type of RNN cell (only LSTM is currently implemented)
    optimizer: 'SGD' or 'Adam'
    lr: learning rate
    vocab_size: size of the vocabulary
    embedding_size: size of continuous embedding that will be input to the RNN
    hidden_size: size of the hidden layer
    dropout rate: value between 0 and 1, number of neurons that will be 
        kept (not dropped) during training, prevents overfitting
    batch_size: number of sequences that will be input simultaneously
    num_steps: length of each input sequence = number of steps in backpropagation through time
    is_training: boolean, True is we want to update the parameters of the model
  '''
  
  def __init__(self,
              cell='LSTM',
              optimizer='Adam',
              lr=0.01,
              vocab_size=10000,
              embedding_size=64,
              hidden_size=128,
              dropout_rate=0.5,
              batch_size=32,
              num_steps = 50,
              is_training=True):
    # hyperparameters that can be changed
    self.which_cell = cell
    self.which_optimizer = optimizer
    self.vocab_size = vocab_size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size
    self.dropout_rate = dropout_rate
    self.is_training = is_training
    self.lr = lr
    self.batch_size = batch_size
    self.num_steps = num_steps
    
    # hard-coded hyperparameters
    self.max_grad_norm = 5
    
    self.init_graph()
    
    self.output, self.state = self.feed_to_network()
    
    self.loss = self.calc_loss(self.output)
    
    if self.is_training:
      self.update_params(self.loss)
    
    
  def init_graph(self):
    '''
    This function initializes all elements of the network.
    '''
    
    self.inputs = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, self.num_steps])
    self.targets = tf.placeholder(dtype=tf.int32, shape=[self.batch_size, self.num_steps])
    
    # input embedding weights
    self.embedding = tf.get_variable("embedding", 
                                     [self.vocab_size, self.embedding_size], 
                                     dtype=tf.float32)
    
    # hidden layer
    if self.which_cell == 'LSTM':
      self.basic_cell = tf.nn.rnn_cell.LSTMCell(self.hidden_size)
    elif self.which_cell == 'RNN':
      self.basic_cell = tf.nn.rnn_cell.BasicRNNCell(self.hidden_size)
    else:
      raise ValueError("Specify which type of RNN you want to use: RNN or LSTM.")
      
    # apply dropout  
    self.cell = tf.nn.rnn_cell.DropoutWrapper(self.basic_cell, 
                                              output_keep_prob=self.dropout_rate)
    
    # initial state contains all zeros
    self.initial_state = self.cell.zero_state(self.batch_size, tf.float32)
    
    # output weight matrix and bias
    self.softmax_w = tf.get_variable("softmax_w",
                                     [self.hidden_size, self.vocab_size], 
                                     dtype=tf.float32)
    self.softmax_b = tf.get_variable("softmax_b",
                                     [self.vocab_size], 
                                     dtype=tf.float32)
    
    self.initial_state = self.cell.zero_state(self.batch_size, dtype=tf.float32)
    
    
  def feed_to_network(self):
    '''
    This function feeds the input to the network and returns the output and the state.
   
    '''
    
    # map input indices to continuous input vectors
    inputs = tf.nn.embedding_lookup(self.embedding, self.inputs)

	  # use dropout on the input embeddings
    inputs = tf.nn.dropout(inputs, self.dropout_rate)
    
    state = self.initial_state
    
    # feed inputs to network: outputs = predictions, state = new hidden state
    outputs, state = tf.nn.dynamic_rnn(self.cell, inputs, sequence_length=None, initial_state=state)
    
    output = tf.reshape(tf.concat(outputs, 1), [-1, self.hidden_size])
    
    return output, state
    
  
  def calc_loss(self, output):
    
    # calculate logits
    # shape of logits = [batch_size*num_steps, vocab_size]
    logits = tf.matmul(output, self.softmax_w) + self.softmax_b
    
    self.softmax = tf.nn.softmax(logits)
      
    # calculate cross entropy loss
    # reshape targets such that it has shape [batch_size*num_steps]
    # loss: contains loss for every time step in every batch
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
        labels=tf.reshape(self.targets, [-1]), logits=logits)
      
    # average loss per batch
    avg_loss = tf.reduce_sum(loss) / self.batch_size
    
    return avg_loss
  
  def update_params(self, loss):
    
    # calculate gradients for all trainable variables 
    # + clip them if their global norm > 5 (prevents exploding gradients)
    grads, _ = tf.clip_by_global_norm(
        tf.gradients(loss, tf.trainable_variables()), self.max_grad_norm)
    
    if self.which_optimizer == 'SGD':
      optimizer = tf.train.GradientDescentOptimizer(self.lr)
    elif self.which_optimizer == 'Adam':
      optimizer = tf.train.AdamOptimizer(self.lr)
    else:
      raise ValueError("Specify which type of optimizer you want to use: SGD or Adam.")
    
    # update the weights
    self.train_op = optimizer.apply_gradients(
				zip(grads, tf.trainable_variables()))

In [None]:
#@title
def run_lm(name='LSTM', cell='LSTM', 
           optimizer='Adam', lr=0.01, 
           vocab_size = 10000, embedding_size=64, 
           hidden_size=128, dropout_rate=0.5, 
           num_steps=50, inspect_emb=False, 
           train_ids=None, valid_ids=None, 
           test_ids=None, test_log_prob=False):
  '''
  Creates training, validation and/or test models,
  trains, validates and/or tests the model.
  Arguments:
    name: name that will be used to save the model
    cell: type of RNN cell (only LSTM is currently implemented)
    optimizer: 'SGD' or 'Adam'
    lr: learning rate
    vocab_size: size of the vocabulary
    embedding_size: size of continuous embedding that will be input to the RNN
    hidden_size: size of the hidden layer
    dropout rate: value between 0 and 1, number of neurons that will be 
        kept (not dropped) during training, prevents overfitting
    num_steps
    inspect_emb: boolean, if True we want to return the embedding_matrix
    train_ids: training data
    valid_ids: validation data
    test_ids: test data
    test_log_prob: boolean, if True we only want to test the log probability for a test sentence
  '''
    
  with tf.Graph().as_default() as graph:

      # create the models
      if not test_log_prob:
      
        with tf.variable_scope("Model"):
          rnn_train = rnn_lm(cell=cell,
                             optimizer=optimizer, 
                             lr=lr,
                             vocab_size=vocab_size,
                             embedding_size=embedding_size,
                             hidden_size=hidden_size,
                             dropout_rate=dropout_rate)

          saver = tf.train.Saver()

        with tf.variable_scope("Model", reuse=True):
          rnn_valid = rnn_lm(cell=cell, 
                             optimizer=optimizer,
                             lr=lr,
                             vocab_size=vocab_size, 
                             embedding_size=embedding_size,
                             hidden_size=hidden_size,
                             dropout_rate=dropout_rate,
                             is_training=False)
          
        reuse = True
        
      else:
        reuse = False
               
      with tf.variable_scope("Model", reuse=reuse):
        rnn_test = rnn_lm(cell=cell, 
                           optimizer=optimizer, 
                           lr=lr,
                           vocab_size=vocab_size,
                           embedding_size=embedding_size,
                           hidden_size=hidden_size,
                           dropout_rate=dropout_rate,
                           batch_size=1,
                           num_steps=1,
                           is_training=False)
      

      sv = tf.train.Supervisor(logdir=name)

      with sv.managed_session(config=tf.ConfigProto()) as session:
        
        if not test_log_prob:
        
          for i in range(5):

            print('Epoch {0}'.format(i+1))

            train_ppl = run_epoch(session, rnn_train, train_ids, num_steps=num_steps)
            print('Train perplexity: {0}'.format(train_ppl))

            valid_ppl = run_epoch(session, rnn_valid, valid_ids, num_steps=num_steps, is_training=False)
            print('Validation perplexity: {0}'.format(valid_ppl))

          save_path = saver.save(session, "{0}/rnn.ckpt".format(name))
          print('Saved the model to ',save_path)

        test_ppl = run_epoch(session, rnn_test, test_ids, num_steps=num_steps,
                             is_training=False, is_test=True, 
                             test_log_prob=test_log_prob)
        if not test_log_prob:
          print('Test perplexity: {0}'.format(test_ppl))
        
        if inspect_emb: 
          emb_matrix = tf.get_default_graph().get_tensor_by_name("Model/embedding:0")
          emb_matrix_np = emb_matrix.eval(session=session)

          return emb_matrix_np

        else:

          return None

def run_epoch(session, rnn, data, num_steps=50, is_training=True, is_test=False, test_log_prob=False):
    '''
    This function runs a single epoch (pass) over the data,
    updating the model parameters if we are training,
    and returns the perplexity.
    Input arguments:
      rnn: object of the rnn_lm class
      data: list of word indices
      num_steps
      is_training: boolean, True is we are training the model
      is_test: boolean, True is we are testing a trained model
      test_log_prob: boolean, True if we want the log probability
    Returns:
      ppl: float, perplexity of the dataset
    '''
  
    generator = batchGenerator(data, test=is_test)
      
    state = session.run(rnn.initial_state)
    sum_loss = 0.0
    iters = 0
    
    if test_log_prob: 
      sum_log_prob = 0.0
      
    while True:

      input_batch, target_batch, end_reached = generator.generate()
        
      if end_reached:
        break

      feed_dict = {rnn.inputs: input_batch,
                  rnn.targets: target_batch,
                  rnn.initial_state : state}

      fetches = {'loss': rnn.loss,
                'state': rnn.state}
      
      if is_training:
        fetches['train_op'] = rnn.train_op
        
      if test_log_prob:
        fetches['softmax'] = rnn.softmax
        
      result = session.run(fetches, feed_dict)
        
      state = result['state']
      loss = result['loss']
      
      if test_log_prob:
        softmax = result['softmax']
        prob_target = softmax[0][target_batch[0][0]]
        sum_log_prob += np.log(prob_target)

      sum_loss += loss
      # the loss is an average over num_steps
      if is_test:
        iters += 1
      else:
        iters += num_steps
        
    # calculate perplexity    
    ppl = np.exp(sum_loss / iters)
    
    if test_log_prob:
      print('Log probability: {0}'.format(sum_log_prob))
    
    return ppl


If you have run all cells in this section, you can now start the following section on data processing.

## Data processing

We will train our language models on **Penn TreeBank**, which is a publicly available benchmark dataset. A benchmark dataset can be used to easily compare models, since everyone has access to the same data. Many published papers use Penn TreeBank as dataset.

Penn TreeBank consists of among others newspaper articles, transcribed telephone conversations and manuals. The training set contains ca. 900.000 words, the validation set ca. 70.000 words and the test set ca. 80.000 words. This is a very small dataset (nowadays language models can be trained on billions of words), but it is large enough for our purposes.

We now download the training, validation and test data:

In [None]:
train_url = 'http://homes.esat.kuleuven.be/~spchlab/data/penntreebank/train.txt'
valid_url = 'http://homes.esat.kuleuven.be/~spchlab/data/penntreebank/valid.txt'
test_url = 'http://homes.esat.kuleuven.be/~spchlab/data/penntreebank/test.txt'
train_file = urllib.request.urlopen(train_url).read()
valid_file = urllib.request.urlopen(valid_url).read()
test_file = urllib.request.urlopen(test_url).read()

The data looks like this:

In [None]:
print('{0}...'.format(valid_file[:500]))

The data has been **normalized**: all words not in the vocabulary are mapped to an unknown words class (<unk\>), all numbers are mapped to the 'N' class, each line contains a single sentence, punctuation has been removed, and so on. 

The purpose of normalization is among others to get rid of all information that is not necessary (such as punctuation), to solve redundancies (for example the same word can occur with different spellings, e.g. 'normalisation' or 'normalization', and we want to get rid of such variants) and to make sure the language model will be able to generalize better. An example of the latter case is the mapping of all numbers to 'N':  in the example above, 'in N years', 'N' can correspond to any number. Assume that in our training data, we see 'in 20 years' and 'in 11 years', and in our test data, we see 'in 5 years'. If '20', '11' and '5' are not mapped to 'N', we have never seen 'in 5 years' before, and the probability estimate for 'in 5 years' will be worse.
  
We will now read the data, add end-of-sentence symbols <eos\> ( since we want to be able to predict the end of a sentence too), and count the frequency of every word in the training data:

In [None]:
# convert the string to a list and replace newlines with the end-of-sentence symbol <eos>
# ignore empty elements ''
train_text = [w for w in train_file.decode(encoding='utf-8', errors='strict').replace('\n',' <eos>').split(' ') if w != '']
valid_text = [w for w in valid_file.decode(encoding='utf-8', errors='strict').replace('\n',' <eos>').split(' ') if w != '']
test_text = [w for w in test_file.decode(encoding='utf-8', errors='strict').replace('\n',' <eos>').split(' ') if w != '']

# count the frequencies of the words in the training data
counter = collections.Counter(train_text)

# sort according to decreasing frequency
count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))

In [None]:
# make a list of the 10000 most frequent words , we will use this later
most_frequent_words = []
for i in range(0,10000):
  most_frequent_words.append(count_pairs[i][0])
# just print a selection if you want to
most_frequent_words[100:105]

We can take a look at the counts of the words in the training set, and compare them with the counts of the words in the validation set. Let' take a look at the top 20 most frequent words:

In [None]:
# count the frequencies of the words in the validation data
counter_valid = collections.Counter(valid_text)

# sort according to decreasing frequency
count_pairs_valid = sorted(counter_valid.items(), key=lambda x: (-x[1], x[0]))

fltrain = float(len(train_text))
flvalid = float(len(valid_text))

print('Top 20 most frequent words:')
print('Train (freq.)\t\tValid (freq.)')
# we can take a look a the 20 most frequent words + their frequencies:
for i in range(20):
  freq_train = round((float(count_pairs[i][1]) / fltrain)*100,3)
  freq_valid = round((float(count_pairs_valid[i][1]) / flvalid)*100,3)
  
  print('{0} ({1} - {2}%)\t\t{3} ({4} - {5}%)'.format(
      count_pairs[i][0], count_pairs[i][1], freq_train,
      count_pairs_valid[i][0], count_pairs_valid[i][1], freq_valid))

Between brackets, we print the raw counts followed by the relative frequencies. 

*   Which types of words are the most frequent?
*   How do you explain the difference in raw counts between the training text and validation text?
*   How do you explain the difference in relative frequencies between the training text and validation text?



Let's now take a look at the mid- and low-frequency range. In the cell below, we look at the relative frequencies for a few words starting at begin_idx. You can change this value to inspect other words.

In [None]:
begin_idx = 200 # you can change this value
end_idx = begin_idx + 20

print('Word\t\tTrain freq.\tValid freq.')
for i in range(begin_idx, end_idx):
  entry = count_pairs[i][0]
  
  freq_train = round((float(counter[entry]) / fltrain)*100,3)
  freq_valid = round((float(counter_valid[entry]) / flvalid)*100,3)
  
  print('{0}\t\t{1}%\t\t{2}%'.format(
      entry, freq_train, freq_valid))



*   Are the differences in relative frequencies larger for the mid-frequency range than for the high-frequency range?
*   How do you explain this?



We now create a mapping from words to indices. The real input for the neural network will be indices, because they take up less space and because it makes certain operations easier.

In [None]:
# words = list of all the words (in decreasing frequency)
items, _ = list(zip(*count_pairs))

# make a dictionary with a mapping from each word to an id; word with highest frequency gets lowest id etc.
item_to_id = dict(zip(items, range(len(items))))
id_to_item = dict(zip(range(len(items)), items))
vocab_size = len(item_to_id)

# convert the words to indices
train_ids_large = [item_to_id[item] for item in train_text]
valid_ids_large = [item_to_id[item] for item in valid_text]
test_ids_large = [item_to_id[item] for item in test_text]

Once the data is converted to ids, it looks like this:

In [None]:
print('Here is an example of words and their indices:')
for i in range(40):
  print('{0}\t{1}'.format(valid_text[i], valid_ids_large[i]))
print('\nAnd this is what the input looks like, a list of indices:')
print(valid_ids_large[:40])

To speed up some experiments, we take a subset of the data.

In [None]:
# take a smaller subset to speed up training
train_ids = train_ids_large[:50000]
valid_ids = valid_ids_large[:10000]
test_ids = test_ids_large[:10000]

#train_ids = [int(i) for i in train_ids]
#valid_ids = [int(i) for i in valid_ids]
#test_ids = [int(i) for i in test_ids]

print('Number of words in small training set: {0}'.format(len(train_ids)))
print('Number of words in small validation set: {0}'.format(len(valid_ids)))
print('Number of words in small test set: {0}'.format(len(test_ids)))

## The classes and functions that we will use

We will now explore the classes and functions that we have defined earlier for training and testing our language models.

The class for an RNN language model is **rnn_lm()**. We will see later which options we can use.

**batchGenerator(<dataset\>)** is class that will generate mini-batches from the data. <dataset\> is a list of word ids.

batchGenerator is a class that will iterate over the data set and create **mini-batches** that will be the input for the neural network. A mini-batch contains several sentences/word sequences, and feeding mini-batches instead of a single sentence or a single word to the network speeds up the processing, and also causes better convergence of the model.

The batches are matrices of the size **batch_size** x **num_steps**. Batch_size is the number of different sequences in a single batch, and num_steps the length of each  sequence.

Here is an example of how batchGenerator can be used. You will notice that the target batch contains the same indices as the input batch, but shifted one (time) step to the right.

In [None]:
batch_size = 32
num_steps = 50

generator = batchGenerator(valid_ids, batch_size=batch_size, num_steps=num_steps)
input_batch, target_batch, end_reached = generator.generate()
print('Shape of the mini-batch: {0}'.format(input_batch.shape))
print('This is what an input batch looks like:\n{0}'.format(input_batch))
print('And this is what a target batch looks like:\n{0}'.format(target_batch))

Here is a function which pretty-prints what the mini-batches look like. You can give it a batch as first argument, and as second argument the index that you want to look at. In our case, there are 32 sequences in every mini-batch, so the indices range between 0 and 31 (in Python, indices always start at 0).

In [None]:
def print_batch(batch, idx):
  for i in range(num_steps):
      word = id_to_item[batch[idx][i]]
      if word == '<eos>':
         print()
      else:
        print(word, end=' ')
  print()
  print()

And here are some examples of what the first and fourth sequence of the  input and target batch look like. Try it yourself with some new values.

In [None]:
print_batch(input_batch, 0)
print_batch(target_batch, 0)

print_batch(input_batch, 3)
print_batch(target_batch, 3)

# try it yourself:
# print_batch(..., ...)



*   What is the difference between the input batch and the target batch?



**run_lm():** this function can be called to build, train and test models with different parameter settings. 

**run_epoch()**: this is a function that does one pass over the whole dataset. If we are training the model, it will update the parameters and return the perplexity. Otherwise, it will just return the perplexity.

## Word embeddings

Start by running the cell below, which will load a large matrix. During loading, which will take some time, you can continue to read the explanation below. 

In [None]:
url_emb_matrix = 'http://homes.esat.kuleuven.be/~spchlab/H02A6/lab/session6/models/emb_matrix_ptb_256h.txt'
emb_matrix = np.loadtxt(urllib.request.urlopen(url_emb_matrix))

print('Size of the embedding matrix: {0}'.format(emb_matrix.shape))

Often the input words for a language model are represented as indices in a vocabulary, or **one-hot vectors** (where all values are 0 except the index of the word, which has value 1). This representation is a discrete representation, just like in n-gram language models. It has the disadvantage that relationships between words (e.g. the syntactic relationship between 'eat' and 'eating', or the semantic relationship between 'eat' and 'drink') cannot be inferred from the word representations. 

Neural language models however, do not use this representation as is but first map it to a continuous, lower-dimensional vector, also called **word embedding**. They do this by looking up the index of the word in a weight matrix $\mathbf{E}$, which is often called the embedding matrix. By training the embedding matrix jointly with the rest of the language model, the resulting word embeddings will have some interesting properties: several syntactic and semantic relationships are encoded as vector offsets in the embedding space. A famous example is the vector offset for male - female, which is shown in the example below:

![alt text](https://github.com/lverwimp/RNN_language_modeling/blob/master/kingqueen.png?raw=1)

The embedding matrix that we loaded above contains the embeddings of a large language model trained on Penn TreeBank. We will not train such a large model because the training time is too long (ca. 1h, depending on the hardware). In the next chapter, we will train smaller models on a subset of Penn TreeBank, which can be trained in a couple of minutes.

We will now take a look at what the embeddings look like. Simply looking at their values does not tell us much:

In [None]:
print("This is what the embedding for 'man' looks like: {0}".format(emb_matrix[item_to_id['man']]))
print("This is what the embedding for 'woman' looks like: {0}".format(emb_matrix[item_to_id['woman']]))



*   What do you observe in the embeddings above? 
*   What could you do to make the embedding more interpretable?



In [None]:
# if you want you can look at the embedding values of another word you're interesting in
# simply change 'woman' in the print statement below
# !!! note: there will be a KeyError if you try a word that is not in the vocabulary
# print("This is what the embedding for 'woman' looks like: {0}".format(emb_matrix[item_to_id['woman']]))

A common technique to inspect embeddings is **dimensionality reduction**, which reduces the high-dimensional vector (in our case 256) to a 2- or 3-dimensional vector which still captures the most important relationships. The simplest dimensionality reduction technique is principal component analysis (PCA). How PCA exactly works, is not important here, but we will use it to map our embeddings to a 2-dimensional space in the code below. We define a function that plots a subset of words in this 2-dimensional space.

In [None]:
# import the libraries that we need
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
    
# perform principal component analysis
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(emb_matrix)

def plot_pca(list_words):
  '''
  Plot all words in 'list_words' on the 2-dimensional PCA space.
  '''
  if len(list_words) > 10:
    raise IOError("Maximum 10 words can be plotted.")
    
  for w in list_words:
    if w not in item_to_id:
      list_words.remove(w)
      print('Ignoring {0} because it is not in the vocabulary'.format(w))

  colors = ['navy','turquoise','darkorange','red','black','blue','yellow','green','purple','pink']

  for color, target_name in zip(colors[:len(list_words)], list_words):
      plt.scatter(principalComponents[item_to_id[target_name], 0], 
                  principalComponents[item_to_id[target_name], 1], 
                  color=color, 
                  label=target_name)
  plt.legend()

You can use the plot_pca function with a maximum of 10 words, for example:

In [None]:
plot_pca(['man', 'woman', 'king', 'queen','men','women','child','children','boy','girl'])

*   Which relationships do you observe in the plot above?
*   What relationships did you expect? 
*   Are these relationships actually present in the plot? If not, why do you think they are not visible?



Another way to inspect word embeddings is to look at words that are closest to a specific target word. Closeness in a vector space can be calculated by using for example cosine similarity. In the function below, we calculate for a given word the top 10 closest words.

In [None]:
def find_closest_words(word):
  if word not in item_to_id:
    raise IOError('This item is not in the vocabulary')
    
  else:
    id_w = item_to_id[word]
    emb_w = emb_matrix[id_w]
    # normalize the embedding vector unit length
    norm_emb_w = emb_w / np.linalg.norm(emb_w)
    
    top_10 = {}
    
    # iterate over all words
    for idx in range(emb_matrix.shape[0]):
      # ignore the word itself
      if idx != id_w:
        
        # normalize to unit length
        norm_curr_w = emb_matrix[idx] / np.linalg.norm(emb_matrix[idx])
        
        # cosine similarity = dot product of normalized vectors
        cos_sim = np.dot(norm_emb_w, norm_curr_w)
        
        # keep list of top 10 largest cos similarities
        if len(top_10) >= 10:
          for sim in top_10.keys():
            if cos_sim > sim:
              
              del top_10[sim]
              top_10[cos_sim] = id_to_item[idx]
              break
        
        else:
          top_10[cos_sim] = id_to_item[idx]
          
        
    print('Words with largest cosine similarity w.r.t. {0}'.format(word))
    # sort the top 10 
    for sim in sorted(top_10, key=float, reverse=True):
      print('{0} ({1})'.format(top_10[sim], sim))
    print()  

The function above can be called as follows:

In [None]:
find_closest_words('man')
find_closest_words('help')

Notice that we did not optimize our embedding space to contain any relationships - this is merely a by-product of training a neural language model. 

*   Which relationships do you see between the target word and its closest words?
*   What kind of relationships do you see: syntactic and/or semantic?
*   Do all closest words have clear relationships with the target word?



## PERSONAL EXERCISE -- PREPARATION FOR THE EXAM

This sections contains a personalized variant on the exercise(s) above.   **You need to bring the printout of this COLAB exercise to the exam**, where it will be used to ask you some further personalized questions.

There are 2 code cells in this section:

**[1]** This cell will generate your personal word list for this exercise.  Enter your KULeuven ID and execute the cell.  
 + it's not so relevant how this happens but don't modify the hidden parameters ! (you can play with them, but the exercise you bring to the exam should be based on the default settings)
 + first your *personal key* is generated; it is a nonsense word using your id as seed and fitting English letter 4gram probabilities.  So it may actually look like an English word, but most likely it isn't
 + then your *personal wordlist* is generated; this consists of the top-3 words from the 2500 most frequent words in the training database that are closest (DTW distance) to your *personal key*.
  
**[2]** This cell runs the find_closest_words() routine on your wordlist and hence returns words that are the nearest neighbours in embedding space for the words in your wordlist. 
For each of your personal words, give a motivation explanation which ones of the close neighbours in embedded space are plausible in your opionion.  
 
**Bring the full output of this COLAB cell to the exam, including key, wordlist and nearest neighbours**.   

**[3]** (optional) In this cell you can plot via PCA the relationships of the words in your wordlist.  It is rather unlikely that you will see much here. 





In [None]:
import random, re
kul_id = 's0123456' #@param {type:"string"}
vocsize = 2500 
selection = 3 
length = 8 
ngram_length = 4 
# UTILITY: a random word generator made up of frequent letter ngrams
def random_word(length=10,seed='u012345',text='abcdefghjk',ng=5):
  chars = [text[i:i + ng] for i in range(0, len(text)-ng+1) if ' ' not in text[i:i+ng]]
  char_counter = collections.Counter(chars)
  ngrams= ([w for (w,n) in char_counter.most_common(100)])
  random.seed(seed,version=2)
  nwords = length//ng + 1
  ngram_word = "".join(random.sample(ngrams,nwords))
  return(ngram_word[0:length])
######### generate your personal key  #####################
text = train_file.decode(encoding='utf-8', errors='strict').lower().replace('\n',' ').replace('<unk>',' ').replace('$',' ')
my_key = random_word(length=length,seed=kul_id,text=text.replace(' ',''),ng=ngram_length)  
#print("Generated Key: ", my_key)

### dtw utility
def stringdtw(str1,str2):
  s1 = '#'+str1
  s2 = '#'+str2
  # initialize
  Dist = [i for i in range(0,len(s2)) ] 
  Dist[0] = 0
  for i in range(1,len(s1)):
    Dist1 = Dist.copy()
    Dist[0] = Dist1[0] + 1
    for j in range(1,len(s2)):
      Dist[j] =  min(
          Dist1[j-1] + int(s1[i]!=s2[j]) ,
          Dist1[j] + 1.2,
          Dist[j-1] + 1.2)
  return(Dist[len(s2)-1])
#############################################################################

######### generate your personal word list ################
wlist = most_frequent_words[0:vocsize]
dist = np.zeros(vocsize,dtype='float')
for i in range(vocsize):
  dist[i] = stringdtw(wlist[i],my_key)
sorted_wordlist = [wlist[i] for i in np.argsort(dist) if len(wlist[i]) >= length-2]
my_wordlist = sorted_wordlist[0:selection]

In [None]:
######### find the closest words to your personal wordlist according to embeddings  ##############
print("="*70)
print("Your KULeuven ID:", kul_id)
print("Your Personal Key:",my_key)
print("Your Personal Wordlist: ", my_wordlist)
print("="*70)
for w in my_wordlist:
  find_closest_words(w)

In [None]:
# You can look for similarities in your word list, but the words in this list don't have much in common, 
# so you probably won't see much, but you can trye
plot_pca(sorted_wordlist[0:10])

Notice that not all words will have sensible nearest neighbours.  The main reason is that the words in your wordlist might have verry little in common
(this was not part of the selection criterion).  

More generally, the model is trained on a relatively small dataset. Not all words will have large enough frequency and the model is not optimized to encode these relationships.  Another anomaly due to the small training dataset is that quite unexpected neighbours pop up in the list. 



## Testing networks

Before going into the details of training a neural language model, we will first show how you can use a trained model. The output of the neural network is given to a **softmax** function, which converts the output values (also called logits) to values between 0 and 1. The sum of those values is 1, and thus the output of the softmax function can be treated as a probability distribution. 

We can then find the probability of a specific word  following the current input word by looking up its probability in the output vector of the softmax function. To find the most probable word, we look for the maximum probability. In practice, we usually work with **log probabilities**, because if we are computing the probability of a sequence of words, the multiplication of all probabilities easily becomes very small. Converting the probabilities to the log domain and summing them instead of multiplying alleviates this problem. The log probability of a sentence is then sum of the log probabilities of every word in the sentence, given their context.

Let's first train a network:

In [None]:
# first make sure that we start training the model from scratch, by removing the models/ directory
!rm -rf baseline

# train the model
run_lm(name='baseline', train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)

During training, not the log probabilities are printed but the **perplexity** of the model. Perplexity is commonly used to measure the quality of a language model, and corresponds to $$e^{\frac{1}{N}~ln~P(x_1~\ldots~x_N)}$$ You see that it is closely related to the log probability. The lower the perplexity, the better. Notice that in the example above the perplexities are quite high because we are training a small model on a small dataset.

To test the model we define a function that prints the log probability of a sentence. The sentence is converted to indices and then given to the model:

In [None]:
def get_log_prob(test_sent):
  
  # convert words to indices
  test_idx = []
  for w in test_sent.split(' '):
    if w not in item_to_id:
      raise IOError("{0} is not part of the vocabulary".format(w))
    else:
      test_idx.append(item_to_id[w])

  # feed sentence to the model
  run_lm(name='baseline',
                test_ids=test_idx,
                test_log_prob=True)

To get the log probability of a specific sentence, use the following commands:

In [None]:
get_log_prob(test_sent='this is a test <eos>')
get_log_prob(test_sent='test a a a <eos>')



*   Which sentence has the largest probability? Is this what you expected?
*   What would happen if you compare the probability of two sentences with an unequal number of words? Remember that the probability of a sentence = product of the probabilities of each word = sum of the log probabilities of each word.



You can test your own sentences here:

In [None]:
get_log_prob('good things happen to good people')
get_log_prob('bad things happen to bad people')

## Training networks

Training neural networks requires a lot of hyperparameter tuning. The hyperparameters of a neural network are for example the type of cell, its size, the method that is used for updating its parameters (also called 'optimizer' ), the type and strength of regularization, ... . All these hyperparameters have to be chosen before the network can built, trained and tested, and they all have to some extent an influence on the  performance of the model.

The default arguments for our network are the following:


* cell: 'LSTM'
* optimizer: 'Adam'
* lr: 0.01
* embedding_size: 64
* hidden_size: 128
* dropout_rate: 0.5

We will explain each of these arguments in the following sections.



### Type of cell

Recurrent neural networks are neural networks that take as input a combination of the standard input and the hidden state of the previous time step. The simplest form of recurrent neural network, often called **vanilla recurrent neural network**, looks like this (picture taken from the Chris Olah's [blog post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)):

![alt text](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png)

The green blocks in the picture represent the neural network. You see that the inputs to the neural network are the current input word $\mathbf{x}_t$ (this is the word embedding as discussed before) and the state of the previous time step $\mathbf{h}_{t-1}$. The network multiplies both inputs with weights and applies a non-linearity, in this case $\tanh$. 

The network that we trained in the previous section is not a vanilla recurrent neural network, but a **long short-term memory (LSTM)** network, which is a more powerful variant. If you're interested in knowing how the LSTM works, the blog post mentioned above is a great introduction.

For comparison, let's now train a simple RNN instead of an LSTM:

In [None]:
run_lm(name='RNN', cell='RNN', train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)

Notice that the simple RNN performs much worse than the LSTM (the perplexities are much higher).

*   How does the simple RNN compare with the LSTM that you trained in the previous section? Are the perplexities lower?
*   Look at the evolution of the train perplexities and compare it with the evolution of the validation perplexities. Does a lower train perplexity automatically correspond to a lower validation perplexity? If this is not the case, why do you think this happens?



### Optimizer

Another important hyperparameter for neural networks is the type of optimizer. Training a neural network implies that you give an input to the network, calculate the output and the difference with the expected output, which is equal to the **error** or **loss**. To update the parameters based on the error, the **gradient** of the loss with respect to the parameters is calculated. The gradient tells you in which direction you need to move to maximize the loss. In our case, we want to minimize the loss so we will move in the opposite direction (negative gradient). The optimizer then decides how this gradient is used to change the parameters. 

The simplest option is to subtract (a scaled version of) the gradient from the parameters. This optimizer is called **stochastic gradient descent**. In the experiments in the previous section, we used another, more complicated, optimizer called **Adam**.

Let's now train a network with stochastic gradient descent instead of Adam:

In [None]:
run_lm(name='SGD', optimizer='SGD', train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)



*   Are the perplexities better than the baseline model?

### Learning rate

Judging from the perplexities above, it seems like the Adam optimizer is the best choice for training our network. However, the interplay between the different hyperparameters of a neural network is complicated, and it is very well possible that a specific optimizer needs a different learning rate. 
Let's try SGD with a learning rate of 1 instead of the default 0.01:

In [None]:
run_lm(name='SGD_1', optimizer='SGD', lr=1.0, train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)



*   Does using a larger learning rate improve results for SGD?
*   Is this result better than for the model optimized with Adam?



Maybe using a larger learning rate helps in general? Let's try the same learning rate in combination with Adam:

In [None]:
run_lm(name='Adam_1', lr=1.0, train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)



*   Does using a larger learning rate for Adam work better?
*   What can you conclude about the interplay between optimizer and learning rate, based on these experiments?



Let's now reduce the learning rate for Adam, from 0.01 to 0.001:

In [None]:
run_lm(name='Adam_0.001', lr=0.001, train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)



*   Is Adam with a learning rate of 0.001 better than with a learning rate of 0.01? 




### Size of the embedding

Let's now take a look at the influence of the size of the LSTM on its performance. By default, we train a model with embeddings of size 64 and a hidden layer of size 128. Let's see what happens if we reduce the size of the embedding:

In [None]:
run_lm(name='emb16', embedding_size=16, train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)



*   Is using a smaller embedding better or worse? Is there a large difference?



What about a larger embedding size?

In [None]:
run_lm(name='emb128', embedding_size=128, train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)

*   Is using a larger embedding better or worse? Is there a large difference?
*   The size of the vocabulary of our language model is 10.000, which is quite small. If we trained a language model with a larger vocabulary size (e.g. 100.000), what do you expect with respect to the relative differences between embedding sizes?

### Size of the hidden layer

We can also change the size of the hidden layer/the number of neurons in the network. In principle you also choose to increase the number of layers, but this is mainly useful for large datasets.

Let's first test a smaller hidden layer:

In [None]:
run_lm(name='hidden64', hidden_size=64, train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)

*   Is using a smaller hidden layer better or worse? Is there a large difference?



Let's now train a model with a larger hidden layer:

In [None]:
run_lm(name='hidden256', hidden_size=256, train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)

*   Is using a smaller hidden layer better or worse? Is there a large difference?


Let's now train a baseline language model (with a hidden size of 128) on the full Penn TreeBank data set by using {train/valid/test}_ids_large:


In [None]:
run_lm(name='large', train_ids=train_ids_large, valid_ids=valid_ids_large, test_ids=test_ids_large)



*   Can you directly compare the perplexity of this model with the perplexity of the model trained on the small dataset? Remember how perplexity is calculated. Are the validation and test sets the same for the small and large dataset?



Let's now train a model with a larger hidden size:

In [None]:
run_lm(name='large_hidden256', hidden_size=256, 
              train_ids=train_ids_large, valid_ids=valid_ids_large, test_ids=test_ids_large)



*   Do you see improvement by using a larger hidden size?
*   Is this improvement relatively larger or smaller than the improvement you observed for the small data set?



### Try it yourself

Now try some different values for the hyperparameters that we discussed (cell, optimizer, lr, embedding_size, hidden_size):

In [None]:
# run_lm(name='my_own_model', cell='LSTM', optimizer='Adam', lr='0.01', embedding_size='64', hidden_size='128', 
#        train_ids=train_ids, valid_ids=valid_ids, test_ids=test_ids)

The hyperparameters that we discussed here are only a small subset, there exist many more that can have an influence on the performance of the neural network. Another important class of hyperparameters is related to **regularization**. Since neural networks contain many parameters, they can easily start **overfitting**, which means that the network starts memorizing the training set and cannot generalize well to new data sets anymore. Two important methods are dropout (setting a proportion of the neurons to 0 during training) and early stopping (stop training if the performance on the validation set decreases), but there exist many more.



*   Have you seen examples of overfitting in the networks that we trained in this section? 


