<a href="https://colab.research.google.com/github/cindyellow/DynaLR/blob/main/NLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Starter code and data

First, perform the required imports for your code:


In [2]:
import collections
import pickle
import numpy as np
import os
from tqdm import tqdm
import pylab
from six.moves.urllib.request import urlretrieve
import tarfile
import sys
import itertools
from datetime import datetime

TINY = 1e-30
EPS = 1e-4
nax = np.newaxis

If you're using colaboratory, this following script creates a folder - here we used 'CSC413/A1' - in order to download and store the data. If you're not using colaboratory, then set the path to wherever you want the contents to be stored at locally.

You can also manually download and unzip the data from [http://www.cs.toronto.edu/~jba/a1_data.tar.gz] and put them in the same folder as where you store this notebook. 

Feel free to use a different way to access the files *data.pk* , *partially_trained.pk*, and *raw_sentences.txt*. 

The file *raw_sentences.txt* contains the sentences that we will be using for this assignment.
These sentences are fairly simple ones and cover a vocabulary of only 250 words (+ 1 special `[MASK]` token word).





In [3]:
######################################################################
# Setup working directory
######################################################################
# Change this to a local path if running locally
%mkdir -p /content/CSC413/A1/
%cd /content/CSC413/A1

######################################################################
# Helper functions for loading data
######################################################################
# adapted from 
# https://github.com/fchollet/keras/blob/master/keras/datasets/cifar10.py

def get_file(fname,
             origin,
             untar=False,
             extract=False,
             archive_format='auto',
             cache_dir='data'):
    datadir = os.path.join(cache_dir)
    if not os.path.exists(datadir):
        os.makedirs(datadir)

    if untar:
        untar_fpath = os.path.join(datadir, fname)
        fpath = untar_fpath + '.tar.gz'
    else:
        fpath = os.path.join(datadir, fname)
    
    print('File path: %s' % fpath)
    if not os.path.exists(fpath):
        print('Downloading data from', origin)

        error_msg = 'URL fetch failure on {}: {} -- {}'
        try:
            try:
                urlretrieve(origin, fpath)
            except URLError as e:
                raise Exception(error_msg.format(origin, e.errno, e.reason))
            except HTTPError as e:
                raise Exception(error_msg.format(origin, e.code, e.msg))
        except (Exception, KeyboardInterrupt) as e:
            if os.path.exists(fpath):
                os.remove(fpath)
            raise

    if untar:
        if not os.path.exists(untar_fpath):
            print('Extracting file.')
            with tarfile.open(fpath) as archive:
                archive.extractall(datadir)
        return untar_fpath

    if extract:
        _extract_archive(fpath, datadir, archive_format)

    return fpath

/content/CSC413/A1


In [4]:
# Download the dataset and partially pre-trained model
get_file(fname='a1_data', 
                         origin='http://www.cs.toronto.edu/~jba/a1_data.tar.gz', 
                         untar=True)
drive_location = 'data'
PARTIALLY_TRAINED_MODEL = drive_location + '/' + 'partially_trained.pk'
data_location = drive_location + '/' + 'data.pk'

File path: data/a1_data.tar.gz
Downloading data from http://www.cs.toronto.edu/~jba/a1_data.tar.gz
Extracting file.


We have already extracted the 4-grams from this dataset and divided them into training, validation, and test sets.
To inspect this data, run the following:

In [5]:
data = pickle.load(open(data_location, 'rb'))
print(data['vocab'][0]) # First word in vocab is [MASK] 
print(data['vocab'][1]) 
print(len(data['vocab'])) # Number of words in vocab
print(data['vocab']) # All the words in vocab
print(data['train_inputs'][:10]) # 10 example training instances

[MASK]
all
251
['[MASK]', 'all', 'set', 'just', 'show', 'being', 'money', 'over', 'both', 'years', 'four', 'through', 'during', 'go', 'still', 'children', 'before', 'police', 'office', 'million', 'also', 'less', 'had', ',', 'including', 'should', 'to', 'only', 'going', 'under', 'has', 'might', 'do', 'them', 'good', 'around', 'get', 'very', 'big', 'dr.', 'game', 'every', 'know', 'they', 'not', 'world', 'now', 'him', 'school', 'several', 'like', 'did', 'university', 'companies', 'these', 'she', 'team', 'found', 'where', 'right', 'says', 'people', 'house', 'national', 'some', 'back', 'see', 'street', 'are', 'year', 'home', 'best', 'out', 'even', 'what', 'said', 'for', 'federal', 'since', 'its', 'may', 'state', 'does', 'john', 'between', 'new', ';', 'three', 'public', '?', 'be', 'we', 'after', 'business', 'never', 'use', 'here', 'york', 'members', 'percent', 'put', 'group', 'come', 'by', '$', 'on', 'about', 'last', 'her', 'of', 'could', 'days', 'against', 'times', 'women', 'place', 'think'

In [6]:
data['train_inputs'].shape

(372500, 4)

In [7]:
data['test_inputs'].shape

(46500, 4)

In [8]:
data['valid_inputs'].shape

(46500, 4)

Now `data` is a Python dict which contains the vocabulary, as well as the inputs and targets for all three splits of the data. `data['vocab']` is a list of the 251 words in the dictionary; `data['vocab'][0]` is the word with index 0, and so on. `data['train_inputs']` is a 372,500 x 4 matrix where each row gives the indices of the 4 consecutive context words for one of the 372,500 training cases.
The validation and test sets are handled analogously.

Even though you only have to modify two specific locations in the code, you may want to read through this code before starting the assignment. 

# Neural Language Model: Masked Word Prediction

In this part, you will learn to implement and train the neural language model from Figure 1. As described in the previous section, during training, we randomly sample one of the $N$ context words to replace with a `[MASK]` token. The goal is for the network to predict the word that was masked, at the corresponding output word position. In practice, this `[MASK]` token is assigned the index 0 in our dictionary. The weights $W^{(2)}$ = `hid_to_output_weights` now has the shape $NV \times H$, as the output layer has $NV$ neurons, where the first $V$ output units are for predicting the first word, then the next $V$ are for predicting the second word, and so on. 
        We call this as *concatenating* output units across all word positions, i.e. the $(v + nV)$-th column is for the word $v$ in vocabulary for the $n$-th output word position. 
        Note here that the softmax is applied in chunks of $V$ as well, to give a valid probability distribution over the $V$ words (For simplicity we also include the `[MASK]` token as one of the possible prediction even though we know the target should not be this token). Only the output word positions that were masked in the input are included in the cross entropy loss calculation:

$$C = -\sum_{i}^{B}\sum_{n}^{N}\sum_{v}^{V} m^{(i)}_{n} (t^{(i)}_{v + nV} \log y^{(i)}_{v + nV})$$

Where:
*  $y^{(i)}_{v + nV}$ denotes the output probability prediction from the neural network for the $i$-th training example for the word $v$ in the $n$-th output word. Denoting $z$ as the logits output, we define the output probability $y$ as a softmax on $z$ over contiguous chunks of $V$ units (see also Figure 1): 

$$y^{(i)}_{v + nV} = \frac{e^{z^{(i)}_{v+nV}}}{\sum_{l}^{V} e^{z^{(i)}_{l+nV}}}$$
* $t^{(i)}_{v + nV}  \in \{0,1\}$ is 1 if for the $i$-th training example, the word $v$ is the $n$-th word in context
* $m^{(i)}_{n} \in \{0,1\}$ is a mask that is set to 1 if we are predicting the $n$-th word position for the $i$-th example (because we had masked that word in the input), and 0 otherwise

There are three classes defined in this part: `Params`, `Activations`, `Model`.
You will make changes to `Model`, but it may help to read through `Params` and `Activations` first.

In [9]:
class Params(object):
    """A class representing the trainable parameters of the model. This class has five fields:
    
           word_embedding_weights, a matrix of size V x D, where V is the number of words in the vocabulary
                   and D is the embedding dimension.
           embed_to_hid_weights, a matrix of size H x ND, where H is the number of hidden units. The first D
                   columns represent connections from the embedding of the first context word, the next D columns
                   for the second context word, and so on. There are N context words.
           hid_bias, a vector of length H
           hid_to_output_weights, a matrix of size NV x H
           output_bias, a vector of length NV"""

    def __init__(self, word_embedding_weights, embed_to_hid_weights, hid_to_output_weights,
                 hid_bias, output_bias):
        self.word_embedding_weights = word_embedding_weights
        self.embed_to_hid_weights = embed_to_hid_weights
        self.hid_to_output_weights = hid_to_output_weights
        self.hid_bias = hid_bias
        self.output_bias = output_bias

    def copy(self):
        return self.__class__(self.word_embedding_weights.copy(), self.embed_to_hid_weights.copy(),
                              self.hid_to_output_weights.copy(), self.hid_bias.copy(), self.output_bias.copy())

    @classmethod
    def zeros(cls, vocab_size, context_len, embedding_dim, num_hid):
        """A constructor which initializes all weights and biases to 0."""
        word_embedding_weights = np.zeros((vocab_size, embedding_dim))
        embed_to_hid_weights = np.zeros((num_hid, context_len * embedding_dim))
        hid_to_output_weights = np.zeros((vocab_size * context_len, num_hid))
        hid_bias = np.zeros(num_hid)
        output_bias = np.zeros(vocab_size * context_len)
        return cls(word_embedding_weights, embed_to_hid_weights, hid_to_output_weights,
                   hid_bias, output_bias)

    @classmethod
    def random_init(cls, init_wt, vocab_size, context_len, embedding_dim, num_hid):
        """A constructor which initializes weights to small random values and biases to 0."""
        word_embedding_weights = np.random.normal(0., init_wt, size=(vocab_size, embedding_dim))
        embed_to_hid_weights = np.random.normal(0., init_wt, size=(num_hid, context_len * embedding_dim))
        hid_to_output_weights = np.random.normal(0., init_wt, size=(vocab_size * context_len, num_hid))
        hid_bias = np.zeros(num_hid)
        output_bias = np.zeros(vocab_size * context_len)
        return cls(word_embedding_weights, embed_to_hid_weights, hid_to_output_weights,
                   hid_bias, output_bias)

    ###### The functions below are Python's somewhat oddball way of overloading operators, so that
    ###### we can do arithmetic on Params instances. You don't need to understand this to do the assignment.

    def __mul__(self, a):
        return self.__class__(a * self.word_embedding_weights,
                              a * self.embed_to_hid_weights,
                              a * self.hid_to_output_weights,
                              a * self.hid_bias,
                              a * self.output_bias)

    def __rmul__(self, a):
        return self * a

    def __add__(self, other):
        return self.__class__(self.word_embedding_weights + other.word_embedding_weights,
                              self.embed_to_hid_weights + other.embed_to_hid_weights,
                              self.hid_to_output_weights + other.hid_to_output_weights,
                              self.hid_bias + other.hid_bias,
                              self.output_bias + other.output_bias)

    def __sub__(self, other):
        return self + -1. * other

In [10]:
class Activations(object):
    """A class representing the activations of the units in the network. This class has three fields:

        embedding_layer, a matrix of B x ND matrix (where B is the batch size, D is the embedding dimension,
                and N is the number of input context words), representing the activations for the embedding 
                layer on all the cases in a batch. The first D columns represent the embeddings for the 
                first context word, and so on.
        hidden_layer, a B x H matrix representing the hidden layer activations for a batch
        output_layer, a B x V matrix representing the output layer activations for a batch"""

    def __init__(self, embedding_layer, hidden_layer, output_layer):
        self.embedding_layer = embedding_layer
        self.hidden_layer = hidden_layer
        self.output_layer = output_layer

def get_batches(inputs, batch_size, shuffle=True):
    """Divide a dataset (usually the training set) into mini-batches of a given size. This is a
    'generator', i.e. something you can use in a for loop. You don't need to understand how it
    works to do the assignment."""

    if inputs.shape[0] % batch_size != 0:
        raise RuntimeError('The number of data points must be a multiple of the batch size.')
    num_batches = inputs.shape[0] // batch_size

    if shuffle:
        idxs = np.random.permutation(inputs.shape[0])
        inputs = inputs[idxs, :]

    for m in range(num_batches):
        yield inputs[m * batch_size:(m + 1) * batch_size, :]

In this part of the assignment, you implement a method which computes the gradient using backpropagation.
To start you out, the *Model* class contains several important methods used in training:


*   `compute_activations` computes the activations of all units on a given input batch
*   `compute_loss_derivative` computes the gradient with respect to the output logits $\frac{\partial C}{\partial z}$
*   `evaluate` computes the average cross-entropy loss for a given set of inputs and targets

You will need to complete the implementation of two additional methods to complete the training, and print the outputs of the gradients. 

## Implement gradient with respect to parameters
`back_propagate` is the function which computes the gradient of the loss with respect to model parameters using backpropagation.
It uses the derivatives computed by *compute_loss_derivative*.
Some parts are already filled in for you, but you need to compute the matrices of derivatives for `embed_to_hid_weights`, `hid_bias`, `hid_to_output_weights`, and `output_bias`.
These matrices have the same sizes as the parameter matrices (see previous section). These matrices have the same sizes as the parameter matrices. Look for the `## YOUR CODE HERE ##` comment for where to complete the code.

In order to implement backpropagation efficiently, you need to express the computations in terms of matrix operations, rather than *for* loops.
You should first work through the derivatives on pencil and paper.
First, apply the chain rule to compute the derivatives with respect to individual units, weights, and biases.
Next, take the formulas you've derived, and express them in matrix form.
You should be able to express all of the required computations using only matrix multiplication, matrix transpose, and elementwise operations --- no *for* loops!
If you want inspiration, read through the code for *Model.compute_activations* and try to understand how the matrix operations correspond to the computations performed by all the units in the network.

*Hint: Your implementations should also be similar to* `hid_to_output_weights_grad`,`hid_bias_grad` *in the same function call*

*Hint: To prompt a GPT-like model, you may only include functions that are relevent to the implementation in your prompt.*
 

In [11]:
class Model(object):
    """A class representing the language model itself. This class contains various methods used in training
    the model and visualizing the learned representations. It has two fields:

        params, a Params instance which contains the model parameters
        vocab, a list containing all the words in the dictionary; vocab[0] is the word with index
               0, and so on."""

    def __init__(self, params, vocab):
        self.params = params
        self.vocab = vocab

        self.vocab_size = len(vocab)
        self.embedding_dim = self.params.word_embedding_weights.shape[1]
        self.embedding_layer_dim = self.params.embed_to_hid_weights.shape[1]
        self.context_len = self.embedding_layer_dim // self.embedding_dim
        self.num_hid = self.params.embed_to_hid_weights.shape[0]

    def copy(self):
        return self.__class__(self.params.copy(), self.vocab[:])

    @classmethod
    def random_init(cls, init_wt, vocab, context_len, embedding_dim, num_hid):
        """Constructor which randomly initializes the weights to Gaussians with standard deviation init_wt
        and initializes the biases to all zeros."""
        params = Params.random_init(init_wt, len(vocab), context_len, embedding_dim, num_hid)
        return Model(params, vocab)

    def indicator_matrix(self, targets, mask_zero_index=True):
        """Construct a matrix where the (v + n*V)th entry of row i is 1 if the n-th target word
         for example i is v, and all other entries are 0.

         Note: if the n-th target word index is 0, this corresponds to the [MASK] token,
               and we set the entry to be 0. 
        """
        batch_size, context_len = targets.shape
        expanded_targets = np.zeros((batch_size, context_len * len(self.vocab)))
        offset = np.repeat((np.arange(context_len) * len(self.vocab))[np.newaxis, :], batch_size, axis=0) # [[0, V, 2V], [0, V, 2V], ...]
        targets_offset = targets + offset

        for c in range(context_len):
          expanded_targets[np.arange(batch_size), targets_offset[:,c]] = 1.
          if mask_zero_index:
            # Note: Set the targets with index 0, V, 2V to be zero since it corresponds to the [MASK] token
            expanded_targets[np.arange(batch_size), offset[:,c]] = 0. 
        return expanded_targets

    def compute_loss_derivative(self, output_activations, expanded_target_batch, target_mask):
        """Compute the gradient of cross-entropy loss wrt output logits z
        
            For example:

         [y_{0} ....  y_{V-1}] [y_{V}, ..., y_{2*V-1}] [y_{2*V} ... y_{i,3*V-1}] [y_{3*V} ... y_{i,4*V-1}] 
                  
         Where for column v + n*V,

            y_{v + n*V} = e^{z_{v + n*V}} / \sum_{m=0}^{V-1} e^{z_{m + n*V}}, for n=0,...,N-1

        This function should return a dC / dz matrix of size [batch_size x (vocab_size * context_len)],
        where each row i in dC / dz has columns 0 to V-1 containing the gradient the 1st output 
        context word from i-th training example, then columns vocab_size to 2*vocab_size - 1 for the 2nd
        output context word of the i-th training example, etc.
        
        C is the loss function summed acrossed all examples as well:

            C = -\sum_{i,j,n} mask_{i,n} (t_{i, j + n*V} log y_{i, j + n*V}), for j=0,...,V, and n=0,...,N

        where mask_{i,n} = 1 if the i-th training example has n-th context word as the target, 
        otherwise mask_{i,n} = 0.
        
        Args:
          output_activations: A [batch_size x (context_len * vocab_size)] matrix, 
              for the activations of the output layer, i.e. the y_j's.
          expanded_target_batch: A [batch_size x (context_len * vocab_size)] matrix, 
              where expanded_target_batch[i,n*V:(n+1)*V] is the indicator vector for 
              the n-th context target word position, i.e. the (i, j + n*V) entry is 1 if the 
              i'th example, the context word at position n is j, and 0 otherwise.
          target_mask: A [batch_size x context_len x 1] tensor, where target_mask[i,n] = 1 
              if for the i'th example the n-th context word is a target position, otherwise 0
        
        Outputs:
          loss_derivative: A [batch_size x (context_len * vocab_size)] matrix,
              where loss_derivative[i,0:vocab_size] contains the gradient
              dC / dz_0 for the i-th training example gradient for 1st output 
              context word, and loss_derivative[i,vocab_size:2*vocab_size] for 
              the 2nd output context word of the i-th training example, etc.
        """
        # Reshape output_activations and expanded_target_batch and use broadcasting
        output_activations_reshape = output_activations.reshape(-1, self.context_len, len(self.vocab))
        expanded_target_batch_reshape = expanded_target_batch.reshape(-1, self.context_len, len(self.vocab))
        gradient_masked_reshape =  target_mask * (output_activations_reshape - expanded_target_batch_reshape)
        gradient_masked = gradient_masked_reshape.reshape(-1, self.context_len * len(self.vocab))
        return gradient_masked

    def compute_loss(self, output_activations, expanded_target_batch, target_mask):
        """Compute the total cross entropy loss over a mini-batch.

        Args:
          output_activations: [batch_size x (context_len * vocab_size)] matrix, 
                for the activations of the output layer, i.e. the y_j's.
          expanded_target_batch: [batch_size (context_len * vocab_size)] matrix, 
                where expanded_target_batch[i,n*V:(n+1)*V] is the indicator vector for 
                the n-th context target word position, i.e. the (i, j + n*V) entry is 1 if the 
                i'th example, the context word at position n is j, and 0 otherwise. matrix obtained
          target_mask: A [batch_size x context_len x 1] tensor, where target_mask[i,n,0] = 1 
                if for the i'th example the n-th context word is a target position, otherwise 0
        
        Returns:
          loss: a scalar for the  total cross entropy loss over the batch, 
                defined in Part 3
        """
        ###########################   YOUR CODE HERE  ##############################
        mask = target_mask.reshape(target_mask.shape[0], target_mask.shape[1])
        mask = np.repeat(mask, self.vocab_size, axis=1)
        loss= -np.sum(mask*expanded_target_batch*np.log(output_activations))
        ############################################################################
        return loss

    def compute_activations(self, inputs):
        """Compute the activations on a batch given the inputs. Returns an Activations instance.
        You should try to read and understand this function, since this will give you clues for
        how to implement back_propagate."""

        batch_size = inputs.shape[0]
        if inputs.shape[1] != self.context_len:
            raise RuntimeError('Dimension of the input vectors should be {}, but is instead {}'.format(
                self.context_len, inputs.shape[1]))

        # Embedding layer
        # Look up the input word indices in the word_embedding_weights matrix
        embedding_layer_state = self.params.word_embedding_weights[inputs.reshape([-1]), :].reshape([batch_size, self.embedding_layer_dim])

        # Hidden layer
        inputs_to_hid = np.dot(embedding_layer_state, self.params.embed_to_hid_weights.T) + \
                        self.params.hid_bias
        # Apply logistic activation function
        hidden_layer_state = 1. / (1. + np.exp(-inputs_to_hid))

        # Output layer
        inputs_to_softmax = np.dot(hidden_layer_state, self.params.hid_to_output_weights.T) + \
                            self.params.output_bias

        # Subtract maximum.
        # Remember that adding or subtracting the same constant from each input to a
        # softmax unit does not affect the outputs. So subtract the maximum to
        # make all inputs <= 0. This prevents overflows when computing their exponents.
        inputs_to_softmax -= inputs_to_softmax.max(1).reshape((-1, 1))

        # Take softmax along each V chunks in the output layer
        output_layer_state = np.exp(inputs_to_softmax)
        output_layer_state_shape = output_layer_state.shape
        output_layer_state = output_layer_state.reshape((-1, self.context_len, len(self.vocab)))
        output_layer_state /= output_layer_state.sum(axis=-1, keepdims=True) # Softmax along vocab of each target word
        output_layer_state = output_layer_state.reshape(output_layer_state_shape) # Flatten back to 2D matrix

        return Activations(embedding_layer_state, hidden_layer_state, output_layer_state)

    def back_propagate(self, input_batch, activations, loss_derivative):
        """Compute the gradient of the loss function with respect to the trainable parameters
        of the model.
        
        Part of this function is already completed, but you need to fill in the derivative
        computations for hid_to_output_weights_grad, output_bias_grad, embed_to_hid_weights_grad,
        and hid_bias_grad. See the documentation for the Params class for a description of what
        these matrices represent.

        Args: 
          input_batch: A [batch_size x context_length] matrix containing the 
              indices of the context words
          activations: an Activations object representing the output of 
              Model.compute_activations
          loss_derivative:  A [batch_size x (context_len * vocab_size)] matrix,
              where loss_derivative[i,0:vocab_size] contains the gradient
              dC / dz_0 for the i-th training example gradient for 1st output 
              context word, and loss_derivative[i,vocab_size:2*vocab_size] for 
              the 2nd output context word of the i-th training example, etc.
              Obtained from calling compute_loss_derivative()
          
        Returns:
          Params object containing the gradient for word_embedding_weights_grad, 
              embed_to_hid_weights_grad, hid_to_output_weights_grad,
              hid_bias_grad, output_bias_grad  
        """

        # The matrix with values dC / dz_j, where dz_j is the input to the jth hidden unit,
        # i.e. h_j = 1 / (1 + e^{-z_j})
        hid_deriv = np.dot(loss_derivative, self.params.hid_to_output_weights) \
                    * activations.hidden_layer * (1. - activations.hidden_layer)

        
        hid_to_output_weights_grad = np.dot(loss_derivative.T, activations.hidden_layer)
        
        ###########################   YOUR CODE HERE  ##############################
        output_bias_grad = loss_derivative.sum(0)
        embed_to_hid_weights_grad = np.dot(hid_deriv.T, activations.embedding_layer)
        ############################################################################
        
        hid_bias_grad = hid_deriv.sum(0)

        # The matrix of derivatives for the embedding layer
        embed_deriv = np.dot(hid_deriv, self.params.embed_to_hid_weights)

        # Word Embedding Weights gradient
        word_embedding_weights_grad = np.dot(self.indicator_matrix(input_batch.reshape([-1,1]), mask_zero_index=False).T, 
                                                 embed_deriv.reshape([-1, self.embedding_dim]))

        return Params(word_embedding_weights_grad, embed_to_hid_weights_grad, hid_to_output_weights_grad,
                      hid_bias_grad, output_bias_grad)

    def sample_input_mask(self, batch_size):
        """Samples a binary mask for the inputs of size batch_size x context_len
        For each row, at most one element will be 1.
        """
        mask_idx = np.random.randint(self.context_len, size=(batch_size,))
        mask = np.zeros((batch_size, self.context_len), dtype=np.int)# Convert to one hot B x N, B batch size, N context len
        mask[np.arange(batch_size), mask_idx] = 1
        return mask
    
    def evaluate(self, inputs, batch_size=100):
        """Compute the average cross-entropy over a dataset.

            inputs: matrix of shape D x N"""

        ndata = inputs.shape[0]

        total = 0.
        for input_batch in get_batches(inputs, batch_size):
            mask = self.sample_input_mask(batch_size)
            input_batch_masked = input_batch * (1 - mask)
            activations = self.compute_activations(input_batch_masked)
            expanded_target_batch = self.indicator_matrix(input_batch)
            target_mask = np.expand_dims(mask, axis=2)
            cross_entropy = self.compute_loss(activations.output_layer, expanded_target_batch, target_mask)
            total += cross_entropy

        return total / float(ndata)

    def display_nearest_words(self, word, k=10):
        """List the k words nearest to a given word, along with their distances."""

        if word not in self.vocab:
            print('Word "{}" not in vocabulary.'.format(word))
            return

        # Compute distance to every other word.
        idx = self.vocab.index(word)
        word_rep = self.params.word_embedding_weights[idx, :]
        diff = self.params.word_embedding_weights - word_rep.reshape((1, -1))
        distance = np.sqrt(np.sum(diff ** 2, axis=1))

        # Sort by distance.
        order = np.argsort(distance)
        order = order[1:1 + k]  # The nearest word is the query word itself, skip that.
        for i in order:
            print('{}: {}'.format(self.vocab[i], distance[i]))

    def word_distance(self, word1, word2):
        """Compute the distance between the vector representations of two words."""

        if word1 not in self.vocab:
            raise RuntimeError('Word "{}" not in vocabulary.'.format(word1))
        if word2 not in self.vocab:
            raise RuntimeError('Word "{}" not in vocabulary.'.format(word2))

        idx1, idx2 = self.vocab.index(word1), self.vocab.index(word2)
        word_rep1 = self.params.word_embedding_weights[idx1, :]
        word_rep2 = self.params.word_embedding_weights[idx2, :]
        diff = word_rep1 - word_rep2
        return np.sqrt(np.sum(diff ** 2))

## Run model training with different optimizers

Once you've implemented the gradient computation, you'll need to train the model.
The function *train* implements the main training procedure.
It takes two arguments:


*   `embedding_dim`: The number of dimensions in the distributed representation.
*   `num_hid`: The number of hidden units


As the model trains, the script prints out some numbers that tell you how well the training is going.
It shows:


*   The cross entropy on the last 100 mini-batches of the training set. This is shown after every 100 mini-batches.
*   The cross entropy on the entire validation set every 1000 mini-batches of training.

At the end of training, this function shows the cross entropies on the training, validation and test sets.
It will return a *Model* instance.

In [19]:
#### THIS IS THE CODE CHUNK

_train_inputs = None
_train_targets = None
_vocab = None

DEFAULT_TRAINING_CONFIG = {'batch_size': 3725,  # the size of a mini-batch
                           
                           #'learning_rate': 0.1,  # the learning rate
                           'fixed_learning_rates': [0.01, 0.1, 0.5],
                          #  'fixed_learning_rates': [0.1],
                           'momentum': 0.9,  # the decay parameter for the momentum vector
                           'epochs': 50,  # the maximum number of epochs to run
                           'init_wt': 0.01,  # the standard deviation of the initial random weights
                           'context_len': 4,  # the number of context words used
                           'show_training_CE_after': 10,  # measure training error after this many mini-batches
                           'show_validation_CE_after': 10,  # measure validation error after this many mini-batches
                           }


def find_occurrences(word1, word2, word3):
    """Lists all the words that followed a given tri-gram in the training set and the number of
    times each one followed it."""

    # cache the data so we don't keep reloading
    global _train_inputs, _train_targets, _vocab
    if _train_inputs is None:
        data_obj = pickle.load(open(data_location, 'rb'))
        _vocab = data_obj['vocab']
        _train_inputs, _train_targets = data_obj['train_inputs'], data_obj['train_targets']

    if word1 not in _vocab:
        raise RuntimeError('Word "{}" not in vocabulary.'.format(word1))
    if word2 not in _vocab:
        raise RuntimeError('Word "{}" not in vocabulary.'.format(word2))
    if word3 not in _vocab:
        raise RuntimeError('Word "{}" not in vocabulary.'.format(word3))

    idx1, idx2, idx3 = _vocab.index(word1), _vocab.index(word2), _vocab.index(word3)
    idxs = np.array([idx1, idx2, idx3])

    matches = np.all(_train_inputs == idxs.reshape((1, -1)), 1)

    if np.any(matches):
        counts = collections.defaultdict(int)
        for m in np.where(matches)[0]:
            counts[_vocab[_train_targets[m]]] += 1

        word_counts = sorted(list(counts.items()), key=lambda t: t[1], reverse=True)
        print('The tri-gram "{} {} {}" was followed by the following words in the training set:'.format(
            word1, word2, word3))
        for word, count in word_counts:
            if count > 1:
                print('    {} ({} times)'.format(word, count))
            else:
                print('    {} (1 time)'.format(word))
    else:
        print('The tri-gram "{} {} {}" did not occur in the training set.'.format(word1, word2, word3))


def train(embedding_dim, num_hid, config=DEFAULT_TRAINING_CONFIG):
    """This is the main training routine for the language model. It takes two parameters:

        embedding_dim, the dimension of the embedding space
        num_hid, the number of hidden units."""

    ########################
    # For reproducibility
    np.random.seed(123)
    ########################

    # Load the data
    data_obj = pickle.load(open(data_location, 'rb'))
    vocab = data_obj['vocab']
    train_inputs = data_obj['train_inputs']
    valid_inputs = data_obj['valid_inputs']
    test_inputs = data_obj['test_inputs']

    # Randomly initialize the trainable parameters

    model = Model.random_init(config['init_wt'], vocab, config['context_len'], embedding_dim, num_hid)

    # Variables used for early stopping
    best_valid_CE = np.infty
    end_training = False

    # Initialize the momentum vector to all zeros
    momentum_delta = Params.zeros(len(vocab), config['context_len'], embedding_dim, num_hid)

    start_time = datetime.now()

    this_chunk_CE = 0.
    batch_count = 0
    for epoch in range(1, config['epochs'] + 1):  
        if end_training:
            break

        print()
        print('Epoch', epoch)

        for m, (input_batch) in enumerate(get_batches(train_inputs, config['batch_size'])):
            batch_count += 1
            print('Batch {}'.format(batch_count))

            # For each example (row in input_batch), select one word to mask out
            mask = model.sample_input_mask(config['batch_size'])
            input_batch_masked = input_batch * (1 - mask) # We only zero out one word per row

            # Forward propagate
            activations = model.compute_activations(input_batch_masked)

            # Compute loss derivative            
            expanded_target_batch = model.indicator_matrix(input_batch)
            loss_derivative = model.compute_loss_derivative(activations.output_layer, expanded_target_batch, mask[:,:, np.newaxis])
            loss_derivative /= config['batch_size']

            # Measure loss function
            cross_entropy = model.compute_loss(activations.output_layer, expanded_target_batch, np.expand_dims(mask, axis=2)) / config['batch_size']
            this_chunk_CE += cross_entropy
            if batch_count % config['show_training_CE_after'] == 0:
                print('Batch {} Train CE {:1.3f}'.format(
                    batch_count, this_chunk_CE / config['show_training_CE_after']))
                this_chunk_CE = 0.

            # Backpropagate
            loss_gradient = model.back_propagate(input_batch, activations, loss_derivative)

            # Update the momentum vector and model parameters
            momentum_delta = config['momentum'] * momentum_delta + loss_gradient
            fixed_delta = loss_gradient

            #####################################
            best_loss = np.inf
            best_params_i = -1
            fixed_learning_rates = config['fixed_learning_rates']

            # Without momentum
            for i in range(len(fixed_learning_rates)):
                learning_rate = fixed_learning_rates[i]
                model.params -= learning_rate * fixed_delta
                curr_loss = model.evaluate(valid_inputs)
                if curr_loss < best_loss:
                    best_loss = curr_loss
                    best_params_i = i
                model.params += learning_rate * fixed_delta

            # With momentum
            for i in range(len(fixed_learning_rates)):
                learning_rate = fixed_learning_rates[i]
                model.params -= learning_rate * momentum_delta
                curr_loss = model.evaluate(valid_inputs)
                if curr_loss < best_loss:
                    best_loss = curr_loss
                    best_params_i = i + len(fixed_learning_rates)
                model.params += learning_rate * momentum_delta

            # Update
            if best_params_i >= len(fixed_learning_rates):
              learning_rate = fixed_learning_rates[best_params_i - len(fixed_learning_rates)]
              model.params -= learning_rate * momentum_delta
            else:
              learning_rate = fixed_learning_rates[best_params_i]
              model.params -= learning_rate * fixed_delta
            #####################################


            # Validate
            if batch_count % config['show_validation_CE_after'] == 0:
                print('Running validation...')
                cross_entropy = model.evaluate(valid_inputs)
                print('Validation cross-entropy: {:1.3f}'.format(cross_entropy))

                if cross_entropy > best_valid_CE:
                    print('Validation error increasing!  Training stopped.')
                    end_training = True
                    break

                best_valid_CE = cross_entropy

    print()
    end_time = datetime.now()
    print('Duration: {}'.format(end_time - start_time))
    train_CE = model.evaluate(train_inputs)
    print('Final training cross-entropy: {:1.3f}'.format(train_CE))
    valid_CE = model.evaluate(valid_inputs)
    print('Final validation cross-entropy: {:1.3f}'.format(valid_CE))
    test_CE = model.evaluate(test_inputs)
    print('Final test cross-entropy: {:1.3f}'.format(test_CE))

    return model

Run the training.


In [18]:
# Varied learning rate
embedding_dim = 16
num_hid = 128
trained_model = train(embedding_dim, num_hid)


Epoch 1
Batch 1


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  mask = np.zeros((batch_size, self.context_len), dtype=np.int)# Convert to one hot B x N, B batch size, N context len


Batch 2
Batch 3
Batch 4
Batch 5
Batch 6
Batch 7
Batch 8
Batch 9
Batch 10
Batch 10 Train CE 5.067
Running validation...
Validation cross-entropy: 4.725
Batch 11
Batch 12
Batch 13
Batch 14
Batch 15
Batch 16
Batch 17
Batch 18
Batch 19
Batch 20
Batch 20 Train CE 4.677
Running validation...
Validation cross-entropy: 4.630
Batch 21
Batch 22
Batch 23
Batch 24
Batch 25
Batch 26
Batch 27
Batch 28
Batch 29
Batch 30
Batch 30 Train CE 4.609
Running validation...
Validation cross-entropy: 4.596
Batch 31
Batch 32
Batch 33
Batch 34
Batch 35
Batch 36
Batch 37
Batch 38
Batch 39
Batch 40
Batch 40 Train CE 4.599
Running validation...
Validation cross-entropy: 4.588
Batch 41
Batch 42
Batch 43
Batch 44
Batch 45
Batch 46
Batch 47
Batch 48
Batch 49
Batch 50
Batch 50 Train CE 4.584
Running validation...
Validation cross-entropy: 4.586
Batch 51
Batch 52
Batch 53
Batch 54
Batch 55
Batch 56
Batch 57
Batch 58
Batch 59
Batch 60
Batch 60 Train CE 4.572
Running validation...
Validation cross-entropy: 4.576
Batch 61


In [20]:
# SGD without momentum
embedding_dim = 16
num_hid = 128
trained_model = train(embedding_dim, num_hid)


Epoch 1
Batch 1


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  mask = np.zeros((batch_size, self.context_len), dtype=np.int)# Convert to one hot B x N, B batch size, N context len


Batch 2
Batch 3
Batch 4
Batch 5
Batch 6
Batch 7
Batch 8
Batch 9
Batch 10
Batch 10 Train CE 5.458
Running validation...
Validation cross-entropy: 5.378
Batch 11
Batch 12
Batch 13
Batch 14
Batch 15
Batch 16
Batch 17
Batch 18
Batch 19
Batch 20
Batch 20 Train CE 5.317
Running validation...
Validation cross-entropy: 5.248
Batch 21
Batch 22
Batch 23
Batch 24
Batch 25
Batch 26
Batch 27
Batch 28
Batch 29
Batch 30
Batch 30 Train CE 5.203
Running validation...
Validation cross-entropy: 5.153
Batch 31
Batch 32
Batch 33
Batch 34
Batch 35
Batch 36
Batch 37
Batch 38
Batch 39
Batch 40
Batch 40 Train CE 5.121
Running validation...
Validation cross-entropy: 5.090
Batch 41
Batch 42
Batch 43
Batch 44
Batch 45
Batch 46
Batch 47
Batch 48
Batch 49
Batch 50
Batch 50 Train CE 5.057
Running validation...
Validation cross-entropy: 5.027
Batch 51
Batch 52
Batch 53
Batch 54
Batch 55
Batch 56
Batch 57
Batch 58
Batch 59
Batch 60
Batch 60 Train CE 4.997
Running validation...
Validation cross-entropy: 4.972
Batch 61


In [16]:
# SGD with momentum
embedding_dim = 16
num_hid = 128
trained_model = train(embedding_dim, num_hid)


Epoch 1
Batch 1


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  mask = np.zeros((batch_size, self.context_len), dtype=np.int)# Convert to one hot B x N, B batch size, N context len


Batch 2
Batch 3
Batch 4
Batch 5
Batch 6
Batch 7
Batch 8
Batch 9
Batch 10
Batch 10 Train CE 5.345
Running validation...
Validation cross-entropy: 5.076
Batch 11
Batch 12
Batch 13
Batch 14
Batch 15
Batch 16
Batch 17
Batch 18
Batch 19
Batch 20
Batch 20 Train CE 4.908
Running validation...
Validation cross-entropy: 4.773
Batch 21
Batch 22
Batch 23
Batch 24
Batch 25
Batch 26
Batch 27
Batch 28
Batch 29
Batch 30
Batch 30 Train CE 4.719
Running validation...
Validation cross-entropy: 4.674
Batch 31
Batch 32
Batch 33
Batch 34
Batch 35
Batch 36
Batch 37
Batch 38
Batch 39
Batch 40
Batch 40 Train CE 4.652
Running validation...
Validation cross-entropy: 4.640
Batch 41
Batch 42
Batch 43
Batch 44
Batch 45
Batch 46
Batch 47
Batch 48
Batch 49
Batch 50
Batch 50 Train CE 4.627
Running validation...
Validation cross-entropy: 4.611
Batch 51
Batch 52
Batch 53
Batch 54
Batch 55
Batch 56
Batch 57
Batch 58
Batch 59
Batch 60
Batch 60 Train CE 4.601
Running validation...
Validation cross-entropy: 4.597
Batch 61
