# CS 224D Assignment #2
# Part [2]: Recurrent Neural Networks

This notebook will provide starter code, testing snippets, and additional guidance for implementing the Recurrent Neural Network Language Model (RNNLM) described in Part 2 of the handout.

Please complete parts (a), (b), and (c) of Part 2 before beginning this notebook.

In [2]:
import sys, os
from numpy import *
from matplotlib.pyplot import *
%matplotlib inline
matplotlib.rcParams['savefig.dpi'] = 100

%load_ext autoreload
%autoreload 2

## (e): Implement a Recurrent Neural Network Language Model

Follow the instructions on the handout to implement your model in `rnnlm.py`, then use the code below to test.

In [190]:
from rnnlm2 import RNNLM
# Gradient check on toy data, for speed
random.seed(10)
wv_dummy = random.randn(10,50)
model = RNNLM((20, 10, 10, 20),
              alpha=0.005, rseed=10)
model.grad_check(array([1,2,3,4,5,6,7,8]), array([2,3,4,5,6,7,8,9]), verbose=True)

grad_check: dJ/dH_4 error norm = 2.748e-10 [ok]
    H_4 dims: [10, 10] = 100 elem
grad_check: dJ/dH_2 error norm = 1.954e-10 [ok]
    H_2 dims: [10, 10] = 100 elem
grad_check: dJ/dH_3 error norm = 1.49e-10 [ok]
    H_3 dims: [10, 10] = 100 elem
grad_check: dJ/dH_1 error norm = 1.565e-10 [ok]
    H_1 dims: [10, 10] = 100 elem
grad_check: dJ/db_4 error norm = 3.912e-10 [ok]
    b_4 dims: [10] = 10 elem
grad_check: dJ/dW_4 error norm = 2.734e-10 [ok]
    W_4 dims: [10, 10] = 100 elem
grad_check: dJ/dW_5 error norm = 1.792e-09 [ok]
    W_5 dims: [20, 10] = 200 elem
grad_check: dJ/db_1 error norm = 4.656e-11 [ok]
    b_1 dims: [10] = 10 elem
grad_check: dJ/db_2 error norm = 4.143e-11 [ok]
    b_2 dims: [10] = 10 elem
grad_check: dJ/db_3 error norm = 8.018e-11 [ok]
    b_3 dims: [10] = 10 elem
grad_check: dJ/dW_1 error norm = 1.237e-10 [ok]
    W_1 dims: [10, 20] = 200 elem
grad_check: dJ/db_5 error norm = 2.607e-09 [ok]
    b_5 dims: [20] = 20 elem
grad_check: dJ/dW_3 error norm = 1.792e-10

## Prepare Vocabulary and Load PTB Data

We've pre-prepared a list of the vocabulary in the Penn Treebank, along with their absolute counts and unigram frequencies. The document loader code below will "canonicalize" words and replace any unknowns with a `"UUUNKKK"` token, then convert the data to lists of indices.

In [31]:
from data_utils import utils as du
import pandas as pd

# Load the vocabulary
vocab = pd.read_table("data/lm/vocab.ptb.txt", header=None, sep="\s+",
                     index_col=0, names=['count', 'freq'], )

# Choose how many top words to keep
vocabsize = 2000
num_to_word = dict(enumerate(vocab.index[:vocabsize]))
word_to_num = du.invert_dict(num_to_word)
##
# Below needed for 'adj_loss': DO NOT CHANGE
fraction_lost = float(sum([vocab['count'][word] for word in vocab.index
                           if (not word in word_to_num) 
                               and (not word == "UUUNKKK")]))
fraction_lost /= sum([vocab['count'][word] for word in vocab.index
                      if (not word == "UUUNKKK")])
print "Retained %d words from %d (%.02f%% of all tokens)" % (vocabsize, len(vocab),
                                                             100*(1-fraction_lost))

Retained 2000 words from 38444 (84.00% of all tokens)


Load the datasets, using the vocabulary in `word_to_num`. Our starter code handles this for you, and also generates lists of lists X and Y, corresponding to input words and target words*. 

*(Of course, the target words are just the input words, shifted by one position, but it can be cleaner and less error-prone to keep them separate.)*

In [36]:
# Load the training set
docs = du.load_dataset('data/lm/ptb-train.txt')
S_train = du.docs_to_indices(docs, word_to_num)
X_train, Y_train = du.seqs_to_lmXY(S_train)

# Load the dev set (for tuning hyperparameters)
docs = du.load_dataset('data/lm/ptb-dev.txt')
S_dev = du.docs_to_indices(docs, word_to_num)
X_dev, Y_dev = du.seqs_to_lmXY(S_dev)

# Load the test set (final evaluation only)
docs = du.load_dataset('data/lm/ptb-test.txt')
S_test = du.docs_to_indices(docs, word_to_num)
X_test, Y_test = du.seqs_to_lmXY(S_test)

# Display some sample data
print " ".join(d[0] for d in docs[7])
print S_test[7]

Big investment banks refused to step up to the plate to support the beleaguered floor traders by buying big blocks of stock , traders say .
[   4  147  169  250 1879    7 1224   64    7    1    3    7  456    1    3
 1024  255   24  378  147    3    6   67    0  255  138    2    5]


## (f): Train and evaluate your model

When you're able to pass the gradient check, let's run our model on some real language!

You should randomly initialize the word vectors as Gaussian noise, i.e. $L_{ij} \sim \mathit{N}(0,0.1)$ and $U_{ij} \sim \mathit{N}(0,0.1)$; the function `random.randn` may be helpful here.

As in Part 1, you should tune hyperparameters to get a good model.

In [51]:
hdim = 100 # dimension of hidden layer = dimension of word vectors
random.seed(10)
L0 = 0.1 * random.randn(vocabsize, hdim) # replace with random init, 
                              # or do in RNNLM.__init__()
# test parameters; you probably want to change these
model = RNNLM(L0, U0 = L0, alpha=0.1, rseed=10, bptt=1)

# Gradient check is going to take a *long* time here
# since it's quadratic-time in the number of parameters.
# run at your own risk... (but do check this!)
model.grad_check(array([1,2,3]), array([2,3,4]))

grad_check: dJ/dH error norm = 1.007e-09 [ok]
    H dims: [100, 100] = 10000 elem
grad_check: dJ/dU error norm = 4.23e-09 [ok]
    U dims: [2000, 100] = 200000 elem
grad_check: dJ/dL[3] error norm = 1.873e-10 [ok]
    L[3] dims: [100] = 100 elem
grad_check: dJ/dL[2] error norm = 1.809e-10 [ok]
    L[2] dims: [100] = 100 elem
grad_check: dJ/dL[1] error norm = 2.361e-10 [ok]
    L[1] dims: [100] = 100 elem


In [92]:
#### YOUR CODE HERE ####

##
# Pare down to a smaller dataset, for speed
# (optional - recommended to not do this for your final model)
ntrain = 5000#len(Y_train)
X = X_train[:ntrain]
Y = Y_train[:ntrain]
model.train_sgd(X, Y, idxiter=model.randomiter(ntrain * 5, ntrain, 5), alphaiter=None, 
                printevery=100, costevery=5000, devidx=None)



#### END YOUR CODE ####

Begin SGD...
  Seen 0 in 0.00 s
  [0]: mean loss 4.517
  Seen 100 in 65.98 s
  Seen 200 in 81.90 s
  Seen 300 in 98.76 s
  Seen 400 in 116.72 s
  Seen 500 in 131.13 s
  Seen 600 in 145.05 s
  Seen 700 in 160.55 s
  Seen 800 in 178.65 s
  Seen 900 in 197.17 s
  Seen 1000 in 215.86 s
  Seen 1100 in 234.50 s
  Seen 1200 in 251.70 s
  Seen 1300 in 266.74 s
  Seen 1400 in 283.57 s
  Seen 1500 in 300.16 s
  Seen 1600 in 318.18 s
  Seen 1700 in 333.07 s
  Seen 1800 in 345.99 s
  Seen 1900 in 359.62 s
  Seen 2000 in 372.47 s
  Seen 2100 in 385.85 s
  Seen 2200 in 398.36 s
  Seen 2300 in 410.98 s
  Seen 2400 in 423.80 s
  Seen 2500 in 436.94 s
  Seen 2600 in 450.42 s
  Seen 2700 in 462.62 s
  Seen 2800 in 475.50 s
  Seen 2900 in 488.44 s
  Seen 3000 in 500.90 s
  Seen 3100 in 513.71 s
  Seen 3200 in 526.18 s
  Seen 3300 in 539.03 s
  Seen 3400 in 552.01 s
  Seen 3500 in 565.02 s
  Seen 3600 in 578.81 s
  Seen 3700 in 592.19 s
  Seen 3800 in 605.14 s
  Seen 3900 in 617.29 s
  Seen 4000 in 637.32

[(0, 4.5170011431387289),
 (5000, 3.7026867506582128),
 (10000, 3.487073333074262),
 (15000, 3.3698461435746432)]

In [93]:
## Evaluate cross-entropy loss on the dev set,
## then convert to perplexity for your writeup
dev_loss = model.compute_mean_loss(X_dev, Y_dev)

The performance of the model is skewed somewhat by the large number of `UUUNKKK` tokens; if these are 1/6 of the dataset, then that's a sizeable fraction that we're just waving our hands at. Naively, our model gets credit for these that's not really deserved; the formula below roughly removes this contribution from the average loss. Don't worry about how it's derived, but do report both scores - it helps us compare across models with different vocabulary sizes.

In [94]:
## DO NOT CHANGE THIS CELL ##
# Report your numbers, after computing dev_loss above.
def adjust_loss(loss, funk, q, mode='basic'):
    if mode == 'basic':
        # remove freebies only: score if had no UUUNKKK
        return (loss + funk*log(funk))/(1 - funk)
    else:
        # remove freebies, replace with best prediction on remaining
        return loss + funk*log(funk) - funk*log(q)
# q = best unigram frequency from omitted vocab
# this is the best expected loss out of that set
q = vocab.freq[vocabsize] / sum(vocab.freq[vocabsize:])
print "Unadjusted: %.03f" % exp(dev_loss)
print "Adjusted for missing vocab: %.03f" % exp(adjust_loss(dev_loss, fraction_lost, q))

Unadjusted: 137.336
Adjusted for missing vocab: 247.348


### Save Model Parameters

In [95]:
##
# Save to .npy files; should only be a few MB total
assert(min(model.sparams.L.shape) <= 100) # don't be too big
assert(max(model.sparams.L.shape) <= 5000) # don't be too big
save("rnnlm.L.npy", model.sparams.L)
save("rnnlm.U.npy", model.params.U)
save("rnnlm.H.npy", model.params.H)

## (g): Generating Data

Once you've trained your model to satisfaction, let's use it to generate some sentences!

Implement the `generate_sequence` function in `rnnlm.py`, and call it below.

In [123]:
def seq_to_words(seq):
    return [num_to_word[s] for s in seq]
    
seq, J = model.generate_sequence(word_to_num["<s>"], 
                                 word_to_num["</s>"], 
                                 maxlen=100)
print J
# print seq
print " ".join(seq_to_words(seq))

21.4488996666
<s> shares fell dec. DG . </s>


**BONUS:** Use the unigram distribution given in the `vocab` table to fill in any `UUUNKKK` tokens in your generated sequences with words that we omitted from the vocabulary. You'll want to use `list(vocab.index)` to get a list of words, and `vocab.freq` to get a list of corresponding frequencies.

In [20]:
# Replace UUUNKKK with a random unigram,
# drawn from vocab that we skipped
from nn.math import MultinomialSampler, multinomial_sample
def fill_unknowns(words):
    #### YOUR CODE HERE ####
    ret = words # do nothing; replace this
    

    #### END YOUR CODE ####
    return ret
    
print " ".join(fill_unknowns(seq_to_words(seq)))