# Finding Grammar in Neural language models

We will investigate a language model trained on the penn treebank dataset, using the code provided at https://github.com/pytorch/examples/tree/master/word_language_model. The model consists of an encoder with 2 hidden LSTM layers with 1500 units, and a linear output layer to which a softmax function is applied. The word embeddings have dimensionality 1500. The model is trained for 40 epochs with a dropout factor of 0.65 and has a test perplexity of 72.30 on the test set. If you are interested in more detail in the model, we advise you to look at the repository containing the code.

In this notebook, we will walk you through an example of how you can compute the probabilties of the next word in a sentence. You can then use this to start your replication of Linzen et al.

In [1]:
# do required imports
import torch
from torch.autograd import Variable
import torch.nn as nn
import pickle

If you downloaded and extracted the zipfile, you should have all data required: the model, and the pickled dictionary mapping words to indices.

Lets start by loading the model. Because the model was trained on a GPU, we need to specifically say that it should be loaded on the CPU when we load it:

In [2]:
lm = torch.load('model.pt', map_location=lambda storage, loc: storage)

In [3]:
# print a summary of the architecture of your model
print(lm)

RNNModel (
  (drop): Dropout (p = 0.65)
  (encoder): Embedding(10000, 1500)
  (rnn): LSTM(1500, 1500, num_layers=2, dropout=0.65)
  (decoder): Linear (1500 -> 10000)
)


### Evaluating a single sentence

We will give an example of how you can get the probabilties for the next word in a single sentence. We will uset he example sentence:<br>

"This is a sentence with seven"

And print the probabilities of completing this sentence with either 'words', 'characters', 'thursday', 'days' or 'walk'. As the model itself does not include the mapping from words to indices, we will need to do this as a preprocessing step. The dictionary that maps words to indices is stored in a pickled file called 'dict'.

In [4]:
# Load dictionary word --> id 
dictionary = pickle.load(open('dict', 'rb'))

# set the maximum sequence length
max_seq_len = 50

# function to transform sentence into word id's and put them in a pytorch Variable
# NB Assumes the sentence is already tokenised!
def tokenise(sentence, dictionary):
    words = sentence.split(' ')
    l = len(words)
    assert l <= max_seq_len, "sentence too long"
    token = 0
    ids = torch.LongTensor(l)

    for word in words:
        try:
            ids[token] = dictionary.word2idx[word]
        except KeyError:
            print( word)
            raw_input()
            ids[token] = dictionary.word2idx['<unk>']
        token += 1
    return ids

We now define a function that can be used to evaluate a single sentence and print the probabilities of finishing this sentence with a word from a list of input words.

In [5]:
# load pytorch softmax function
softmax = nn.Softmax()

def evaluate(model, dictionary, sentence, check_words):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    
    # number of tokens (= output size)
    ntokens = len(dictionary)
    hidden = model.init_hidden(1)
    
    # tokenise the sentence, put in torch Variable
    test_data = tokenise(sentence, dictionary)
    input_data = Variable(test_data, volatile=True)

    # run the model, compute probabilities by applying softmax
    output, hidden = model(input_data, hidden)
    output_flat = output.view(-1, ntokens)
    logits = output[-1, :]
    sm = softmax(logits).view(ntokens)
    
    # get probabilities of certain words by looking up their
    # indices and print them
    def get_prob(word):
        return sm[dictionary.word2idx[word]].data[0]

    print (sentence, '\n')
    print ('\n'.join(
            ['%s: %f' % (word, get_prob(word)) for word in check_words]
            ) )

    return

SyntaxError: Missing parentheses in call to 'print' (<ipython-input-5-cf69fc0157bf>, line 27)

In [None]:
# test sentence and words to check
test_sentence = 'this is a sentence with seven'
check_words = ['words', 'characters', 'thursday', 'days', 'walk']

In [None]:
evaluate(lm, dictionary, test_sentence, check_words)

# Now try it yourself

You can now start with your replication of Linzen's paper, for which the first step is to try different inputs with varying distances etc. to get a feeling for what the model is doing and to familiarise yourself with using it.