# LSTM language model handout
## [COSC 7336 Advanced Natural Language Processing](https://fagonzalezo.github.io/dl-tau-2017-2/)

In [1]:
import numpy as np
import urllib
from matplotlib import pyplot as plt
import random
import sys

from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from keras.layers.wrappers import TimeDistributed



Using TensorFlow backend.


In this handout we will build a character-based language model using a Long Short Term Memory (LSTM) recurrent neural network. The code is based on this Keras example https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py.

We will use a text from Nietzsche that is availabe here: https://s3.amazonaws.com/text-datasets/nietzsche.txt. The text is converted to lowercase and we build a two dictionaries to map characters to indices and back. 

In [51]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read().lower()
chars = sorted(list(set(text)))
vocab_size = len(chars)
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
print("Total number of chars:", len(text))
print("Vocabulary size:", vocab_size)

Total number of chars: 600893
Vocabulary size: 57


The following is an example of the test in the book:

In [25]:
print(text[31000:31500])

ts object purely and simply as "the thing in itself," without any
falsification taking place either on the part of the subject or the
object. i would repeat it, however, a hundred times, that "immediate
certainty," as well as "absolute knowledge" and the "thing in itself,"
involve a contradictio in adjecto; we really ought to free ourselves
from the misleading significance of words! the people on their part may
think that cognition is knowing all about things, but the philosopher
must say to him


## Many to one LSTM model 

We will build a model with this structure:

![many-to-one.jpg](many-to-one.jpg)

The model receives sequences of size 40 and predicts the character that will follow this sequence, the 41-th character. The LSTM layer will have 128 neurons

In [27]:
maxlen = 40
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, vocab_size), return_sequences=False, name="lstm_1"))
model.add(Dense(vocab_size, name="dense_1"))
model.add(Activation('softmax', name="activation_1"))
model.summary(70)

______________________________________________________________________
Layer (type)                   Output Shape                Param #    
lstm_1 (LSTM)                  (None, 128)                 95232      
______________________________________________________________________
dense_1 (Dense)                (None, 57)                  7353       
______________________________________________________________________
activation_1 (Activation)      (None, 57)                  0          
Total params: 102,585
Trainable params: 102,585
Non-trainable params: 0
______________________________________________________________________


In order to train this model we need to build the sequences from the input text. Each sequence will have a length of `maxlen = 40` and the label will correspond to the character that follows that sequence, i.e. the 41-th character. The sequences overlap, this is controlled with the `step` variable. Characters are represented using a one-hot representation of `len(chars)` length.

In [28]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

print('Shape X', X.shape)
print('Shape y', y.shape)

nb sequences: 200285
Vectorization...
Shape X (200285, 40, 57)
Shape y (200285, 57)


We use a categorical crossentropy loss since the output of the model is a softmax layer:

In [29]:
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer = optimizer)

Training the model will require at least 20 epochs to get good results, so a GPU is a most. For illustration we will train it for just an epoch with a reduced set of samples:

In [32]:
model.fit(X[:1000,:,:], y[:1000,:], batch_size=128, epochs=1)

Epoch 1/1


<keras.callbacks.History at 0x13f6a9128>

## Many-to-many LSTM

Despite the model we trained is a many-to-one model, we can convert it to a many-to-many model, since an LSTM always produces an output for every time step:

![many-to-many.jpg](many-to-many.jpg)

In order to do this we will change the architecture of the model. In particular, we will change the dense output layer so that it produces an output for every time step. This is accomplished by using the `TimeDistributed` class. This doesn't increase the number of parameters since the weights are shared for all the time steps. 

In [33]:
maxlen = 40
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, vocab_size), return_sequences=True, name="lstm_1"))
model.add(TimeDistributed(Dense(vocab_size), name="dense_1"))#Check names to see how to load weights
model.add(Activation('softmax', name="activation_1"))
model.summary(70)

______________________________________________________________________
Layer (type)                   Output Shape                Param #    
lstm_1 (LSTM)                  (None, 40, 128)             95232      
______________________________________________________________________
dense_1 (TimeDistributed)      (None, 40, 57)              7353       
______________________________________________________________________
activation_1 (Activation)      (None, 40, 57)              0          
Total params: 102,585
Trainable params: 102,585
Non-trainable params: 0
______________________________________________________________________


Now we will load a pretrained model. This model corresponds to the many-to-one model define before, but since the parameters of both models are the same we can use the weights for the many-to-many model.

In [34]:
h5file = 'weights1.00-1.07.hdf5'
optimizer = RMSprop(lr=0.01)
model.load_weights(h5file)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

## Calculating the probability of a text

The model that we trained calculates the probability of a character given the previous characters: $P(c_i | c_{1},\dots, c_{i-1})$. We can use this conditional probability to calculate the join probability of a given sequence as follows:
$$P(c_1, \dots, c_n) = P(c_1)\prod_{i=2}^{n}\ P(c_i | c_{1},\dots, c_{i-1})$$

This probability is the likelihood of the particular sequence $c_1, \dots, c_n$ being generated by the language model.

In the many-to-many model the conditional probabilities are given by the outputs at each time step. Remember that the output of the model is a softmax layer with as many neurons are characters in the vocabulary. The following function calculates the log likelihood of a text according to the above formula:

In [35]:
def log_likelihood(model, text):
    probs = model.predict(parse_text(text, vocab_size, padding=True)).squeeze()
    return sum([np.log(probs[i, char_indices[c]]) 
                 for i,c in enumerate(text[1:]) ]) 

To use it for particular texts, we need to represent them in a way that is understood by the neural network model, this is done by the following function:

In [36]:
def parse_text(text, vocab_size, padding=False):
    if padding:
        X = np.zeros((1, maxlen, vocab_size), dtype=np.bool)
    else:
        X = np.zeros((1, len(text), vocab_size), dtype=np.bool)
    for t, char in enumerate(text):
        X[0, t, char_indices[char]] = 1
    return X

The absolute value of the likelihood does not say much by itself. It is more useful if we use it to compare different texts, i.e. if we interpret it in relative terms.

For instance the likelihood of the text *"the thing in itself"* is

In [38]:
print (log_likelihood(model, "the thing in itself"))


-24.6521512464


And the likelihood of the same text without spaces, *"thethinginitself"*, is:

In [39]:
print (log_likelihood(model, "thethinginitself"))

-59.0525213182


This means that $$P(\text{"the thing in itself"} | model) > P(\text{"thethinginitself"} | model)$$

The following are other variations of the text

In [40]:
print (log_likelihood(model, "the thingy in itself"))
print (log_likelihood(model, "itself thing the in"))

-30.8013833459
-31.7916916162


This suggest that the model is able to capture different aspects of the text structure such as the morphology of the words or the syntaxis.

### Morphological structure

The model can help us to identify the most likely (unlikely) combinations of a set of characters:

In [50]:
from itertools import permutations
from random import shuffle
char_list = list(u'ywh ')
perms = [''.join(perm) for perm in permutations(char_list)]
for p, t in sorted([(log_likelihood(model, text), text) for text in perms], reverse=True)[:5]:
    print(p, t)
print('-'*50)
for p, t in sorted([(log_likelihood(model, text), text) for text in perms], reverse=True)[-5:]:
    print(p, t)

-2.90723706037 y wh
-4.96534752846 why 
-7.63534402847  why
-7.64063081145 hy w
-9.28439575434 y hw
--------------------------------------------------
-18.8574256897  ywh
-19.5259504318 wyh 
-21.4377524853 h wy
-21.4675536156 h yw
-25.4104990959  yhw


### Syntactical structure

Instead of characters we can use words:

In [None]:
from itertools import permutations
bow =  ['philosopher', 'kant', 'is', 'a']
perms = [' '+' '.join(perm)+' ' for perm in permutations(bow)]

These are the most likely permutations:

In [53]:
for p, t in sorted([(log_likelihood(model, text), text) for text in perms], reverse = True)[:10]:
    print(p, t)

-28.924826585  is a philosopher kant 
-32.3237102695  kant is a philosopher 
-34.8795643604  a kant is philosopher 
-35.6037164213  kant is philosopher a 
-39.9717190545  a philosopher kant is 
-43.2843809983  is philosopher kant a 
-43.6753551863  is kant a philosopher 
-43.7594629446  a kant philosopher is 
-43.8379937951  is a kant philosopher 
-44.6056019792  a is kant philosopher 


These are the most unlikely:

In [54]:
for p, t in sorted([(log_likelihood(model, text), text) for text in perms], reverse = True)[-10:]:
    print(p, t)

-50.7616942155  kant a is philosopher 
-51.1649401682  kant philosopher is a 
-53.6018530477  is kant philosopher a 
-58.6275342517  philosopher kant is a 
-65.4431888871  philosopher is a kant 
-66.7852299176  philosopher a kant is 
-69.441136879  kant philosopher a is 
-71.2249834351  philosopher is kant a 
-71.6653220765  philosopher kant a is 
-76.0457286723  philosopher a is kant 


### Generating text

One of the most interesting applications of language models is random text generation. 

Since the model tells us the probability of the next character given the previous ones. We can start from a given text and use the model conditional probability to get a good candidate for the next character, we add that character to our sequence and we keep generating characters in this way.

To generate a character using the conditional probability calculated by the model, $P(c_t | c_{1}, c_{2},\dots, c_{t-1})$, we will use two main mechanisms: deterministic and  stochastic. For the deterministic we generate the character with the maximum probability. For the stochastic we sample from the conditional probability distribution given by the model. We combine the two strategies using a process analogous to this one:
  ```python
  for i in [1..n]:
      P = predict_next() 
      bin_var = sample_binomial(temperature)
      if bin_var:
          c_i = sample_multinomial(P) 
      else:
          c_i = P.argmax() 
  ```
The `temperature` parameter determines the balance between stochastic and deterministic. A higher value makes the process more stochastic, a lower value makes it more deterministic.


In [76]:
# Function to sample an index from a probability array:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

The following function feeds the model with an initial text, successively execute the model to get a probability distribution and uses the `sample` function to generate the next character:

In [89]:
def generate_text(diversity, model, sentence, n_chars, padding=True):
    print()
    print(sentence, end='')
    for i in range(n_chars):
        x = np.zeros((1, maxlen, vocab_size))
        if padding and len(sentence) < 40:
            space_array = [" "]*(40-len(sentence))
            for t, char in enumerate(space_array):
                x[0, t, char_indices[char]] = 1.
        for t, char in enumerate(sentence, 40-len(sentence)):
            x[0, t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds[-1], diversity)
        next_char = indices_char[next_index]
        sentence = sentence[1:] + next_char
        sys.stdout.write(next_char)
        sys.stdout.flush()
    return True

Now let's ask the model for the *'the meaning of life'* ...

In [88]:
generate_text(0.6, model, 'the meaning of life is ', 400)


the meaning of life is the propress of the sexielses of the invertion of huthertarely be" the same "sleptestolle)," and is not tas "in chrisent to himself we may not not in the except of a somm there is end, accord, that there is arbitraons as "freelongence, sworts the word lumbs promp," perhaps mankind" is shame are as so far as it is guen ene who
has there are stands her which is not with why higkn? what must are long

True