# Language Modelling using RNN

In this project, a language model using Recurrent Neural Networks (RNNs) will be created (in Keras) and trained on news headline data. It will then be used to generate news headlines of our own!

## Setup

Let's import the required libraries and preprocess the training data for the model. We will use Keras with a Tensorflow backend.

In [1]:
from keras.callbacks import LambdaCallback
from keras.layers import Dense, SimpleRNN, Activation
from keras.models import Sequential
from keras.optimizers import RMSprop
from keras.utils import to_categorical

import numpy as np
import random
import re
import string

# Reading in the training data; taking a smaller set to reduce training time
with open("headlines.train", 'r') as f:
    headlines_train = f.readlines()[:100000]

# Removing excess punctuation and newline
regex = re.compile('[%s]' % re.escape(string.punctuation))
headlines_train = [regex.sub('', h.split("\n")[0]) for h in headlines_train]

# Define the unk, start and stop tokens
UNK_TOKEN = "<UNK>" # UNKNOWN - mapping rare words
START_TOKEN = "<START>"
STOP_TOKEN = "<STOP>"

def count_unigrams(text, unigram_dict):
    """
    :param text: A headline, consisting of a string of words
    :param unigram_dict: A dictionary containing unigrams as keys and their respective counts as values
    """
    tokens = [START_TOKEN] + text.split(" ") + [STOP_TOKEN]
    for i in range(len(tokens)):
        unigram = tokens[i]
        if unigram not in unigram_dict:
            unigram_dict[unigram] = 1
        else:
            unigram_dict[unigram] += 1

min_freq = 3 # The minimum word frequency to be present in the vocabulary

# The following are used to keep track of and remove infrequent words
low_freq = set()
all_words = {}

def replace_text_train(text):
    return " ".join([UNK_TOKEN if t in low_freq else t for t in text.split()])

# Finding all words with low frequency
for h in headlines_train:
    count_unigrams(h, all_words)
for word, count in all_words.items():
    if count <= min_freq:
        low_freq.add(word)
# Replacing low frequency words from training dataset with UNK
headlines_train_clean = [replace_text_train(h) for h in headlines_train]

# Build vocabulary and make a mapping from index to word for generation
vocab = set([item for sublist in map(lambda x: x.split(" "), headlines_train_clean) for item in sublist])
vocab.add(STOP_TOKEN)
vocab_list = list(vocab)
word_to_index = {vocab_list[i]: i for i in range(len(vocab_list))}
index_to_word = {v: k for k, v in word_to_index.items()}

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


For our RNN, we be first converting our text into GloVe word embeddings before giving it as input. We'll also need to define some parameters which will be used in our model.

In [2]:
# Reading in GloVe embeddings as save them as a dictionary
with open("glove_embeddings.txt", 'r') as f:
    gloves = [t.split(" ") for t in f.readlines()]
    gloves_dict = {t[0]: np.array(t[1:]) for t in gloves}

In [3]:
# Parameters to used for the batch generator and model
vocab_size = len(index_to_word.keys())
sent_len = max([len(h.split(" ")) for h in headlines_train_clean]) + 1
glove_dim = next(iter(gloves_dict.values())).size

## Creating Data Batches
First we'll need to turn the headlines into data samples (where each sample is an output word given the entire history of previous words in the headline). To do this, we will iterate through all the headlines and through each word within the headline to get (history, word) pairs as our inputs and labels.

In [4]:
data = []
for h in headlines_train_clean:
    # Pad the text in the beginning with start tokens
    text = [START_TOKEN for _ in range(sent_len)] + h.split(" ") + [STOP_TOKEN]
    for i in range(len(text) - sent_len):
        data.append((text[i:i+sent_len], text[i+sent_len]))

Keras allows batches of data to be fed into the RNN through a generator, so we'll make such a generator to process the data and package it nicely for the model to use during the training steps.

In [5]:
# Parameters for the data generator and model
batch_size = 512
num_batches = -(-len(data) // batch_size)

def sample_generator():
    while True:
        random.shuffle(data)
        for i in range(num_batches):
            batch_input = np.zeros((batch_size, sent_len, glove_dim))
            batch_label = np.zeros((batch_size, vocab_size))
            for j in range(batch_size):
                idx = j + i*batch_size
                history, word = data[j]
                for k in range(len(history)):
                    if history[k] in gloves_dict:
                        batch_input[j,k,:] = gloves_dict[history[k]]
                batch_label[j,word_to_index[word]] = 1
            yield batch_input, batch_label

## Building the model

In [6]:
hidden_neurons = 128
model = Sequential()
model.add(SimpleRNN(hidden_neurons, input_shape=(sent_len, glove_dim)))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_1 (SimpleRNN)     (None, 128)               54912     
_________________________________________________________________
dense_1 (Dense)              (None, 13550)             1747950   
Total params: 1,802,862
Trainable params: 1,802,862
Non-trainable params: 0
_________________________________________________________________
None


The next block of code defines a few functions to generate sentences from the RNN.

In [7]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def generate_headline():
    end_sentence = False
    sent = np.zeros((sent_len, glove_dim))
    
    generated = []
    curr_len = 0
    while not end_sentence:
        sent_input = np.expand_dims(sent[-sent_len:sent.shape[0]], axis=0)
        word_probs = model.predict(sent_input, verbose=0)
        next_word = sample(np.squeeze(word_probs, axis=0))
        if next_word == word_to_index[STOP_TOKEN] or curr_len == sent_len:
            end_sentence = True
            print(' '.join(generated))
        else:
            if index_to_word[next_word] in gloves_dict:
                word_embeded = gloves_dict[index_to_word[next_word]]
            else:
                word_embeded = np.zeros(glove_dim)
            sent = np.concatenate((sent, np.expand_dims(word_embeded, axis=0)), axis=0)
            generated.append(index_to_word[next_word])
            curr_len += 1

Before we begin training, let's first look at what kind of headlines an untrained RNN generates.

In [8]:
for _ in range(5):
    generate_headline()

vandal revises tested seminar conservationist face tougher mates slaps cipriani brindabella springsteen midwest brawlers warrant
jailed stewarts ekka kirby christie mossman bride jacobs firies honours banking 18pc bellied milan wmd
putting shocking rust victorians halls direction alarms partnerships expense destruction an penguin nov passion pritchard
share irrigator childrens adani abalone simulation plays opponents toodyay eureka tehran exmouth coldest tally stalked
thousands donors scales abandon humane lnp preparation finalises alcopop ada audit martyn showing aplenty boer


Doesn't make much sense, eh?

## RNN Training
Now that we've constructed our RNN, we can begin training. This takes quite a while (~30 minutes) to train.

In [9]:
def on_epoch_end(epoch, logs):
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    for i in range(3):
        generate_headline()
        print()

optimizer = RMSprop(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.fit_generator(sample_generator(), num_batches, 3,
          callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)])

Epoch 1/3

----- Generating text after Epoch: 0
smokers <UNK> man injured service station redevelopment

colombia court <UNK>

govt announces 2m in melbourne first

Epoch 2/3

----- Generating text after Epoch: 1
thunderbirds systemic dairy hope new shire shire

temporary north years palestinians in the bill cherry flowering american olive of mining

company new shire shire

Epoch 3/3

----- Generating text after Epoch: 2
sharon to enrol super bowl burning phoenix

france police investigate

reveals an attempted on the island



<keras.callbacks.History at 0x16462267048>

We've finally finished training our RNN! Let's see what kind of headlines we can generate now.

In [15]:
for i in range(10):
    generate_headline()

first hand in the downgrades
govt toughens urges progress
more late over attempted to enrol campus
researchers denmark of the <UNK> increases
denmark us to enrol horne horne technology a
bundaberg floods
report more <UNK> <UNK> the increase
<UNK> field <UNK> man forced following to
bushrangers police investigate truck appearance
<UNK> teen late late to blame burning after arrest


That seems much better! Although some of the headlines still don't make a lot of sense, they seem to follow the syntactic structure of a legitimate news headline. With more rigorous training and a more complex RNN architecture, we could probably do even better!