# Task 3: build a neural network language model



Let's start by importing the necessary libraries

In [1]:
import numpy as np
import pandas as pd
import io
import random
from tqdm import tqdm

# Note that we will not use keras tokenizer but keep using the same NLTK tokenizer from task 1
from nltk.tokenize import WordPunctTokenizer

# We use Keras here for simplicity. Replace with your neural network of choice.

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.optimizers import RMSprop
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.utils import to_categorical

# dataframe display option
pd.options.display.max_columns = None


Using TensorFlow backend.


# Global parameters

Next we setup the following global parameters:

Set to train on posts, comment or title, three type of posts in the original corpus
* POSTS_TYPE

Reduce the volume of posts by filtering on number of tokens in the text and sub sampling 
* MIN_TOKENS_LEN
* MAX_TOKENS_LEN
* DF_SAMPLE_COUNT

Lower tokens occurence limit 
* TOKENS_MIN_COUNT

Finally, define the length of sequences that will act as input and the sliding window to create the sequences from the list of tokens. 

* SEQUENCE_WINDOW: smaller window generate a large number of sequences
* SEQUENCE_LEN: longer sequences makes it harder to train the neural network


In [2]:
# setup variables

POSTS_TYPE = 'post'
MIN_TOKENS_LEN = 100
MAX_TOKENS_LEN = 200
DF_SAMPLE_COUNT = 20000

TOKENS_MIN_COUNT = 10

SEQUENCE_WINDOW = 4
SEQUENCE_LEN = 13

### Load the data
We use the same dataset we created in task 1 and used to train n-grams models in task 2


In [3]:
# Load data
df_raw = pd.read_csv('../Dataset/stackexchange_812k1.tokenized.csv').sample(frac = 1, random_state = 8).reset_index(drop = True)


### Reduce the dataset

In an ideal world where all the RAM and CPU is available free of charge we would not have to reduce the size of our dataset.

If we are using Google Colab, the RAM is limited to 25GB. Reducing the size of the dataset is needed to prevent the notebook from crashing. It laso helps to speed up the succesive runs and experiments. 

We want to limit the number of items and reduce the overall number of unique tokens.

Note: if your code crashes for lack of memory, Colab shows you a link to increase the RAM to 25Gb. But then you have to rerun the notebook from the start.


In [4]:

df = df_raw[
            (df_raw.category == POSTS_TYPE) & 
            (df_raw.n_tokens > MIN_TOKENS_LEN)  & 
            (df_raw.n_tokens < MAX_TOKENS_LEN)
        ].sample(DF_SAMPLE_COUNT).reset_index(drop = True)

print("df.shape: ", df.shape)

print(df.text.sample(2).values)


df.shape:  (20000, 7)
['I am attempting to estimate . I have expressions for both the conditional expectation and the probability of a, where the probability is Poisson distributed and the conditional expectation is calculated with a relatively expensive recursive formula. As the Poisson distribution has countably infinite support all non-negative integers I cannot calculate all elements of the sum. As the conditional expectation has only a recursive formula no closed form that I know I cannot calculate an integral. The conditional expectation is also increasing in a. My thought would be to truncate, as the probabilities will be single peaked and decreasing quite quickly, but I was unsure as to if there is a guideline on where to truncate or if there is a better solution. Thank you.'
 "The metric you describe is in fact very common It's mean absolute error, or MAE. In scikit learn you can find it in the metrics submodule . Usually it's used for regression tasks, not for classification,

If you recall, the tokens are stored in the dataframe as strings and separated by whitespaces. Let's transform back the tokens column into arrays of tokens

In [5]:
# transform the tokens field from white space separated strings into list of tokens
df['tokens'] = df.tokens.apply(lambda t : np.array(t.split()))
print(df.tokens.sample().values)

[array(['is', 'this', 'much', 'different', 'from', 'doing', 'two', 'local',
       'polynomials', 'of', 'degree', ',', 'one', 'for', 'below', 'the',
       'threshold', 'and', 'one', 'for', 'above', 'with', 'smooth', 'at',
       'are', 'the', 'upper', 'and', 'lower', 'limits', 'of', 'the',
       'confidence', 'interval', 'for', 'the', 'smoothed', 'outcome', '.',
       'lpoly', 'lne', 'd', 'if', 'd', 'lt', ',', 'bw', '.', 'deg', 'n',
       'gen', 'x', 's', 'ci', 'se', 'se', 'lpoly', 'lne', 'd', 'if', 'd',
       'gt', ',', 'bw', '.', 'deg', 'n', 'gen', 'x', 's', 'ci', 'se',
       'se', 'get', 'the', 'cis', 'forvalues', 'v', 'gen', 'ul', 'v', "'",
       's', 'v', "'", '.', 'se', 'v', "'", 'gen', 'll', 'v', "'", 's',
       'v', "'", '-', '.', 'se', 'v', "'", 'tw', 'line', 'ul', 'll', 's',
       'x', ',', 'lcolor', 'blue', 'blue', 'blue', 'lpattern', 'dash',
       'dash', 'solid', 'line', 'ul', 'll', 's', 'x', ',', 'lcolor',
       'red', 'red', 'red', 'lpattern', 'dash', 'dash', 

## Vocabulary

Let's new reduce the overall vocabulary size by excluding tokens that appear less than TOKENS_MIN_COUNT overall.

The goal is to reduce the number of unique tokens with a minimum of impact on the original list of tokens.


In [6]:
# generate vocabulary
# filter out words that are too scarce
import itertools
all_tokens = list(itertools.chain.from_iterable(df.tokens))

# filter out least common tokens
from collections import Counter
counter_tokens = Counter(all_tokens)


vocab_size  = len(set(all_tokens))
vocab       = list(set(all_tokens))
print("original number of tokens", len(all_tokens))
print("original vocab_size", vocab_size)


# remove all tokens that appear in less than TOKENS_MIN_COUNT times
fltrd_tokens = [ token for token in all_tokens if counter_tokens[token] > TOKENS_MIN_COUNT ]

print("nomber of tokens", len(fltrd_tokens))
print("vocab_size", len(set(fltrd_tokens)))

vocab_size  = len(set(fltrd_tokens))
vocab       = list(set(fltrd_tokens))




original number of tokens 2859381
original vocab_size 32864
nomber of tokens 2796260
vocab_size 7226


Just to be sure let's inspect the tokens that were rejected

In [7]:
# rejected tokens

rejected_tokens = np.unique([ token for token in all_tokens if counter_tokens[token] <= TOKENS_MIN_COUNT ])


In [8]:
print("len(rejected_tokens): ", len(rejected_tokens))
print(np.random.choice(rejected_tokens, 100, replace = False))


len(rejected_tokens):  25638
['warping' 'ailment' 'foldcv' 'ig' 'faculty' 'galaxies' 'pycm' 'workbook'
 'mushroom' 'propper' 'procedur' 'keys' 'voxel' 'surveyna' 'pointest'
 'remarks' 'glivenko' 'lohr' 'coxtest' 'aces' 'hθ' 'theonull'
 'binarization' 'sharks' 'memspace' 'erratically' 'sizing' 'fevers' 'loaf'
 'chnaged' 'yoy' 'arguable' 'newmod' 'niceness' 'prp' 'eclipse'
 'arrangement' 'ive' 'werksituatie' 'eigenfaces' 'spring' 'dimitriy'
 'vglm' 'hodg' 'emailing' 'discounted' 'islucky' 'preferrably'
 'indifferent' 'minimally' 'guidiance' 'hyvarinen' 'depths' 'fitdiscr'
 'sidebar' 'expectantly' 'bowl' 'mobile' 'cplot' 'pior' 'handicap'
 'erlang' 'oristano' 'constuct' 'balloon' 'tunçel' 'assisting'
 'undersample' 'inlier' 'indefinite' 'bowling' 'vicious' 'shits' 'dietz'
 'flowchart' 'reinforce' 'expicitly' 'wps' 'julie' 'stazionary'
 'overfiting' 'clc' 'carcinus' 'hakan' 'biggiest' 'nonsignificant'
 'killmann' 'bolded' 'binomila' 'determinable' 'accuray' 'vsports'
 'fulfil' 'clerks' 'di

# OOV
Next we need to replace the missing tokens by a specific token so that our model knows how to handle OOV. 

Let's add the token "UNK" to the vocabulary.

In [9]:
vocab.append('UNK')
vocab_size +=1 

## tokens as vocabulary indexes

At this point, we have sequences of tokens and we need sequences of numbers. 
We replace each token by its index in the vocabulary making sure that unknown tokens are replaced by the index of the "UNK" word.

This step can take quite a while to run when the size of the dataframe is too large.

To speed up this step and still handle OOV tokens it's best to avoid conditions in the list expression and use a try / except pattern in a function.
[link text](https://)    


In [10]:
mapping = { w : i for i, w in enumerate(vocab) }

def getidx(token):
    try:
        return mapping[token]
    except:
        return mapping['UNK']


df['tokens_idx'] = df.tokens.apply(lambda tokens : np.array([ getidx(token) for token in tokens]))


In [11]:
print(df.tokens_idx.head(2).values)


[array([5356, 3153,  635, 1734, 6917, 7125, 6174, 2198, 1561, 3534, 4479,
       3101, 4085, 4642, 1807, 4642, 3310, 3271, 2881, 4599, 3415, 3101,
       5880, 1807, 3009, 5369, 2900, 6078, 5282, 5708, 5967, 2810, 1962,
       3495, 2623, 3153, 1513, 5349,  788, 1081, 1867, 1203, 3153, 2140,
       5159, 5042, 3153, 2140, 1858, 5967, 5904, 2221, 3747, 3153, 5297,
       2881, 3521, 3239, 3101, 1787, 1962, 6203, 5955, 2521, 1685, 1513,
       5349,  788, 2948, 2175, 6114, 3101, 1136, 1962, 2138, 5369,  839,
       5904, 1081, 3436, 5967, 6885, 3239, 3432, 4640, 5972, 2221, 3747,
       2948, 1081, 3153, 5822, 3239, 5955, 5752, 1391, 3748, 2322, 3101,
       5644, 1174, 2221, 1792, 5469, 2652, 2342, 2948, 5822, 4877, 1787,
       5369, 6792, 5449, 3101, 2175, 1081, 4624, 3473, 2057, 5459, 5297,
       2881, 5822, 4877, 5967, 2623, 3473, 5459, 4801, 4433,  981, 3748,
       3635, 5955, 3741, 1685, 3799, 3239, 2707, 3101, 6917, 3261, 5955,
       3070, 3368, 3101, 2175, 1081, 1014, 7226, 3

## Sequence generation

This is the final step in preparing the corpus as input to the neural network.

We want to train the neural net on a classification task where the input is a sequence of words (as token indexes) and the output the following word.

From each list of tokens indexes, we generate K sequences of length N by taking a subset of length SEQUENCE_LENGTH and repeatedly sliding the window by SEQUENCE_WINDOW.

For instance, in a sentence of 15 tokens, if we set SEQUENCE_LENGTH = 6 and SEQUENCE_WINDOW = 3, we generate 5 sequences of length 6. Sequences that are shorter than SEQUENCE_LENGTH are left-padded with zeros.




In [12]:

def generate_sequences(sentence):
    sequences = []
    _end = SEQUENCE_WINDOW
    while _end < len(sentence) + SEQUENCE_WINDOW:
        sequences.append(sentence[:_end])
        _end += SEQUENCE_WINDOW
    padded_seqs = pad_sequences(sequences, maxlen=SEQUENCE_LEN, padding='pre')
    return padded_seqs
    


In [13]:
# Apply the sequence generation 
multi_sequences = df.tokens_idx.apply(generate_sequences)


The code below can be optimized to avoid reallocating memory as the main array is expanded with each new array of sequences. But for now it will suffice.

In [14]:
i = 0
for d in tqdm(multi_sequences.values):
    if i == 0:
        all_sequences = d
    else:
        all_sequences = np.concatenate( ( all_sequences, d )  )
    i +=1
print("\nsequences.shape: ",all_sequences.shape)

100%|████████████████████████████████████████████████████████████████████████████| 20000/20000 [03:49<00:00, 87.29it/s]


sequences.shape:  (722440, 13)





Depending on your parameters, this may result in a massive amount of sequences. Although more data is always best, having too many sequences will make it hard and time consuming to train the network. 

The next step is optional but shows how to sample N% of the sequences to reduce the input dataset. 



In [15]:
if True:
    
    mask = np.random.choice([False, True], len(all_sequences), p=[0.50, 0.50])

    sequences = all_sequences[mask].copy()
else:
    sequences = all_sequences.copy()
    print("\nsequences.shape: ",sequences.shape)
print("\nsequences.shape: ",sequences.shape)


sequences.shape:  (361231, 13)


Then we create the predictors and labels for the classificaton task.

In [16]:
predictors  = sequences[:,:-1]
label       = sequences[:,-1]

print("predictors.shape", predictors.shape)
print("label.shape", label.shape)

# The to_categorical Keras function transforms the vocab_size vector of labels into a one hot encoded matrix of dimension (n, vocab_size)
label_cat       = to_categorical(label, num_classes=vocab_size)

print("label_cat.shape", label_cat.shape)

predictors.shape (361231, 12)
label.shape (361231,)
label_cat.shape (361231, 7227)


# Model

We are now ready to define and train the neural network.

We choose

* an embedding dimension (32, 64, ...), 
* 2 LSTM layers 
* followed by a dense layer with softmax activation
* the optimizer is RMSprop with a learning rate of 0.01



In [17]:
'''
Define model
'''
embedding_dimension = 64
model = Sequential()
model.add(
    Embedding(vocab_size,
        embedding_dimension,
        input_length=SEQUENCE_LEN -1)
    )
model.add(LSTM(128, return_sequences = True))
model.add(LSTM(64))
model.add(Dense(vocab_size, activation='softmax'))
optimizer = RMSprop(lr=0.01)

model.compile(loss='categorical_crossentropy',
    optimizer=optimizer,
    metrics=['accuracy'])

print(model.summary())


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 12, 64)            462528    
_________________________________________________________________
lstm_1 (LSTM)                (None, 12, 128)           98816     
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                49408     
_________________________________________________________________
dense_1 (Dense)              (None, 7227)              469755    
Total params: 1,080,507
Trainable params: 1,080,507
Non-trainable params: 0
_________________________________________________________________
None


Let's fit the model

In [18]:
'''
Model Fitting!
'''
model.fit(predictors, label_cat, batch_size = 256, epochs=4, verbose=1)



  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
   512/361231 [..............................] - ETA: 2:15:19 - loss: 4.8694 - accuracy: 0.1816





<keras.callbacks.callbacks.History at 0x1d3459cff60>

In [19]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


In [20]:

tokenizer = WordPunctTokenizer()

def generate_text(nmax, text, temperature):
    n = 0
    tokens = tokenizer.tokenize(text)
    while (len(tokens) < nmax) :
        n +=1
        
        # only takes known words into account
        tokens_idx = [ vocab.index(word) if word in vocab else vocab.index('UNK') for word in tokens  ]
        # print(tokens_idx)
        tokens_list = pad_sequences([tokens_idx], maxlen=SEQUENCE_LEN-1, padding='pre')
        probas = model.predict_proba(tokens_list, verbose=0)[0]

        next_word_idx = sample(probas, temperature = temperature)
        next_word = vocab[next_word_idx]
        # print(next_word_idx, next_word)

        # next_word = np.random.choice(vocab, p = probas)
        if next_word != '?':
            print(next_word, probas[vocab.index(next_word)]  )
            text += ' ' + next_word
        # print(text)
        tokens = tokenizer.tokenize(text)
        if n> 200:
            break;
    return text


In [21]:
generate_text(15, 'a random variable', 3)

kpss 1.9849795e-06
penalized 1.5628754e-06
ssa 2.096636e-09
given 0.0020274164
aicc 1.5821489e-06
student 0.00035136475
color 7.966885e-06
faces 3.1813456e-06
documents 0.00021886693
e 0.0030902135
high 8.537658e-06
cronbach 9.656214e-08


'a random variable kpss penalized ssa given aicc student color faces documents e high cronbach'

# Perplexity

In the case of the n-gram language model, we used the propability of each n-gram in the input sentence to calculate the perplexity. 

Our current model does not rely on n-grams, but on probabilities of sequences of tokens to be followed by a subsequent token.

We can adapt the perplexity formula for n-grams language models to sequence based language models as such

if we consider the sentence of N tokens: 

$$w_{1},\cdots, w_N$$

Then we can calculate the probability of that sentence as the product of probabilities of all the padded subsequences. Let's take an example of a 3 tokens sentence.

$$
P(w_{1},w_2, w_3) =  P(w_{3} | w_1, w_2) \times p(w_2 | w_1 ,0)  \times p(w_1 | 0 ,0)
$$

In general, for a sentence of N tokens and a sequence length of length S 

$$
P(w_{1},\cdots, w_N) = \prod_{k = 1}^{ \max{(N,S)}} P(w_{k} | \text{padded}_S(w_{1}, \cdots, w_{k-1})    ) 
$$

where 
$$P(w_{k} | \text{padded}_S(w_{1}, \cdots, w_{k-1})$$ 

is precisely the probability given by the classification model.

We can compute the perplexity of a sentence of length N with 
$$PP(w_{1},\cdots, w_N) = \exp [ - \frac{1}{N} {\sum_{i = 1}^{ \max{(N,S)} } \log { P(w_{k} | \text{padded}_S(w_{1}, \cdots, w_{k-1}) } } ) ]$$




In [22]:
# set the sequence window to 1 to generate all the sub sequences from the original sentence.

SEQUENCE_WINDOW = 1

# and define the perplexity for a sentence

def perplexity(sentence):
    # tokenize
    tokens = tokenizer.tokenize(sentence.lower())
    N = len(tokens)
    # find the indexes of the tokens from the vocabulary
    tokens_idx = [ vocab.index(word) if word in vocab else vocab.index('UNK') for word in tokens  ]
    # generate a N x SEQUENCE_LEN array of padded sequences 
    sequences = generate_sequences(tokens_idx)
    predictors  = sequences[:,:-1]
    label       = sequences[:,-1]
    # the probabilities of all the words in the vocab given each padded sequence
    probas = model.predict_proba(predictors, verbose=0)
    # add the log of the probability of the label given the padded sequence
    logprob = 0
    for k in range(N):
        p = probas[k,label[k]]
        logprob += np.log( p  )    
    return np.exp(- logprob / N), logprob



Now compare three sentences, the first one directly extracted from the corpus, the second from an old song and the third gammaticaly invalid.

Although this is not direct proof of the quality of the model, we see that the 1st sentence has the lowest perplexity, while the 3rd sentence the highest. And the second sentence, which is grammaticaly correct but obviously not about data or algorithm scores in between.



In [23]:
sentence = "In a fixed-effects model only time-varying variables can be used."
print(sentence, perplexity(sentence))

sentence = "I know a pretty little place in Southern California, down San Diego way."
print(sentence, perplexity(sentence))

sentence = "This that is noon but yes apple whatever did regression variable"
print(sentence, perplexity(sentence))


In a fixed-effects model only time-varying variables can be used. (243.19416503047952, -82.40790235996246)
I know a pretty little place in Southern California, down San Diego way. (374.04122297881037, -88.86549019813538)
This that is noon but yes apple whatever did regression variable (3454.791916884255, -89.62269258499146)


## Perplexity on corpus

Finally let's calculate the perplexity on a validation set.

We define the validation set as N random items from the original corpus. Here we will choose 100 titles with between 10 and 100 tokens.








In [31]:
df_valid = df_raw[df_raw.category.isin(['title'])].copy()
#print("df_valid",df_valid)
print(df_valid.head(2))

    post_id  parent_id  comment_id  \
21   154700        NaN         NaN   
24   160640        NaN         NaN   

                                                 text category  \
21  Are aov with Error same as lmer of lme package...    title   
24  How to compare contingency tables for a specif...    title   

                                               tokens  n_tokens  
21  are aov with error same as lmer of lme package...        13  
24  how to compare contingency tables for a specif...        10  


In [25]:
def corpus_perplexity(corpus):
    # start by calculating the total number of tokens in the corpus
    all_sentences = ' '.join(corpus)
    all_tokens =  tokenizer.tokenize(all_sentences.lower())
    N = len(all_tokens)
    logproba = 0
    perps = []
    for sentence in corpus:
        pp, logp = perplexity(sentence)
        logproba += logp
        perps.append(pp)
        #print ("{:.2f}\t{:.2f}\t{:.2f}\t{:.2f}\t{:.2f}\t{}".format(pp, np.mean(perps), logp, logproba, np.exp( - logproba / (N  )), sentence  ))

    return np.exp( - logproba / (N))




In [33]:
corpus = df_valid.text.values
#print(corpus)
perplexity_score= corpus_perplexity(corpus)
print(" Corpus perplexity: {:.2f}".format(perplexity_score ))



 Corpus perplexity: 659.78
