# Text Generation with LSTM

<img src="https://github.com/ckenlam/Deep-Learning-Text-Generation-Model/blob/master/bro-code-lstm.png?raw=true" width="700">

Inspired by Janelle Shane's [Pickup Line Generator](https://aiweirdness.com/post/159302925452/the-neural-network-generated-pickup-lines-that-are), I will attempt to train a recurrent neural network that can come up with its own Bro Codes using data from the following sources:
- https://brocode.org/the-code/
- http://www.fanpop.com/clubs/barney-stinson/articles/162623/title/chicks-code

## Loading the Libraies

In [1]:
import numpy as np
import pandas as pd
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from keras.models import Sequential
import keras.utils as ku
from tqdm import tqdm
import re
import requests

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Loading the data

I gathered all 184 Bro Codes and 125 Chick Codes and saved them in a text file. As you can imagine, this is a very small dataset of 309 documents; we shall see what kind of outputs we can get from this dataset. 

In [2]:
url = 'https://raw.githubusercontent.com/ckenlam/Deep-Learning-Text-Generation-Model/master/data.txt'
response = requests.get(url)
data = response.text

## Data Processing

To generate realistic sentences, I need to preserve the punctuations in the dataset. However, Keras' Tokenizer will remove all punctuations by default. To get around that, I will have to add a space to any punctuations and remove the desired punctuations from Tokenizer's filter.

In [3]:
data = re.sub('\d+: ','', data)
cleanup_dict = {"ARTICLE ":""
               ,".":" ."
               ,",":" ,"
               ,"!":" !"
               ,"?":" ?"
               ,"\r\n":"\n"
               ,"\n\n":"\n"
                }
for from_this, to_this in cleanup_dict.items():
    data = data.replace(from_this, to_this)

A text generation model requires each document to be split into N-gram sequences and 1 associated next-word. Simply put, the sentence "Bro’s do not keep a personal diary" need to be transformed into the following:
1. Bro's + **do**
2. Bro's do + **not**
3. Bro's do not + **keep**
4. Bro's do not keep + **a**
5. Bro's do not keep a + **personal**
6. Bro's do not keep a personal + **diary**

The underlying idea of text generation model is essentially a language model that predicts the probability of each word given an input sequence of text. The predicted word will then be added to the previous input sequence which will be used to generate the next word again. The cycle repeats for an arbitrary number of times.  

The following data processing function is based on the one found in Shivam Bansal's [article](https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275).

In [4]:
tokenizer = Tokenizer(filters='"#$%&()*+-/:;<=>@[\\]^_`{|}~\t\n')

def data_processing(text_data):
    text = text_data.lower().split('\n')
    tokenizer.fit_on_texts(text)
    words_count = len(tokenizer.word_index)+1
    sequences = []
    
    #For each line in the text data, tokenize each word
    for line in text:
        tokenized_line = tokenizer.texts_to_sequences([line])[0]
        
    #Turn each tokenized line into n-gram sequences and append to the "sequences" list
        for seq in range(1, len(tokenized_line)):
            n_gram_seq = tokenized_line[:seq+1]
            sequences.append(n_gram_seq)
            
    #Find the maximum length in this dataset
    max_sequence_length = max(len(x) for x in sequences)
    
    #Make sure all n-gram sequences are of the same length
    sequences = pad_sequences(sequences,maxlen=max_sequence_length,padding='pre')
    X = sequences[:,:-1]
    y = sequences[:,-1]
    y = ku.to_categorical(y, num_classes=words_count)
    word_index = tokenizer.word_index
    return X,y,max_sequence_length,words_count, word_index

In [5]:
X, y, max_seq_length, total_words_count, word_index = data_processing(data)

## Defining the Models

For this experiment, I will train a neural network with 3 hidden layers of bidirectional LSTM without using a pre-trained word embedding to initialize the embedding layer.  

In [9]:
model = Sequential()
model.add(Embedding(total_words_count, 10, input_length=max_seq_length - 1))
model.add(Bidirectional(LSTM(150, dropout=0.6, recurrent_dropout=0.6,return_sequences=True)))
model.add(Bidirectional(LSTM(150, dropout=0.6, recurrent_dropout=0.6,return_sequences=True)))
model.add(Bidirectional(LSTM(150)))
model.add(Dropout(0.1))
model.add(Dense(total_words_count,activation='softmax'))

model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 136, 10)           16700     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 136, 300)          193200    
_________________________________________________________________
bidirectional_2 (Bidirection (None, 136, 300)          541200    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 300)               541200    
_________________________________________________________________
dropout_1 (Dropout)          (None, 300)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1670)              502670    
Total params: 1,794,970
Trainable params: 1,794,970
Non-trainable params: 0
_________________________________________________________________


## Training the Models

In [10]:
from keras.callbacks import History 
from keras.callbacks import EarlyStopping
batch_size = 50
epochs = 200

In [16]:
from keras.callbacks import ModelCheckpoint
# checkpoint
filepath="weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

history = model.fit(X, y, epochs=epochs, verbose = 1
                      , batch_size=batch_size
                      , callbacks=callbacks_list
                      ,validation_split=0.2
                      #,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)]
                     )
print("Training completed!")
model.save('model.h5') 

W0619 02:27:07.843797 140585019008896 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 5757 samples, validate on 1440 samples
Epoch 1/200

Epoch 00001: val_acc improved from -inf to 0.06250, saving model to weights-improvement-01-0.06.hdf5
Epoch 2/200

Epoch 00002: val_acc did not improve from 0.06250
Epoch 3/200

Epoch 00003: val_acc improved from 0.06250 to 0.06250, saving model to weights-improvement-03-0.06.hdf5
Epoch 4/200

Epoch 00004: val_acc did not improve from 0.06250
Epoch 5/200

Epoch 00005: val_acc did not improve from 0.06250
Epoch 6/200

Epoch 00006: val_acc did not improve from 0.06250
Epoch 7/200

Epoch 00007: val_acc did not improve from 0.06250
Epoch 8/200

Epoch 00008: val_acc did not improve from 0.06250
Epoch 9/200

Epoch 00009: val_acc did not improve from 0.06250
Epoch 10/200

Epoch 00010: val_acc did not improve from 0.06250
Epoch 11/200

Epoch 00011: val_acc improved from 0.06250 to 0.06458, saving model to weights-improvement-11-0.06.hdf5
Epoch 12/200

Epoch 00012: val_acc did not improve from 0.06458
Epoch 13/200

Epoch 00013: val_acc

W0619 12:22:15.068167 140585019008896 __init__.py:44] file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/googleapiclient/discovery_cache/__init__.py", line 36, in autodetect
    from google.appengine.api import memcache
ModuleNotFoundError: No module named 'google.appengine'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 33, in <module>
    from oauth2client.contrib.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.contrib.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 37, in <module>
    from oauth2client.locked_file import Lock


Epoch 00200: val_acc did not improve from 0.11458
Training completed!


It took me many hours to train over 200 epochs on Google Colab. However, the validation acuracy only improved from 6.25% to 11.46%, while the training accuracy is at 80.81%. This means that the model is not generalizing very well. While this is unfortunate, given that my dataset is so small, this result is somewhat expected. Nonetheless, we might still get interesting outputs from the model. We shall see in the next section.

## Testing the Models

In [11]:
from keras.models import load_model
model = load_model('model.h5')

The following helper function is taken from Keras's **lstm_text_generation** [example script](https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py) found in its github. For text generation model, we may not always be interested in THE word that has the highest probability (i.e. safest guesses); instead, we want the model to be slightly more "creative" in its guesses of the next word. To do so, the helper function introduce a scaling factor called "temperature" which defines how conservative or creative we want the output to be. A lower "temperature" (e.g. 0.1) will lead to safer guesses, while a higher "temperature will yield "riskier" guesses.   

In [12]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Once again, the following generate_text function is based on Shivam Bansal's [article](https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275) with minor modification that incorporates the previously mentioned helper function.   

In [12]:
def generate_text(seed_text, next_words, max_sequence_len, model, temp_low, temp_high):
    for j in range(next_words):
        tokenized_seed_text = tokenizer.texts_to_sequences([seed_text])[0]
        tokenized_seed_text = pad_sequences([tokenized_seed_text], 
                                            maxlen=max_sequence_len-1, 
                                            padding='pre')
        
        #for each prediction, a new "temperature" is random chosen to bring more randomness in the results
        temperature = random.uniform(temp_low,temp_high)
        
        predicted_proba = model.predict(tokenized_seed_text, verbose=0)[0]
        predicted_word = sample(predicted_proba,temperature = temperature)
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted_word:
                output_word = word
                break
        seed_text += " " + output_word

    return seed_text

To generate new Bro Codes, I will start the sequence with "a bro" (or "when a bro, "if a bro", etc) and randomly assign a sequence length and temperature. Below is one of the many trials that I ran:

In [117]:
import random
for n in range(10):
    word_count = random.randint(7,15)
    print(generate_text('a bro',word_count, max_seq_length, model, 2, 4))

a bro does not go with front of other bros .
a bro or crosses a birthday of their women and do bro .
a bro never facetimes . a chick hits sometimes even better bro’s pretend not the being anything
a bro must never tamper with another chick's ex , unless the bro must ex it
a bro doesn’t offers another bro drink and drive with if a naked to the the
a bro shall not expected to notice and if she can
a bro is not allowed to notice another bro’s new haircut . unless are drunk
a bro is not required to fail of at all drunk , all
a bro never cries , with bro and only be results the agreed the existence
a bro doesn’t allow another bro to another anything if they or revenge of


## Conclusion

As expected, the model generates the exact sequences from the Bro Codes many times due to overfitting; yet, it still managed to come up with some novel ones:
- 'A bro is unacceptable in sandals.'
- 'A bro is honor bound to drink'
- 'A bro always grow a moustache of armpits'
- 'A bro will not sleep in front of bros. no'
- 'A bro never tickles another bro's muscles'
- 'A bro will order any type of alcoholic drink she wants'
- 'A bro is honour bound to accept all style chick'
- 'A bro does not choose his own wrestling challenges when a chick orders'
- 'A bro never gives another bro’s toothbrush they wouldn’t try them self .'
- 'A bro never willingly relinquishes possession of a bachelor party'
- 'A bro never lets another bro drink with any married bros'
- 'A bro will only be supportive of all decisions of his side chicks .'
- 'A bro is honor bound to his wife .'
- 'A bro is always completely respectful to music in a bar'
- 'When a bro is in doubt , he shall consider the actions of chuck norris'

These ones make no sense, but are still grammatically correct (to a certain extent):
- 'If a bro asks out a guitar at a party , he will leave a quarter'
- 'If a bro asks , a chick may not be tolerated in another bro to toss a banana'
- 'When a bro dies while lifting weights  , he is required to sit him twice before she must honor the weird wait three days'
- 'A bro never facetimes , skypes or uses any other guy and chair for cue .'
- 'A bro never borrows or lends clothes of another bro to eating a banana '


## Reference

[1] S. Bansal, 'Language Modelling and Text Generation using LSTMs — Deep Learning for NLP', 2018. [Online]. Available: https://medium.com/@shivambansal36/language-modelling-text-generation-using-lstms-deep-learning-for-nlp-ed36b224b275. [Accessed: 21- June- 2019]

[2] Keras Team, 'lstm_text_generation.py', 2019. [Online]. Available: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py . [Accessed: 21- June- 2019]