# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Seinfeld Script Generator

Notebook 3: Preprocessing & Modeling - Character Level

I am a strong believer in that one size doesn't fit for all. There are so many different types of brands out there with various needs, set-ups and available resources, they should not be limited to only one option. And therefore I decided to explore different models which can address their needs at different levels. 

Upon reading articles and papers from work done by previous data scientists, I decided start with recurrent neural network model which was proven to be unreasonably effective in NLP tasks such as text generation, language translation and speech recognition. RNN is powerful because it creates this loop of updating the weights and states so that the model can have memories. However RNN suffers from gradient vanishing and long-term dependency problems. The Long-Short-Term-Memory(LSTM) architecture, which is a speciall type of RNN, takes a step further to address these issues and therefore achieves remarkable results. As I wanted to generate meaningful dialogue between characters, I expected my model to take the context into consideration, which means long-term dependency would be involved. That was why I decided to build my first script generator model using LSTMs.

For text generation models using LSTMs, one of the very famous project was done by AI researcher Andrej Karpathy. In his blog [_The Unreasonable Effectiveness of Recurrent Neural Networks_](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), he built a character-based RNN-LSTM model to generate text and achieved impressive results. The data size used in his blog ranged from 1MB to 474MB text files and the running time ranged from minutes to days. My data size is 4.2MB with 3.39 million characters, which is comparable to Karpahty's Shakespeare example. It's a moderate amount of data for character level RNN-LSTM, so I decided to test out.

This notebook was trained on Kaggle's GPU notebook and was heavily inspired by the codes shared by [Patrick DeKelly](https://www.kaggle.com/valkling/pythonicpythonscript4making-seinfeld-scripts) on Kaggle. I started off using AI Notebook but the limited RAM crashes immediately after I started training the model. Kaggle by contrast is a lot more stable in this case. However not perfect either -- as I first tried to use the entire data to train in the model, it takes 30min to train an epoch and I'm training 30, so after 9 hours when I was at epoch 18, Kaggle called stop and refused to run my model anymore. It was under this condition that I decide to use what Patrick did -- to just train partial of the text.

In [3]:
import pandas as pd
import numpy as np
import random

from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Input, Embedding, LSTM, Dropout, Dense, Activation
from keras.optimizers import Adam
from keras.layers import LSTM
from keras.callbacks import EarlyStopping, ModelCheckpoint, Callback

import os

In [14]:
path = "../data/for_train.txt"

# lower case the text for easy tokenizing
text = open(path, 'r').read().lower()

### Text Preprocessing

The first step would be to tokenize all the characters. When we feed data into the model, we cannot feed them as it is, we would have to convert them into a lanugage that our model understands, in this case, a matrix (with numbers). So we are mapping digits to each character so that we can represent our data using entirly numbers.

In [15]:
len(text)

3387170

In [18]:
# As mentioned, in consideration of training time and hardware limitation, I used the first 500,000 characters of my data to train the model

text = text[:500000]

char = list(set(text))
char.sort() 

print(char)
print(f'Unique tokens: {len(char)}')

# np.save('../assets/char_based/charindex.npy', char)

['\n', ' ', '!', '"', '#', '$', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '\\', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¿', 'é']
Unique tokens: 59


There are 59 unique tokens in my text.

### Define Sequence Length

The logic behind my model would, given the model certain number of characters, and train it to be able to predict the character right after. Given the RNN/LSTM mechanism, I am allowed to set that number of my choice. Here are the considerations: if the number is too small, this means that my model would only be able to look at very small amount of characters, so it might not have enough information. However if the number is too large, for i.e. the text that my model learns extend from episode to episode, then there might be too much noise that prevent my model from effectively predicting the next character. Considering these, I decided to set the sequence length number to 100, which is a little longer than 13, the average words per line.

In [19]:
# this is the sequence length
maxlen = 100

# set up feature and label
X_train = []
y_train = []

# turn the data into a list of sequences
for i in range(0, len(text)-maxlen, 1): 
    X = text[i: i + maxlen]
    y = text[i + maxlen]
    
    # map the token with the index value
    X_train.append([char.index(x) for x in X])
    y_train.append(char.index(y))

# reshape the X_train to be ready to fit into the model
X_train = np.reshape(X_train, (len(X_train), maxlen))
# one hot encode the label
y_train = np_utils.to_categorical(y_train)

### Modeling

Kaparchy used 3 LSTM layers to train the Shakespeare text and so did DeKelly. So I decided to do the same. I also added dropout layers for the model to avoid overfitting.

In [24]:
def get_model():
    
    model = Sequential()
    
    # add Embedding layer
    model.add(Input(shape=(maxlen)))
    model.add(Embedding(len(char), maxlen, trainable=False))
    
    # add 3 stacks of LSTM layers
    model.add(LSTM(512, dropout=0.1, recurrent_dropout=0.1, return_sequences=True))
    model.add(LSTM(512, dropout=0.1, recurrent_dropout=0.1, return_sequences=True))
    model.add(LSTM(512))
    
    # add 2 NN layers to add complexity of the model
    model.add(Dense(256, activation="elu"))
    model.add(Dense(128, activation="elu"))
    
    # add output layer
    model.add(Dense(len(char), activation='softmax'))
    
    model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(lr=0.001),
                  metrics=['accuracy'])

    return model

model = get_model()

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 100)          5900      
_________________________________________________________________
lstm_6 (LSTM)                (None, 100, 512)          1255424   
_________________________________________________________________
lstm_7 (LSTM)                (None, 100, 512)          2099200   
_________________________________________________________________
lstm_8 (LSTM)                (None, 512)               2099200   
_________________________________________________________________
dense_6 (Dense)              (None, 256)               131328    
_________________________________________________________________
dense_7 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_8 (Dense)              (None, 59)               

In [22]:
checkpoint = ModelCheckpoint('../assets/char_based/model_checkpoint.hdf5',
                             monitor='loss',
                             verbose=1,
                             save_best_only=True,
                             mode='min')

# early stopping to detect when loss stops dropping 
early = EarlyStopping(monitor="loss",
                      mode="min",
                      patience=3)

callbacks = [checkpoint, early]

In [None]:
model.fit(X_train, y_train,
          batch_size=256,
          epochs=30,
          verbose=1,
          callbacks = callbacks)

It took me only 262s to train 1 epoch in this case. So in total my training time was a little over 2 hours. My final epoch returned a loss of 1.0687 and accuracy of 0.6589.

### Script Generation

To generate script, I built a function thats takes in number of letters one needs the model to generate. I didn't format any of the output as this is my first model, therefore I expected to see the generated text in all lowercase.

In [23]:
def generate_text(next_letters):
    
    # randomly select a starting point of the text to be generate
    x = np.random.randint(0, len(X_train)-1)
    pattern = X_train[x]
    generated = []
    for t in range(next_letters):
        x = np.reshape(pattern, (1, len(pattern)))
        pred = model.predict(x)
        result = np.argmax(pred)
        generated.append(result)
        pattern = np.append(pattern,result)
        pattern = pattern[1:len(pattern)]
        
    return generated

In [None]:
generated = generate_text(5000)

generated = [char[x] for x in generated]
generated = ''.join(generated)

print(generated)

### Save Model and Scripts

In [None]:
model.save_weights('../assets/char_based/full_train_weights.hdf5')
model.save('../assets/char_based/full_train_model.hdf5')

In [None]:
f = open('../texts/char_level.txt','w')
f.write(generated)
f.close()

### Evaluation

Note: I generated three different texts and manually concatnated them into the same ```char_level.txt``` file.

Let's take a look the generated text. From the format perspective, except for the first line, which was randomly generated from my ```generate_text```function, all other lines follow the original data that used to train the model with ```character name``` who is speaking in the front, followed by ```: ```, then the line. It also knows to have space after punctuations like ```,```, ```.```, no space after ```(```, ```)```. And also learned to make a new line after certain punctuation such as ```.``` and ```?```. There are minior imperfections such as have special character combined with some words ```*"hip*```, ```*weathes```, but in general I think it does well on learning the formatting. 

Grammar-wise overall it makes sense, there are typos her and there such as ```goldes```, ```anciers```, ```flie drava```, trouble to identify which of the singular/plural form to use i.e.```six hunter```. Tense confusion: ```i lived everything```. Not every sentence makes semantic sense but still it reads like English. 

One thing amazes me is that the model not only predict the four main characters to speak. Every once a while, there will be a character who in reality only showed couple times in the show. i.e. ```babu```, ```vanessa```.

However despite these, the biggest problem here is still the singularity of outputs. In all of the three 5,000-letter texts, there are lots of repetition of "i know", "i don't know". It feels like the model is making the safest choice, by predicting the most frequent words spoken by the characters.

I believe with more data and more training time, the model will definitely get better. So far for a 2h trained model with only 500,000 characters, I think it was an acceptable result. Upon building this model, I had the following takeaways:

1. Instead of a character level RNN-LSTM model, how about using a word-based one to avoid typos?
2. A batch generator might be able to solve the limitation of RAM issue on my google Cloud.
3. How can I add diversity into the predictions?

With those thoughts, I started building my second model, the word level RNN-LSTM model.

In [26]:
generated = open('../texts/char_level.txt', 'r').read()
print(generated)

picks up a paper)
vanessa: jerry, this is great.
jerry: i know, i know.
jerry: well, if i call the airline, a quaker, and i have to get a gardener.
george: you don't know if i should do it.
jerry: well, if that's not it.
elaine: what do you go from your stuff? that's a *weathes.
jerry: it's only six hunter in the eyes.
jerry: i know, i know.
jerry: i know.
george: i keep first.
elaine: i know.
jerry: i know, i know.
jerry: i know.
jerry: i know.
george: i know.
elaine: you know, i lived everything. i told you to let us sleep in there.
jerry: what did you say then?
jerry: i know.
george: i keep forgetting that it's good. that's why we're having dinner with that night is touching.
jerry: i know, i know.
jerry: i know.
george: i can't believe it.
jerry: well, if that's not the test. that's all.
jerry: well, i could do it. i go to the next guy. what do you go down there? (george pulls his seat) (jerry and elaine starts meats and goes to another on the steps off the room on the couch near g