# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Seinfeld Script Generator

Notebook 3: Preprocessing & Modeling - Character Level

I am a strong believer in that one size doesn't fit for all. There are so many different types of brands out there with various needs, set-ups and available resources, they should not be limited to only one option. And therefore I decided to explore different models which can address their needs at different levels. 

Upon reading articles and papers from work done by previous data scientists, I decided start with recurrent neural network model which was proven to be unreasonably effective in NLP tasks such as text generation, language translation and speech recognition. RNN is powerful because it creates this loop of updating the weights and states so that the model can have memories. The Long-Short-Term-Memory(LSTM) architecture more specifically, avoid the long-term dependency problem that a simple RNN has trouble with.

Here are my thought process on the modeling. As my data size is 4.2MB. It's a good amount to 


Originally, I tried putting both the directions and dialogue into the text. However, since there are so many directions this ends up making a script that is mostly stuff like "cut to a picture of a man in the street." or "cut to stock video of a train" ect. 

This notebook was originally trained on Kaggle notebook and was heavily inspired by the codes shared by [Patrick DeKelly](https://www.kaggle.com/valkling/pythonicpythonscript4making-seinfeld-scripts) on Kaggle. Takes 30min to train 1 epoch on full data and 4min to train 1 epoch with 500,000 characters

In [2]:
import pandas as pd
import numpy as np
import random

from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Input, Embedding, LSTM, Dropout, Dense, Activation
from keras.optimizers import Adam
from keras.layers import LSTM
from keras.callbacks import EarlyStopping, ModelCheckpoint, Callback

import os

In [15]:
path = "../data/for_train.txt"
text = open(path, 'r').read().lower()

### Preprocessing

Next we will prepare an index of every unique character in our text. We are only getting rid of capitalization for simplicity, but still keeping all special characters. This will give us an output that retains the punctuation and format of the original. 

anything ~1MB+ is great

While we could work with the every Seinfeld script, it ends up being a lot of data to go through within the time limit. As such, I added an if block to limit the text data to just the first half million characters. Using more text and training longer is a valid option for improving the output with more training time.

In [16]:
len(text)

3592591

In [17]:
if len(text) > 500000:
    text = text[:500000]

char = list(set(text))
char.sort() 
print(char)

np.save("charindex.npy", char)

['\n', ' ', '!', '"', '#', '$', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '\\', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¿', 'é']


In [6]:
text[:2000]

'jerry: you know, why we\'re here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about "we should go out"? this is what they\'re talking about...this whole thing, we\'re all out now, no one is home. not one person here is home, we\'re all out! there are people tryin\' to find us, they don\'t know where we are. (imitates one of these people "tryin\' to find us"; pretends his hand is a phone) "did you ring?, i can\'t find him." (imitates other person on phone) "where did he go?" (the first person again) "he didn\'t tell me where he was going". he must have gone out. you wanna go out: you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...there you\'re staring around, whatta you do? you go: "we gotta be getting back". once you\'re out, you wanna get back! you wanna go to sleep, you wanna get up, you wann

# Create Sequences
In a nutshell, this model will look at the last 75 characters in the script and attempt to predict the 76th. Our X variable will be a 75 character sequence and our Y variable will be the 76th character. This block chops the text data into such sequences of characters. 

Note that this part also tokenizes the characters, which is to say it replaces each character with a number that corresponds to it's index in charindex. This is why it is important to save a copy of the charindex with your model just in case. We will need it to decode our predictions later.


In [11]:
maxlen = 75
X_train = []
y_train = []
for i in range(0, len(text)-maxlen, 1 ): 
    X = text[i:i + maxlen]
    y = text[i + maxlen]
    X_train.append([char.index(x) for x in X])
    y_train.append(char.index(y))

X_train = np.reshape(X_train, (len(X_train), maxlen))
y_train = np_utils.to_categorical(y_train)

# Create the Model
The model uses 3 LSTMs stacked on top of each. Adding another LSTM layer and/or running it a lot longer or in multiple session will give better results. However, the 3 LSTM should do fine in 6 hour and adding the loopbreaker to our code later will make even under trained models give good results. Also note that we are using CuDNNLSTMs. If you don't know what that is, it is a special LSTM layer specially made for NIVDA GPUs. These function the same as regular LSTM layers but are automatically optimised for the GPU. You lose some customization with these layers but they work roughly twice as fast as regular LSTMs layers if conditions are right.


In [8]:
def get_model():
    
    model = Sequential()
    
    model.add(Input(shape=(maxlen, )))
    model.add(Embedding(len(char), 75, trainable=False))
    
    model.add(LSTM(512, return_sequences=True,))
    model.add(LSTM(512, return_sequences=True,))
    model.add(LSTM(512,))
    
    model.add(Dense(256, activation="elu"))
    model.add(Dense(128, activation="elu"))
    model.add(Dense(len(char), activation='softmax'))
    
    model.compile(loss='categorical_crossentropy',
                  optimizer=Adam(lr=0.001),
                  metrics=['accuracy'])

    return model

model = get_model()

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 75, 75)            4425      
_________________________________________________________________
lstm (LSTM)                  (None, 75, 512)           1204224   
_________________________________________________________________
lstm_1 (LSTM)                (None, 75, 512)           2099200   
_________________________________________________________________
lstm_2 (LSTM)                (None, 512)               2099200   
_________________________________________________________________
dense (Dense)                (None, 256)               131328    
_________________________________________________________________
dense_1 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_2 (Dense)              (None, 59)               

# Checkpoints and Custom Callback
We will use 3 callbacks. Checkpoint, EarlyStopping, and a custom TextSample callback. Text sample prints a sample line at the end of every epoch to see how the model is progressing each epoch. For Kaggle, this is less important as you have to commit your code to run this long enough to output results.

In [9]:
filepath="../assets/char_based/model_checkpoint.hdf5"

checkpoint = ModelCheckpoint(filepath,
                             monitor='loss',
                             verbose=1,
                             save_best_only=True,
                             mode='min')

early = EarlyStopping(monitor="loss",
                      mode="min",
                      patience=1)

callbacks = [checkpoint, early]

Even with a GPU, this can take a while. As is, I'm setting this notebook to take almost the full 6 hour limit. I have played around with training these types of models for 12 or even 24 hours wit more layers.  However, usually if gotten to roughly around 1.0 loss the generator is good enough to go. Can train almost indefinitely on most models. We are not *really* worried about overfitting. Hypothetically, if the loss gets too low the text might become overfit, which in this case means just copying the text in the most inefficient way. However, it should take an unrealistically long time to get to that point (or maybe just impossible).

In [None]:
model.fit(X_train, y_train,
          batch_size=256,
          epochs=26,
          verbose=1,
          callbacks = callbacks)

In [None]:
# model = load_model(filepath)
model.save_weights("full_train_weights.hdf5")
model.save("full_train_model.hdf5")

# Generating New Seinfeld Scripts
This block generates new text in the style of the input text of TEXT_LENGTH size in characters. It takes a random seed pattern from the training set, predicts the next character, adds it to the end of the pattern, then drops the first character of the pattern and predicts on the new pattern and so forth.

Pretty much this text generator *tries* to accurately duplicate the Seinfeld script but inevitably makes errors ,and those errors compound, but is still trained well enough that it ends up making Seinfeld *like* scripts 


This is simple bit of I came up with while putting this together. Every so many character predictions, the program just changes one of the characters in the pattern to predict on (except the last few, to prevent spelling errors). This causes our model to perceive a slightly different text which causes it to change it's overall predictions slightly too. Without this, even a well trained model might start to repeat itself at some point and get caught in a loop. The loopbreaker can even prevent overfitting or allow under trained models to perform much better. Without a loopbreaker like this, models will need to be trained for many more hours before they can function without looping in on themselves.

Changing this value up and down an interesting way to significantly change the output. Setting it high will have more repeated speech, slightly lower might get many line starting the same then vering off into different directions, really low will get lots of varied text but line structures and format might become unstable. Probably keep it somewhere between 1 and 10.


In [None]:
next_letters  = 5000

x = np.random.randint(0, len(X_train)-1)
pattern = X_train[x]
generated = []
for t in range(next_letters):
    if t % 500 == 0:
        print("%"+str((t/next_letters)*100)+" done")
  
    x = np.reshape(pattern, (1, len(pattern)))
    pred = model.predict(x)
    result = np.argmax(pred)
    generated.append(result)
    pattern = np.append(pattern,result)
    pattern = pattern[1:len(pattern)]

As you can see, the output is not bad. Text generators like this are pretty good on a line by line basis. Some of the lines seem really plausible as Seinfeld dialogue. Plot and scene structure is off. Different characters show up talking about irrelevant things. In some ways that works comedy. Still, more AI structures are needed to keep track of the plot and such. Anyways, this is the extent of most AI text generation these days without more structured custom code.

In [None]:
generated = [char[x] for x in generated]
generated = ''.join(generated)

print(generated)

In [None]:
f = open("../texts/char_level.txt","w")
f.write(outp)
f.close()