# The Below Notebook is for a text generation model with LSTM.

I recommend running this notebook in Google Colab with a GPU.


### The alternative is to bring in movies that are mainly women and use those scripts:

- Bridesmaids
- Ghost World
- Juno
- Martha Marcy May Marlene
- Precious
- Sex and the City
- The Help
- Frozen

I believe that, at around 20,000 words apiece, we are looking at a big enough corpus to train some sort of text generation model for screenplays.

The result wasn't spectacular, but was worth the effort.


Many thanks to Jason Brownlee, whose [article](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/) provided much of the basis for what you see below.

In [38]:
# Bring some mates

import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout, Embedding
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
import tensorflow as tf
import sys
import re

In [5]:
#from google.colab import drive
#drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### We must create a dataset of the screenplays that we want to use for training.

The screenplays that we are going to use is above.

In [27]:
#Lets save each script to be used as its own variable here
bridesmaids = "../data/Scripts/BRIDESMAIDS.TXT"
ghost_world = "../data/Scripts/GHOST WORLD.TXT"
juno = "../data/Scripts/JUNO.TXT"
martha_marcy = "../data/Scripts/MARTHA MARCY MAY MARLENE.TXT"
precious = "../data/Scripts/PRECIOUS.TXT"
sex_city = "../data/Scripts/SEX AND THE CITY: THE MOVIE.TXT"
the_help = "../data/Scripts/THE HELP.TXT"
frozen = "../data/Scripts/FROZEN.TXT"

In [28]:
raw_text = open(bridesmaids, 'r', encoding='utf-8').read() + open(ghost_world, 'r', encoding='utf-8').read() + open(juno, 'r', encoding='utf-8').read() + open(martha_marcy, 'r', encoding='utf-8').read() + open(precious, 'r', encoding='utf-8').read() + open(sex_city, 'r', encoding='utf-8').read() + open(the_help, 'r', encoding='utf-8').read() + open(frozen, 'r', encoding='utf-8').read()
raw_text = raw_text.lower()

### Char2Vec or Word2Vec

I ran both models in my exploration, and found that, for generating a screenplay, I needed to choose Char2Vec as the model recognized the need for including the white space that is in a screenplay.

Word2Vec would, however, create more legible text. But it did not have any of the screenplay structure, which was to be expected.

Uncomment the tokens cells below if you would like to try this with Word2Vec.

In [113]:
#tokens = re.findall(r"\w+", raw_text)

In [115]:
#tokens = [word for word in tokens if not word.startswith('0')]

In [29]:
# create mapping of unique words to integers
# Change this to tokens if you want word2Vec
chars = sorted(list(set(raw_text)))

In [30]:
# create mapping of unique words to integers
# Change this to raw_text if you just want characters
char_to_int = dict((c, i) for i, c in enumerate(chars))

Use the below for word tokenization. Above is for characters

In [31]:
n_chars = len(raw_text)
n_vocab = len(chars)
print ("Total Characters: ", n_chars)
print ("Unique Characters: ", n_vocab)

Total Characters:  1735806
Unique Characters:  60


In [39]:
# prepare the dataset of input to output pairs encoded as integers
# Change raw_text to tokens for word2Vec
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print ("Total Patterns: ", n_patterns)

Total Patterns:  1735706


In [33]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [34]:
# define the LSTM model
model = Sequential()
model.add(Embedding(12278, 100, input_length = seq_length))
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))

# Extra layer, because computers need to work
#model.add(LSTM(256))
#model.add(Dropout(0.2))

model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [36]:
# define the checkpoint - keep the best one
filepath="char-textgen-weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

### Fit and run!


In [None]:
model.fit(X, y, epochs=5, batch_size=128, callbacks=callbacks_list)

We are using the below to save the model and move it over to Streamlit for deployment.

In [None]:
tf.keras.models.save_model(model, "saved_final_model.hp5", save_format="h5")

The below is for loading the weights and then generating text from it.

In [135]:
# load the network weights
best_model = "../code/word-textgen-weights-improvement-20-6.8268.hdf5"
model.load_weights(best_model)
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [136]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

In [None]:
# pick a random seed
start = np.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print ("FADE IN:")
# include a space in the join statement if doing word2Vec
print ("\"", ''.join([int_to_char[value] for value in pattern]), "\"")

# generate characters
for i in range(50):
	x = np.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose=0)
	index = np.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print ("\nFADE OUT.")

### Evaluations

The first running of the text generation model involved some hilarious outcomes, namely the fact that, after a while, the only word being produced was, "to". I can't say it's a roaring success, but it has been fun to try! I deployed a char2Vec model on Streamlit, which you can also find in this repo.

The word2Vec model began spewing either the word "the" or "thicker" repeatedly at the end, which is not useful. It seems to only work for around 50 words before it becomes unintelligible. And it also doesn't keep the general structure of a screenplay, which is an issue in itself.