# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Capstone Project

Notebook 3: Preprocessing & Modeling

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [6]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, SimpleRNN, GRU
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential
import tensorflow.keras.utils as ku 

In [7]:
pd.set_option('display.max_colwidth', 0)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [42]:
df = pd.read_csv('../data/for_train.csv')
df.head()

Unnamed: 0,char_line
0,"jerry: you know, why we're here? to be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about ""we should go out""? this is what they're talking about...this whole thing, we're all out now, no one is home. not one person here is home, we're all out! there are people tryin' to find us, they don't know where we are. (imitates one of these people ""tryin' to find us""; pretends his hand is a phone) ""did you ring?, i can't find him."" (imitates other person on phone) ""where did he go?"" (the first person again) ""he didn't tell me where he was going"". he must have gone out. you wanna go out: you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...there you're staring around, whatta you do? you go: ""we gotta be getting back"". once you're out, you wanna get back! you wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? where ever you are in life, it's my feeling, you've gotta go. (pete's luncheonette. jerry and george are sitting at a table.)"
1,"jerry: seems to me, that button is in the worst possible spot. (talking about george's shirt) the second button literally makes or breaks the shirt, look at it: it's too high! it's in no-man's-land, you look like you live with your mother."
2,george: are you through? (kind of irritated)
3,"jerry: you do of course try on, when you buy?"
4,"george: yes, it was purple, i liked it, i don't actually recall considering the buttons."


### Sepearting Punctuation

In [9]:
def seperate_punch(text):
    punch = ['...', '.', '[', ']', '(', ')', ';', ':', "'", '/', '"', ',', '?', '*', '!', '-', '$', '%', '&', '\n']
    for i in punch:
        text = text.replace(i, ' ' + i + ' ')
    return text

df['char_line'] = df['char_line'].map(seperate_punch)

In [10]:
texts = []
for text in df['char_line']:
    texts.append(text)

### Tokenizing

In [11]:
maxlen = 200
max_words = 10000
batch_size = 32

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print(f'Found {len(word_index)} unique tokens')

data = pad_sequences(sequences, maxlen=maxlen)
print(f'Shape of data tensor: {data.shape}')

Found 18647 unique tokens
Shape of data tensor: (51292, 200)


In [12]:
sequences[1]

[5,
 710,
 8,
 23,
 15,
 819,
 19,
 18,
 3,
 1072,
 987,
 907,
 170,
 59,
 9,
 1,
 6,
 388,
 3,
 249,
 819,
 2641,
 324,
 169,
 1339,
 3,
 388,
 79,
 39,
 10,
 10,
 1,
 6,
 134,
 557,
 10,
 1,
 6,
 18,
 24,
 112,
 1,
 6,
 2254,
 4,
 79,
 50,
 4,
 423,
 28,
 55,
 379]

In [44]:
data

array([[  16,   46,    1, ...,   39,    7,  191],
       [   0,    0,    0, ...,   28,   55,  379],
       [   0,    0,    0, ...,  220,   17, 1968],
       ...,
       [   0,    0,    0, ...,   17,   30, 1934],
       [   0,    0,    0, ...,   17,   55,  219],
       [   0,    0,    0, ...,   64,   88,   21]], dtype=int32)

In [14]:
texts[1]

"jerry :  seems to me ,  that button is in the worst possible spot .   ( talking about george ' s shirt )  the second button literally makes or breaks the shirt ,  look at it :  it ' s too high !  it ' s in no - man ' s - land ,  you look like you live with your mother . "

### Model 1 Simple Modeling - Test Water
https://www.kaggle.com/shivamb/beginners-guide-to-text-generation-using-lstms
& Francoise Challot?

In [15]:
def get_training_data(data, maxlen):
    
    X, y = data[:,:-1], data[:,-1]
    y = ku.to_categorical(y, num_classes=max_words)
    
    return X, y

In [16]:
X, y = get_training_data(data, maxlen)

In [35]:
model = Sequential()

model.add(Embedding(max_words, output_dim=maxlen))
model.add(SimpleRNN(32))
model.add(Dense(max_words, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

history = model.fit(X, y,
                   epochs=30,
                   verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [36]:
def generate_text(seed_text, next_words, model, maxlen):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=maxlen-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text

In [39]:
generate_text('soup nazi', 100, model, 400)

'soup nazi you know that again you going on here again about it is you know george yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah'

For this model I used keras tokenizer to tokenize the data, and I used SimpleRNN to test out: the result is 

### Model 2 - manually tokenizing credit 
credit https://medium.com/coloredfeather/generating-a-tv-script-using-recurrent-neural-networks-dd0a645e97e7

### Model 3 - russian guy's way
https://github.com/TannerGilbert/Tutorials/blob/master/Keras-Tutorials/4.%20LSTM%20Text%20Generation/Keras%20LSTM%20Text%20Generation.ipynb

### Model 4 - Use generator
Francoise textbook

### Model 5 - GPT-2

shared by Hilary
https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce#scrollTo=LdpZQXknFNY3

Script Buddy V2
https://github.com/cdpierse/script_buddy_v2/blob/master/script_buddy/script_generation.ipynb

Teach GPT-2 sense of humor
https://github.com/mf1024/Transformers/blob/master/Teaching%20GPT-2%20a%20sense%20of%20humor.ipynb