## Text Generation using LSTMs

Reccurent Neural Networks are networks which use the output from the previous step as the input the current. There are multiple types:
1. Sequence to Sequence - convential output at every stage
2. Sequence to Vector - single output at the end of all stages
3. Vector to Sequence - single input in the beginning followed by 0 input



### LSTMs

An issue with RNN is that after a while the network will begin to forget the first input, thus we need some sort of long-term memory. Still quiet new. 

In a LSTM cell there are three inputs:
1. C(t-1) - Cell state of the previous neuron
2. h(t-1) - Output from previous cell
3. x(t) - Input into current cell



LSTM cell has three layers:
1. Forget-Gate Layer - Determines what are we going to forget or keep from the previous cell state. It uses a sigmoid, thus output is between 0 and 1, where 0 means forget and 1 means keep.
2. Input-Gate Layer - Used to create new candidate values
3. Output Layer - To determine outut

One variation is LSTM cell with 'peepholes'. Another variation is GRU which is simpler than standard LSTMs. 

### Code

In [28]:
import spacy

nlp = spacy.load('en_core_web_sm')
nlp.max_length = 1200000 # To increase max length of space

In [29]:
def read_file(file_path):
    with open(file_path, 'r') as f:
        text = f.read()
    return text

def seperate_func(doc):
    return [t.text.lower() for t in nlp(doc) if t.text not in '\n\n \n\n\n!"#$%&()*+,--./:;<=>?@[\\]^_`{|}~\t\n']

d = read_file('moby_dick_four_chapters.txt')
tokens = seperate_func(d)
tokens    

['call',
 'me',
 'ishmael',
 'some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no',
 'money',
 'in',
 'my',
 'purse',
 'and',
 'nothing',
 'particular',
 'to',
 'interest',
 'me',
 'on',
 'shore',
 'i',
 'thought',
 'i',
 'would',
 'sail',
 'about',
 'a',
 'little',
 'and',
 'see',
 'the',
 'watery',
 'part',
 'of',
 'the',
 'world',
 'it',
 'is',
 'a',
 'way',
 'i',
 'have',
 'of',
 'driving',
 'off',
 'the',
 'spleen',
 'and',
 'regulating',
 'the',
 'circulation',
 'whenever',
 'i',
 'find',
 'myself',
 'growing',
 'grim',
 'about',
 'the',
 'mouth',
 'whenever',
 'it',
 'is',
 'a',
 'damp',
 'drizzly',
 'november',
 'in',
 'my',
 'soul',
 'whenever',
 'i',
 'find',
 'myself',
 'involuntarily',
 'pausing',
 'before',
 'coffin',
 'warehouses',
 'and',
 'bringing',
 'up',
 'the',
 'rear',
 'of',
 'every',
 'funeral',
 'i',
 'meet',
 'and',
 'especially',
 'whenever',
 'my',
 'hypos',
 'get',
 'such',
 'an',
 'upper',
 'hand',
 '

In [30]:
len(tokens)

11338

In [31]:
## Pass 25 words to neural network and predict 26th word

train_len = 25 + 1
text_sequences = []

for i in range(train_len,len(tokens)):
    text_sequences.append(tokens[i-train_len:i])

In [32]:
# Sliding text sequences
print(' '.join(text_sequences[0]))
print('\n',' '.join(text_sequences[1]))
print('\n',' '.join(text_sequences[2]))

call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on

 me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore

 ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore i


In [33]:
# Preprocessing
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences) # fit_on_texts() is used to fit the tokenizer on the text sequences

sequences = tokenizer.texts_to_sequences(text_sequences)
print(sequences[0]) # Each number represents a word
print(sequences[1])

print('\n',tokenizer.index_word) # Dictionary of word to unique index
vocab_size = len(tokenizer.index_word)
print('Vocabulary size:',vocab_size)

[956, 14, 263, 51, 261, 408, 87, 219, 129, 111, 954, 260, 50, 43, 38, 314, 7, 23, 546, 3, 150, 259, 6, 2713, 14, 24]
[14, 263, 51, 261, 408, 87, 219, 129, 111, 954, 260, 50, 43, 38, 314, 7, 23, 546, 3, 150, 259, 6, 2713, 14, 24, 957]

 {1: 'the', 2: 'a', 3: 'and', 4: 'of', 5: 'i', 6: 'to', 7: 'in', 8: 'it', 9: 'that', 10: 'he', 11: 'his', 12: 'was', 13: 'but', 14: 'me', 15: 'with', 16: 'as', 17: 'at', 18: 'this', 19: 'you', 20: 'is', 21: 'all', 22: 'for', 23: 'my', 24: 'on', 25: 'be', 26: "'s", 27: 'not', 28: 'from', 29: 'there', 30: 'one', 31: 'up', 32: 'what', 33: 'him', 34: 'so', 35: 'bed', 36: 'now', 37: 'about', 38: 'no', 39: 'into', 40: 'by', 41: 'were', 42: 'out', 43: 'or', 44: 'harpooneer', 45: 'had', 46: 'then', 47: 'have', 48: 'an', 49: 'upon', 50: 'little', 51: 'some', 52: 'old', 53: 'like', 54: 'if', 55: 'they', 56: 'would', 57: 'do', 58: 'over', 59: 'landlord', 60: 'thought', 61: 'room', 62: 'when', 63: 'could', 64: "n't", 65: 'night', 66: 'here', 67: 'head', 68: 'such', 6

In [34]:
import numpy as np
sequences = np.array(sequences)
sequences

array([[ 956,   14,  263, ..., 2713,   14,   24],
       [  14,  263,   51, ...,   14,   24,  957],
       [ 263,   51,  261, ...,   24,  957,    5],
       ...,
       [ 952,   12,  166, ...,  262,   53,    2],
       [  12,  166, 2712, ...,   53,    2, 2718],
       [ 166, 2712,    3, ...,    2, 2718,   26]])

Here we see, given the first 25 words i.e. 956 to 14, 14 to 24, we get 24 as the output, 957 as the output, thus making our dataset ready.

#### Implementing LSTM

In [40]:
from tensorflow.keras.utils import to_categorical
X = sequences[:,:-1] #Extract all the word indexes except the last one
y = sequences[:,-1] #Extract the last word index

y = to_categorical(y, num_classes=vocab_size+1) # Convert to one-hot encoding
print(y.shape)
print(X.shape)

seq_len = X.shape[1]

(11312, 2719)
(11312, 25)


In [41]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

def create_model(size,seq_len):
    model = Sequential()
    model.add(Embedding(size, seq_len, input_length=seq_len)) 
    # size - input_dim, seq_len - output_dim(25)
    model.add(LSTM(seq_len*2, return_sequences=True))
    model.add(LSTM(50)) # We use 50 since it is a multiple of 25(seq_len)
    model.add(Dense(50, activation='relu')) 
    model.add(Dense(size, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model
    


In [42]:
model = create_model(vocab_size+1,seq_len)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 25)            67975     
                                                                 
 lstm (LSTM)                 (None, 25, 50)            15200     
                                                                 
 lstm_1 (LSTM)               (None, 50)                20200     
                                                                 
 dense (Dense)               (None, 50)                2550      
                                                                 
 dense_1 (Dense)             (None, 2719)              138669    
                                                                 
Total params: 244,594
Trainable params: 244,594
Non-trainable params: 0
_________________________________________________________________


In [44]:
from pickle import dump,load

model.fit(X,y,batch_size=128,epochs=100,verbose=1)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x7febac3727d0>