<a id="1.1"></a>
<h3 style="background-color:gold;font-family:newtimeroman;font-size:200%;text-align:center">What are Recurrent Neural Networks?</h3>

<p><span style='font-family: "Trebuchet MS", Times, serif; font-size: 18px;'>RNN is a type of artificial neural network designed to work with data that has sequences. Let's consider an example:
Assume we've a sequence of data: ["1st day sale", "2nd day sale", "3rd day sale", "4th day sale"].</span></p>

<p><span style='font-family: "Trebuchet MS", Times, serif; font-size: 18px;'>To predict what's the sale price would likely to be on the 5th day, RNN comes into use, i.e; Predicting the shift of data with one day ahead in sequence. Most common uses of RNN are in:</span></p>
<p>
    <span style='font-family: "Trebuchet MS", Times, serif; font-size: 18px;'>
        <li>Time Series</li>
        <li>Automobile Trajectories (left,right,back,forward)</li>
        <li>Sound/Speech (sequence of sounds)</li>
        <li>Music</li>
    </span>
</p>

More Information here: [Technical definition and knowledge anout RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network)

<a id="1.2"></a>
<h3 style="background-color:gold;font-family:newtimeroman;font-size:200%;text-align:center">Working of a neuron in Feed Forward Network</h3>

<p><span style='font-family: "Trebuchet MS", Times, serif; font-size: 18px;'>To solve the above type of problem using RNN, let's understand how a simple neuron works in a Feed Forward Network!</span></p>

![Working of a simple neuron](https://learnopencv.com/wp-content/uploads/2017/10/neuron-diagram.jpg)

<p><span style='font-family: "Trebuchet MS", Times, serif; font-size: 18px;'>A single neuron takes in some inputs, aggregates them, and passes them through an activation function(like 'relu', 'sigmoid', 'tanh', etc.) and then generates an output.</span></p>

![RNN](https://wiki.tum.de/download/attachments/22578349/RNN0.JPG?version=1&modificationDate=1485263911757&api=v2)

<p><span style='font-family: "Trebuchet MS", Times, serif; font-size: 18px;'>In a RNN, the input generated after passing through the Activation function is sent back to itself, into the input of the same neuron!</span></p>

<p><span style='font-family: "Trebuchet MS", Times, serif; font-size: 18px;'>Here a neuron recieves input from a previous timestamp, as well as current time stamp. Hence, they are also known as memory cells</span></p>

<p><span style='font-family: "Trebuchet MS", Times, serif; font-size: 18px;'>RNNs are very comfortable with I/p and O/p for both sequences and single vector values. It's very easy to create layer of an RNN</span></p>

More Information here: [Technical definition and knowledge anout RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network)

<a id="1.1"></a>
<h3 style="background-color:gold;font-family:newtimeroman;font-size:200%;text-align:center">Text generation and Recurrent Neural Networks</h3>

**<font size="3"><a href="#chap1">1. Text Processing</a></font>**
**<br><font size="3"><a href="#chap2">2. Keras Tokenization</a></font>**
**<br><font size="3"><a href="#chap3">3. LSTM</a></font>**
**<br><font size="3"><a href="#chap4">4. Split the Data into Training and Test</a></font>**
**<br><font size="3"><a href="#chap5">5. Training the model</a></font>**
**<br><font size="3"><a href="#chap6">6. New text generation</a></font>**
**<br><font size="3"><a href="#chap7">7. Exploring</a></font>**

<a id="chap1"></a>
<h3 style="background-color:gold;font-family:newtimeroman;font-size:200%;text-align:center">Text Processing</h3>

In [1]:
## Convinience function for reading a file
def read_file(filepath):
    
    with open(filepath) as f:
        str_text = f.read()
    
    return str_text

In [2]:
# read_file("../input/the-great-gatsby/The Great Gatsby.txt")

In [3]:
len(read_file("../input/the-great-gatsby/The Great Gatsby.txt"))

273736

`# Tokenizing and cleaning text`

In [4]:
import spacy

nlp = spacy.load('en', disable=['parser', 'tagger', 'ner']) ## since I only want to use `tokenizer`, I can disable the other ones

In [5]:
nlp.max_length = 1198623 # Increasing SpaCys max-nlp limit to avoid errors like mentioned below

> [E088] Text of length 1029371 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

In [6]:
## Separating the Punctuations from the text since we don't want our NN to train on those informations
def separate_punc(doc_text):
    return [token.text.lower() for token in nlp(doc_text) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

In [7]:
doc = read_file("../input/the-great-gatsby/The Great Gatsby.txt")
tokens = separate_punc(doc)

In [8]:
# tokens

In [9]:
len(tokens)

54609

`## Let's create something so that when we pass in the first #25 words, it should automatically predict the next word in the sequence, i.e. #26`

In [10]:
# organize into sequences of tokens
train_len = 25+1 # 50 training words , then one target word

# Empty list of sequences
text_sequences = []

for i in range(train_len, len(tokens)):
    
    # Grab train_len# amount of characters
    seq = tokens[i-train_len:i]
    
    # Add to list of sequences
    text_sequences.append(seq)

In [11]:
type(text_sequences)

list

In [12]:
text_sequences[1]

['great',
 'gatsby',
 'by',
 'fe',
 'scott',
 'fitzgerald',
 'a',
 'sa',
 'ie',
 'el',
 'ee',
 'lee',
 '\n\x0c',
 '‘',
 'then',
 'wear',
 'the',
 'gold',
 'hat',
 'if',
 'that',
 'will',
 'move',
 'her',
 'if',
 'you']

In [13]:
" ".join(text_sequences[1])

'great gatsby by fe scott fitzgerald a sa ie el ee lee \n\x0c ‘ then wear the gold hat if that will move her if you'

In [14]:
" ".join(text_sequences[2])

'gatsby by fe scott fitzgerald a sa ie el ee lee \n\x0c ‘ then wear the gold hat if that will move her if you can'

In [15]:
len(text_sequences)

54583

<a id="chap2"></a>
<h3 style="background-color:gold;font-family:newtimeroman;font-size:200%;text-align:center">Tokenization using Keras</h3>

In [16]:
from keras.preprocessing.text import Tokenizer

In [17]:
# integer encode sequences of words
## We're replacing the above shown sequence of words into numbers
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences)

In [18]:
sequences[1] ## Each of this number is an ID for a particular word

[61,
 25,
 59,
 6645,
 6644,
 6643,
 3,
 6642,
 6641,
 1869,
 1418,
 6640,
 40,
 9,
 54,
 1868,
 1,
 625,
 1150,
 57,
 13,
 296,
 624,
 23,
 57,
 15]

In [19]:
## To check the relationship mapping between each word/sequence_number
# for i in sequences[0]:
#     print(f'{i} : {tokenizer.index_word[i]}')

# tokenizer.index_word

In [20]:
## Let's count how many times a word shows up
# tokenizer.word_counts

In [21]:
vocabulary_size = len(tokenizer.word_counts)

In [22]:
vocabulary_size ## size of the vocabulary

6646

`# Format the "type(sequences) -----> list" into a Numpy Matrix (ndarray)`

In [23]:
import numpy as np

In [24]:
sequences = np.array(sequences)

In [25]:
type(sequences)

numpy.ndarray

In [26]:
sequences

array([[   1,   61,   25, ...,  624,   23,   57],
       [  61,   25,   59, ...,   23,   57,   15],
       [  25,   59, 6645, ...,   57,   15,  229],
       ...,
       [   2,   46,  537, ...,   79,   16,   80],
       [  46,  537,  238, ...,   16,   80,   81],
       [ 537,  238,   18, ...,   80,   81, 6646]])

<a id="chap3"></a>
<h3 style="background-color:gold;font-family:newtimeroman;font-size:200%;text-align:center">LSTM CELL</h3>

In [27]:
import keras
from keras.models import Sequential
from keras.layers import Dense,LSTM,Embedding

In [28]:
def create_model(vocabulary_size, seq_len):
    model = Sequential()
    model.add(Embedding(vocabulary_size, 25, input_length=seq_len))
    model.add(LSTM(150, return_sequences=True))
    model.add(LSTM(150))
    model.add(Dense(150, activation='relu'))

    model.add(Dense(vocabulary_size, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
   
    model.summary()
    
    return model

<a id="chap4"></a>
<h3 style="background-color:gold;font-family:newtimeroman;font-size:200%;text-align:center">Splitting the dataset</h3>

In [29]:
from keras.utils import to_categorical

In [30]:
sequences

array([[   1,   61,   25, ...,  624,   23,   57],
       [  61,   25,   59, ...,   23,   57,   15],
       [  25,   59, 6645, ...,   57,   15,  229],
       ...,
       [   2,   46,  537, ...,   79,   16,   80],
       [  46,  537,  238, ...,   16,   80,   81],
       [ 537,  238,   18, ...,   80,   81, 6646]])

In [31]:
# First 49 words
sequences[:,:-1]

array([[   1,   61,   25, ...,  296,  624,   23],
       [  61,   25,   59, ...,  624,   23,   57],
       [  25,   59, 6645, ...,   23,   57,   15],
       ...,
       [   2,   46,  537, ...,   76,   79,   16],
       [  46,  537,  238, ...,   79,   16,   80],
       [ 537,  238,   18, ...,   16,   80,   81]])

In [32]:
X = sequences[:,:-1]

In [33]:
y = sequences[:,-1]

In [34]:
y = to_categorical(y, num_classes=vocabulary_size+1)

In [35]:
seq_len = X.shape[1]

In [36]:
seq_len

25

<a id="chap5"></a>
<h3 style="background-color:gold;font-family:newtimeroman;font-size:200%;text-align:center">Model Training</h3>

In [37]:
# define model
model = create_model(vocabulary_size+1, seq_len)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 25, 25)            166175    
_________________________________________________________________
lstm (LSTM)                  (None, 25, 150)           105600    
_________________________________________________________________
lstm_1 (LSTM)                (None, 150)               180600    
_________________________________________________________________
dense (Dense)                (None, 150)               22650     
_________________________________________________________________
dense_1 (Dense)              (None, 6647)              1003697   
Total params: 1,478,722
Trainable params: 1,478,722
Non-trainable params: 0
_________________________________________________________________


In [38]:
from pickle import dump,load

In [39]:
# fit model
model.fit(X, y, batch_size=128, epochs=400,verbose=1)

Epoch 1/400
Epoch 2/400
Epoch 3/400
Epoch 4/400
Epoch 5/400
Epoch 6/400
Epoch 7/400
Epoch 8/400
Epoch 9/400
Epoch 10/400
Epoch 11/400
Epoch 12/400
Epoch 13/400
Epoch 14/400
Epoch 15/400
Epoch 16/400
Epoch 17/400
Epoch 18/400
Epoch 19/400
Epoch 20/400
Epoch 21/400
Epoch 22/400
Epoch 23/400
Epoch 24/400
Epoch 25/400
Epoch 26/400
Epoch 27/400
Epoch 28/400
Epoch 29/400
Epoch 30/400
Epoch 31/400
Epoch 32/400
Epoch 33/400
Epoch 34/400
Epoch 35/400
Epoch 36/400
Epoch 37/400
Epoch 38/400
Epoch 39/400
Epoch 40/400
Epoch 41/400
Epoch 42/400
Epoch 43/400
Epoch 44/400
Epoch 45/400
Epoch 46/400
Epoch 47/400
Epoch 48/400
Epoch 49/400
Epoch 50/400
Epoch 51/400
Epoch 52/400
Epoch 53/400
Epoch 54/400
Epoch 55/400
Epoch 56/400
Epoch 57/400
Epoch 58/400
Epoch 59/400
Epoch 60/400
Epoch 61/400
Epoch 62/400
Epoch 63/400
Epoch 64/400
Epoch 65/400
Epoch 66/400
Epoch 67/400
Epoch 68/400
Epoch 69/400
Epoch 70/400
Epoch 71/400
Epoch 72/400
Epoch 73/400
Epoch 74/400
Epoch 75/400
Epoch 76/400
Epoch 77/400
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f42f87362d0>

In [40]:
# save the model to file
model.save('epochBIG.h5')
# save the tokenizer
dump(tokenizer, open('epochBIG', 'wb'))

<a id="chap6"></a>
<h3 style="background-color:gold;font-family:newtimeroman;font-size:200%;text-align:center">Generating new text</h3>

In [41]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

In [42]:
def generate_text(model, tokenizer, seq_len, seed_text, num_gen_words):
    '''
    INPUTS:
    model : model that was trained on text data
    tokenizer : tokenizer that was fit on text data
    seq_len : length of training sequence
    seed_text : raw string text to serve as the seed
    num_gen_words : number of words to be generated by model
    '''
    
    # Final Output
    output_text = []
    
    # Intial Seed Sequence
    input_text = seed_text
    
    # Create num_gen_words
    for i in range(num_gen_words):
        
        # Take the input text string and encode it to a sequence
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        
        # Pad sequences to our trained rate (50 words in the video)
        pad_encoded = pad_sequences([encoded_text], maxlen=seq_len, truncating='pre')
        
        # Predict Class Probabilities for each word
        pred_word_ind = model.predict_classes(pad_encoded, verbose=0)[0]
        
        # Grab word
        pred_word = tokenizer.index_word[pred_word_ind] 
        
        # Update the sequence of input text (shifting one over with the new word)
        input_text += ' ' + pred_word
        
        output_text.append(pred_word)
        
    # Make it look like a sentence.
    return ' '.join(output_text)

`# Grab a random seed sequence`

In [43]:
text_sequences[0]

['the',
 'great',
 'gatsby',
 'by',
 'fe',
 'scott',
 'fitzgerald',
 'a',
 'sa',
 'ie',
 'el',
 'ee',
 'lee',
 '\n\x0c',
 '‘',
 'then',
 'wear',
 'the',
 'gold',
 'hat',
 'if',
 'that',
 'will',
 'move',
 'her',
 'if']

In [44]:
import random
random.seed(101)
random_pick = random.randint(0,len(text_sequences))

In [45]:
random_seed_text = text_sequences[random_pick]

In [46]:
random_seed_text

['music',
 'had',
 'died',
 'down',
 'as',
 'the',
 'ceremony',
 'began',
 'and',
 'now',
 'a',
 'long',
 'cheer',
 'floated',
 'in',
 'at',
 'the',
 'window',
 'followed',
 'by',
 'in-',
 'termittent',
 'cries',
 'of',
 '“',
 '‘']

In [47]:
seed_text = ' '.join(random_seed_text)

In [48]:
seed_text

'music had died down as the ceremony began and now a long cheer floated in at the window followed by in- termittent cries of “ ‘'

In [49]:
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=50)



'yea — ea — ea ’ and finally by a burst of cuff let ebooks at planet ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com ebook.com'

<a id="chap7"></a>
<h3 style="background-color:gold;font-family:newtimeroman;font-size:200%;text-align:center">Exploring more!</h3>

In [50]:
full_text = read_file('../input/the-great-gatsby/The Great Gatsby.txt')

In [51]:
for i,word in enumerate(full_text.split()):
    if word == 'Great':
        print(' '.join(full_text.split()[i-20:i+20]))
        print('\n')

a century after my father, and a little later I participated in that delayed Teutonic mi- gration known as the Great War. I enjoyed the counter-raid so thoroughly that I came back restless. Instead of being the warm center of


