<a href="https://colab.research.google.com/github/aryatomarAI/Natural-Language-Processing-with-python/blob/main/Deep-Learning-(-NLP-Project-)/Text_Generation_with_Neural_Networks_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation with Neural Networks
we will be using RNN(Recurrent Neural Networks)

## RNN:
A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs.

## LSTM
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. This is a behavior required in complex problem domains like machine translation, speech recognition, and more.


## DATA 
we are using Moby Dick's first four chapter for text generation. Moby Dick is Novel by Herman Melville.

### Import Tools and load data

In [1]:
import numpy as np
import pandas as pd
import spacy
nlp=spacy.load("en",disable=['parser','tagger','ner'])

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
nlp.max_length

1000000

In [4]:
## create a function which will seperate punctuations from the doc file
def seperate_punc(doc_file):
    return [token.text.lower() for token in nlp(doc_file) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

In [5]:
with open("drive/MyDrive/moby_dick_four_chapters.txt") as f:
    doc=f.read()

In [6]:
tokens=seperate_punc(doc)

In [7]:
len(tokens)

11338

In [8]:
tokens[:15]

['call',
 'me',
 'ishmael',
 'some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no']

## Create Sequences of tokens

In [9]:
# organise into sequences of tokens
train_len=25+1

# empty list of sequences
text_sequences=[]

for i in range(train_len,len(tokens)):
    seq=tokens[i-train_len:i]
    
    text_sequences.append(seq)

In [10]:
type(text_sequences)

list

In [11]:
text_sequences[0:2]

[['call',
  'me',
  'ishmael',
  'some',
  'years',
  'ago',
  'never',
  'mind',
  'how',
  'long',
  'precisely',
  'having',
  'little',
  'or',
  'no',
  'money',
  'in',
  'my',
  'purse',
  'and',
  'nothing',
  'particular',
  'to',
  'interest',
  'me',
  'on'],
 ['me',
  'ishmael',
  'some',
  'years',
  'ago',
  'never',
  'mind',
  'how',
  'long',
  'precisely',
  'having',
  'little',
  'or',
  'no',
  'money',
  'in',
  'my',
  'purse',
  'and',
  'nothing',
  'particular',
  'to',
  'interest',
  'me',
  'on',
  'shore']]

In [12]:
' '.join(text_sequences[0])

'call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on'

In [13]:
" ".join(text_sequences[1])

'me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore'

In [14]:
len(text_sequences)

11312

## Keras Tokenization
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

In [15]:
from keras.preprocessing.text import Tokenizer

In [16]:
tokenizer=Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences=tokenizer.texts_to_sequences(text_sequences)

**fit_on_texts** Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, `"The cat sat on the mat."` It will create a dictionary s.t. `word_index["the"] = 1`; `word_index["cat"] = 2` it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word (often the first few are stop words because they appear a lot).

**texts_to_sequences** Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary. Nothing more, nothing less, certainly no magic involved.

In [17]:
sequences[0]

[956,
 14,
 263,
 51,
 261,
 408,
 87,
 219,
 129,
 111,
 954,
 260,
 50,
 43,
 38,
 315,
 7,
 23,
 546,
 3,
 150,
 259,
 6,
 2712,
 14,
 24]

In [18]:
len(tokenizer.index_word)

2717

In [19]:
list(tokenizer.index_word.items())[:50]

[(1, 'the'),
 (2, 'a'),
 (3, 'and'),
 (4, 'of'),
 (5, 'i'),
 (6, 'to'),
 (7, 'in'),
 (8, 'it'),
 (9, 'that'),
 (10, 'he'),
 (11, 'his'),
 (12, 'was'),
 (13, 'but'),
 (14, 'me'),
 (15, 'with'),
 (16, 'as'),
 (17, 'at'),
 (18, 'this'),
 (19, 'you'),
 (20, 'is'),
 (21, 'all'),
 (22, 'for'),
 (23, 'my'),
 (24, 'on'),
 (25, 'be'),
 (26, "'s"),
 (27, 'not'),
 (28, 'from'),
 (29, 'there'),
 (30, 'one'),
 (31, 'up'),
 (32, 'what'),
 (33, 'him'),
 (34, 'so'),
 (35, 'bed'),
 (36, 'now'),
 (37, 'about'),
 (38, 'no'),
 (39, 'into'),
 (40, 'by'),
 (41, 'were'),
 (42, 'out'),
 (43, 'or'),
 (44, 'harpooneer'),
 (45, 'had'),
 (46, 'then'),
 (47, 'have'),
 (48, 'an'),
 (49, 'upon'),
 (50, 'little')]

In [20]:
len(sequences)

11312

In [21]:
tokenizer.word_counts

OrderedDict([('call', 27),
             ('me', 2471),
             ('ishmael', 133),
             ('some', 758),
             ('years', 135),
             ('ago', 84),
             ('never', 449),
             ('mind', 164),
             ('how', 321),
             ('long', 374),
             ('precisely', 37),
             ('having', 142),
             ('little', 767),
             ('or', 950),
             ('no', 1003),
             ('money', 120),
             ('in', 5647),
             ('my', 1786),
             ('purse', 71),
             ('and', 9646),
             ('nothing', 281),
             ('particular', 152),
             ('to', 6497),
             ('interest', 24),
             ('on', 1716),
             ('shore', 26),
             ('i', 7150),
             ('thought', 676),
             ('would', 702),
             ('sail', 104),
             ('about', 1014),
             ('a', 10377),
             ('see', 416),
             ('the', 15540),
             ('watery', 26),
  

In [22]:
len(tokenizer.word_counts)

2717

In [23]:
vocabulary_size=len(tokenizer.index_word)

In [24]:
vocabulary_size

2717

In [25]:
type(sequences)

list

**Sequences data type to numpy array**

In [26]:
sequences=np.array(sequences)

In [27]:
sequences[0:3]

array([[ 956,   14,  263,   51,  261,  408,   87,  219,  129,  111,  954,
         260,   50,   43,   38,  315,    7,   23,  546,    3,  150,  259,
           6, 2712,   14,   24],
       [  14,  263,   51,  261,  408,   87,  219,  129,  111,  954,  260,
          50,   43,   38,  315,    7,   23,  546,    3,  150,  259,    6,
        2712,   14,   24,  957],
       [ 263,   51,  261,  408,   87,  219,  129,  111,  954,  260,   50,
          43,   38,  315,    7,   23,  546,    3,  150,  259,    6, 2712,
          14,   24,  957,    5]])

In [28]:
type(sequences)

numpy.ndarray

## Now let's split the data into features and labels

In [29]:
sequences[:,:-1][:4]

array([[ 956,   14,  263,   51,  261,  408,   87,  219,  129,  111,  954,
         260,   50,   43,   38,  315,    7,   23,  546,    3,  150,  259,
           6, 2712,   14],
       [  14,  263,   51,  261,  408,   87,  219,  129,  111,  954,  260,
          50,   43,   38,  315,    7,   23,  546,    3,  150,  259,    6,
        2712,   14,   24],
       [ 263,   51,  261,  408,   87,  219,  129,  111,  954,  260,   50,
          43,   38,  315,    7,   23,  546,    3,  150,  259,    6, 2712,
          14,   24,  957],
       [  51,  261,  408,   87,  219,  129,  111,  954,  260,   50,   43,
          38,  315,    7,   23,  546,    3,  150,  259,    6, 2712,   14,
          24,  957,    5]])

In [30]:
sequences[:,-1][:4]

array([ 24, 957,   5,  60])

In [31]:
X=sequences[:,:-1]
y=sequences[:,-1]

In [32]:
from tensorflow.keras.utils import to_categorical

In [33]:
y=to_categorical(y,num_classes=vocabulary_size+1)

In [34]:
seq_len=X.shape[1]

In [35]:
y.shape

(11312, 2718)

In [36]:
seq_len

25

In [37]:
X.shape

(11312, 25)

## Build a model

In [46]:
from keras.models import Sequential
from keras.layers import Dense,LSTM,Embedding,Dropout

In [47]:
# create a function for our model
def create_model(vocabulary_size,seq_len):
    model=Sequential()
    model.add(Embedding(vocabulary_size,seq_len,input_length=seq_len))
    model.add(LSTM(250,return_sequences=True))
    model.add(Dropout(0.35))
    model.add(LSTM(250))
    model.add(Dropout(0.35))
    model.add(Dense(150,activation='relu'))
    model.add(Dense(vocabulary_size,activation="softmax"))
    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    model.summary()
    
    return model

In [48]:
model=create_model(vocabulary_size+1,seq_len)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 25, 25)            67950     
_________________________________________________________________
lstm (LSTM)                  (None, 25, 250)           276000    
_________________________________________________________________
dropout (Dropout)            (None, 25, 250)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 250)               501000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 250)               0         
_________________________________________________________________
dense (Dense)                (None, 150)               37650     
_________________________________________________________________
dense_1 (Dense)              (None, 2718)              4

## Train the model

In [49]:
model.fit(X,y,batch_size=32,epochs=350,verbose=2)

Epoch 1/350
354/354 - 27s - loss: 6.6345 - accuracy: 0.0480
Epoch 2/350
354/354 - 3s - loss: 6.2764 - accuracy: 0.0529
Epoch 3/350
354/354 - 3s - loss: 6.1677 - accuracy: 0.0526
Epoch 4/350
354/354 - 3s - loss: 6.0687 - accuracy: 0.0617
Epoch 5/350
354/354 - 3s - loss: 5.9275 - accuracy: 0.0655
Epoch 6/350
354/354 - 3s - loss: 5.8148 - accuracy: 0.0679
Epoch 7/350
354/354 - 3s - loss: 5.9010 - accuracy: 0.0650
Epoch 8/350
354/354 - 3s - loss: 5.7382 - accuracy: 0.0677
Epoch 9/350
354/354 - 3s - loss: 5.6463 - accuracy: 0.0706
Epoch 10/350
354/354 - 3s - loss: 5.5677 - accuracy: 0.0723
Epoch 11/350
354/354 - 3s - loss: 5.4936 - accuracy: 0.0728
Epoch 12/350
354/354 - 3s - loss: 5.4125 - accuracy: 0.0753
Epoch 13/350
354/354 - 3s - loss: 5.3375 - accuracy: 0.0769
Epoch 14/350
354/354 - 3s - loss: 5.2685 - accuracy: 0.0773
Epoch 15/350
354/354 - 3s - loss: 5.1950 - accuracy: 0.0794
Epoch 16/350
354/354 - 3s - loss: 5.1218 - accuracy: 0.0829
Epoch 17/350
354/354 - 3s - loss: 5.0618 - accur

<keras.callbacks.History at 0x7f1bf0d83650>

In [50]:
model.save("drive/MyDrive/Model/Text-Generation-Model-350.h5")

In [51]:
from pickle import load,dump
dump(tokenizer, open('drive/MyDrive/Model/tokenizer', 'wb'))

## Load the saved Model

In [53]:
from keras.models import load_model
my_model=load_model('drive/MyDrive/Model/Text-Generation-Model-350.h5')

## Generating New Text

In [54]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

In [66]:
def gen_text(model, tokenizer, seq_len, seed_text, num_gen_words):

  '''
    INPUTS:
    model : model that was trained on text data
    tokenizer : tokenizer that was fit on text data
    seq_len : length of training sequence
    seed_text : raw string text to serve as the seed
    num_gen_words : number of words to be generated by model
  '''
  # Final Output
  output_text=[]

  # Initial Seed Sequence
  input_text=seed_text

  # Create num_gen_words
  for i in range(num_gen_words):
    #take in the input seed text and encode it to a sequence
    encoded_text=tokenizer.texts_to_sequences([input_text])[0]

    # pad sequences to our trained rate
    pad_encoded=pad_sequences([encoded_text],maxlen=seq_len,truncating='pre')

    # Predicting class Probabilities
    predicted_index=model.predict_classes(pad_encoded,verbose=0)[0]

    # grab the word 
    pred_word=tokenizer.index_word[predicted_index]

    # Update the sequence of input text (shifting one over with the new word)
    input_text=input_text+ " " + pred_word

    #append the new word to output
    output_text.append(pred_word)

  return " ".join(output_text)


## Grab a random seed Sequence

In [56]:
text_sequences[0]

['call',
 'me',
 'ishmael',
 'some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no',
 'money',
 'in',
 'my',
 'purse',
 'and',
 'nothing',
 'particular',
 'to',
 'interest',
 'me',
 'on']

In [57]:
len(text_sequences[0])

26

In [58]:
import random
random.seed(101)
random_pick=random.randint(0,len(text_sequences))
random_seed_text=text_sequences[random_pick]

In [59]:
random_seed_text

['thought',
 'i',
 'to',
 'myself',
 'the',
 'man',
 "'s",
 'a',
 'human',
 'being',
 'just',
 'as',
 'i',
 'am',
 'he',
 'has',
 'just',
 'as',
 'much',
 'reason',
 'to',
 'fear',
 'me',
 'as',
 'i',
 'have']

In [62]:
seed_text=' '.join(random_seed_text)
seed_text

"thought i to myself the man 's a human being just as i am he has just as much reason to fear me as i have"

In [68]:
gen_text(my_model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=30)



'to be afraid of him better sleep with a sober cannibal than a drunken christian landlord said i tell him to stash his tomahawk there or pipe or whatever you'

## Let's predict on a selected sequence and compare it to the real text 

In [73]:
seed_text1=text_sequences[5]
seed_text1=" ".join(seed_text1)
seed_text1

'ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore i thought i would'

In [74]:
gen_text(my_model,tokenizer,seq_len,seed_text=seed_text1,num_gen_words=30)



'sail about a little and see the watery part of the world it is a way i have of driving off the spleen and regulating the circulation whenever i find'

## Exploring generated sequence

In [75]:
with open("drive/MyDrive/moby_dick_four_chapters.txt") as f:
  full_text=f.read()

In [77]:

print(" ".join(full_text.split()[5:65]))


ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth;


## So our model is predicting the sequence pretty well  