# Text Generation with Neural Networks
we will be using RNN(Recurrent Neural Networks)

## RNN:
A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs.

## LSTM
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. This is a behavior required in complex problem domains like machine translation, speech recognition, and more.


## DATA 
we are using Moby Dick's first four chapter for text generation. Moby Dick is Novel by Herman Melville.

### Import Tools and load data

In [2]:
import numpy as np
import pandas as pd
import spacy
nlp=spacy.load("en_core_web_lg",disable=['parser','tagger','ner'])

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [7]:
nlp.max_length

1000000

In [11]:
## create a function which will seperate punctuations from the doc file
def seperate_punc(doc_file):
    return [token.text.lower() for token in nlp(doc_file) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

In [12]:
with open("moby_dick_four_chapters.txt") as f:
    doc=f.read()

In [13]:
tokens=seperate_punc(doc)







In [14]:
len(tokens)

11338

In [15]:
tokens[:15]

['call',
 'me',
 'ishmael',
 'some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no']

## Create Sequences of tokens

In [16]:
# organise into sequences of tokens
train_len=25+1

# empty list of sequences
text_sequences=[]

for i in range(train_len,len(tokens)):
    seq=tokens[i-train_len:i]
    
    text_sequences.append(seq)

In [17]:
type(text_sequences)

list

In [18]:
text_sequences[0:2]

[['call',
  'me',
  'ishmael',
  'some',
  'years',
  'ago',
  'never',
  'mind',
  'how',
  'long',
  'precisely',
  'having',
  'little',
  'or',
  'no',
  'money',
  'in',
  'my',
  'purse',
  'and',
  'nothing',
  'particular',
  'to',
  'interest',
  'me',
  'on'],
 ['me',
  'ishmael',
  'some',
  'years',
  'ago',
  'never',
  'mind',
  'how',
  'long',
  'precisely',
  'having',
  'little',
  'or',
  'no',
  'money',
  'in',
  'my',
  'purse',
  'and',
  'nothing',
  'particular',
  'to',
  'interest',
  'me',
  'on',
  'shore']]

In [19]:
' '.join(text_sequences[0])

'call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on'

In [20]:
" ".join(text_sequences[1])

'me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore'

In [21]:
len(text_sequences)

11312

## Keras Tokenization
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

In [22]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [23]:
tokenizer=Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences=tokenizer.texts_to_sequences(text_sequences)

**fit_on_texts** Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, `"The cat sat on the mat."` It will create a dictionary s.t. `word_index["the"] = 1`; `word_index["cat"] = 2` it is word -> index dictionary so every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word (often the first few are stop words because they appear a lot).

**texts_to_sequences** Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary. Nothing more, nothing less, certainly no magic involved.

In [24]:
sequences[0]

[956,
 14,
 263,
 51,
 261,
 408,
 87,
 219,
 129,
 111,
 954,
 260,
 50,
 43,
 38,
 314,
 7,
 23,
 546,
 3,
 150,
 259,
 6,
 2713,
 14,
 24]

In [25]:
len(tokenizer.index_word)

2718

In [26]:
list(tokenizer.index_word.items())[:50]

[(1, 'the'),
 (2, 'a'),
 (3, 'and'),
 (4, 'of'),
 (5, 'i'),
 (6, 'to'),
 (7, 'in'),
 (8, 'it'),
 (9, 'that'),
 (10, 'he'),
 (11, 'his'),
 (12, 'was'),
 (13, 'but'),
 (14, 'me'),
 (15, 'with'),
 (16, 'as'),
 (17, 'at'),
 (18, 'this'),
 (19, 'you'),
 (20, 'is'),
 (21, 'all'),
 (22, 'for'),
 (23, 'my'),
 (24, 'on'),
 (25, 'be'),
 (26, "'s"),
 (27, 'not'),
 (28, 'from'),
 (29, 'there'),
 (30, 'one'),
 (31, 'up'),
 (32, 'what'),
 (33, 'him'),
 (34, 'so'),
 (35, 'bed'),
 (36, 'now'),
 (37, 'about'),
 (38, 'no'),
 (39, 'into'),
 (40, 'by'),
 (41, 'were'),
 (42, 'out'),
 (43, 'or'),
 (44, 'harpooneer'),
 (45, 'had'),
 (46, 'then'),
 (47, 'have'),
 (48, 'an'),
 (49, 'upon'),
 (50, 'little')]

In [27]:
len(sequences)

11312

In [28]:
tokenizer.word_counts

OrderedDict([('call', 27),
             ('me', 2471),
             ('ishmael', 133),
             ('some', 758),
             ('years', 135),
             ('ago', 84),
             ('never', 449),
             ('mind', 164),
             ('how', 321),
             ('long', 374),
             ('precisely', 37),
             ('having', 142),
             ('little', 767),
             ('or', 950),
             ('no', 1003),
             ('money', 120),
             ('in', 5647),
             ('my', 1786),
             ('purse', 71),
             ('and', 9646),
             ('nothing', 281),
             ('particular', 152),
             ('to', 6497),
             ('interest', 24),
             ('on', 1716),
             ('shore', 26),
             ('i', 7150),
             ('thought', 676),
             ('would', 702),
             ('sail', 104),
             ('about', 1014),
             ('a', 10377),
             ('see', 416),
             ('the', 15540),
             ('watery', 26),
  

In [29]:
len(tokenizer.word_counts)

2718

In [30]:
vocabulary_size=len(tokenizer.index_word)

In [31]:
vocabulary_size

2718

In [32]:
type(sequences)

list

**Sequences data type to numpy array**

In [33]:
sequences=np.array(sequences)

In [34]:
sequences[0:3]

array([[ 956,   14,  263,   51,  261,  408,   87,  219,  129,  111,  954,
         260,   50,   43,   38,  314,    7,   23,  546,    3,  150,  259,
           6, 2713,   14,   24],
       [  14,  263,   51,  261,  408,   87,  219,  129,  111,  954,  260,
          50,   43,   38,  314,    7,   23,  546,    3,  150,  259,    6,
        2713,   14,   24,  957],
       [ 263,   51,  261,  408,   87,  219,  129,  111,  954,  260,   50,
          43,   38,  314,    7,   23,  546,    3,  150,  259,    6, 2713,
          14,   24,  957,    5]])

In [35]:
type(sequences)

numpy.ndarray

## Now let's split the data into features and labels

In [36]:
sequences[:,:-1][:4]

array([[ 956,   14,  263,   51,  261,  408,   87,  219,  129,  111,  954,
         260,   50,   43,   38,  314,    7,   23,  546,    3,  150,  259,
           6, 2713,   14],
       [  14,  263,   51,  261,  408,   87,  219,  129,  111,  954,  260,
          50,   43,   38,  314,    7,   23,  546,    3,  150,  259,    6,
        2713,   14,   24],
       [ 263,   51,  261,  408,   87,  219,  129,  111,  954,  260,   50,
          43,   38,  314,    7,   23,  546,    3,  150,  259,    6, 2713,
          14,   24,  957],
       [  51,  261,  408,   87,  219,  129,  111,  954,  260,   50,   43,
          38,  314,    7,   23,  546,    3,  150,  259,    6, 2713,   14,
          24,  957,    5]])

In [37]:
sequences[:,-1][:4]

array([ 24, 957,   5,  60])

In [38]:
X=sequences[:,:-1]
y=sequences[:,-1]

In [39]:
from keras.utils import to_categorical

In [40]:
y=to_categorical(y,num_classes=vocabulary_size+1)

In [41]:
seq_len=X.shape[1]

In [42]:
y.shape

(11312, 2719)

In [43]:
seq_len

25

In [44]:
X.shape

(11312, 25)

## Build a model

In [66]:
from keras.models import Sequential
from keras.layers import Dense,LSTM,Embedding,Dropout

In [68]:
# create a function for our model
def create_model(vocabulary_size,seq_len):
    model=Sequential()
    model.add(Embedding(vocabulary_size,seq_len,input_length=seq_len))
    model.add(LSTM(250,return_sequences=True))
    model.add(Dropout(0.35))
    model.add(LSTM(250))
    model.add(Dropout(0.35))
    model.add(Dense(150,activation='relu'))
    model.add(Dense(vocabulary_size,activation="softmax"))
    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    model.summary()
    
    return model

In [71]:
model=create_model(vocabulary_size+1,seq_len)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 25, 25)            67975     
_________________________________________________________________
lstm_3 (LSTM)                (None, 25, 250)           276000    
_________________________________________________________________
dropout_3 (Dropout)          (None, 25, 250)           0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 250)               501000    
_________________________________________________________________
dropout_4 (Dropout)          (None, 250)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 150)               37650     
_________________________________________________________________
dense_4 (Dense)              (None, 2719)             

## Train the model

In [None]:
model.fit(X,y,batch_size=32,epochs=100,verbose=2)

Instructions for updating:
Use tf.cast instead.
Epoch 1/100
 - 94s - loss: 6.6600 - accuracy: 0.0490
Epoch 2/100
 - 93s - loss: 6.2974 - accuracy: 0.0523
Epoch 3/100
 - 80s - loss: 6.2387 - accuracy: 0.0514
Epoch 4/100
 - 82s - loss: 6.1071 - accuracy: 0.0530
Epoch 5/100
 - 88s - loss: 5.9940 - accuracy: 0.0618
Epoch 6/100
 - 83s - loss: 5.8876 - accuracy: 0.0666
Epoch 7/100
 - 92s - loss: 5.8634 - accuracy: 0.0681
Epoch 8/100
 - 91s - loss: 5.7571 - accuracy: 0.0690
Epoch 9/100
 - 86s - loss: 5.6731 - accuracy: 0.0718
Epoch 10/100
 - 84s - loss: 5.5976 - accuracy: 0.0749
Epoch 11/100
 - 85s - loss: 5.5321 - accuracy: 0.0746
Epoch 12/100
 - 99s - loss: 5.4621 - accuracy: 0.0771
Epoch 13/100
 - 107s - loss: 5.3987 - accuracy: 0.0759
Epoch 14/100
 - 84s - loss: 5.3422 - accuracy: 0.0796
Epoch 15/100
 - 83s - loss: 5.2889 - accuracy: 0.0795
Epoch 16/100
 - 83s - loss: 5.2419 - accuracy: 0.0814
Epoch 17/100
 - 82s - loss: 5.1910 - accuracy: 0.0812
Epoch 18/100
 - 81s - loss: 5.2240 - accur

**Training this model without GPU is taking to much time so let's train the model on Google Colab the code is in other Notebook**