# **               Character Level Language Model using KERAS**

> In this work LSTM RNN is used as a generative model to generate entirely 
new plausible sequences of texts in our Prime Minister Narendra Modi's Style. Generative models are not only used to learn how well the model is trained but also to learn more about problem domain itself. Since LSTM RNN takes a long time to train, Google colab is used to train the model on GPU



# **What is RNN ?**

> Recurrent Neural Network (RNN) is a type of Deep Neural Network  where connections between units form a directed graph along a sequence.This allows it to exhibit temporal dynamic behavior for a time sequence.

#**Why RNN and Why not other models? **

> Since Texts are sequences, unlike feedforward neural networks which doesn't share features learned across different positions of text, RNNs can use their internal state (memory) to process sequences of inputs thus making it more suitable.

# **What is LSTM ?**

> LSTM (Long Short Term Memory). This tracks and updates a "cell state" or memory variable $c^{\langle t \rangle}$ at every time-step, which can be different from $a^{\langle t \rangle}$. LSTM uses three gates namely forget gate, update gate and output gate to keep track of the information

> **Forget gate**

>Lets assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural. If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state. In an LSTM, the forget gate lets us do this

> **Update gate**

> Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural.

> **Output gate**



> Output gate is used to decide which outputs we will use.
















# **Our Steps**


> Install Dependencies

> Upload our Dataset from Google Drive

> Preprocess the Data

> Build a Simple LSTM Model

> Generate new texts

> Build a Larger LSTM Model

> Generate new texts


























# **Installing and uploading dataset**

In [0]:
# 1. Authenticate and create the PyDrive client.

!pip install -U -q PyDrive
 
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
 

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [0]:
#List the Files in the specified folder in Google Drive

file_list = drive.ListFile({'q': "'1fiRsurTg6CvaL2Dx6Xd-DsEKYhh76Crm' in parents and trashed=false"}).GetList()
for file1 in file_list:
  print('title: %s, id: %s' % (file1['title'], file1['id']))

In [5]:
#importing numpy
import numpy as np

#import Keras 
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


In [0]:
# Uploading Dataset
train_downloaded = drive.CreateFile({'id': '1omHsR2nv3TQiT9UT0Ps-a97_BijUENRc'})
train_downloaded.GetContentFile('Speeches.txt')

# 1.PROBLEM STATEMENT

# **1. 1 Dataset and Preprocessing**

In [16]:
# Read the text file
data = open("Speeches.txt",'r').read()

characterset1 = sorted(list(set(data)))
data_size, vocab_size = len(data), len(characterset1)
print("Total number of characters ",data_size)
print("Number of uniques characters",vocab_size)


Total number of characters  48222
Number of uniques characters 78


    
> We have 78 unique characters, this dataset needs some cleaning. Let us remove punctuations and Duplicated spaces



In [17]:
import re, string 

#Keep all the alphabets and spaces remove everything else
delete = re.compile('[^a-zA-Z ]')

#First parameter is the replacement, second parameter is your input string
data=delete.sub(' ', data)

#Remove Duplicated Spaces
data=re.sub(' +', ' ',data)
data = data.lower()

characterset = sorted(list(set(data)))
data_size, vocab_size = len(data), len(characterset)
print("Size of the dataset",data_size)
print("Number of uniques characters",vocab_size)



Size of the dataset 46625
Number of uniques characters 27


In [18]:
#Mask each character to a number, making it easy for the LSTM to train
char_to_int = dict((c, i) for i, c in enumerate(characterset))
print(char_to_int)

{' ': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}


# 2 SIMPLE LSTM MODEL

#  ** 2.1 MODEL OVERVIEW**
>   At each time step, map T_x sequence of characters to the next character. Here T_x acts as a sliding window. ie allowing each character a chance to be learned from the T_x characters that preceded it (except the first T_x characters).
  
  
> **ARGUMENTS :**

> n_x = No of Training examples 

> T_x = Length of the input sequence, which acts as a sliding window.

> X_train = list of integers, where each integer is a number that maps to a character in the vocabulary.

> Y_train =  list of integers, exactly the same as X but shifted T_x index to the left.

> X = Training set of shape [n_x, T_x, 1]

> y = One hot encodings of output pattern.  Each y value is converted into a sparse vector with a length of 27, full of zeros except with a 1 in the column for the letter  that the pattern represents.


> **Why do we need One-hot-encoding ?**

> Without one-hot encoding we are forcing our RNN to precisely predict the next character, whereas one-hot encoding allows us to use softmax activation function from which we can predict the probability of occurence of all possible 27 characters a more easier representation.







In [19]:

T_x = 50
X_train = []
Y_train = []
for t in range(0, data_size - T_x, 1):
	inp = data[t:t + T_x]
	out = data[t + T_x]
	X_train.append([char_to_int[char] for char in inp])
	Y_train.append(char_to_int[out])
n_x = len(X_train)
print("Training set size ",n_x)

Training set size  46575


In [0]:
# reshape X to be [samples, time steps, features]
X = np.reshape(X_train, (n_x, T_x, 1))
# normalize
X = X / float(vocab_size)
# one hot encode the output variable
y = np_utils.to_categorical(Y_train)



# **2.2 Defining the LSTM model**


>*  Create  an LSTM Model with 256 memory units.   

>*  Add  Dropout of probabilty 50 percentage

>*  Add a Dense Layer with softmax activation function

>*   Since it is a Multiclass classification problem with 27 classes we need to compile categorical_cross entropy loss function

>*   Using Adam optimizer - a combination of RMS Prop and momentum, to speed up the algorithm












In [0]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.5))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [0]:
# define the checkpoint
filepath="parameters-updates-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [23]:
modelfit = model.fit(X, y, epochs=12, batch_size=128, callbacks=callbacks_list)



Epoch 1/12

Epoch 00001: loss improved from inf to 2.87369, saving model to parameters-updates-01-2.8737.hdf5
Epoch 2/12

Epoch 00002: loss improved from 2.87369 to 2.83272, saving model to parameters-updates-02-2.8327.hdf5
Epoch 3/12

Epoch 00003: loss improved from 2.83272 to 2.77980, saving model to parameters-updates-03-2.7798.hdf5
Epoch 4/12

Epoch 00004: loss improved from 2.77980 to 2.74651, saving model to parameters-updates-04-2.7465.hdf5
Epoch 5/12

Epoch 00005: loss improved from 2.74651 to 2.70986, saving model to parameters-updates-05-2.7099.hdf5
Epoch 6/12

Epoch 00006: loss improved from 2.70986 to 2.68831, saving model to parameters-updates-06-2.6883.hdf5
Epoch 7/12

Epoch 00007: loss improved from 2.68831 to 2.67023, saving model to parameters-updates-07-2.6702.hdf5
Epoch 8/12

Epoch 00008: loss improved from 2.67023 to 2.65606, saving model to parameters-updates-08-2.6561.hdf5
Epoch 9/12

Epoch 00009: loss improved from 2.65606 to 2.63817, saving model to parameters-u

# GENERATING TEXT WITH LSTM NETWORK



> Generating text with LSTM is straightforward we just have to load the parameters from the checkpoint file which has the lowest loss, perform one step of forward propagation instead of feeding the actual input to the next cell we feed the sampled output of the previous cell








In [0]:
# load the network weights
filename = "parameters-updates-12-2.5898.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [0]:
int_to_char = dict((i, c) for i, c in enumerate(characterset))

In [0]:
import sys

In [64]:
# pick a random seed
start = np.random.randint(0, len(X_train)-1)
pattern = X_train[start]
print("Sample Text:")
print( "\"", ''.join([int_to_char[value] for value in pattern]))
# generate characters
for i in range(50):
	x = np.reshape(pattern, (1, len(pattern), 1))
	x = x / float(vocab_size)
	prediction = model.predict(x, verbose=0)
	index = np.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]


Sample Text:
" often there are families in the village who celebr
the say that is is a soace of the vay that is is a soace of the vay that is soaces 

Spelling mistakes and some sentences are repeated. This result is not perfect let us improve the model by building a larger LSTM network

# **Building Larger LSTM Network**

Improve the quality by building a much larger network with same 256 units but an extra layer with reduced dropout probability 

In [0]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [0]:
# define the checkpoint
filepath="parameters2-updates-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [40]:
modelfit = model.fit(X, y, epochs=12, batch_size=128, callbacks=callbacks_list)

Epoch 1/12

Epoch 00001: loss improved from inf to 2.85015, saving model to parameters2-updates-01-2.8502.hdf5
Epoch 2/12

Epoch 00002: loss improved from 2.85015 to 2.74079, saving model to parameters2-updates-02-2.7408.hdf5
Epoch 3/12

Epoch 00003: loss improved from 2.74079 to 2.62584, saving model to parameters2-updates-03-2.6258.hdf5
Epoch 4/12

Epoch 00004: loss improved from 2.62584 to 2.51091, saving model to parameters2-updates-04-2.5109.hdf5
Epoch 5/12

Epoch 00005: loss improved from 2.51091 to 2.39205, saving model to parameters2-updates-05-2.3920.hdf5
Epoch 6/12

Epoch 00006: loss improved from 2.39205 to 2.28054, saving model to parameters2-updates-06-2.2805.hdf5
Epoch 7/12

Epoch 00007: loss improved from 2.28054 to 2.17745, saving model to parameters2-updates-07-2.1774.hdf5
Epoch 8/12

Epoch 00008: loss improved from 2.17745 to 2.07481, saving model to parameters2-updates-08-2.0748.hdf5
Epoch 9/12

Epoch 00009: loss improved from 2.07481 to 1.99231, saving model to para

# Generating Texts

In [0]:
# load the network weights
filename = "parameters2-updates-12-1.7781.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [0]:
int_to_char = dict((i, c) for i, c in enumerate(characterset))

In [65]:
# pick a random seed
start = np.random.randint(0, len(X_train)-1)
pattern = X_train[start]
print("Sample Text:")
print( "\"", ''.join([int_to_char[value] for value in pattern]))
# generate characters
for i in range(50):
	x = np.reshape(pattern, (1, len(pattern), 1))
	x = x / float(vocab_size)
	prediction = model.predict(x, verbose=0)
	index = np.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]

Sample Text:
" seeing a good time namaste my greetings to all those 
the seadh of the sarliament rian i am sake the sas

In [67]:
#@title Output 2
# pick a random seed
start = np.random.randint(0, len(X_train)-1)
pattern = X_train[start]
print("Sample Text:")
print( "\"", ''.join([int_to_char[value] for value in pattern]))
# generate characters
for i in range(50):
	x = np.reshape(pattern, (1, len(pattern), 1))
	x = x / float(vocab_size)
	prediction = model.predict(x, verbose=0)
	index = np.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]

Sample Text:
"  prakash narayan and his wife prabha devi based on
 the sarling and the say that is is a soace of the

In [69]:
#@title Output 3
# pick a random seed
start = np.random.randint(0, len(X_train)-1)
pattern = X_train[start]
print("Sample Text:")
print( "\"", ''.join([int_to_char[value] for value in pattern]))
# generate characters
for i in range(50):
	x = np.reshape(pattern, (1, len(pattern), 1))
	x = x / float(vocab_size)
	prediction = model.predict(x, verbose=0)
	index = np.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]

Sample Text:
" our duty to serve mother india by keeping the coun
try te siould be and the say the seople in the sar

In [72]:
#@title Output 4
# pick a random seed
start = np.random.randint(0, len(X_train)-1)
pattern = X_train[start]
print("Sample Text:")
print( "\"", ''.join([int_to_char[value] for value in pattern]))
# generate characters
for i in range(50):
	x = np.reshape(pattern, (1, len(pattern), 1))
	x = x / float(vocab_size)
	prediction = model.predict(x, verbose=0)
	index = np.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	sys.stdout.write(result)
	pattern.append(index)
	pattern = pattern[1:len(pattern)]

Sample Text:
" ng live lal bahadur shastri my colleagues from cen
tring the saslon i am surength in the sarliament r

  We could see some funny texts but there is definitely an improvement comparing to the smaller LSTM. The way the Neural Network learnt the words "*Namaste*" "*Lal Bahadur Shastri*" and "*India*" is fascinating.

# Performance improvement plans  :



> Adding more memory unit

> Padding input sequence 

> Tuning Dropouts and Batch size

> Changing the LSTM layers to be “stateful” to maintain state across batches.








