<a href="https://colab.research.google.com/github/aayushkubb/Deep_Learning_Tutorial/blob/master/Next_Word_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Deep Learning for NLP
--

Next Word Prediction
--


Autofill/showing what could be the potential sequence of words saves a
lot of time while writing emails and makes users happy to use it in any
product.

Problem
--
You want to build a model to predict/suggest the next word based on a
previous sequence of words using Email Data.

Like you see in the below image, language is being suggested as the
next word.

<img src="https://drive.google.com/uc?id=1xQcV7rei1McMMeVS76O2a7T5sVKt2J03"  />

Solution
--
In this section, we will build an LSTM model to learn sequences of words
from email data. We will use this model to predict the next word.



# Import Data

In [None]:
# Importing and installing necessary libraries
import numpy as np
import pandas as pd
import re

# Tokenizer
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer()

import collections
from keras.utils import np_utils

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint

# Read Data

In [None]:
sms_data = pd.read_csv("/content/drive/MyDrive/Case studies/spam.csv",encoding="ISO-8859-1")
sms_data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
list_sms=sms_data['v2'].tolist()

# Data Pre Processing

In [None]:
# Processing the data

# Converting list to string
# Useful in general not specific to this dataset
from collections import Iterable
def flatten(items):
 """Yield items from any nested iterable"""
 for x in items:
    if isinstance(x, Iterable) and not isinstance(x,(str, bytes)):
        for sub_x in flatten(x):
            print("Hey")
            yield sub_x   # generator function -> all o/p gets stored in a temporary production table
    else:
        yield x

TextData = list(flatten(list_sms))
TextData = ''.join(TextData)
print(TextData[:50])  ## just checking 


Go until jurong point, crazy.. Available only in b


# Data Cleaning

In [None]:

# Remove unwanted lines and converting into lower case
TextData = TextData.replace('\n','')
TextData = TextData.lower()
pattern = r'[^a-zA-Z0-9\s]'

TextData = re.sub(pattern, '', TextData)
                            
# Tokenizing
tokens = tokenizer.tokenize(TextData)   # the long sentence is broken into tokens
tokens = [token.strip() for token in tokens]  ## optional
                            
# get the distinct words and sort it
word_counts = collections.Counter(tokens)
corpusLen = len(tokens) 
word_c = len(word_counts) 

print(corpusLen) ## printing total no. of words
print(word_c)  ## printing count of words
print(word_counts)

79788
13232


In [None]:
##print(word_c)

distinct_words_sorted = [x[0] for x in word_counts.most_common()]
##distinct_words_sorted = list(sorted(distinct_words))

# Generate indexing for all words
word_index = {x: i for i, x in enumerate(distinct_words_sorted)}

print(word_index)



# Next step: Data preparation for modeling

Here we are dividing the mails into sequence of words with a fixed length
of 5 words (you can choose anything based on the business problem and
computation power). We are splitting the text by words sequences. When
creating these sequences, we slide this window along the whole document
one word at a time, allowing each word to learn from its preceding one.

In [None]:
# prepare the dataset of input to output pairs encoded as integers
# Generate the data for the model
# input = the input sentence to the model with index
# output = output of the model with index

InputData = []
OutputData = []

# decide on sentence length
# sentence_length = 25
sentence_length = 5

for i in range(0 , corpusLen - sentence_length):
 X = tokens[i : i + sentence_length]
 Y = tokens[i + sentence_length]
 InputData.append([word_index[char] for char in X])
 OutputData.append(word_index[Y])
 
print(InputData[0])
print ("\n")
print(OutputData[0])   

[41, 379, 4033, 863, 680]


681


In [None]:
# reverse the dictionary
index_words={j:i for i,j in word_index.items()}

In [None]:
print(" ".join([index_words.get(i) for i in InputData[0]]))
print(index_words.get(OutputData[0]))

go until jurong point crazy
available


In [None]:
# Generate X
X = np.reshape(InputData, (len(InputData), sentence_length, 1))
print(X.shape[0], X.shape[1], X.shape[2])
print("----------------------------------")
print(X[0])
print("----------------------------------")
print(len(X[0]))
print("----------------------------------")

# One hot encode the output variable
Y = np_utils.to_categorical(OutputData)
print(Y)
print("----------------------------------")
print(len(Y[0]))
print("----------------------------------")
print(Y.shape[0], Y.shape[1])

79783 5 1
----------------------------------
[[  41]
 [ 379]
 [4033]
 [ 863]
 [ 680]]
----------------------------------
5
----------------------------------
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
----------------------------------
13232
----------------------------------
79783 13232


> Next step : Model building
    
We will now define the LSTM model. Here we define a single hidden LSTM
layer with 256 memory units. This model uses dropout 0.2. The output
layer is using the softmax activation function. Here we are using the ADAM
optimizer.

In [None]:
X.shape[1], X.shape[2],Y.shape[1]

(5, 1, 13232)

In [None]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(Y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/


#define the checkpoint
file_name_path="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(file_name_path, monitor='loss', verbose=1, save_best_only=True, mode='min')
# # https://machinelearningmastery.com/check-point-deep-learning-models-keras/

callbacks = [checkpoint]

We can now fit the model to the data. Here we use 5 epochs and a
batch size of 128 patterns. For better results, you can use more epochs like
50 or 100. And of course, you can use them on more data.

In [None]:

#fit the model
model.fit(X, Y, epochs=5, batch_size=128, callbacks=callbacks)


Epoch 1/5

Epoch 00001: loss improved from inf to 7.41367, saving model to weights-improvement-01-7.4137.hdf5
Epoch 2/5

Epoch 00002: loss improved from 7.41367 to 7.06721, saving model to weights-improvement-02-7.0672.hdf5
Epoch 3/5

> Note We have not split the data into training and testing data.
We are not interested in the accurate model. As we all know, deep
learning models will require a lot of data for training and take a lot
of time to train, so we are using a model checkpoint to capture all of
the model weights to file. We will use the best set of weights for our
prediction.

After running the above code, you will have weight checkpoint files
in your local directory. Pick the network weights file that is saved in your working directory. For example, when we ran this example, below was the checkpoint with the smallest loss that we achieved with 5 epochs.

In [None]:
# load the network weights
file_name = "/content/weights-improvement-05-6.6146.hdf5"
model.load_weights(file_name)
model.compile(loss='categorical_crossentropy', optimizer='adam')

> Last step: Predicting next word

We will randomly generate a sequence of words and input to the model
and see what it predicts.

In [None]:
# Generating random sequence
start = numpy.random.randint(0, len(InputData))  # producing a random no. between 0 and 79788
input_sent = InputData[start]  # getting the sentence or bag of words (of len 25 words) 

# Generate index of the next word of the email
X = numpy.reshape(input_sent, (1, len(input_sent), 1))
predict_word = model.predict(X, verbose=0)
print(predict_word)  # holds the probabilities of all the words

print(len(predict_word[0])) # shows that the predict word list has probabilities of 13344 words

index = numpy.argmax(predict_word)  # find the index of the highest probability word
# Must read : https://stackoverflow.com/questions/28697993/numpy-what-is-the-logic-of-the-argmin-and-argmax-functions

print(input_sent)
print ("\n")
print(index)

In [None]:
# Convert these indexes back to words
for key , value in word_index.items():
  if value == index:
    ans=key
    break

sentence=[]

for i in input_sent:
  for key, value in word_index.items():
    if value==i:
      sentence.append(key)
    

for i in sentence:
  print(i, end =' ')

print ("\n")

print(ans)