Next Word in a sentence prediction Using RNN

In this notebook we will go to the process of reading a text file and come up with input and output for a RNN network, next day we will see how to use the model to predict the next word in the sentence.

I will be using [this tutorial](https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/).

### I step : Load the text

In [1]:
import numpy as np

In [2]:
def load_document(path):
    """
    load the document at the given path
    """
    with open(path, 'r') as file:
        text = file.read()
        return text      

In [3]:
doc = load_document('./data/republic_clean.txt')
print(doc[:201])

﻿BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what


### II Step clean the text : 

Cleaning the text is the most important part of any NLP task, is to clean the text, it envolves spliting the text into sentence and sentences into tokens,removing puncuactions and stop word., NLTK library is good at this but let use raw pyhton and string method.

We use a translation table to remove punctuaction from each token.

In [4]:
import string

def clean_document(doc):
    """
    Clean the document pass in parameter.
    """
    doc = doc.replace('--', ' ')
    tokens = doc.split()
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word.lower() for word in tokens]
    return tokens

In [5]:
tokens = clean_document(doc)
print(tokens[:201])
print('Total Tokens: {}'.format(len(tokens)))
print('Unique Tokens: {}'.format( len(set(tokens))))

['i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid', 'us', '

At this point we have the whole text splitted into an array of token ,  like 118683 tokens , we need to split it into an array of 51 tokens each why 51 tokens?

The first 50 tokens will be our input and the last will be our output....

We can do this by iterating over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.

In [6]:
length = 50 + 1
sequences = list()
#TODO:  but this is ineficient 
for i in range(length, len(tokens)):
    sequence = tokens[i-length:i]
    line = ' '.join(sequence)
    sequences.append(line)
print("total sentences is {}".format(len(sequences)))

total sentences is 118632


In [7]:
sequences[:51]

['i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with',
 'went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with the',
 'down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in wha

Let save everything to a file for later use 

In [8]:
def save_document(lines, path):
    """
    save the document to the given path
    """
    data = '\n'.join(lines)
    with open(path , 'w') as file :
        file.write(data)

In [9]:
save_document(sequences, 'data/republic_sentences.txt')

In [10]:
len(sequences)

118632

Let reload our input data again

In [11]:

doc = load_document('data/republic_sentences.txt')
lines = doc.split('\n')

In [12]:
lines[:5]

['i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with',
 'went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with the',
 'down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in wha

### II. Preparing the text for the model 

We have our sentences , 118632 differents sentences... w

we can now move to the preparation of our dataset for Machine learning.

We are predicting the next word in a sentence , so our X is a sentence , and Y is the same sentence sifted by one.

In [13]:
lines[0]

'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted'

If the line above is x , our Y will be....

In [14]:
for line in lines:
    line = line.split(' ')

In [15]:
lines[0]

'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted'

Transforming sentences into list of tokens...

In [16]:
sentences = []
for line in lines:
    sentences.append(line.split())

In [17]:
sentences = np.array(sentences)

This is how X looks like

In [18]:
sentences[0][:-1]

array(['i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus',
       'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i',
       'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess',
       'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because',
       'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they',
       'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a',
       'new', 'thing', 'i', 'was'], dtype='<U19')

And Y looks like

In [19]:
sentences[0][1:]

array(['i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with',
       'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might',
       'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis',
       'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i',
       'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would',
       'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new',
       'thing', 'i', 'was', 'delighted'], dtype='<U19')

In [20]:
X , Y = sentences[:, :-1], sentences[:, 1:]

In [21]:
X[0]

array(['i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus',
       'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i',
       'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess',
       'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because',
       'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they',
       'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a',
       'new', 'thing', 'i', 'was'], dtype='<U19')

In [22]:
Y[0].shape

(50,)

### One Hot encoding the token

Given our vocabulary we can replace each word in a sentence denoting his possition in our dictionary.
based on that possition we can convert hours number into one hoted vector.m

In [23]:
vocab = list(set(tokens))

In [24]:
len(vocab)

7409

Our vocabulary has 7409 words

In [25]:
# each word to an number

In [26]:
word_to_number = {}
for x in range(len(vocab)):
    word_to_number[vocab[x]] = x

In [27]:
number_to_word = {}
for word, number in word_to_number.items():
    number_to_word[number] = word

Convert array of token to array of integer

In [28]:
X_number = np.vectorize(word_to_number.__getitem__)(X)
Y_number = np.vectorize(word_to_number.__getitem__)(Y)

Checking if the arrays are equals after conversion

In [29]:
np.testing.assert_array_equal(X, np.vectorize(number_to_word.__getitem__)(X_number))

In [30]:
np.testing.assert_array_equal(Y, np.vectorize(number_to_word.__getitem__)(Y_number))

One hot encoding the variables check here for a [numpy solution](https://stackoverflow.com/a/36960495/4683950)

In [31]:
np.arange(X_number.max()+1)

array([   0,    1,    2, ..., 7406, 7407, 7408])

In [32]:
def all_index(idx, axis):
    """
    helper function for indexing
    """
    grid = np.ogrid[tuple(map(slice, idx.shape))]
    grid.insert(axis, idx)
    return tuple(grid)   

In [33]:
def one_hot_initialization(a):
    ncols = a.max()+1
    out = np.zeros(a.shape + (ncols,), dtype=int)
    out[all_index(a, axis=2)] = 1
    return out

In [34]:
X = one_hot_initialization(X_number)
Y = one_hot_initialization(Y_number)

In [38]:
X[0].shape

(50, 7409)

In [40]:
Y[0].shape

(50, 7409)

In [41]:
X.shape

(118632, 50, 7409)

In [42]:
Y.shape

(118632, 50, 7409)

array([1951, 1951, 1395, 6504, 4319, 2905, 2986, 1158, 4285, 5633, 2986,
       3028, 6859, 1352, 7317, 1951, 3791, 1954, 5836, 1502, 3295, 2905,
       2986, 7392, 1583, 2986, 7205, 2680,  399, 5632, 1768, 1951, 4453,
       2905, 5039, 7034, 5087, 5419,  866,  588, 3189, 2986, 5288, 5441,
       1179, 5180, 4893, 2424, 1951, 1179])

In [39]:
np.testing.assert_array_equal(np.argmax(X[0], axis=1), X_number[0])

In [None]:
np.argmax(X[0], axis=1)

In [42]:
def array_to_word(array, number_to_word=number_to_word):
    """
    convert a one hotte encoded array to word.
    """
    x = np.argmax(array, axis=1)
    return np.vectorize(number_to_word.__getitem__)(x)

In [43]:
array_to_word(X[0])

array(['i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus',
       'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i',
       'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess',
       'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because',
       'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they',
       'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a',
       'new', 'thing', 'i', 'was'], dtype='<U9')

In [44]:
array_to_word(Y[0])

array(['i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with',
       'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might',
       'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis',
       'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i',
       'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would',
       'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new',
       'thing', 'i', 'was', 'delighted'], dtype='<U9')

#### II. Building the model.

In the following part we will try to implement the network from scratch ans see how it goes.

In [29]:
Y.shape

(118632, 7410)

In [31]:
X[0].shape

(50, 7410)

In [36]:
Y[0]

array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)