Next Word in a sentence prediction Using RNN

In this notebook we will go to the process of reading a text file and come up with input and output for a RNN network, next day we will see how to use the model to predict the next word in the sentence.

I will be using [this tutorial](https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/).

### I step : Load the text

In [197]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
def load_document(path):
    """
    load the document at the given path
    """
    with open(path, 'r') as file:
        text = file.read()
        return text      

In [3]:
doc = load_document('./data/republic_clean.txt')
print(doc[:201])

﻿BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what


### II Step clean the text : 

Cleaning the text is the most important part of any NLP task, is to clean the text, it envolves spliting the text into sentence and sentences into tokens,removing puncuactions and stop word., NLTK library is good at this but let use raw pyhton and string method.

We use a translation table to remove punctuaction from each token.

In [4]:
import string

def clean_document(doc):
    """
    Clean the document pass in parameter.
    """
    doc = doc.replace('--', ' ')
    tokens = doc.split()
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word.lower() for word in tokens]
    return tokens

In [5]:
tokens = clean_document(doc)
print(tokens[:201])
print('Total Tokens: {}'.format(len(tokens)))
print('Unique Tokens: {}'.format( len(set(tokens))))

['i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid', 'us', '

At this point we have the whole text splitted into an array of token ,  like 118683 tokens , we need to split it into an array of 51 tokens each why 51 tokens?

The first 50 tokens will be our input and the last will be our output....

We can do this by iterating over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.

In [6]:
length = 50 + 1
sequences = list()
#TODO:  but this is ineficient 
for i in range(length, len(tokens)):
    sequence = tokens[i-length:i]
    line = ' '.join(sequence)
    sequences.append(line)
print("total sentences is {}".format(len(sequences)))

total sentences is 118632


In [7]:
sequences[:51]

['i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with',
 'went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with the',
 'down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in wha

Let save everything to a file for later use 

In [8]:
def save_document(lines, path):
    """
    save the document to the given path
    """
    data = '\n'.join(lines)
    with open(path , 'w') as file :
        file.write(data)

In [9]:
save_document(sequences, 'data/republic_sentences.txt')

In [10]:
len(sequences)

118632

Let reload our input data again

In [11]:

doc = load_document('data/republic_sentences.txt')
lines = doc.split('\n')

In [12]:
lines[:5]

['i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with',
 'went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with the',
 'down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in wha

### II. Preparing the text for the model 

We have our sentences , 118632 differents sentences... w

we can now move to the preparation of our dataset for Machine learning.

We are predicting the next word in a sentence , so our X is a sentence , and Y is the same sentence sifted by one.

In [13]:
lines[0]

'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted'

If the line above is x , our Y will be....

In [14]:
for line in lines:
    line = line.split(' ')

In [15]:
lines[0]

'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted'

Transforming sentences into list of tokens...

In [16]:
sentences = []
for line in lines:
    sentences.append(line.split())

In [17]:
sentences = np.array(sentences)

This is how X looks like

In [18]:
sentences[0][:-1]

array(['i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus',
       'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i',
       'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess',
       'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because',
       'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they',
       'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a',
       'new', 'thing', 'i', 'was'], dtype='<U19')

And Y looks like

In [19]:
sentences[0][1:]

array(['i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with',
       'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might',
       'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis',
       'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i',
       'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would',
       'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new',
       'thing', 'i', 'was', 'delighted'], dtype='<U19')

In [20]:
X , Y = sentences[:, :-1], sentences[:, 1:]

In [21]:
X[0]

array(['i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus',
       'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i',
       'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess',
       'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because',
       'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they',
       'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a',
       'new', 'thing', 'i', 'was'], dtype='<U19')

In [22]:
Y[0].shape

(50,)

### One Hot encoding the token

Given our vocabulary we can replace each word in a sentence denoting his possition in our dictionary.
based on that possition we can convert hours number into one hoted vector.m

In [23]:
vocab = list(set(tokens))

In [24]:
len(vocab)

7409

Our vocabulary has 7409 words

In [25]:
# each word to an number

In [26]:
word_to_number = {}
for x in range(len(vocab)):
    word_to_number[vocab[x]] = x

In [27]:
number_to_word = {}
for word, number in word_to_number.items():
    number_to_word[number] = word

Convert array of token to array of integer

In [28]:
X_number = np.vectorize(word_to_number.__getitem__)(X)
Y_number = np.vectorize(word_to_number.__getitem__)(Y)

Checking if the arrays are equals after conversion

In [29]:
np.testing.assert_array_equal(X, np.vectorize(number_to_word.__getitem__)(X_number))

In [30]:
np.testing.assert_array_equal(Y, np.vectorize(number_to_word.__getitem__)(Y_number))

One hot encoding the variables check here for a [numpy solution](https://stackoverflow.com/a/36960495/4683950)

In [31]:
np.arange(X_number.max()+1)

array([   0,    1,    2, ..., 7406, 7407, 7408])

In [32]:
def all_index(idx, axis):
    """
    helper function for indexing
    """
    grid = np.ogrid[tuple(map(slice, idx.shape))]
    grid.insert(axis, idx)
    return tuple(grid)   

In [33]:
def one_hot_initialization(a):
    ncols = a.max()+1
    out = np.zeros(a.shape + (ncols,), dtype=int)
    out[all_index(a, axis=2)] = 1
    return out

In [34]:
X = one_hot_initialization(X_number)
Y = one_hot_initialization(Y_number)

In [38]:
X[0].shape

(50, 7409)

In [40]:
Y[0].shape

(50, 7409)

In [41]:
X.shape

(118632, 50, 7409)

In [42]:
Y.shape

(118632, 50, 7409)

array([1951, 1951, 1395, 6504, 4319, 2905, 2986, 1158, 4285, 5633, 2986,
       3028, 6859, 1352, 7317, 1951, 3791, 1954, 5836, 1502, 3295, 2905,
       2986, 7392, 1583, 2986, 7205, 2680,  399, 5632, 1768, 1951, 4453,
       2905, 5039, 7034, 5087, 5419,  866,  588, 3189, 2986, 5288, 5441,
       1179, 5180, 4893, 2424, 1951, 1179])

In [39]:
np.testing.assert_array_equal(np.argmax(X[0], axis=1), X_number[0])

In [None]:
np.argmax(X[0], axis=1)

In [42]:
def array_to_word(array, number_to_word=number_to_word):
    """
    convert a one hotte encoded array to word.
    """
    x = np.argmax(array, axis=1)
    return np.vectorize(number_to_word.__getitem__)(x)

In [43]:
array_to_word(X[0])

array(['i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus',
       'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i',
       'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess',
       'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because',
       'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they',
       'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a',
       'new', 'thing', 'i', 'was'], dtype='<U9')

In [44]:
array_to_word(Y[0])

array(['i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with',
       'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might',
       'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis',
       'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i',
       'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would',
       'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new',
       'thing', 'i', 'was', 'delighted'], dtype='<U9')

#### II. Building the model.

In the following part we will try to implement the network from scratch ans see how it goes.

In [47]:
# %load scripts/functions.py
import numpy as np


def sigmoid(x, derivate=False):
    """
    Compute the element wise sigmoid for the array x
    Args:
        x ([type]): [description]
        derivate (bool, optional): [description]. Defaults to False.
    """
    x = x + 1e-12  # why are we doing this?
    f = 1 / (1 + np.exp(x))
    if derivate:
        f = f * (1 - f)
    else:
        return f


def tanh(x, derivate=False):
    """
    Compute tanh function

    Args:
        x ([type]): [description]
        derivate (bool, optional): [description]. Defaults to False.
    """
    x = x + 1e-12

    f = (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
    if derivate:
        return 1 - f**2
    else:
        return f


def softmax(x, derivate=False):
    """
    Compute softmax derivate of x

    Args:
        f ([type]): [description]
        derivate (bool, optional): [description]. Defaults to False.
    """
    x = x + 1e-12

    f = np.exp(x) / np.sum(np.exp(x))

    if derivate:
        pass
    else:
        return f


def cross_entropy(predictions, targets, epsilon=1e-12):
    """
    Computes cross entropy between targets (encoded as one-hot vectors)
    and predictions.
    Input: predictions (N, k)
           targets (N, k)
    Returns: scalar
    """
    predictions = np.clip(predictions, epsilon, 1. - epsilon)
    cross_entropy = -np.mean(targets * np.log(predictions + 1e-9))
    return cross_entropy


def log_loss(y_predicted, y, derivate=False):
    """
    Compute the log loss between the predicted output and the real output
    Args:
        y_predicted ([type]): [description]
        y ([type]): [description]
        derivate (bool, optional): [description]. Defaults to False.
    """
    f = - y * np.log(y_predicted)

    if derivate:
        return - y / y_predicted
    else:
        return f


In [190]:
# %load scripts/rnn.py
import sys
import numpy as np
from datetime import datetime


class RNNumpy:
    def __init__(self, hidden_size, vocab_size, bptt_truncate=4):
        """
        function help us to return  initialize the neural networks parameters
        for text prediction the the number of input and
        the number of output is equals to the vocabulary size, as well

        Args:
            hidden_size ([type]): [description]
            vocab_size ([type]): [description]
        """
        num_inputs = num_outputs = vocab_size

        def normal(shape):
            """
            Generate normal distribution but a lot can be done here

            Args:
                shape ([type]): [description]

            Returns:
                [type]: [description]
            """
            return np.random.normal(scale=0.01, size=shape)

        self.U = normal((hidden_size, num_inputs))
        self.W = normal((hidden_size, hidden_size))
        # Output layer parameters
        self.V = normal((num_outputs, hidden_size))
        self.b_hidden = np.zeros((hidden_size, 1))
        self.b_out = np.zeros((vocab_size, 1))
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size

        # TODO: I don't know the meaning of this yet but it will be clear soon
        self.bptt_truncate = bptt_truncate

    def forward(self, x):
        """
        X denoting one training sample or a sentence

        Args:
            x ([type]): [description]
        """
        T = len(x)
        # we are saving all s in an numpy array
        s = np.zeros((T, self.hidden_size))
        s[-1] = np.zeros(self.hidden_size)  # we initialize it with zeros
        o = np.zeros((T, self.vocab_size))
        for t in np.arange(T):
            s[t] = tanh(np.dot(self.U, x[t]) + self.W.dot(s[t - 1]))
            o[t] = softmax(self.V.dot(s[t]))
        return o, s

    def predict(self, x):
        o, s = self.forward(x)
        return np.argmax(o, axis=1)

    def calculate_loss(self, x, y):
        """
        this will calculate the loss for one training example
        we calculate the loss of y_1, y2,y3     and  be find the mean of it
        TODO : is this the right approach?
        """
        y_predicted, _ = self.forward(x)
        log_loss = y * np.log(y_predicted)
        return -np.mean(np.sum(log_loss, axis=1))

    def calculate_total_loss(self, X, Y):
        loss = 0.0
        for i in range(len(Y)):
            loss += self.calculate_loss(X[i], Y[i])
        return loss / float(len(Y))
    
    def back_propagation_trough_time(self, x, y):
        T = len(y)
        o, s = self.forward(x)
        dl_dU = np.zeros(self.U.shape)
        dl_dV = np.zeros(self.V.shape)
        dl_dW = np.zeros(self.W.shape)
        # TODO: should implement o-Y
        # seems to understand this but , I can improve it and make it readble o_t - y_t
        delta_o = o[-1] - y[-1]
        for t in np.arange(T):
            dl_dV += np.outer(delta_o, s[t])
            delta_t = self.V.T.dot(delta_o) * (1 - (np.power(s[t], 2)))

            # TODO this part is not well understood, will improve it
            for bptt_step in np.arange(
                    max(0, t - self.bptt_truncate), t + 1)[::-1]:
                dl_dW += np.outer(delta_t, s[bptt_step - 1])
                dl_dU += np.outer(delta_t, x[t])
                delta_t = self.W.T.dot(delta_t) * \
                    (1 - (np.power(s[bptt_step - 1], 2)))
        return dl_dV, dl_dU, dl_dW

    def numpy_sgd_step(self, x, y, learning_rate):
        dL_dV, dL_dU, dL_dW = self.back_propagation_trough_time(x, y)
        self.U -= learning_rate * dL_dU
        self.V -= learning_rate * dL_dV
        self.W -= learning_rate * dL_dW


def train_with_sgd(
        model,
        x_train,
        y_train,
        learning_rate=0.005,
        nepoch=100,
        evaluate_loss_after=5):
    """
    Train with sgd

    Args:
        model ([type]): [description]
        x_train ([type]): [description]
        y_train ([type]): [description]
        learning_rate (float, optional): [description]. Defaults to 0.005.
        nepoch (int, optional): [description]. Defaults to 100.
        evaluate_loss_after (int, optional): [description]. Defaults to 5.
    """
    losses = []
    num_examples_seen = 0
    for epoch in range(nepoch):
        if (epoch % evaluate_loss_after == 0):
            loss = model.calculate_total_loss(x_train, y_train)
            losses.append((epoch, num_examples_seen, loss))
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print('{} loss after number of example seen = {} epoch = {}: {}'.format(time, num_examples_seen, epoch, loss))
            
            # setting the learning rate if it's increasing
            if(len(losses) > 1 and losses[-1][1] > losses[-2][1]):
                learning_rate = learning_rate * 0.5
                print(f"setting learning rate to {learning_rate}")
            sys.stdout.flush()
        for i in range(len(y_train)):
            model.numpy_sgd_step(x_train[i], y_train[i], learning_rate)
            num_examples_seen += 1
    return losses

In [191]:
rnn = RNNumpy(50, vocab_size=len(vocab))

In [192]:
rnn.calculate_loss(X[0], Y[0])

8.910437361288848

In [193]:
output_o[-1] - Y[0][-1]

array([0.00013481, 0.00013504, 0.00013507, ..., 0.00013507, 0.00013501,
       0.00013505])

In [194]:
Y[0][-1]

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
x = [p[0] for p in losses]
y = [p[2] for p in losses]
plt.plot(x, y)
plt.title('loss of each epoch')
plt.xlabel('epoch')

In [195]:
np.random.seed(10)
losses = train_with_sgd(rnn, X[:400], Y[:400], nepoch = 10, evaluate_loss_after = 1)

2020-04-13 21:14:29 loss after number of example seen = 0 epoch = 0: 8.910471836450384
2020-04-13 21:16:56 loss after number of example seen = 400 epoch = 1: 8.909616987357108
setting learning rate to 0.0025
2020-04-13 21:19:15 loss after number of example seen = 800 epoch = 2: 8.905731341089929
setting learning rate to 0.00125
2020-04-13 21:21:29 loss after number of example seen = 1200 epoch = 3: 8.862861658845667
setting learning rate to 0.000625
2020-04-13 21:23:57 loss after number of example seen = 1600 epoch = 4: 6.239898255236687
setting learning rate to 0.0003125
2020-04-13 21:26:37 loss after number of example seen = 2000 epoch = 5: 6.011202003567245
setting learning rate to 0.00015625
2020-04-13 21:29:02 loss after number of example seen = 2400 epoch = 6: 5.83136170914172
setting learning rate to 7.8125e-05
2020-04-13 21:31:29 loss after number of example seen = 2800 epoch = 7: 5.767273609825919
setting learning rate to 3.90625e-05
2020-04-13 21:33:52 loss after number of ex

NameError: name 'plt' is not defined

In [202]:
rnn.predict(X[4])

array([2986, 2986, 2986,  399,  399,  399,  399,  399,  399,  399,  399,
        399,  399,  399,  399,  399,  399,  399,  399,  399,  399,  399,
        399,  399,  399,  399,  399,  399,  399,  399,  399,  399,  399,
        399,  399,  399,  399,  399,  399,  399,  399,  399,  399,  399,
        399,  399,  399,  399,  399,  399])

In [211]:
array_to_word(Y[4])

array(['to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of',
       'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers',
       'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis',
       'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what',
       'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which',
       'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the',
       'procession', 'of'], dtype='<U10')

In [210]:
Y[4]

array([0, 0, 0, ..., 0, 0, 0])

In [216]:
np.vectorize(number_to_word.__getitem__)(rnn.predict(X[1186]))

array(['inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably', 'inevitably', 'inevitably',
       'inevitably', 'inevitably'], dtype='<U10')