Text preprocessing for word next word prediction RNN

In this notebook we will go to the process of reading a text file and come up with input and output for a RNN network, next day we will see how to use the model to predict the next word in the sentence.

I will be using [this tutorial](https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/).

### I step : Load the text

In [1]:
import numpy as np

In [2]:
def load_document(path):
    """
    load the document at the given path
    """
    with open(path, 'r') as file:
        text = file.read()
        return text      

In [3]:
doc = load_document('./data/republic_clean.txt')
print(doc[:201])

﻿BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what


### II Step clean the text : 

Cleaning the text is the most important part of any NLP task, is to clean the text, it envolves spliting the text into sentence and sentences into tokens,removing puncuactions and stop word., NLTK library is good at this but let use raw pyhton and string method.

We use a translation table to remove punctuaction from each token.

In [4]:
import string

def clean_document(doc):
    """
    Clean the document pass in parameter.
    """
    doc = doc.replace('--', ' ')
    tokens = doc.split()
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word.lower() for word in tokens]
    return tokens

In [5]:
tokens = clean_document(doc)
print(tokens[:201])
print('Total Tokens: {}'.format(len(tokens)))
print('Unique Tokens: {}'.format( len(set(tokens))))

['i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid', 'us', '

At this point we have the whole text splitted into an array of token ,  like 118683 tokens , we need to split it into an array of 51 tokens each why 51 tokens?

The first 50 tokens will be our input and the last will be our output....

We can do this by iterating over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.

In [6]:
length = 50 + 1
sequences = list()
# but thise ineficient 
for i in range(length, len(tokens)):
    sequence = tokens[i-length:i]
    line = ' '.join(sequence)
    sequences.append(line)
print("total sentences is {}".format(len(sequences)))

total sentences is 118632


In [7]:
sequences[:51]

['i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with',
 'went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with the',
 'down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in wha

Let save everything to a file for later use 

In [8]:
def save_document(lines, path):
    """
    save the document to the given path
    """
    data = '\n'.join(lines)
    with open(path , 'w') as file :
        file.write(data)

In [9]:
save_document(sequences, 'data/republic_sentences.txt')

In [10]:
len(sequences)

118632

Let reload our input data again

In [11]:

doc = load_document('data/republic_sentences.txt')
lines = doc.split('\n')

In [12]:
lines[:5]

['i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with',
 'went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with the',
 'down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in wha

### III Preparing the text for the model 

Let convert our test into  a list of tokens 

In [13]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

Using TensorFlow backend.


Basically what we did was to create a list of all words in our corpus, and give to every word and index..
now given a word we can kknow his number form the corpus.

In [14]:
' '.join([tokenizer.index_word.get(index) for index in sequences[0]])

'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted'

In [15]:
sequences= np.array(sequences)

In [16]:
sequences.shape

(118632, 51)

We have our sequence now as a numpy array of 118632 sequences with 51 word each
Next question is how to split this into X and Y?

Remember the ouput is the last word of our sentence.

In [17]:
X, Y = sequences[:, :-1], sequences[:, -1]

Input : 

In [18]:
' '.join([tokenizer.index_word.get(index) for index in X[0]])

'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was'

In [19]:
Y

array([1147,   35,    1, ...,   21,   23,   85])

Output:

In [20]:
''.join([tokenizer.index_word.get(index) for index in Y][0])

'delighted'

In [21]:
from keras.utils import to_categorical

In [22]:
vocab_size = len(tokenizer.word_index) + 1

In [23]:
Y = to_categorical(Y, num_classes=vocab_size)

So As of now we have X our array of word indexes for the input and Y the array of ouput at each time step

In [24]:
X = to_categorical(X, num_classes=vocab_size)

In [25]:
X.shape

(118632, 50, 7410)

X is a 3 d array : where one dimenesion is the the number of sequences we have as input, another is the number of sequences in the input
    and another one is the length of our vocabulary.

Next step is to learn word embedding and how they work or train our models.m

II. Training the model