## **RNN (Recurrent Neural Network) - Language Mode**

`Language Modeling (LM) is one of the foundational task in the realm of natural language processing (NLP). At a high level, the goal is to predict the n + 1 token in a sequence given the n tokens preceding it. A well trained language model are used in applications such as machine translation.`

### **DataPreparation**

In [None]:
import string
import re
# define function to load and read text file 
# load file
def loading_file(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [None]:
# We will be splitting the text into words/tokens based on spaces, and from the first few words, 
# we can see that some words are separated by "--",
# hence we'll replace that with a space.
# Removing punctuation marks and retain only alphabetical words.
def clean_txt(txt):
    # replace '--' with a space ' '
    doc = txt.replace('--', ' ')
    # split into tokens by white space
    tokens = txt.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [None]:
# save tokens to file, one dialog per line
def save_txt(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [None]:
# load document
in_filename = '/content/republic_clean.txt'
doc= loading_file(in_filename)
print(doc[:200])

BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what


In [None]:
# clean document
tokens = clean_txt(doc)
print(tokens[:100])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))
# organize into sequences of tokens
length = 50 +1
sequences = list()

['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight']
Total Tokens: 118214
Unique Tokens: 7837


In [None]:
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))
# save sequences to file
out_filename = '/content/republic_clean_seq.txt'
save_txt(sequences, out_filename)

Total Sequences: 118163




---


## **Build Model LSTM**

In [None]:
from numpy import array
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM ,GRU,Bidirectional
from keras.layers import Embedding
from pickle import dump, load

In [None]:
# define the model
def define_model(vocab_size, seq_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 50, input_length=seq_length))
    model.add(LSTM(128, return_sequences=True))
    model.add(Bidirectional(GRU(64)))
    model.add(Dense(300, activation='relu'))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model

In [None]:
# load
in_filename = '/content/republic_clean_seq.txt'
doc = loading_file(in_filename)
lines = doc.split('\n')
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

In [None]:

# define model
model = define_model(vocab_size, seq_length)
# fit model
model.fit(X, y, batch_size=300, epochs=150)
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 50)            391900    
                                                                 
 lstm (LSTM)                 (None, 50, 128)           91648     
                                                                 
 bidirectional (Bidirectiona  (None, 128)              74496     
 l)                                                              
                                                                 
 dense (Dense)               (None, 300)               38700     
                                                                 
 dense_1 (Dense)             (None, 7838)              2359238   
                                                                 
Total params: 2,955,982
Trainable params: 2,955,982
Non-trainable params: 0
______________________________________________

## **Prediction**

In [None]:
import numpy as np
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        predict_x=model.predict(encoded, verbose=0) 
        yhat=np.argmax(predict_x,axis=1)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

In [None]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
# load cleaned text sequences
in_filename = '/content/republic_clean_seq.txt'
doc = loading_file(in_filename)
lines = doc.split('\n')
seq_length = len(lines[0].split()) - 1
# load the model
model = load_model('model.h5')
# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')


act as to give the man within him in some way or other the most complete mastery over the entire human creature he should watch over the manyheaded monster like a good husbandman fostering and cultivating the gentle qualities and preventing the wild ones from growing he should be making the



In [None]:
# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print(generated)

lionheart his ally and in common care of them all should be uniting the several parts with one another and with himself yes he said that is quite what the maintainer of justice say and so from every point of view whether of pleasure honour or advantage the approver of


## **Build Model GRU**

In [None]:
# define the model
def define_model(vocab_size, seq_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 50, input_length=seq_length))
    model.add(GRU(128, return_sequences=True))
    model.add(Bidirectional(GRU(64)))
    model.add(Dense(300, activation='relu'))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model

In [None]:
model2 = define_model(vocab_size, seq_length)
# fit model
model2.fit(X, y, batch_size=300, epochs=150)
# save the model to file
model2.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 50)            391900    
                                                                 
 gru (GRU)                   (None, 50, 128)           69120     
                                                                 
 bidirectional (Bidirectiona  (None, 128)              74496     
 l)                                                              
                                                                 
 dense (Dense)               (None, 300)               38700     
                                                                 
 dense_1 (Dense)             (None, 7838)              2359238   
                                                                 
Total params: 2,933,454
Trainable params: 2,933,454
Non-trainable params: 0
______________________________________________

#Generated seq. for GRU model

In [None]:
generated = generate_seq(model2, tokenizer, seq_length, seed_text, 50)
print(generated)

lionheart his ally and in common care of them all should be uniting the several parts with one another and with himself yes he said that is quite what the maintainer of justice say and so from every point of view whether of pleasure honour or advantage the approver of


## **Build Model RNN**

In [None]:
import tensorflow.keras.models as models
import tensorflow.keras.layers as layers
# define the model
def define_model(vocab_size, seq_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 50, input_length=seq_length))
    model.add(layers.SimpleRNN(128, return_sequences=True))
    model.add(layers.SimpleRNN(64, return_sequences=True))
    model.add(layers.SimpleRNN(64))
    model.add(Dense(300, activation='relu'))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model1.png', show_shapes=True)
    return model

In [None]:
# define model
model1 = define_model(vocab_size, seq_length)
# fit model
model1.fit(X, y, batch_size=512, epochs=100)
# save the model to file
model1.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 50, 50)            391900    
                                                                 
 simple_rnn (SimpleRNN)      (None, 50, 128)           22912     
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 50, 64)            12352     
                                                                 
 simple_rnn_2 (SimpleRNN)    (None, 64)                8256      
                                                                 
 dense_2 (Dense)             (None, 300)               19500     
                                                                 
 dense_3 (Dense)             (None, 7838)              2359238   
                                                                 
Total params: 2,814,158
Trainable params: 2,814,158
No

#Generated seq. for rnn model

In [None]:
generated = generate_seq(model1, tokenizer, seq_length, seed_text, 50)
print(generated)

lionheart his ally and in common marriage and where the state whereas the number having an march upwards is to be the reflection and the unjust is wise or right than the style certainly he said i agree he said they are to be told and other cities should agree
