## Import Main Libraries

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from keras.preprocessing.text import Tokenizer
from numpy import array
from keras.utils.np_utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, LSTM, GRU, Dense
from pickle import dump
from keras.models import load_model
from random import randint
from keras.preprocessing.sequence import pad_sequences
import numpy as np

# Data Preparation

## Load and Read Text Data

In [None]:
with open('republic_text.txt') as file:
    contents = file.read()
    print(contents)

The Project Gutenberg EBook of The Republic, by Plato

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: The Republic

Author: Plato

Translator: B. Jowett

Posting Date: August 27, 2008 [EBook #1497]
Release Date: October, 1998
Last Updated: June 22, 2016

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK THE REPUBLIC ***




Produced by Sue Asscher





THE REPUBLIC

By Plato


Translated by Benjamin Jowett


Note: The Republic by Plato, Jowett, etext #150




INTRODUCTION AND ANALYSIS.

The Republic of Plato is the longest of his works with the exception
of the Laws, and is certainly the greatest of them. There are nearer
approaches to modern metaphysics in the Philebus and in the Sophist; the
Politicus or Statesman is more ideal; the form and institutions of
the Sta

## Clean The Text Data

The raw text must be converted into a sequence of tokens or words which can be used to train the model.

Here we will apply some of text preprocessing techniques

1- Replace ‘–‘ with a white space so we can split words better.

2- Split words based on white space.

3- Removing all non-essential letters (Numbers and Punctuation).

4- Convert all characters to lowercase.


In [None]:
nltk.download('punkt')
nltk.download('wordnet')

# function to preprocesse the summary text
def clean_text(contents):
    # replace '--' with a space ' '
    contents = contents.replace('--', ' ') 
    
    # remove any special characters and punctuaton
    contents=re.sub('[^a-zA-Z]',' ',contents)
    
    # convert all words to lowercase
    contents=str(contents).lower() 
    
    # tokenize the sentence
    contents=word_tokenize(contents)   
    
    return contents  # return our text

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# clean document
tokens = clean_text(contents)
print(tokens[:10]) # print list of tokens that look cleaner than the raw text
print('Total number of Tokens: %d' % len(tokens)) # find out nuber of words in our text after applying preprocessing 
print('Total number of Unique Tokens: %d' % len(set(tokens))) # find out nuber of vocabulary (unique words) in our text

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'republic', 'by', 'plato', 'this']
Total number of Tokens: 217708
Total number of Unique Tokens: 10243


We can organize the long list of tokens into sequences of 50 input words and 1 output word.
That is, sequences of 51 words.

This can be accomplished by iterating over the list of tokens from token 51 onwards and recording the previous 50 tokens as a sequence, then continuing the procedure until the list of tokens is exhausted.

Here we split the list of clean tokens into sequences with a length of 51 tokens

In [None]:
# define length of our sequence 
length_of_seq = 50+1
sequences = list()
for i in range(length_of_seq, len(tokens)):
    seq = tokens[i-length_of_seq:i]
    
    # To save these tokens as a lines, we'll convert them into space-separated strings 
    line = ' '.join(seq)
    sequences.append(line)
    
print (sequences[:1]) # TO make sure that oue sequence is 50 word
print("----------------------------------------------")
print('Total number of Sequences: %d' % len(sequences))

['the project gutenberg ebook of the republic by plato this ebook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever you may copy it give it away or re use it under the terms of the project gutenberg license included with this ebook or']
----------------------------------------------
Total number of Sequences: 217657


Here  we can see that we will have exactly 217657 training patterns to fit our model.

Now, We can save the sequences to a separate file and load them later.

In [None]:
# function for saving lines of text to a file
def save_seq(lines, filename):
    data = '\n'.join(lines)
    f = open(filename, 'w')
    f.write(data)
    f.close()

In [None]:
# Call save_seq function and save our training sequences to the file 'republic_sequences.txt'
sequences_file = 'republic_sequences.txt'
save_seq(sequences, sequences_file)

Now we have training data stored in the file ‘republic_sequences.txt‘ in our current working directory.

In ‘republic_sequences.txt‘ file each line consist of 50 words 

So,let's go on to fitting a language model to this data.

# Train The Language Model

Train a statistical language model using a recurrent architecture from the prepared data that

a. uses a distributed representation for words so that different words with similar meanings will have a similar representation.

b. learns the representation at the same time as learning the model.

c. learns to predict the probability for the next word using the context of the last 100 words

## First, Load  training data (load sequences).

In [None]:
with open('republic_sequences.txt') as f:
    contents = f.read()
    # split data into separate training sequences by splitting based on new lines.
    new_lines = contents.split('\n') 

## Next, Encode the training data (encode sequences).

According to the word embedding layer the input sequences should be made up of integers

We can encode our input sequences by mapping each word to a unique number using Tokenizer class in the Keras.

In [None]:
tokenizer = Tokenizer()
# fit Tokenizer to encode all of the training sequences, converting each sequence from a list of words to a list of integers.
tokenizer.fit_on_texts(new_lines) 
sequences = tokenizer.texts_to_sequences(new_lines)

## Separate Sequences into  Inputs and Output

Here we need to separate The sequences  into input (X) and output (y) using array slicing.

In [None]:
# separate sequences into input and output
sequences = array(sequences)
X = sequences[:,:-1]
y = sequences[:,-1]

## Define size of vocabulary and sequence size for the model

We need to know the size of the vocabulary for defining the embedding layer later. We can determine the vocabulary by calculating the size of the mapping dictionary.

The word index dictionary field on the Tokenizer object allows us to obtain the mapping of words to numbers.


In [None]:
# define size of vocabulary for using in embedding layer in the model.
size_of_vacab = len(tokenizer.word_index) + 1
size_of_vacab

10244

Using the second dimension (number of columns) of the input data's structure is a decent generic method to indicate that. As a result, if the length of sequences changes when preparing data, you won't have to update this data loading function because it's general.

In [None]:
# The length of input sequences must be specified to the Embedding layer (50 word in each sequence)
seq_length = X.shape[1]
seq_length

50

### Encode The outputs words

After separating, we need to one hot encode the output word using  to_categorical() that can be used to one hot encode the output words for each input-output sequence pair.This means converting it from an integer to a vector of 0 values, one for each word in the vocabulary, with a 1 to indicate the specific word at the index of the words integer value.

This is so that the model learns to predict the probability distribution for the next word and the ground truth from which to learn from is 0 for all words except the actual word that comes next.

In [None]:
from keras.utils.np_utils import to_categorical
y = to_categorical(y, num_classes=size_of_vacab)

# Trial_1 using Embedding layer,  2 LSTM layers and 2 Dense layers + output

## Build The Language Model

We will use an Embedding Layer to learn the representation of words, and a Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) recurrent neural network to learn to predict words based on their context.

A dense fully connected layer connects to the LSTM hidden layers to interpret the features extracted from the sequence. The output layer predicts the next word as a single vector the size of the vocabulary with a probability for each word in the vocabulary

In [None]:
# define model

model1 = Sequential()
model1.add(Embedding(size_of_vacab,50,input_length=seq_length))

model1.add(LSTM(200, return_sequences=True))
model1.add(LSTM(200))

model1.add(Dense(2000, activation='relu'))
model1.add(Dense(1500, activation='relu'))

model1.add(Dense(size_of_vacab, activation='softmax'))
print(model1.summary())

2022-04-05 17:09:43.232861: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-05 17:09:43.325723: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-05 17:09:43.326521: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-05 17:09:43.327879: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compil

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 50)            512200    
_________________________________________________________________
lstm (LSTM)                  (None, 50, 200)           200800    
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               320800    
_________________________________________________________________
dense (Dense)                (None, 2000)              402000    
_________________________________________________________________
dense_1 (Dense)              (None, 1500)              3001500   
_________________________________________________________________
dense_2 (Dense)              (None, 10244)             15376244  
Total params: 19,813,544
Trainable params: 19,813,544
Non-trainable params: 0
____________________________________________

The model is compiled specifying the categorical cross entropy loss because , the model is learning a multi-class classification and this is the suitable loss function for this type of problem. 

Use The efficient Adam implementation to mini-batch gradient descent and accuracy is evaluated of the model.

In [None]:
# compile model
model1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model1.fit(X, y, batch_size=128, epochs=60)

2022-04-05 17:09:45.836310: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 8918713232 exceeds 10% of free system memory.
2022-04-05 17:09:56.599871: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 8918713232 exceeds 10% of free system memory.
2022-04-05 17:10:03.155936: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/60


2022-04-05 17:10:06.421275: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8005


Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<keras.callbacks.History at 0x7f83b402e5d0>

## Save Model

The model is saved to the file'language model.h' in the current working directory using the Keras model API.

We'll need the mapping of words to integers when we load the model to make predictions.

This is stored in the Tokenizer object, which we can also save using Pickle.

In [None]:
# save the model to file to use when generate the text
model1.save('language model1.h')
# save the tokenizer
dump(tokenizer, open('tokenizer1.pkl', 'wb'))

2022-04-05 17:48:31.248998: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


##Use The Language Model 

Here we can use The model to generate new sequences of text that have the same statistical properties.

###Load Data

We require the text so that we can choose a source sequence to feed into the model in order to generate a new text sequence

In [None]:
with open('republic_sequences.txt') as f:
    contents = f.read()
    lines = contents.split('\n')

we will need to specify the expected length of input. We can determine this from the input sequences by calculating the length of one line of the loaded data and subtracting 1 for the expected output word that is also on the same line.

In [None]:
seq_length = len(lines[0].split()) - 1
seq_length

50

### Load Model

In [None]:
model1 = load_model('language model1.h')

In [None]:
# load the tokenizer
dump(tokenizer, open('tokenizer1.pkl', 'wb'))

###Generate Text

Here we will select a random line of text from the input text for  generating The text 

In [None]:
# select the random line of the text data
_text = lines[randint(0,len(lines))]
print(_text + '\n')

or intruder very true suppose i said the study of philosophy to take the place of gymnastics and to be continued diligently and earnestly and exclusively for twice the number of years which were passed in bodily exercise will that be enough would you say six or four years he asked



In [None]:
# function to generate a sequence from a language model
def generate_new_seq(model, tokenizer, seq_length, _text, n_words):
    result = list()
    in_text = _text
    # generate a fixed number of words
    for _ in range(n_words):
    
        # the _text must be encoded to integers using tokenizer that we used when training the model. 
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        
        # here The model can predict the next word directly by calling model.predict and np.argmax that will return the index of the word with the highest probability.
        yhat = model.predict(encoded, verbose=0)
        yhat= np.argmax(yhat, axis=1) 
        
        # Fifth, map predicted word index to word 
        #To find the related word, we can look up the index in the Tokenizers mapping.
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
                
            # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

In [None]:
# generate new text
generated = generate_new_seq(model1, tokenizer, seq_length, _text, 100)
print(generated)

say five years i replied at the end of the time they must be sent down again into the den and compelled to hold any military or other office which young men are qualified to hold in this way they will call their own advantage or the good the human creature would be as far as he can be into one another the only original and in this other sphere we acknowledge that we could not suppose that a man is profited by persuasion and this he is afraid to be a debt which he had seen in his own


# Trial_2 using Embedding layer,  1 LSTM layers and 2 Dense layers + output

### Build The Language Model

In [None]:
# define model
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

model2 = Sequential()
model2.add(Embedding(size_of_vacab, 50, input_length=seq_length))
model2.add(LSTM(256))
model2.add(Dense(2500, activation='relu'))
model2.add(Dense(2000, activation='relu'))
model2.add(Dense(size_of_vacab, activation='softmax'))
print(model2.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 50)            512200    
_________________________________________________________________
lstm_2 (LSTM)                (None, 256)               314368    
_________________________________________________________________
dense_3 (Dense)              (None, 2500)              642500    
_________________________________________________________________
dense_4 (Dense)              (None, 2000)              5002000   
_________________________________________________________________
dense_5 (Dense)              (None, 10244)             20498244  
Total params: 26,969,312
Trainable params: 26,969,312
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
# compile model
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model2.fit(X, y, batch_size=128, epochs=60)

2022-04-05 17:48:45.088857: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 8918713232 exceeds 10% of free system memory.
2022-04-05 17:48:53.363495: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 8918713232 exceeds 10% of free system memory.


Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<keras.callbacks.History at 0x7f80be750ed0>

### Save the model

In [None]:
from pickle import dump

# save the model to file to use when generate the text
model2.save('language model2.h')
# save the tokenizer
dump(tokenizer, open('tokenizer2.pkl', 'wb'))

## Use The Language Model

###Load Data

In [None]:
with open('republic_sequences.txt') as f:
    contents = f.read()
    lines = contents.split('\n')

In [None]:
seq_length = len(lines[0].split()) - 1
seq_length

50

### Load Model

In [None]:
# load model
from keras.models import load_model
model2 = load_model('language model2.h')

In [None]:
# load the tokenizer
dump(tokenizer, open('tokenizer2.pkl', 'wb'))

### Generate Text

In [None]:
from random import randint
# select the random line of the text data
_text = lines[randint(0,len(lines))]
print(_text + '\n')

discussion they are found to have sustained a mighty overthrow and all their former notions appear to be turned upside down and as unskilful players of draughts are at last shut up by their more skilful adversaries and have no piece to move so they too find themselves shut up at



In [None]:
from keras.preprocessing.sequence import pad_sequences
import numpy as np

# function to generate a sequence from a language model
def generate_new_seq(model, tokenizer, seq_length, _text, n_words):
    result = list()
    in_text = _text
    # generate a fixed number of words
    for _ in range(n_words):
    
        # the _text must be encoded to integers using tokenizer that we used when training the model. 
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        
        # here The model can predict the next word directly by calling model.predict and np.argmax that will return the index of the word with the highest probability.
        yhat = model.predict(encoded, verbose=0)
        yhat= np.argmax(yhat, axis=1) 
        
        # map predicted word index to word 
        #To find the related word, we can look up the index in the Tokenizers mapping.
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
                
            # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

In [None]:
# generate new text
generated = generate_new_seq(model2, tokenizer, seq_length, _text, 100)
print(generated)

last for they have nothing to say in this new game of which words are the counters and yet all the time they are in the right the observation is suggested to me by what is now occurring for any one of us might say that although in words he is not able to meet you at each step of the argument he sees as a fact that the votaries of philosophy when they carry on the study not only in youth as a part of education but as the pursuit of their maturer years most of them become strange


# Trial_3 using Embedding layer,  1 GRU layer and 5 Dense layer + output

### Build The Language Model 

In [None]:
# define model
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, GRU

model3 = Sequential()
model3.add(Embedding(size_of_vacab, 50, input_length=seq_length))

model3.add(GRU(128))

model3.add(Dense(3000, activation='relu'))
model3.add(Dense(2000, activation='relu'))
model3.add(Dense(1000, activation='relu'))
model3.add(Dense(500, activation='relu'))
model3.add(Dense(200, activation='relu'))

model3.add(Dense(size_of_vacab, activation='softmax'))
print(model3.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 50, 50)            512200    
_________________________________________________________________
gru (GRU)                    (None, 128)               69120     
_________________________________________________________________
dense_6 (Dense)              (None, 3000)              387000    
_________________________________________________________________
dense_7 (Dense)              (None, 2000)              6002000   
_________________________________________________________________
dense_8 (Dense)              (None, 1000)              2001000   
_________________________________________________________________
dense_9 (Dense)              (None, 500)               500500    
_________________________________________________________________
dense_10 (Dense)             (None, 200)              

In [None]:
# compile model
model3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model3.fit(X, y, batch_size=128, epochs=60)

2022-04-05 18:20:37.343225: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 8918713232 exceeds 10% of free system memory.


Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<keras.callbacks.History at 0x7f83b3524750>

### Save the model

In [None]:
from pickle import dump

# save the model to file to use when generate the text
model3.save('language model3.h')
# save the tokenizer
dump(tokenizer, open('tokenizer3.pkl', 'wb'))

##Use The Language Model

###Load Data

In [None]:
with open('republic_sequences.txt') as f:
    contents = f.read()
    lines = contents.split('\n')

In [None]:
seq_length = len(lines[0].split()) - 1
seq_length

50

###Load Model

In [None]:
from keras.models import load_model
# load model
model3 = load_model('language model3.h')

In [None]:
# load the tokenizer
dump(tokenizer, open('tokenizer3.pkl', 'wb'))

###Generate Text

In [None]:
from random import randint
# select the random line of the text data
_text = lines[randint(0,len(lines))]
print(_text + '\n')

differ from him whom i have been describing for when a man consorts with the many and exhibits to them his poem or other work of art or the service which he has done the state making them his judges when he is not obliged the so called necessity of diomede



In [None]:
from keras.preprocessing.sequence import pad_sequences
import numpy as np

# function to generate a sequence from a language model
def generate_new_seq(model, tokenizer, seq_length, _text, n_words):
    result = list()
    in_text = _text
    # generate a fixed number of words
    for _ in range(n_words):
    
        # the _text must be encoded to integers using tokenizer that we used when training the model. 
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        
        # here The model can predict the next word directly by calling model.predict and np.argmax that will return the index of the word with the highest probability.
        yhat = model.predict(encoded, verbose=0)
        yhat= np.argmax(yhat, axis=1) 
        
        # map predicted word index to word 
        #To find the related word, we can look up the index in the Tokenizers mapping.
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
                
            # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

In [None]:
# generate new text
generated = generate_new_seq(model3, tokenizer, seq_length, _text, 100)
print(generated)

are deserving of progress friend socrates the dorian life is the best and the true method of the body only and in which a certain character is unknown by plato and then in a state which is ordered states neither can these be found in the city you say that the philosopher was mistaken yes and there is an endless purgation of the body very true then is the name which you have been describing by all the poets are both hard and educators of the sake of appearances is great in nursing up in institutions and remaining only a


# Trial_4  using Embedding layer, 2 GRU layer and 3 Dense layer + output

### Build The Language Model

In [None]:
# define model
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, GRU

model4 = Sequential()
model4.add(Embedding(size_of_vacab, 50, input_length=seq_length))
model4.add(GRU(265, return_sequences=True))
model4.add(GRU(128))

model4.add(Dense(2500, activation='relu'))
model4.add(Dense(2000, activation='relu'))
model4.add(Dense(1000, activation='relu'))

model4.add(Dense(size_of_vacab, activation='softmax'))
print(model4.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 50, 50)            512200    
_________________________________________________________________
gru_1 (GRU)                  (None, 50, 265)           252015    
_________________________________________________________________
gru_2 (GRU)                  (None, 128)               151680    
_________________________________________________________________
dense_12 (Dense)             (None, 2500)              322500    
_________________________________________________________________
dense_13 (Dense)             (None, 2000)              5002000   
_________________________________________________________________
dense_14 (Dense)             (None, 1000)              2001000   
_________________________________________________________________
dense_15 (Dense)             (None, 10244)            

In [None]:
# compile model
model4.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model4.fit(X, y, batch_size=128, epochs=60)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<keras.callbacks.History at 0x7f80d4e12bd0>

## Save the Language Model

In [None]:
from pickle import dump

# save the model to file to use when generate the text
model4.save('language model4.h')
# save the tokenizer
dump(tokenizer, open('tokenizer4.pkl', 'wb'))

## Use The Language Model

###Load Data

In [None]:
with open('republic_sequences.txt') as f:
    contents = f.read()
    lines = contents.split('\n')

In [None]:
seq_length = len(lines[0].split()) - 1
seq_length

50

###Load Model

In [None]:
from keras.models import load_model
model4 = load_model('language model4.h')

In [None]:
# load the tokenizer
dump(tokenizer, open('tokenizer4.pkl', 'wb'))

###Generate Text

In [None]:
from random import randint
# select the random line of the text data
_text = lines[randint(0,len(lines))]
print(_text + '\n')

the quantitative differences of physical phenomena but while acknowledging their value in education he sees also that they have no connexion with our higher moral and intellectual ideas in the attempt which plato makes to connect them we easily trace the influences of ancient pythagorean notions there is no reason to



In [None]:
from keras.preprocessing.sequence import pad_sequences
import numpy as np

# function to generate a sequence from a language model
def generate_new_seq(model, tokenizer, seq_length, _text, n_words):
    result = list()
    in_text = _text
    # generate a fixed number of words
    for _ in range(n_words):
    
        # the _text must be encoded to integers using tokenizer that we used when training the model. 
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        
        # here The model can predict the next word directly by calling model.predict and np.argmax that will return the index of the word with the highest probability.
        yhat = model.predict(encoded, verbose=0)
        yhat= np.argmax(yhat, axis=1) 
        
        # map predicted word index to word 
        #To find the related word, we can look up the index in the Tokenizers mapping.
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
                
            # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

In [None]:
# generate new text
generated = generate_new_seq(model4, tokenizer, seq_length, _text, 100)
print(generated)

discuss how many there are some elementary artist of as thrasymachus may judge only each of them not in modern times we sometimes need to imply that he is like a man who tells us that he is a good man who is the greatest one of them that is the inference and when you want to keep a pruning hook safe then justice is useful to the individual and to the state but when you want to use it then the art of payment begins by his right men and not to lose their plan in the case he


# Trial_5 using Embedding layer, 1 GRU layer, 1 LSTM and 3 Dense layer + output

## Build The Language Model

In [None]:
# define model
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, GRU

model5 = Sequential()
model5.add(Embedding(size_of_vacab, 50, input_length=seq_length))

model5.add(GRU(256, return_sequences=True))
model5.add(LSTM(256))

model5.add(Dense(2500, activation='relu'))
model5.add(Dense(2000, activation='relu'))
model5.add(Dense(1000, activation='relu'))

model5.add(Dense(size_of_vacab, activation='softmax'))
print(model5.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 50, 50)            512200    
_________________________________________________________________
gru_3 (GRU)                  (None, 50, 256)           236544    
_________________________________________________________________
lstm_3 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_16 (Dense)             (None, 2500)              642500    
_________________________________________________________________
dense_17 (Dense)             (None, 2000)              5002000   
_________________________________________________________________
dense_18 (Dense)             (None, 1000)              2001000   
_________________________________________________________________
dense_19 (Dense)             (None, 10244)            

### Compile & Fit Model

In [None]:
# compile model
model5.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model5.fit(X, y, batch_size=128, epochs=60)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<keras.callbacks.History at 0x7f7cd23d5890>

## Save the model

In [None]:
from pickle import dump

# save the model to file to use when generate the text
model5.save('language model5.h')

# save the tokenizer
dump(tokenizer, open('tokenizer5.pkl', 'wb'))

##Use The Language Model

###Load Data

In [None]:
with open('republic_sequences.txt') as f:
    contents = f.read()
    lines = contents.split('\n')

In [None]:
seq_length = len(lines[0].split()) - 1
seq_length

50

###Load Model

In [None]:
from keras.models import load_model
# load model
model5 = load_model('language model5.h')

In [None]:
# load the tokenizer
dump(tokenizer, open('tokenizer5.pkl', 'wb'))

###Generate Text

In [None]:
from random import randint
# select the random line of the text data
_text = lines[randint(0,len(lines))]
print(_text + '\n')

qualities although sometimes like the jewish prophets we may dash away these figures of speech and describe the nature of god only in negatives these again by degrees acquire a positive meaning it would be well if when meditating on the higher truths either of philosophy or religion we sometimes substituted



In [None]:
from keras.preprocessing.sequence import pad_sequences
import numpy as np

# function to generate a sequence from a language model
def generate_new_seq(model, tokenizer, seq_length, _text, n_words):
    result = list()
    in_text = _text
    # generate a fixed number of words
    for _ in range(n_words):
    
        # the _text must be encoded to integers using tokenizer that we used when training the model. 
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        
        # here The model can predict the next word directly by calling model.predict and np.argmax that will return the index of the word with the highest probability.
        yhat = model.predict(encoded, verbose=0)
        yhat= np.argmax(yhat, axis=1) 
        
        # map predicted word index to word 
        #To find the related word, we can look up the index in the Tokenizers mapping.
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
                
            # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

In [None]:
# generate new text
generated = generate_new_seq(model5, tokenizer, seq_length, _text, 100)
print(generated)

one form of style may be hard against those which you were going to receive the ideal polity though imperfectly that the just man seeks to have a share in the government of the other or again about death the same person is acknowledging the size of the state but no city must be treated or not by external effect at last too will be made the most quick while in his ideal state will hereafter be called a man very true he said and i think that you mean for the art of justice when deprived of their subjects
