<center>
    <h1>Generating Text with NLP</h1>
    <img width="500px" src="https://1000wordphilosophy.files.wordpress.com/2018/02/plato.jpg?w=656&h=458">
</center>

# Introduction 

A language model can predict the probability of the next word in the sequence, based on the words already observed in the sequence. Neural network models are a preferred method for developing statistical language models because they can use a distributed representation, where different words with similar meanings have similar representation and because they can use a large context of recently observed words when making predictions.

# Goals

    1. Prepare text for developing a word-based language model.
    2. Design and fit a neural language model with a learned embedding and an LSTM hidden layer.
    3. Use the learned language model to generate new text with similar statistical properties as the source text.

# The Dataset
The Republic is the classical Greek philosopher Plato's most famous work. It is structured as a dialog on the topic of order and justice within a city state. I got the file from the Project Gutenberg's website. <a href="http://www.gutenberg.org/cache/epub/1497/pg1497.txt">Link</a> to the dataset.

# Overview

    1. The Data
    2. Data Preparation
    3. Train the Language Model
    4. Use the Language Model


## Data Preparation

The data contains 
    - chapter heading (e.g. BOOK I)
    - many punctuations, (e.g. -, ;, ?, :, etc.)
    - long monologues
    - quoted dialogs

In [3]:
# import libraries
import numpy as np
import pandas as pd
import os
import re
import string
from random import randint

from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras_preprocessing.sequence import pad_sequences

from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D, LSTM, Embedding, Dense, Dropout

from keras.utils.vis_utils import plot_model

In [9]:
# load the txt file in the memory
def load_doc(file_name):
    # open the file as read-only
    file = open(file_name, "r", encoding='utf8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [13]:
# load document
in_filename = 'pg1497.txt'
doc = load_doc(in_filename)
print(doc[:800])

﻿The Project Gutenberg eBook of The Republic, by Plato

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: The Republic

Author: Plato

Translator: B. Jowett

Release Date: October, 1998 [eBook #1497]
[Most recently updated: September 11, 2021]

Language: English


Produced by: Sue Asscher and David Widger

*** START OF THE PROJECT GUTENBERG EBOOK THE REPUBLIC ***




THE REPUBLIC

By Plato

Trans


In [15]:
# reference: https://pynative.com/python-regex-findall-finditer/
# find the beginning of the book
[m.start() for m in re.finditer('BOOK I\.', doc)]

[967, 38188, 553671]

In [20]:
# find the end of the book
[m.start() for m in re.finditer('years which we have been describing', doc)]

[1195178]

In [25]:
doc = doc[553671:1195178]

print(doc[:200])

BOOK I.


I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in wha


## Cleaning the Text

In [29]:
# turn the doc into clean tokens
def clean_doc(doc):
    # replace "--" with a space " "
    doc = doc.replace("--", " ")
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile("[%s]" % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub("", w) for w in tokens]
    # remove the remaining tokens, which are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [30]:
# clean doc
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid',

### Save the Cleaned Text
Organize the long list of tokens into sequences of 50 input words and 1 output word. These are sequences of 51 words. A possible way is to iterate over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens. Then transform the tokens into space-separated strings for later storage in a file. 

In [31]:
# organize into sequence of tokens
length = 50 + 1
seq = list()

for i in range(length, len(tokens)):
    # select sequence of tokens
    s = tokens[i-length:i]
    # convert into a line
    line = " ".join(s)
    # store
    seq.append(line)
print("Total Sequences: %d" % len(seq))

Total Sequences: 117285


Running the above piece creates a long list of lines. Printing statistics on the list, I can see that I have exactly 117,285 training patterns to fit the model later.


Next, I can save the sequences to a new file for later loading. I can define a new function for saving lines of text to a file. This new function is called save_doc() and is listed below. It takes as input a list of lines and a filename. The lines are written, one per line, in ASCII format.

In [32]:
# save tokens to a file, one dialog per line
def save_doc(lines, filename):
    data = "\n".join(lines)
    file = open(filename, "w")
    file.write(data)
    file.close()

In [33]:
# save sequences to file
out_filename = 'sequences.txt'
save_doc(seq, out_filename)

## Prepare the Model for Training

In [39]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [40]:
# load
in_filename = 'sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')