<a href="https://colab.research.google.com/github/chambai/Deep_Learning_Course/blob/main/Week%204%20DL%20NLP%202/AustenWordLevelTextGen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation in the style of Jane Austen
Adapted from: https://stackabuse.com/python-for-nlp-deep-learning-text-generation-with-keras/

In [None]:
# import python libraries
import numpy as np
from keras.models import Sequential, load_model
from keras.layers import Dense, Embedding, LSTM, Dropout
from keras.utils import to_categorical
from random import randint
import re

Use the Natural Language Tool Kit (NLTK) library to download the dataset.  We are using the **Gutenberg Dataset** which contains 3036 english books written by 142 authors

In [None]:
import nltk   # natural language tool kit library
nltk.download('gutenberg')  # downloads a library that NLTK uses

from nltk.corpus import gutenberg as gut  # downloads the gutenberg dataset
print(gut.fileids())    # prints the name of the files in the dataset

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


The file austen-sense.txt contains raw text for the novel Sense and Sensibility by Jane Austen

In [None]:
# get the book text
book_text = nltk.corpus.gutenberg.raw('austen-sense.txt')

In [None]:
# print the first 500 characters of the text so we can look at it
print(book_text[:500])

[Sense and Sensibility by Jane Austen 1811]

CHAPTER 1


The family of Dashwood had long been settled in Sussex.
Their estate was large, and their residence was at Norland Park,
in the centre of their property, where, for many generations,
they had lived in so respectable a manner as to engage
the general good opinion of their surrounding acquaintance.
The late owner of this estate was a single man, who lived
to a very advanced age, and who for many years of his life,
had a constant companion an


The text may contain many special characters and numbers.  This text isn't actually too bad but it still needs cleaning to remove special characters (such as whitespaces), numbers and punctuation.
## Data preprocessing
To remove the puctuations and special characters, we will define a function called `preprocess_text()`  This uses regular expressions to search for and replace words.  The python library "`re`" does this. There are many tutorials for this on the web i.e. [w3schools regex tutorial](https://www.w3schools.com/python/python_regex.asp).  

The `preprocess_text()` function accepts a text string as a parameter and returns a cleaned text string in lower case.

Stop word removal needs to be done with an NLP library like NLTK i.e. https://stackabuse.com/removing-stop-words-from-strings-in-python/  as the Keras tokenizer is limited in its capabilities

In [None]:
def preprocess_text(sen):
    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sen)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence.lower()

Call the `preprocess_text()` function to clean the data and display the first 500 characters of the cleaned text:

In [None]:
book_text = preprocess_text(book_text)
book_text[:500]

' sense and sensibility by jane austen chapter the family of dashwood had long been settled in sussex their estate was large and their residence was at norland park in the centre of their property where for many generations they had lived in so respectable manner as to engage the general good opinion of their surrounding acquaintance the late owner of this estate was single man who lived to very advanced age and who for many years of his life had constant companion and housekeeper in his sister b'

In [None]:
# limit the text to 20,000 characters
print(len(book_text))
book_text = book_text[:20000]
print(len(book_text))

636106
20000


## Convert words to numbers
We are using a simple approach to convert words into single integers.  Before we do this we need to tokenize the text into individual words.  To do this we can use the `word_tokenize()` method from the `nltk.tokenize` module.

The following code tokenizes the text in the dataset and prints out the total number of words in the dataset, as well as the total number of unique words in that dataset.

In [None]:
from nltk.tokenize import word_tokenize
# punkt is a sentence tokenizer that nltk requires. 
# It divides a text into a list of sentences, by using an unsupervised algorithm 
# to build a model for abbreviation words, collocations, and words that start sentences
nltk.download('punkt')

book_text_words = (word_tokenize(book_text))
n_words = len(book_text_words)
unique_words = len(set(book_text_words))

print('Total Words: %d' % n_words)
print('Unique Words: %d' % unique_words)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Total Words: 3662
Unique Words: 900


To convert tokenized words to numbers, we use the `Tokenizer` class from the `keras.preprocessing.text` module.  Then use the fit_on_texts method and pass in the list of words.  A dictionary will be created where the keys will represent words, and integers will represent the corresponding values of the dictionary.

In [None]:
# convert words to numbers
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=115944)
tokenizer.fit_on_texts(book_text_words)

In [None]:
vocab_size = len(tokenizer.word_index) + 1    # word_index is the dictionary. Store the number of unique words in vocab_size variable
word_2_index = tokenizer.word_index           # store the dictionary in the variable called word_2_index

In [None]:
# just for exploration, let's print the 500th word in the dictionary and it's index
print(book_text_words[500])
print(word_2_index[book_text_words[500]])

far
125


Here, the word 'far' is assigned the integer 125

## Creating the input sequences
The following code splits the text into smaller sequences of text, each consisting of 100 words.

The first iteration of the loop:
> The first 100 words from the begining of the text are added to the input_sequence list
> The 101st word is appened to the output_words list

In the second iteration of the loop:
> The first 100 words starting from the second word of the text are added to the input_sequence list
> The 102nd word is appened to the output_words list

And so on...

In [None]:
input_sequence_words = []  # input sequences in words (used for metric evaluation later on)
input_sequence = []   # empty list to hold the sequences that will be input into our model
output_words = []     # empty list to hold the output words
input_seq_length = 100  # length of the input sequence

# form the input sequence list and the output words list
for i in range(0, n_words - input_seq_length , 1):
    in_seq = book_text_words[i:i + input_seq_length]
    input_sequence_words.append(in_seq)
    out_seq = book_text_words[i + input_seq_length]
    input_sequence.append([word_2_index[word] for word in in_seq])
    output_words.append(word_2_index[out_seq])

In [None]:
# print the first sequence to see what it looks like - a list of 100 integers that represent the first observation of words
print(len(input_sequence))      # print the number of input sequences
print(input_sequence[0])        # print the first input sequence
print(len(input_sequence[0]))   # print the length of the first input sequence

3562
[177, 3, 236, 19, 389, 390, 178, 4, 116, 1, 25, 22, 237, 54, 391, 10, 392, 15, 87, 6, 88, 3, 15, 393, 6, 30, 64, 394, 10, 4, 395, 1, 15, 238, 239, 9, 102, 396, 23, 22, 240, 10, 32, 179, 397, 13, 2, 398, 4, 180, 139, 241, 1, 15, 399, 400, 4, 242, 401, 1, 69, 87, 6, 402, 140, 103, 240, 2, 45, 243, 181, 3, 103, 9, 102, 70, 1, 5, 117, 22, 244, 403, 3, 404, 10, 5, 141, 16, 8, 245, 36, 246, 118, 70, 247, 5, 49, 248, 89, 405]
100


In [None]:
# reshape the input sequences to be 3-dimensional
X = np.reshape(input_sequence, (3562, 100, 1))    # number of input sequences, length of each sequence
X = np.reshape(input_sequence, (len(input_sequence), input_seq_length, 1))

# Normalize the data by dividing by the max number of unique words (the vocab size)
X = X / float(vocab_size)

# one-hot encode the output words so that they can be used by the model (converts the output to 2-dimensions)
y = to_categorical(output_words)

In [None]:
print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (3562, 100, 1)
y shape: (3562, 901)


## Create, compile and fit the model

In [None]:
model = Sequential()
# LSTM layer has 800 neurons (units).  The input shape is (100, 1) (Number of words in a sequence, 1 to make it 2D data) (Number of time-steps, features per time-step)
#model.add(LSTM(800, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(LSTM(800, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(LSTM(800, return_sequences=True))
model.add(LSTM(800))
model.add(Dense(y.shape[1], activation='softmax'))

model.summary()

# the output word can be one of any of the unique words in the vocabulary
# This means it is a multi-class calssification problem and we use the categorical crossentropy loss function
model.compile(loss='categorical_crossentropy', optimizer='adam')

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 100, 800)          2566400   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100, 800)          5123200   
_________________________________________________________________
lstm_2 (LSTM)                (None, 800)               5123200   
_________________________________________________________________
dense (Dense)                (None, 901)               721701    
Total params: 13,534,501
Trainable params: 13,534,501
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(X, y, batch_size=64, epochs=2, verbose=1)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7ff75630ce90>

## Making predictions

In [None]:
# randomly select a sequence of integers from the input sequences
random_seq_index = np.random.randint(0, len(input_sequence)-1)
random_seq = input_sequence[random_seq_index]

# convert the integer sequence to its words
# word_2_index contains a dictionary of the format word : index (word being the key and index being the value)
# the next line of code reverses this to index: word (index now being the key and word is now the value)
# this reversed dictionary can now be used by supplying an index to it, and the word will be returned
index_2_word = dict(map(reversed, word_2_index.items())) # swaps keys with values
# loop round using a list iteration to get the list of words that correspond to the integers in the randomly picked sequence
word_sequence = [index_2_word[value] for value in random_seq]

# join the words in the list and print the sequence of words
print(' '.join(word_sequence))

between her and her mother was strikingly great elinor saw with concern the excess of her sister sensibility but by mrs dashwood it was valued and cherished they encouraged each other now in the violence of their affliction the agony of grief which overpowered them at first was voluntarily renewed was sought for was created again and again they gave themselves up wholly to their sorrow seeking increase of wretchedness in every reflection that could afford it and resolved against ever admitting consolation in future elinor too was deeply afflicted but still she could struggle she could exert herself she


In [None]:
# this code predicts the next 100 words that follow the randomly picked sequence above
# we loop round, making 100 predictions
for i in range(100):
    int_sample = np.reshape(random_seq, (1, len(random_seq), 1))    # reshape to make 3-D input (1 sequence, length of the sequence, 1 because the first LSTM requires another dimension)
    int_sample = int_sample / float(vocab_size)                     # normalise

    predicted_word_index = model.predict(int_sample, verbose=0)     # predict

    predicted_word_id = np.argmax(predicted_word_index)             # get the index of the maximum value (they are categorical so )
    seq_in = [index_2_word[index] for index in random_seq]          # loop round integers in the random sequence and pick out the words for the random sequence - don't think we need this

    word_sequence.append(index_2_word[ predicted_word_id])          # get the predicted word by finding the word at the predicted index

    #random_seq.append(predicted_word_id)                            # append so we get a list of predicted words - do we need this?
    #random_seq = random_seq[1:len(random_seq)]                      # do not use the first element of random_seq - do we need this?

In [None]:
# loop round the list of predicted words and print them out for our final prediction of the next 100 words
final_output = ""
for word in word_sequence:
    final_output = final_output + " " + word

print(final_output)

 between her and her mother was strikingly great elinor saw with concern the excess of her sister sensibility but by mrs dashwood it was valued and cherished they encouraged each other now in the violence of their affliction the agony of grief which overpowered them at first was voluntarily renewed was sought for was created again and again they gave themselves up wholly to their sorrow seeking increase of wretchedness in every reflection that could afford it and resolved against ever admitting consolation in future elinor too was deeply afflicted but still she could struggle she could exert herself she to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to to


The model has repeated words at the end, so this is not a very good model yet

## Metrics - The Bleu Score

In [None]:
# Bleu score
print(input_sequence_words[:2])       # print out the first 2 elements of the input sequence words list
print(word_sequence)                  # print out the words of our randomly picked sequence

from nltk.translate.bleu_score import sentence_bleu
reference = input_sequence_words
candidate = word_sequence

score = sentence_bleu(reference, candidate)
print(score)

[['sense', 'and', 'sensibility', 'by', 'jane', 'austen', 'chapter', 'the', 'family', 'of', 'dashwood', 'had', 'long', 'been', 'settled', 'in', 'sussex', 'their', 'estate', 'was', 'large', 'and', 'their', 'residence', 'was', 'at', 'norland', 'park', 'in', 'the', 'centre', 'of', 'their', 'property', 'where', 'for', 'many', 'generations', 'they', 'had', 'lived', 'in', 'so', 'respectable', 'manner', 'as', 'to', 'engage', 'the', 'general', 'good', 'opinion', 'of', 'their', 'surrounding', 'acquaintance', 'the', 'late', 'owner', 'of', 'this', 'estate', 'was', 'single', 'man', 'who', 'lived', 'to', 'very', 'advanced', 'age', 'and', 'who', 'for', 'many', 'years', 'of', 'his', 'life', 'had', 'constant', 'companion', 'and', 'housekeeper', 'in', 'his', 'sister', 'but', 'her', 'death', 'which', 'happened', 'ten', 'years', 'before', 'his', 'own', 'produced', 'great', 'alteration'], ['and', 'sensibility', 'by', 'jane', 'austen', 'chapter', 'the', 'family', 'of', 'dashwood', 'had', 'long', 'been', 'se

The Bleu score is about 0.5, not great!

In [None]:
# Bleu score
print(input_sequence_words[:2])       # print out the first 2 elements of the input sequence words list
print(word_sequence)                  # print out the words of our randomly picked sequence

# remove the words 'of' as a littel exercise on the Bleu score   
word_sequence_no_of = [word for word in word_sequence if word != 'of']  # loop round the predicted words and remove the word 'of'
print(word_sequence_no_of)                                              # print the modified output

# set the reference words and the candidate words that are to be compared to the reference
from nltk.translate.bleu_score import sentence_bleu
reference = input_sequence_words
candidate = word_sequence_no_of

# calculate the Bleu score
score = sentence_bleu(reference, candidate)
print(score)

[['sense', 'and', 'sensibility', 'by', 'jane', 'austen', 'chapter', 'the', 'family', 'of', 'dashwood', 'had', 'long', 'been', 'settled', 'in', 'sussex', 'their', 'estate', 'was', 'large', 'and', 'their', 'residence', 'was', 'at', 'norland', 'park', 'in', 'the', 'centre', 'of', 'their', 'property', 'where', 'for', 'many', 'generations', 'they', 'had', 'lived', 'in', 'so', 'respectable', 'manner', 'as', 'to', 'engage', 'the', 'general', 'good', 'opinion', 'of', 'their', 'surrounding', 'acquaintance', 'the', 'late', 'owner', 'of', 'this', 'estate', 'was', 'single', 'man', 'who', 'lived', 'to', 'very', 'advanced', 'age', 'and', 'who', 'for', 'many', 'years', 'of', 'his', 'life', 'had', 'constant', 'companion', 'and', 'housekeeper', 'in', 'his', 'sister', 'but', 'her', 'death', 'which', 'happened', 'ten', 'years', 'before', 'his', 'own', 'produced', 'great', 'alteration'], ['and', 'sensibility', 'by', 'jane', 'austen', 'chapter', 'the', 'family', 'of', 'dashwood', 'had', 'long', 'been', 'se

The Bleu score is higher (about 0.8).  **The higher the Bleu score the better**.

Of course you can't just remove words to adjust the score, this was just a test to understand the Bleu score. We need to improve our model to improve the Bleu Score

## **Exercises**
1. We limited the amount of data in this exercise.  Increase the amount of data and see if it improves the score.   
2. Change the hyper parameters, including the size and number of LSTM layers and number of epochs to see if you get better results.
3. Try adding dropout after the LSTM layers and Dense layers.

