<a href="https://colab.research.google.com/github/britt-klose/GENAIHW5/blob/main/GENAIHW5_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

GEN AI Assignment 5

Brittany Klose

11/14/24

In [1]:
import numpy as np
import json
import re
import string

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses

In [2]:
#@title 0. Parameters

VOCAB_SIZE = 1000
MAX_LEN = 200
EMBEDDING_DIM = 100
N_UNITS = 128
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32
EPOCHS = 50

In [3]:
#@title 1. Load the data
%pwd

'/content'

In [4]:
import requests
urls = [
     "https://www.gutenberg.org/cache/epub/158/pg158.txt",   # Emma
     "https://www.gutenberg.org/cache/epub/161/pg161.txt",   # Sense & Sensibility
     "https://www.gutenberg.org/cache/epub/105/pg105.txt",    # Persuasion
     "https://www.gutenberg.org/cache/epub/1342/pg1342.txt"   #Pride & Prejudice
]



In [5]:
#Clean individual texts
#For loop to clean each text file by removing headers & footers
# Goal is for each text to start at ch1 and end after the last chapter

def clean_text(text):

  #text = re.sub(r'(?i)(Table of Contents|Contents)(.*?)(?=\n)(?=\n)', '', text, flags=re.DOTALL)
  header = text.find("by Jane Austen")
  footer = text.find("*** END OF THE PROJECT GUTENBERG EBOOK ")
  text = text[header:footer]
  chapter1_header = re.search(r"CHAPTER I ", text)
  if chapter1_header:
    text = text[chapter1_header.header():]

  text = re.sub(r'(Preface).*?(VOLUME.\.|volume\.).*?(CHAPTER I.\.|Chapter I\.)', r'\2', text, flags=re.DOTALL|re.IGNORECASE)
  text = re.sub(r'(?i)(Table of Contents|Contents)(.*?)(?=\n)(?=\n)', '', text, flags=re.DOTALL)

  return text

In [6]:
# Save all cleaned files to a single file
def append_cleantxt(urls):

  # Initialize an empty string to hold all text
  all_text = ""

  # Download each text file and append to all_text
  for url in urls:
    response = requests.get(url)
    og_text = response.text

    final_text=clean_text(og_text)

    all_text += final_text + "\n\n"  # Separate texts by newlines

  return all_text

In [7]:
# Call functions to clean and merge all texts into a single cleaned file

all_cleaned_text = append_cleantxt(urls)

# Save the cleaned text to a file
with open('jane_works_cleaned.txt', 'w', encoding='utf-8') as file:
    file.write(all_cleaned_text)


In [8]:
# Count the words of text
with open('jane_works_cleaned.txt', "r", encoding="utf-8") as file:
  file_content = file.read()
  words = file_content.split()

  n_words = len(words)
print(f"{n_words} words loaded")

359700 words loaded


In [9]:
preview_text = words[:200]
print(f"200 words of Jane: {preview_text}")

200 words of Jane: ['by', 'Jane', 'Austen', 'VOLUME', 'I.', 'CHAPTER', 'I.', 'CHAPTER', 'II.', 'CHAPTER', 'III.', 'CHAPTER', 'IV.', 'CHAPTER', 'V.', 'CHAPTER', 'VI.', 'CHAPTER', 'VII.', 'CHAPTER', 'VIII.', 'CHAPTER', 'IX.', 'CHAPTER', 'X.', 'CHAPTER', 'XI.', 'CHAPTER', 'XII.', 'CHAPTER', 'XIII.', 'CHAPTER', 'XIV.', 'CHAPTER', 'XV.', 'CHAPTER', 'XVI.', 'CHAPTER', 'XVII.', 'CHAPTER', 'XVIII.', 'VOLUME', 'II.', 'CHAPTER', 'I.', 'CHAPTER', 'II.', 'CHAPTER', 'III.', 'CHAPTER', 'IV.', 'CHAPTER', 'V.', 'CHAPTER', 'VI.', 'CHAPTER', 'VII.', 'CHAPTER', 'VIII.', 'CHAPTER', 'IX.', 'CHAPTER', 'X.', 'CHAPTER', 'XI.', 'CHAPTER', 'XII.', 'CHAPTER', 'XIII.', 'CHAPTER', 'XIV.', 'CHAPTER', 'XV.', 'CHAPTER', 'XVI.', 'CHAPTER', 'XVII.', 'CHAPTER', 'XVIII.', 'VOLUME', 'III.', 'CHAPTER', 'I.', 'CHAPTER', 'II.', 'CHAPTER', 'III.', 'CHAPTER', 'IV.', 'CHAPTER', 'V.', 'CHAPTER', 'VI.', 'CHAPTER', 'VII.', 'CHAPTER', 'VIII.', 'CHAPTER', 'IX.', 'CHAPTER', 'X.', 'CHAPTER', 'XI.', 'CHAPTER', 'XII.', 'CHAPTER', 'XIII.

In [10]:
#@title 2. Tokenize the data
# Pad the punctuation, to treat them as separate 'words'
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}])", r" \1 ", s)
    s = re.sub(" +", " ", s)
    return s
with open("jane_works_cleaned.txt", "r", encoding="utf-8") as file:
    text_data = [pad_punctuation(line) for line in file]


In [11]:

# Display an example of padded
example_data = text_data[100]
example_data

'rather too much her own way , and a disposition to think a little too\n'

In [12]:
# Convert to a Tensorflow Dataset
text_ds = (
    tf.data.Dataset.from_tensor_slices(text_data)
    .batch(BATCH_SIZE)
    .shuffle(1000)
)

In [13]:
# Create a vectorisation layer
vectorize_layer = layers.TextVectorization(
    standardize="lower",
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [14]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()

In [15]:
# Display some token:word mappings
for i, word in enumerate(vocab[:20]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: ,
3: .
4: the
5: to
6: and
7: of
8: a
9: her
10: was
11: in
12: i
13: ;
14: it
15: she
16: not
17: be
18: ”
19: that


In [17]:
# Display the same example converted to ints
example_tokenised = vectorize_layer(all_cleaned_text)
print(example_tokenised.numpy())

[ 38 189   1   1   1 192   1 192   1 192   1 192   1 192   1 192   1 192
   1 192   1 192   1 192   1 192   1 192   1 192   1 192   1 192   1 192
   1 192   1 192   1   1   1 192   1 192   1 192   1 192   1 192   1 192
   1 192   1 192   1 192   1 192   1 192   1 192   1 192   1 192   1 192
   1 192   1 192   1 192   1   1   1 192   1 192   1 192   1 192   1 192
   1 192   1 192   1 192   1 192   1 192   1 192   1 192   1 192   1 192
   1 192   1 192   1 192   1 192   1 192   1   1  12 192  12  79   1   1
   1   6   1  25   8 563 180   6 179   1 154   5   1  99   7   4 315   1
   7   1   6  20 653   1   1 313  11   4 240  25  31  87   5 710  60   1
   1  15  10   4   1   7   4 124 983   7   8 109   1   1   1   6   1  11
 562   7   9 616   1  39   1   7  26 146  49   8  31 517   1   9 166  20
   1 101 147]


In [18]:
#@title 3. Create the training set of recipes and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y


train_ds = text_ds.map(prepare_inputs)

In [19]:
#@title 4. Build the LSTM

inputs = layers.Input(shape=(None,), dtype="int32")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = layers.LSTM(N_UNITS, return_sequences=True)(x)
# Last LSTM Layer if needed
#x = layers.LSTM(N_UNITS)(x)      # return_sequences=False by default
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
lstm = models.Model(inputs, outputs)
lstm.summary()

In [20]:
#@title 5. Train the LSTM
loss_fn = losses.SparseCategoricalCrossentropy()
lstm.compile("adam", loss_fn)

In [21]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {
            word: index for index, word in enumerate(index_to_word)
        }  # <1>

    def sample_from(self, probs, temperature):  # <2>
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]  # <3>
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:  # <4>
            x = np.array([start_tokens])
            y = self.model.predict(x, verbose=0)  # <5>
            sample_token, probs = self.sample_from(y[0][-1], temperature)  # <6>
            info.append({"prompt": start_prompt, "word_probs": probs})
            start_tokens.append(sample_token)  # <7>
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("Jane Quote:", max_tokens=100, temperature=1.0)

In [22]:
# Tokenize starting prompt

text_generator = TextGenerator(vocab)

In [23]:
lstm.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator],
)

Epoch 1/50
[1m1173/1173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 468ms/step - loss: 0.6135
generated text:
Jane Quote: all call might [UNK] [UNK] there were . 

[1m1173/1173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m553s[0m 469ms/step - loss: 0.6133
Epoch 2/50
[1m1173/1173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 467ms/step - loss: 0.2503
generated text:
Jane Quote: of . my few [UNK] , ” said the [UNK] , [UNK] under least 

[1m1173/1173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m548s[0m 468ms/step - loss: 0.2503
Epoch 3/50
[1m1173/1173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 468ms/step - loss: 0.2231
generated text:
Jane Quote: , to be that she was wanted in expected [UNK] , . all [UNK] has 

[1m1173/1173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m550s[0m 469ms/step - loss: 0.2231
Epoch 4/50
[1m1173/1173[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 467ms/step - loss: 0.2147
generated text:
Jane Quote: , what his ev

<keras.src.callbacks.history.History at 0x7b4c670660e0>

In [24]:
#@title Generate text using the LSTM
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
        print("--------\n")

In [25]:
info = text_generator.generate(
    "You pierce my soul. I am half agony, half hope. Tell me not that I am too", max_tokens=10, temperature=1.0
)


generated text:
You pierce my soul. I am half agony, half hope. Tell me not that I am too



In [26]:
print_probs(info, vocab)

In [27]:
info = text_generator.generate(
    "You pierce my soul. I am half agony, half hope. Tell me not that I am too", max_tokens=10, temperature=0.1
)


generated text:
You pierce my soul. I am half agony, half hope. Tell me not that I am too



In [28]:
print_probs(info, vocab)

In [29]:
info = text_generator.generate(
    "To wish was to hope, and to hope was to expect", max_tokens=7, temperature=1.0
)
print_probs(info, vocab)


generated text:
To wish was to hope, and to hope was to expect



In [30]:
info = text_generator.generate(
    "To wish was to hope, and to hope was to expect", max_tokens=50, temperature=0.2
)
print_probs(info, vocab)


generated text:
To wish was to hope, and to hope was to expect to [UNK] 


PROMPT: To wish was to hope, and to hope was to expect
to:   	94.07%
the:   	4.86%
.:   	0.4%
that:   	0.19%
herself:   	0.16%
--------


PROMPT: To wish was to hope, and to hope was to expect to
[UNK]:   	99.99%
any:   	0.01%
marry:   	0.0%
be:   	0.0%
a:   	0.0%
--------


PROMPT: To wish was to hope, and to hope was to expect to [UNK]
:   	99.8%
,:   	0.11%
;:   	0.04%
.:   	0.04%
in:   	0.01%
--------



Discussion & Analysis

3. Experiment with Model Complexity

* Increase the number of LSTM layers
* Adjust the number of units in each LSTM layer

Results:

Adding additional LSTM layers helped my program though more in regards to adding more words than necessarily adding clarity to the result. I didn’t add more than 3 layers as I was afraid it would lead to overfitting. I then tried to simply adjust the number of units in each LSTM layer. I did two layers, one with 64 units and one with 128 units. This produced slightly better output but not by much. Additionally, when I first added in  extra layers I experienced errors around 18 epochs that would prevent the code from continuing its execution. I had to update some of my definitions to include more scripts to match the complexity of the model. In this last run I went back to my original amount of layers again and just increased the epochs from 25 to 50 to see if it would help as opposed to increasing the layers. Unfortunately by then I had reached the limit for using a better run time so the model took an especially long time to load, and in the end the results seemed just as bad at the first time with 25 epochs, though this may have been because I lowered the vocab size back down to 10,000.

4. Temperature and Prompt Variations

* Test various seed prompts to generate text.
* Experiment with different temperature settings
* Analyze the generated outputs for each prompt and temperature combination.

Results:

 For this program I tested using the same prompt 4 times, using 4 different prompts, and then using the same 2 prompts twice as seen in my last test run to see how it would affect performance. Using 4 different prompts introduced more diversity and flexibility in results though they did not do as well keeping the consistency of Jane Austen’s iconic dramatic flare and longing and agonizing tone. On the contrary when I did the same prompt 4 times I had much less diversity but the results were more consistent in tone. When I used 2 prompts 2 times it was about a mix between the previous steps in diversity and consistency. In this last run I made quotes slightly longer to see if it would produce better results however I forgot to adjust the max_tokens to be larger so all but the last prompt returned nothing. And as mentioned earlier this run took a cruel amount of time to process so I did not rerun it, but if I did it again I would obviously update the tokens for better results.

Regarding temperature effects, lower temperatures provided very little output and the creativity was very low, usually offering a random word or two that was frankly so mundane and nowhere near poetic or dramatic enough to be in a Jane Austen novel. I hoped with increasing the temperature the program would add more words and get more detailed, which it did, but adding more detail didn’t quite finish the thought/prompt the way I intended. Instead it made some parts too long and would miss the theme. To elaborate, all the prompts I chose related to agonizing over love in some way so my goal was for the model to predict words related to love and tragedy but the model did not often catch it, though it seemed closer when the temperature was lower. Though maybe it just seemed that way because less words were included so sentences didn’t seem too long and unfocused. Unfortunately this can’t be seen well in this last run but looking at the last prompt with temperature 0.2 the results only provided an additional word or two yet the incompleteness of the sentence leaving the thought not quite finished seemed to relate more to Austen’s tortured love language of longing but not quite reaching it than a bunch of extra detail in the end.

5. Evaluation of Generated Text

Results:

The quality of the generated text wasn’t as great as one would hope for but it wasn’t any worse than I was expecting going into this. For the complexity of the model and the amount of training put into it I thought the results were not bad. Relevance and stylistic accuracy overall was somewhat low, but I wasn’t surprised by this. When I increased complexity and vocab size it was definitely better and I could see it improving more so if I added in more prompts for testing. Additionally, coherence was pretty decent especially in the lower temperatures as only a few words were added. All in all, this was a really cool and fun assignment and I look forward to doing more testing with this kind of model going forward.