<a href="https://colab.research.google.com/github/chrisfinan/GenAI/blob/main/HW5/Problem1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Chris Finan
> HW5

>11/22/24

#Description
Text Generation Using LSTM on Project Gutenberg Training Data:
* In this project you will develop an LSTM (Long Short-Term Memory) model to generate text.
* By training the model on various works, you will try to produce coherent and stylistically relevant text based on seed phrases (prompts).
* To improve the model's performance, you will explore the use of multiple training data, additional LSTM layers, and other optimizations.
* The quantity and quality of training data are crucial for achieving meaningful text generation; a larger and more diverse dataset allows the model to better capture the nuances and patterns of written text.
* I will be creating 2 different LSTM models and using them to generate text based on 5 different Shakespeare stories.

In [127]:
import numpy as np
import json
import re
import string

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses

## 0. Parameters <a name="parameters"></a>

In [2]:
VOCAB_SIZE = 20000
MAX_LEN = 200
EMBEDDING_DIM = 100
N_UNITS = 128
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32
EPOCHS = 25

## 1. Load the data <a name="load"></a>

In [3]:
import requests

# List of URLs for additional texts (e.g., different Shakespeare plays)
urls = [
  "https://www.gutenberg.org/files/1041/1041-0.txt",  # The Sonnets
  "https://www.gutenberg.org/files/1112/1112-0.txt" ,  # Romeo and Juliet
  "https://www.gutenberg.org/cache/epub/1531/pg1531.txt" , # Othello
  "https://www.gutenberg.org/cache/epub/1533/pg1533.txt",   # Macbeth
  "https://www.gutenberg.org/cache/epub/1514/pg1514.txt" # A Midsummer Night's Dream
]

# Initialize an empty string to hold all text
all_text = ""

# Download each text file and append to all_text
for url in urls:
  response = requests.get(url)
  text = response.text
  all_text += text + "\n\n"  # Separate texts by newlines

# Save combined text to a single file
with open("combined_shakespeare.txt", "w", encoding="utf-8") as file:
  file.write(all_text)

In [4]:
# Function to clean a single story
def clean_story(text):
    # After checking all the links, these statements start/end the stories:
    start_marker = r"\*\*\* START OF THE PROJECT GUTENBERG EBOOK.*?\*\*\*"
    end_marker = r"\*\*\* END OF THE PROJECT GUTENBERG EBOOK.*?\*\*\*"

    # Keep only the text after the start marker
    start_match = re.search(start_marker, text, flags=re.DOTALL)
    if start_match:
        text = text[start_match.end():]  # Trim everything before the marker

    # Remove everything after the end marker
    end_match = re.search(end_marker, text, flags=re.DOTALL)
    if end_match:
        text = text[:end_match.start()]  # Trim everything after the marker

    # Convert text to lowercase
    text = text.lower()

    # Replace multiple spaces, newlines, and tabs with a single space
    text = re.sub(r"\s+", " ", text).strip()

    return text

# Initialize an empty string to hold all cleaned text
cleaned_text = ""

# Download each text file, clean, and append to cleaned_text
for url in urls:
    response = requests.get(url)
    text = response.text
    cleaned = clean_story(text)
    cleaned_text += cleaned + "\n\n"  # Separate texts by newlines

# There is too much punctuation in weird places because the texts are mostly plays - need to remove to output better results
cleaned_text = cleaned_text.translate(str.maketrans("", "", string.punctuation))
cleaned_text = re.sub(r"[{}‘’“”\"]".format(re.escape(string.punctuation)), "", cleaned_text)

# Save the cleaned combined text to a single file - I used this to verify the text was cleaned of noise
with open("cleaned_shakespeare.txt", "w", encoding="utf-8") as file:
    file.write(cleaned_text)

# Preview the first 1000 characters of cleaned text
print(cleaned_text[:1000])

the sonnets by william shakespeare i from fairest creatures we desire increase that thereby beautys rose might never die but as the riper should by time decease his tender heir might bear his memory but thou contracted to thine own bright eyes feedst thy lights flame with selfsubstantial fuel making a famine where abundance lies thyself thy foe to thy sweet self too cruel thou that art now the worlds fresh ornament and only herald to the gaudy spring within thine own bud buriest thy content and tender churl makst waste in niggarding pity the world or else this glutton be to eat the worlds due by the grave and thee ii when forty winters shall besiege thy brow and dig deep trenches in thy beautys field thy youths proud livery so gazed on now will be a tatterd weed of small worth held then being asked where all thy beauty lies where all the treasure of thy lusty days to say within thine own deep sunken eyes were an alleating shame and thriftless praise how much more praise deservd thy bea

## 2. Tokenise the data

In [5]:
# There is a lot of punctuation and they were basically treated as normal words, so I removed them above
# I left this statement here because I was getting errors when I removed it
def pad_punctuation(s):
    # Pad punctuation to treat them as separate tokens
    s = re.sub(f"([{string.punctuation}])", r" \1 ", s)
    s = re.sub(" +", " ", s)
    return s

# Pad punctuation in the cleaned text
cleaned_text = [pad_punctuation(cleaned_text)]

In [6]:
sequence_length = 50
sequences = []
# Access the string within the list using cleaned_text[0]
for i in range(0, len(cleaned_text[0].split()) - sequence_length):
    seq = cleaned_text[0].split()[i: i + sequence_length]
    sequences.append(" ".join(seq))

# Need to split into 2 different variables for data:
x_data = [seq[:-1] for seq in sequences]  # All but the last token
y_data = [seq[-1] for seq in sequences]  # Only the last token


In [7]:
# Convert to a Tensorflow Dataset
text_ds = tf.data.Dataset.from_tensor_slices((x_data, y_data))

# Shuffle and batch
text_ds = text_ds.shuffle(1000).batch(BATCH_SIZE, drop_remainder=True)

In [8]:
# Create a vectorisation layer
vectorize_layer = layers.TextVectorization(
    standardize="lower",
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [9]:
# Adapt the layer to the training set
vectorize_layer.adapt(x_data)
vocab = vectorize_layer.get_vocabulary()

In [10]:
# Display some token:word mappings
for i, word in enumerate(vocab[:10]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: the
3: and
4: i
5: to
6: of
7: a
8: my
9: in


## 3. Create the Training Set

In [42]:
# Create the training set of recipes and the same text shifted by one word
def prepare_inputs(text, _):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]  # All but the last token
    y = tokenized_sentences[:, 1:]  # Shifted sequence (the next token)
    return x, y[:, 0]  # Keep only the first token for y

train_ds = text_ds.map(prepare_inputs)

train_ds = train_ds.shuffle(buffer_size=1000)


## 4. Build the LSTM <a name="build"></a>

In [43]:
# Only 1 LSTM layer, so return_sequences is false
inputs = layers.Input(shape=(sequence_length,), dtype="int32")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = layers.LSTM(N_UNITS, return_sequences=False)(x)
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
lstm = models.Model(inputs, outputs)
lstm.summary()

## 5. Train the LSTM <a name="train"></a>

In [44]:
loss_fn = losses.SparseCategoricalCrossentropy()
lstm.compile("adam", loss_fn)

In [170]:
# Had to make quite a few changes to this class, especially the sample_from and generate functions
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10, sequence_length=100):
        self.index_to_word = index_to_word
        self.word_to_index = {word: index for index, word in enumerate(index_to_word)}
        self.sequence_length = sequence_length

    def sample_from(self, probs, temperature, top_k=10, top_p=0.9):
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)

        if top_k > 0:
            top_k_indices = np.argsort(probs)[-top_k:]
            filtered_probs = np.zeros_like(probs)
            filtered_probs[top_k_indices] = probs[top_k_indices]
            probs = filtered_probs / np.sum(filtered_probs)

        elif top_p > 0:
            sorted_indices = np.argsort(probs)[::-1]
            cumulative_probs = np.cumsum(probs[sorted_indices])
            cutoff = np.argmax(cumulative_probs > top_p)
            valid_indices = sorted_indices[:cutoff + 1]
            filtered_probs = np.zeros_like(probs)
            filtered_probs[valid_indices] = probs[valid_indices]
            probs = filtered_probs / np.sum(filtered_probs)

        return np.random.choice(len(probs), p=probs), probs

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [self.word_to_index.get(x, 1) for x in start_prompt.split()]
        sample_token = None
        info = []

        while len(start_tokens) < max_tokens and (sample_token is None or sample_token != 0):
            x = np.array([start_tokens[-self.sequence_length:]])
            if len(x[0]) < self.sequence_length:
                x = np.pad(x, [(0, 0), (self.sequence_length - len(x[0]), 0)], constant_values=0)

            y = self.model.predict(x, verbose=0)
            probs = y[0]

            for token in start_tokens[-15:]:  # Increase the lookback
                probs[token] *= 0.5  # Apply penalty for repetition
            probs = probs / np.sum(probs)

            sample_token, probs = self.sample_from(probs, temperature)
            info.append({"prompt": start_prompt, "word_probs": probs})
            start_tokens.append(sample_token)

            if 0 <= sample_token < len(self.index_to_word):
                start_prompt += " " + self.index_to_word[sample_token]
            else:
                start_prompt += " <UNK>"
                break

        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("Out out brief candle", max_tokens=100, temperature=1.0)


In [112]:
# Tokenize starting prompt

text_generator = TextGenerator(vocab)


In [113]:
lstm.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator],
)

Epoch 1/25
[1m3322/3324[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 12ms/step - loss: 3.0088
generated text:
Out out brief candle cunning norways and on iron on peascod much pox hems span fated necessary misheathed heal enfetterd essence span isles heal span isles taket yee necessary postposthaste unguarded suggestion unguarded certainly shorter isles coold outcast moraler span imports banish saluted taket acknown taket yee suggestion yee taket circled unguarded flaming certainly jesses sayling owdst reuels unguarded assassination taket greeting—speak peerless taket cabind screw fearing burst unguarded listning benedecite certainly span appetites suggestion greeting—speak suggestion predecessors saluted disasters unguarded taket welltuned breechd unguarded certainly afloat greeting—speak suggestion her—lie gouts suggestion suggestion suggestion predecessors suggestion taket greeting—speak using disasters

[1m3324/3324[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 1

<keras.src.callbacks.history.History at 0x7b44c15728c0>

After running 25 epochs, you can clearly see that the text is in Shakespearian English. Unfortunately though, there isn't much coherence the further you go down the line. There is also an interesting repetition for some words, like 'span' 4 times in a row at epoch 25. Epoch 3 also repeats 'haggard' several times. I spent several hours trying to debug the TextGenerator class and this seemed to be the best generation after multiple run-throughs.

## 6. Generate text using the LSTM

In [119]:
# For some reason I got an error on only one block of code with this, so I had to slightly change
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            # Check if index is within vocab bounds before accessing
            if i < len(vocab):
                print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
            else:
                print(f"Unknown token ID: {i} (out of vocabulary) \t{np.round(100*p,2)}%")  # Indicate out-of-vocabulary tokens
        print("--------\n")

This first quote is from Macbeth and the generation doesn't really make sense, but it is also only 10 words. The word 'iustly' seems like a mistake or a typo - could be 'justly.' Then this one would kind of make sense, but 99% for that word is weird. Using 0.5 for temperature seems balanced and it made decent decisions overall.
> "By the pricking of my thumbs something wicked this way comes"

In [115]:
info = text_generator.generate(
    "By the pricking of my thumbs", max_tokens=10, temperature=0.5
)

print_probs(info, vocab)


generated text:
By the pricking of my thumbs fear—how wear iustly twisted


PROMPT: By the pricking of my thumbs
most:   	60.24%
fear—how:   	28.81%
propriety:   	2.53%
conveniences:   	2.44%
languishes:   	2.32%
--------


PROMPT: By the pricking of my thumbs fear—how
returned:   	35.23%
perspective:   	19.77%
shifted:   	12.82%
grows:   	9.62%
fear—how:   	9.6%
--------


PROMPT: By the pricking of my thumbs fear—how wear
iustly:   	99.99%
more:   	0.01%
miscalld:   	0.0%
from:   	0.0%
gazeth:   	0.0%
--------


PROMPT: By the pricking of my thumbs fear—how wear iustly
shifted:   	27.09%
twisted:   	24.81%
span:   	14.98%
unmoved:   	9.26%
fear—how:   	5.59%
--------



This one was really weird and didn't make sense. After perspective, it looks like all coherence is thrown out the window. A lower temperature should make the safer plays and ideally sound more coherent, but there is a problem here. I tried to penalize repeated words, but that fell through on some occaisions. Here the word 'he' is generated 12 times in a row at 100%! This was also the one that originally created an error for a token out of bounds. Maybe in one of the plays it repeats 'he' to mimic a laugh, but there is clearly something wrong here.
> "But, soft! what light through yonder window breaks?
It is the east, and Juliet is the sun."

In [120]:
info = text_generator.generate(
    "What light through yonder window breaks", max_tokens=20, temperature=0.2
)

print_probs(info, vocab)


generated text:
What light through yonder window breaks chancd perspective he he he he he he he he he he he he


PROMPT: What light through yonder window breaks
chancd:   	94.63%
franticmad:   	3.02%
returned:   	2.24%
bell—diablo:   	0.09%
shifted:   	0.01%
--------


PROMPT: What light through yonder window breaks chancd
perspective:   	100.0%
clip:   	0.0%
circled:   	0.0%
lacking:   	0.0%
bonds:   	0.0%
--------


PROMPT: What light through yonder window breaks chancd perspective
he:   	100.0%
Unknown token ID: 19999 (out of vocabulary) 	0.0%
solemnities:   	0.0%
sorrie:   	0.0%
sores:   	0.0%
--------


PROMPT: What light through yonder window breaks chancd perspective he
he:   	100.0%
instruction:   	0.0%
more:   	0.0%
fees:   	0.0%
iustly:   	0.0%
--------


PROMPT: What light through yonder window breaks chancd perspective he he
he:   	100.0%
instruction:   	0.0%
scan:   	0.0%
horned:   	0.0%
fees:   	0.0%
--------


PROMPT: What light through yonder window breaks chancd persp

This one has an interesting word choice altogether and repeats miscalld and twisted many times. With a higher temperature of 1.0, the structure and word choice seems more random, however these words are repeated too many times. Words like miscalld are disproportionately higher percentage, but are not 100% like the last one.
> "Deny thy father and refuse thy name.
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet."

In [118]:
info = text_generator.generate(
    "Deny thy father and refuse thy name", max_tokens=40, temperature=1.0
)
print_probs(info, vocab)


generated text:
Deny thy father and refuse thy name fear—how shifted instruction sir—o pontic reuels assassination scan twisted miscalld twisted miscalld unmoved miscalld twisted miscalld twisted iustly fear—how shifted gazeth returned returned dunnest isles scan twisted miscalld twisted yew span shifted miscalld


PROMPT: Deny thy father and refuse thy name
fear—how:   	79.33%
pontic:   	4.35%
sorted:   	3.36%
bell—diablo:   	3.18%
returned:   	2.47%
--------


PROMPT: Deny thy father and refuse thy name fear—how
bell—diablo:   	30.45%
horned:   	10.66%
shifted:   	10.36%
isles:   	10.05%
sir—o:   	8.79%
--------


PROMPT: Deny thy father and refuse thy name fear—how shifted
instruction:   	47.88%
iustly:   	25.75%
them—list:   	5.78%
assassination:   	3.89%
marshallst:   	3.77%
--------


PROMPT: Deny thy father and refuse thy name fear—how shifted instruction
franticmad:   	30.22%
miscalld:   	24.09%
reuels:   	15.34%
sir—o:   	12.29%
now—whats:   	4.22%
--------


PROMPT: Deny thy

This one almost seems readable and sounds like Shakespeare before the gibberish is thrown in at the end. We see miscalld and twisted repeated again but not as frequently. This one specifically doesn't have any 'normal' words. Most of them are all words you would see when reading Shakespeare, but none of them match anything in the original quote from Othello.
> "If after every tempest come such calms,
May the winds blow till they have wakened death,
And let the laboring bark climb hills of seas
Olympus-high, and duck again as low
As hell's from heaven! If it were now to die,
'Twere now to be most happy, for I fear
My soul hath her content so absolute
That not another comfort like to this
Succeeds in unknown fate."

In [121]:
info = text_generator.generate(
    "If after every tempest comes such calms", max_tokens=50, temperature=0.4
)
print_probs(info, vocab)


generated text:
If after every tempest comes such calms twisted greeneyd unmoved tuggd iustly now—whats shifted assassination miscalld twisted miscalld unmoved miscalld twisted gazeth returned isles miscalld twisted reuels enfeebled franticmad horned shifted miscalld twisted reuels miscalld unmoved miscalld twisted gazeth returned isles miscalld twisted reuels gazeth now—whats franticmad isles miscalld twisted


PROMPT: If after every tempest comes such calms
twisted:   	99.63%
unmoved:   	0.36%
isles:   	0.01%
returned:   	0.0%
shifted:   	0.0%
--------


PROMPT: If after every tempest comes such calms twisted
instruction:   	26.81%
iustly:   	20.33%
them—list:   	17.61%
assassination:   	11.6%
greeneyd:   	10.67%
--------


PROMPT: If after every tempest comes such calms twisted greeneyd
unmoved:   	98.28%
twisted:   	1.36%
franticmad:   	0.1%
pontic:   	0.09%
fear—how:   	0.05%
--------


PROMPT: If after every tempest comes such calms twisted greeneyd unmoved
scan:   	40.3%
gazeth

## 7. Build/Train the LSTM (Part 2: Multiple layers)

In [165]:
#I decided to try with 3 LSTM layers here because I heard of Google crashing or timing out with 4 LSTM layers
inputs2 = layers.Input(shape=(100,), dtype="int32")
x2 = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs2)
x2 = layers.LSTM(64, return_sequences=True)(x2)
x2 = layers.LSTM(128, return_sequences=True)(x2)
x2 = layers.LSTM(256)(x2)
outputs2 = layers.Dense(VOCAB_SIZE, activation="softmax")(x2)
lstm2 = models.Model(inputs2, outputs2)
lstm2.summary()


In [171]:
lstm2.compile("adam", loss_fn)

In [172]:
# Tokenize starting prompt

text_generator2 = TextGenerator(vocab)

In [173]:
lstm2.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator2],
)

Epoch 1/25
[1m3324/3324[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - loss: 6.9241
generated text:
Out out brief candle i to the in and of it you the and a with that my the i be is your i and me in of a to a it to that my the so not of your the and is me the i and it in i to the that you be your to but a and a not my it is and the the to you i of that in to of and i your my it me be the and a the of i but to with is to not that and my in your

[1m3324/3324[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m109s[0m 32ms/step - loss: 6.9241
Epoch 2/25
[1m3324/3324[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step - loss: 6.8166
generated text:
Out out brief candle and i of to i is you my with the the and me not a of it that a be thou my is you in this to for the not the the and a i a i of with of you is to to in that and the thou this not the a it my of of the for in you be and is to i that i the to not with this me the in my in be and the is and i for t

<keras.src.callbacks.history.History at 0x7b443ef68160>

The thing that is really interesting about the 3-layer model is that it basically only uses the most common words whereas the single layer model used very obsure words and often avoided the most common ones. All of the epochs tend to use 'i, you, thou' quite frequently. None of these are coherent either. I am unsure why this one and the previous one are so different with word choices.This one is hard to read as there is a lot of repetition of consecutive pronouns. Varying the number of units in each LSTM layer creates a hierarchical learning structure for the model. I think I may have started too small and created a bottleneck issue. That could be why mainly the simplest words were always used. Smaller units are faster and won't overfit, but won't generate as high quality text. For a larger dataset like this one, I should have gone with higher numbers.

## 8. Generate text using the Multi-Layer LSTM

Looking at the percentages, it looks like the only possibilities are the most common words. The percentages are also very low for each one. Even though this one is short, it does not make sense at all and actually seems worse than the first one.
> "By the pricking of my thumbs something wicked this way comes"

In [174]:
info = text_generator2.generate(
    "By the pricking of my thumbs", max_tokens=10, temperature=0.5
)
print_probs(info, vocab)


generated text:
By the pricking of my thumbs i the with your


PROMPT: By the pricking of my thumbs
the:   	32.48%
thou:   	17.59%
in:   	10.61%
i:   	9.39%
as:   	6.28%
--------


PROMPT: By the pricking of my thumbs i
the:   	34.09%
thou:   	18.51%
in:   	11.21%
as:   	6.67%
your:   	5.75%
--------


PROMPT: By the pricking of my thumbs i the
thou:   	25.02%
in:   	15.01%
the:   	11.45%
as:   	8.92%
your:   	7.74%
--------


PROMPT: By the pricking of my thumbs i the with
thou:   	25.52%
in:   	15.15%
the:   	11.6%
as:   	8.94%
your:   	7.56%
--------



Somehow, with all of the common words being used, none of them match up with the actual quote. There is really not much to say about this one other than the fact that it doesn't make sense. I can see the the low temperature is somewhat trying with 'thou in i with my.' It rhymes and is somewhat readable.
> "But, soft! what light through yonder window breaks?
It is the east, and Juliet is the sun."

In [175]:
info = text_generator2.generate(
    "What light through yonder window breaks", max_tokens=20, temperature=0.2
)
print_probs(info, vocab)


generated text:
What light through yonder window breaks the the thou in i with my the this in thou a and your


PROMPT: What light through yonder window breaks
the:   	98.59%
thou:   	0.67%
my:   	0.27%
in:   	0.19%
i:   	0.14%
--------


PROMPT: What light through yonder window breaks the
the:   	67.84%
thou:   	15.29%
my:   	5.99%
in:   	4.23%
i:   	3.1%
--------


PROMPT: What light through yonder window breaks the the
thou:   	45.44%
my:   	17.39%
in:   	12.13%
i:   	8.77%
the:   	6.1%
--------


PROMPT: What light through yonder window breaks the the thou
my:   	31.22%
in:   	21.62%
i:   	15.73%
the:   	10.93%
as:   	5.85%
--------


PROMPT: What light through yonder window breaks the the thou in
my:   	38.57%
i:   	19.77%
the:   	13.35%
as:   	7.31%
your:   	4.91%
--------


PROMPT: What light through yonder window breaks the the thou in i
my:   	46.04%
the:   	16.21%
as:   	8.97%
your:   	6.3%
it:   	4.85%
--------


PROMPT: What light through yonder window breaks the the thou 

This one hurts to read. It seems too random and not coherent in the slightest due to the high temp. The percents are very close to each other here, so it would seem to be even more random.
> "Deny thy father and refuse thy name. Or, if thou wilt not, be but sworn my love, And I'll no longer be a Capulet."

In [176]:
info = text_generator2.generate(
    "Deny thy father and refuse thy name", max_tokens=40, temperature=1.0
)
print_probs(info, vocab)


generated text:
Deny thy father and refuse thy name the as it your i that in my a the with thou this i and we thou our the as thy of it in the my that in the my with thou a


PROMPT: Deny thy father and refuse thy name
the:   	31.01%
thou:   	11.46%
my:   	9.55%
in:   	8.89%
i:   	8.38%
--------


PROMPT: Deny thy father and refuse thy name the
the:   	18.28%
thou:   	13.57%
my:   	11.27%
in:   	10.51%
i:   	9.88%
--------


PROMPT: Deny thy father and refuse thy name the as
the:   	18.68%
thou:   	13.85%
my:   	11.45%
in:   	10.66%
i:   	9.99%
--------


PROMPT: Deny thy father and refuse thy name the as it
the:   	18.71%
thou:   	13.94%
my:   	11.57%
in:   	10.78%
i:   	10.13%
--------


PROMPT: Deny thy father and refuse thy name the as it your
the:   	19.16%
thou:   	14.26%
my:   	11.79%
in:   	10.97%
i:   	10.29%
--------


PROMPT: Deny thy father and refuse thy name the as it your i
the:   	20.17%
thou:   	14.93%
my:   	12.45%
in:   	11.56%
this:   	7.66%
--------


PROMPT: Deny

This one gave me hope after the first few words, but completely fell apart. I feel like a better model using 0.4-0.5 for temperature will be the sweet spot as I have noticed they have been the best for these tests.There is still too much repetition and nearly no variety in words like all of the other ones generated by the second lstm model.
> "If after every tempest come such calms, May the winds blow till they have wakened death, And let the laboring bark climb hills of seas Olympus-high, and duck again as low As hell's from heaven! If it were now to die, 'Twere now to be most happy, for I fear My soul hath her content so absolute That not another comfort like to this Succeeds in unknown fate."

In [177]:
info = text_generator2.generate(
    "If after every tempest comes such calms", max_tokens=50, temperature=0.4
)
print_probs(info, vocab)


generated text:
If after every tempest comes such calms the and thou with the we the as this no i a it your the in in my thou thy with if the we i and as a by it that the my the thou your in i in the with thy my


PROMPT: If after every tempest comes such calms
the:   	76.49%
thou:   	6.42%
my:   	4.04%
in:   	3.36%
i:   	2.88%
--------


PROMPT: If after every tempest comes such calms the
the:   	36.48%
thou:   	17.41%
my:   	10.86%
in:   	9.09%
i:   	7.76%
--------


PROMPT: If after every tempest comes such calms the and
the:   	36.67%
thou:   	17.34%
my:   	10.9%
in:   	9.08%
i:   	7.77%
--------


PROMPT: If after every tempest comes such calms the and thou
the:   	42.75%
my:   	12.69%
in:   	10.62%
i:   	9.11%
as:   	5.51%
--------


PROMPT: If after every tempest comes such calms the and thou with
the:   	42.78%
my:   	12.7%
in:   	10.61%
i:   	9.11%
as:   	5.51%
--------


PROMPT: If after every tempest comes such calms the and thou with the
my:   	19.49%
in:   	16.36%
i:   	1

##Conclusion

> I was honestly disappointed when reading the text that was generated after spending quite some time working on this assignment over the past week. When I originally generated text, I ran into the issue of punctuation being repeated like normal words. In Shakespeare, there are a lot of apostrophes and dashes and colons. The line breaks are also different because most of the texts I used were plays and there would be a character name, a colon, and a sentence. After trying to deal with them, I couldn't really find a good solution, so I removed punctuation altogether. After attempt 1, everything looked good. Then, I got to the text generation and "overthrow" was being repeated like 10-20 times in a row on each prompt. So I went back to the drawing board and focused lots of time on the TextGenerator. Some edits got me closer and some pushed me farther away from a good solution. After hours of trying to fix this, I reverted to the one that gave the best results for the first LSTM. I am thankful that I was able to stay connected to my runtime for the entire time I was testing and debugging. The second LSTM model took around an hour to run and might honestly be worse than the first model that took 20 mins to run. Unfortunately, there was a big disconnect between the 2 models and each was missing something the other had. As much as I would love to spend 15 hours fixing my models and training them again, I unfortunately cannot and had to settle with what I generated here.

##Reference:
* https://github.com/bforoura/GenAI/blob/main/Module5/recipe_lstm.ipynb