<a href="https://colab.research.google.com/github/brubee/nlp-project/blob/master/poem_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Poem Generator

NLP project made for Introduction to NLP class of 2024, made by yours truly (Jurás Dorka, IZZGAY).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install transformers
!pip install numpy
!pip install pandas
!pip install tensorflow

### Imports

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2Tokenizer
from transformers import WEIGHTS_NAME, CONFIG_NAME
import os

### Data

My plan was to make a poem generator, so for that I had to find a suitable dataset of poems.

The dataset I ended up using was [this](https://www.kaggle.com/datasets/johnhallman/complete-poetryfoundationorg-dataset) one from Kaggle.

In [3]:
poems_data = pd.read_csv("/content/drive/MyDrive/uni/kaggle_poem_dataset.csv")
poems_data.head(2)

Unnamed: 0.1,Unnamed: 0,Author,Title,Poetry Foundation ID,Content
0,0,Wendy Videlock,!,55489,"Dear Writers, I’m compiling the first in what ..."
1,1,Hailey Leithauser,0,41729,"Philosophic\nin its complex, ovoid emptiness,\..."


Then I grouped the authors to see who had the most poems in the dataset. Seeing as it was William Shakespeare, who I was somewhat familiar with, I decided to use his poems for this project.

In [4]:
poems_data.groupby("Author").agg({"Content": "count"}).sort_values("Content", ascending=False).head(5)

Unnamed: 0_level_0,Content
Author,Unnamed: 1_level_1
William Shakespeare,85
Anonymous,82
"Alfred, Lord Tennyson",78
Rae Armantrout,62
William Wordsworth,59


Here I filter for only Shakespeare's works, and set my training data to be only his poems without needless columns.

In [5]:
shakespeare_poems = poems_data[poems_data["Author"] == "William Shakespeare"]
shakespeare_poems = shakespeare_poems.dropna()
shakespeare_poems = shakespeare_poems["Content"].str.lower()

Then I save the filtered data into a txt file.

In [6]:
string = ''
for x in shakespeare_poems:
    string += x + "</s>"

In [7]:
data_location = "data"
if not os.path.exists(data_location):
    os.makedirs(data_location)

In [8]:
with open('./data/poetry.txt', 'w', encoding='utf-8') as f:
    f.write(string)

### Load Tokenizer and Model

In [21]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [22]:
configuration = GPT2Config(
    vocab_size=tokenizer.vocab_size,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id
)
model = TFGPT2LMHeadModel(configuration)

### Load the Dataset

In [23]:
textfile = open("./data/poetry.txt", "r", encoding='utf-8')
text = textfile.read()
textfile.close()

### Tokenize the Data

In [24]:
string_tokenized = tokenizer.encode(text)
print("Done tokenizing")

Token indices sequence length is longer than the specified maximum sequence length for this model (32802 > 1024). Running this sequence through the model will result in indexing errors


Done tokenizing


### Create the dataset

In [25]:
examples = []
block_size = 100
BATCH_SIZE = 12
BUFFER_SIZE = 1000
for i in range(0, len(string_tokenized) - block_size + 1, block_size):
    examples.append(string_tokenized[i:i + block_size])
inputs, labels = [], []

In [26]:
for ex in examples:
    inputs.append(ex[:-1])
    labels.append(ex[1:])
dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print("Done creating dataset")

Done creating dataset


### Create Model

In [27]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [28]:
model.compile(optimizer=optimizer, loss=loss)
model.fit(dataset, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x7911ee83a410>

I have tried training with different amounts of epochs, to see the difference between them. At first I started out with only 10, but that soon turned out to be too little, and then at 20 it did look better, but the test poem at the end was still filled with nonesense. At 30 epochs, it still doesn't make sense, but it does look like a poem!

I am thinking of further trying out 40 and 50 epochs, to see if I can bring the loss even lower.

After creating and training the model, I save it.

In [29]:
save_location = "./models"
if not os.path.exists(save_location):
    os.makedirs(save_location)
model.save_pretrained(save_location)
tokenizer.save_pretrained(save_location)

('./models/tokenizer_config.json',
 './models/special_tokens_map.json',
 './models/vocab.json',
 './models/merges.txt',
 './models/added_tokens.json')

And then all that's left is testing it out!

In [35]:
text = "I wrote a poem for him"
input_ids = tokenizer.encode(text, return_tensors='tf')

beam_output = model.generate(
    input_ids,
    max_length=80,
    num_beams=5,
    temperature=0.7,
    no_repeat_ngram_size=2,
    num_return_sequences=5
)

print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I wrote a poem for him.

so, as the one, and doth will more;
the earth, that in her eye,
that all the boar they have still from the world’d
and all-d to make thee of her heart were:
in on thee, yet not for their sight, where they are all to her head
her lips;


The poem doesn't make sense, but it is still better than it was at 10 epochs (for obvious reasons). I also increased the max_length because at 50 it kept cutting off in the middle of the poem - it still does that at 80, but I think it's somewhat better.