<a href="https://colab.research.google.com/github/brubee/nlp-project/blob/master/poem_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Poem Generator

NLP project made for Introduction to NLP class, 2024, made by yours truly (Jurás Dorka, IZZGAY).

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install transformers
!pip install numpy
!pip install pandas
!pip install tensorflow



### Imports

In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2Tokenizer
from transformers import WEIGHTS_NAME, CONFIG_NAME
import os

### Data

In [4]:
poems_data = pd.read_csv("/content/drive/MyDrive/uni/kaggle_poem_dataset.csv")
poems_data.head(2)

Unnamed: 0.1,Unnamed: 0,Author,Title,Poetry Foundation ID,Content
0,0,Wendy Videlock,!,55489,"Dear Writers, I’m compiling the first in what ..."
1,1,Hailey Leithauser,0,41729,"Philosophic\nin its complex, ovoid emptiness,\..."


In [5]:
poems_data.groupby("Author").agg({"Content": "count"}).sort_values("Content", ascending=False).head(5)

Unnamed: 0_level_0,Content
Author,Unnamed: 1_level_1
William Shakespeare,85
Anonymous,82
"Alfred, Lord Tennyson",78
Rae Armantrout,62
William Wordsworth,59


In [6]:
shakespeare_poems = poems_data[poems_data["Author"] == "William Shakespeare"]
shakespeare_poems = shakespeare_poems.dropna()
shakespeare_poems = shakespeare_poems["Content"].str.lower()

In [7]:
string = ''
for x in shakespeare_poems:
    string += x + "</s>"

In [8]:
data_location = "data"
if not os.path.exists(data_location):
    os.makedirs(data_location)

In [9]:
with open('./data/poetry.txt', 'w', encoding='utf-8') as f:
    f.write(string)

### Load Tokenizer and Model

In [31]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

In [32]:
configuration = GPT2Config(
    vocab_size=tokenizer.vocab_size,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id
)
model = TFGPT2LMHeadModel(configuration)

### Load the Dataset

In [33]:
textfile = open("./data/poetry.txt", "r", encoding='utf-8')
text = textfile.read()
textfile.close()

### Tokenize the Data

In [34]:
string_tokenized = tokenizer.encode(text)
print("Done tokenizing")

Token indices sequence length is longer than the specified maximum sequence length for this model (32802 > 1024). Running this sequence through the model will result in indexing errors


Done tokenizing


### Create the dataset

In [35]:
examples = []
block_size = 100
BATCH_SIZE = 12
BUFFER_SIZE = 1000
for i in range(0, len(string_tokenized) - block_size + 1, block_size):
    examples.append(string_tokenized[i:i + block_size])
inputs, labels = [], []

In [36]:
for ex in examples:
    inputs.append(ex[:-1])
    labels.append(ex[1:])
dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print("Done creating dataset")

Done creating dataset


### Create Model

In [37]:
optimizer = tf.keras.optimizers.Adam(
    learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [38]:
model.compile(optimizer=optimizer, loss=loss)
model.fit(dataset, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x78c390db4d90>

In [39]:
save_location = "./models"
if not os.path.exists(save_location):
    os.makedirs(save_location)
model.save_pretrained(save_location)
tokenizer.save_pretrained(save_location)

('./models/tokenizer_config.json',
 './models/special_tokens_map.json',
 './models/vocab.json',
 './models/merges.txt',
 './models/added_tokens.json')

In [42]:
text = "To Juliet"
input_ids = tokenizer.encode(text, return_tensors='tf')

beam_output = model.generate(
    input_ids,
    max_length=80,
    num_beams=5,
    temperature=0.7,
    no_repeat_ngram_size=2,
    num_return_sequences=5
)

print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


To Juliet;
to the more of a day.

‘o! this she,’ quoth she with the boar,
and, and all her love is not with her eyes; for me:
that thou wilt, ‘to thee.</s>when i have i am i do not love as one doth my heart of the world,--

