# Edgar Allan Poet - Generative Poetry with Transformers

_Authors: Alessia SARRITZU, Alberto MARTINELLI_

This project implements a text generation model based on the Transformer decoder architecture (GPT-style) to generate poetry from natural language prompts.

The goal is to train a model capable of producing original poems with appropriate titles in catchy poetic language.

### **Usage**
The user simply writes a prompt such as:

> Write a poem about silence and the sea

And the model responds with something similar to the following:

> _A Still Horizon_ <br>
> The silence pours in waves of light...


## Dataset

The dataset is sourced from the [Poetry Foundation Kaggle dataset](https://www.kaggle.com/datasets/tgdivy/poetry-foundation-poems), which includes hundreds of English-language poems from a wide range of poets and periods.

In [1]:
import kagglehub
import os
import pandas as pd
import re

path = kagglehub.dataset_download("tgdivy/poetry-foundation-poems")
csv_file = os.path.join(path, "PoetryFoundationData.csv")

df = pd.read_csv(csv_file, index_col=False)
df = df[["Poet", "Title", "Poem"]].dropna()

df.to_csv("edgar_allan_poet_cleaned.csv")

df.head()

Unnamed: 0,Poet,Title,Poem
0,Michelle Menting,\r\r\n Objects Used to Prop...,"\r\r\nDog bone, stapler,\r\r\ncribbage board, ..."
1,Lucia Cherciu,\r\r\n The New Church\r\r\n...,"\r\r\nThe old cupola glinted above the clouds,..."
2,Ted Kooser,\r\r\n Look for Me\r\r\n ...,\r\r\nLook for me under the hood\r\r\nof that ...
3,Grace Cavalieri,\r\r\n Wild Life\r\r\n ...,"\r\r\nBehind the silo, the Mother Rabbit\r\r\n..."
4,Connie Wanek,\r\r\n Umbrella\r\r\n ...,\r\r\nWhen I push your button\r\r\nyou fly off...


## Dataset Formatting & Tokenization

We define a custom `PoetryDataset` class to format the poetry data for training.

Each training sample is built as a **prompt–response pair**:
- **Prompt:** `"Write a poem about <Title>."`
- **Target:** `"<Title>\n<Poem>"`

This structure encourages the model to learn both:
- How to generate a title based on the prompt
- How to continue with a coherent, structured poem

We tokenize the combined prompt + target using Hugging Face's `GPT2TokenizerFast`.  
Key tokenization settings include:
- `truncation=True`: ensures sequences don't exceed max length
- `padding='max_length'`: aligns input sizes for batching
- `max_length=512`: controls the maximum number of tokens per example

We then wrap the dataset into a `DataLoader` to enable efficient batched training with shuffling.


In [2]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2TokenizerFast, GPT2LMHeadModel
from torch.optim import AdamW
import torch.nn.functional as F
import random
import shutil

# Load and Format the Dataset
class PoetryDataset(Dataset):
    def __init__(self, csv_path, tokenizer, max_length=512):
        self.data = pd.read_csv(csv_path).dropna()
        self.tokenizer = tokenizer
        self.max_length = max_length

        self.samples = []
        for _, row in self.data.iterrows():
            prompt = f"Write a poem about {row['Title']}."
            target = f"{row['Title']}\n{row['Poem']}"
            full_input = prompt + "\n" + target
            tokenized = tokenizer(full_input, truncation=True, max_length=max_length, padding="max_length")
            self.samples.append({
                'input_ids': torch.tensor(tokenized['input_ids']),
                'attention_mask': torch.tensor(tokenized['attention_mask'])
            })

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        return self.samples[idx]

#  Tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Load Dataset
poetry_dataset = PoetryDataset("edgar_allan_poet_cleaned.csv", tokenizer)
data_loader = DataLoader(poetry_dataset, batch_size=4, shuffle=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

## Model Loading and Conditional Training

To optimize training time and ensure reproducibility, we implement a **conditional training strategy**:

- If a previously trained model exists (as a `.zip` file), we simply:
  - Unpack the archive
  - Load both the model and tokenizer from disk
  - Skip training entirely

- If no saved model is found:
  - We initialize a fresh GPT-2 model
  - Add a custom `[PAD]` token to resolve padding ambiguity
  - Train it on the poetry dataset for 5 epochs
  - Save both the model and tokenizer to disk
  - Compress the folder into a `.zip` archive for future reuse

#### **Model Architecture and Training Procedure**

We use a pretrained GPT-2 model (`GPT2LMHeadModel`) from Hugging Face as the foundation for Edgar Allan Poet. This architecture is based on the Transformer decoder, which applies masked self-attention for causal language modeling.

#### **Training Strategy**

The model is trained to generate poetic text conditioned on natural language prompts. Each training sample is constructed in the following format:

```
> Write a poem about <Title>
> <Title>
  <Poem body>
```

This format enables the model to learn both how to generate a meaningful title and how to compose a thematically consistent poem.

#### **Training Configuration**

- **Optimizer:** AdamW  
  Chosen for its efficiency and stability in training Transformer-based models by decoupling weight decay from the gradient update.

- **Learning rate:** 5e-5  
  A moderate learning rate suitable for fine-tuning pretrained language models without overwriting learned weights too aggressively.

- **Loss function:** CrossEntropyLoss  
  Used implicitly by `GPT2LMHeadModel` for next-token prediction, appropriate for autoregressive generation tasks.

- **Epochs:** 5  
  Allow the model to better learn poetic structure and thematic consistency.

- **Max sequence length:** 512 tokens  
  Balances capturing poem structure while respecting model memory limits.

- **Batch size:** 4  
  Kept small to ensure training fits into Colab’s GPU memory constraints while maintaining gradient stability.

In [3]:
model_path = "edgar_allan_poet_model"
model_zip = f"{model_path}.zip"

# Load or Train
if os.path.exists(model_zip):
    print("Found trained model ZIP, loading...")

    # Unzip the model directory
    shutil.unpack_archive(model_zip, model_path)

    # Load model and tokenizer
    tokenizer = GPT2TokenizerFast.from_pretrained(model_path)
    model = GPT2LMHeadModel.from_pretrained(model_path)
    model = model.to("cuda" if torch.cuda.is_available() else "cpu")

else:
    print("No saved model found — training from scratch...")

    # Load tokenizer & model
    tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
    model = GPT2LMHeadModel.from_pretrained("gpt2")

    # Add [PAD] token
    if tokenizer.pad_token is None:
        tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        model.resize_token_embeddings(len(tokenizer))

    model = model.to("cuda" if torch.cuda.is_available() else "cpu")

    # Train
    optimizer = AdamW(model.parameters(), lr=5e-5)
    epochs = 5  # ⬅️ More training for stronger thematic alignment

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for batch in data_loader:
            input_ids = batch['input_ids'].to(model.device)
            attention_mask = batch['attention_mask'].to(model.device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids)
            loss = outputs.loss
            total_loss += loss.item()

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        print(f"Epoch {epoch+1} | Loss: {total_loss / len(data_loader):.4f}")

    # Save and zip
    model.save_pretrained(model_path)
    tokenizer.save_pretrained(model_path)
    shutil.make_archive(model_path, 'zip', model_path)
    print("Model trained, saved, and zipped.")


No saved model found — training from scratch...


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch 1 | Loss: 1.9630
Epoch 2 | Loss: 1.8413
Epoch 3 | Loss: 1.7817
Epoch 4 | Loss: 1.7318
Epoch 5 | Loss: 1.6862
Model trained, saved, and zipped.


## Inference: Prompt-Based Poem Generation

After training, we use the model in **inference mode** to generate new poems from user-provided prompts. The function `generate_poem(prompt)` takes a natural language description (e.g., _"the loneliness of space"_) and returns a generated poetic text, split into a title and a body.

#### **Decoding Configuration**
We use Hugging Face’s `generate()` method with the following parameters:
- **`temperature=0.8`**: Adds variability to the output while keeping the content more focused and less chaotic.
- **`top_k=30`**: Limits sampling to the top 30 most likely tokens at each step to reduce randomness while encouraging creativity.
- **`do_sample=True`**: Enables sampling (instead of greedy decoding) to encourage non-repetitive, expressive output.
- **`max_length=150`**: Ensures generated poems remain short and stylistically compact.

The prompt is structured as:
```
Write a poem about <your prompt>.
```

After generation, the output is post-processed to:
- Remove excessive whitespace, punctuation artifacts, or formatting issues
- Separate the first line as the poem title
- Return a clean and readable poetic structure

In [15]:
def split_title_and_poem(text, max_lines=12):
    lines = text.strip().splitlines()
    lines = [line.strip() for line in lines if line.strip()]

    if not lines:
        return "Untitled", ""

    title = lines[0]

    filtered = []
    for line in lines[1:]:
        if line.lower() == title.lower():
            continue
        if line.strip() in {".", ":", "…"} or line.isspace():
            continue
        filtered.append(line)

    poem_body = "\n".join(filtered[:max_lines])
    return title, poem_body


def generate_poem(prompt, max_length=150, temperature=0.8, top_k=30):
    input_text = f"Write a poem about {prompt}.\n"

    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to(model.device)

    output_ids = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        num_return_sequences=1
    )

    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    raw_poem = generated_text[len(input_text):].strip()

    title, poem = split_title_and_poem(raw_poem)
    return title.strip(), poem.strip()


## Analysis of Results


In [16]:
test_prompts = [
    "the loneliness of space",
    "a storm over the mountains",
    "the joy of childhood",
    "a memory of lost love",
    "a machine that writes poems",
    "a forgotten city under the sea",
    "a candle burning in silence",
    "the end of the world and the beginning of something new",
    "a painter dreaming in colors",
    "an old library full of secrets"
]

print("Generating poems...\n" + "-"*60 + "\n")
for prompt in test_prompts:
    title, poem = generate_poem(prompt)
    line_count = poem.count('\n') + 1
    word_count = len(poem.split())

    # Print result
    print(f"Prompt: {prompt}")
    print(f"Title: {title}")
    print(f"Poem:\n{poem}")
    print(f"Length: {line_count} lines, {word_count} words")
    print("-" * 60)

Generating poems...
------------------------------------------------------------

Prompt: the loneliness of space
Title: The Last Man
Poem:
He was the last man, the last one who was not here,
The last man to die, the man to live on the last day,
The man to be driven from heaven, and the one to die
With
Length: 4 lines, 38 words
------------------------------------------------------------
Prompt: a storm over the mountains
Title: Winter’s anachalesis
Poem:
Snow falls on the grasses of my yard
like a comet. It shines
and scatters as the sun.
The sun was just a cloud.
All those years
Length: 5 lines, 27 words
------------------------------------------------------------
Prompt: the joy of childhood
Title: Odes
Poem:
Odes are a child's delight,—the boy-hood, the spring of youth. These odes are the delight of youth, the joys of youth. All odes are the joys of youth, all odes are the joys of youth. O
Length: 1 lines, 36 words
------------------------------------------------------------
Prompt

## Final Result Analysis

In this test round, **Edgar Allan Poet** generated poems for 10 creative prompts, including abstract concepts, emotional memories, and surreal imagery. The results show coherence, structure, and poetic tone.

### Strengths

**Thematic Relevance**

Each poem reflects its corresponding prompt with metaphorical or literal interpretation:
- _"The Last Man"_ for “the loneliness of space” evokes a post-apocalyptic atmosphere.
- _"Psalm for the Dead"_ reflects solemnity and repetition, matching the tone of “a candle burning in silence”.
- _"The Day of the Dove"_ for “a memory of lost love” includes soft imagery and narrative recall.

**Title Quality**

Most titles are well-formed, poetic, and relevant. Examples include:
- _"Winter’s Anachalesis"_
- _"A Poem about a Poet’s Voice"_
- _"Luna, the Moon"_

Only one case generated `"."` as a title, suggesting occasional instability, but all others were meaningful.


### Minor Issues

| Issue | Description |
|-------|-------------|
| **Empty output** | “A forgotten city under the sea” generated a title but no poem — possibly due to sampling uncertainty or prompt complexity. |
| **Looping in abstract prompts** | _“Odes are the joys of youth…”_ for “joy of childhood” contains some repetition. Likely a stylistic choice, but worth noting. |
| **Abrupt endings** | A few poems end mid-thought or with incomplete sentences — likely due to max length being reached too soon or abrupt sampling cut-off. |

<br>

**Summary Table**

| Prompt                                 | Title                            | Lines | Words |
|----------------------------------------|----------------------------------|-------|-------|
| the loneliness of space                | The Last Man                     | 4     | 38    |
| a storm over the mountains             | Winter’s Anachalesis             | 5     | 27    |
| the joy of childhood                   | Odes                             | 1     | 36    |
| a memory of lost love                  | The Day of the Dove              | 4     | 33    |
| a machine that writes poems            | The Little House                 | 2     | 33    |
| a forgotten city under the sea         | A Man with a Gun                 | 1     | 0     |
| a candle burning in silence            | Psalm for the Dead               | 3     | 33    |
| the end of the world and...            | .                                | 3     | 24    |
| a painter dreaming in colors           | A Poem about a Poet’s Voice      | 7     | 26    |
| an old library full of secrets         | “Luna, the Moon”                 | 2     | 11    |


<br>

#### **Interpretation**

This test confirms that **Edgar Allan Poet** successfully fulfills the project’s objective: generating structured, meaningful poetry from abstract and stylistic prompts. The model displays:

- Strong control of tone and poetic rhythm  
- Thematic fluency and imagination  
- Appropriate format and title separation  

We believe that with minor postprocessing and more fine-tuning, it could easily be extended into a user-facing poetry generation tool.

