# Exploratory Analysis of the Haiku DPO Dataset

This notebook performs exploratory analysis on the Hugging Face dataset [davanstrien/haiku_dpo](https://huggingface.co/datasets/davanstrien/haiku_dpo). We will analyze the 'question' and 'chosen' columns to answer specific questions about the data.

## 1. Import Required Libraries
Import libraries such as pandas and datasets for data manipulation and analysis.

In [2]:
import pandas as pd
from datasets import load_dataset
import re


## 2. Load the Dataset
Use the datasets library to load the Hugging Face dataset `davanstrien/haiku_dpo`.

In [3]:
# Load the dataset from Hugging Face
dataset = load_dataset('davanstrien/haiku_dpo', split='train')
df = pd.DataFrame(dataset)
df.head()

Unnamed: 0,question,generation_model,generation_prompt,generations,scores,chosen,chosen_score,rejected,rejected_score,tie,difference_in_score,system
0,Can you compose a haiku about the serenity of ...,"[TheBloke/OpenHermes-2.5-Mistral-7B-AWQ, TheBl...",[<|im_start|>system\nYou are a poet specialisi...,"[Majestic peaks so tall,\nSilent whispers of t...","[3, 4, 4, 2, 3, 3, 2, 2, 3, 2, 3, 4, 3, 4, 3, ...","Majestic peaks rise,\nSilent, still, in purest...",4,"Majestic peaks soar,\nSilence speaks, soul at ...",2,False,2,You are a poet specialising in creating Haiku....
1,Write a haiku that captures the grandeur of mo...,"[TheBloke/OpenHermes-2.5-Mistral-7B-AWQ, TheBl...",[<|im_start|>system\nYou are a poet specialisi...,"[Snow-capped peaks so high,\nSilent, majestic,...","[4, 3, 3, 4, 2, 3, 4, 3, 2, 2, 2, 3, 2, 4, 3, ...","Snow-capped peaks so high,\nSilent, majestic, ...",4,"Majestic mountains rise,\nPeaks piercing the a...",2,False,2,You are a poet specialising in creating Haiku....
2,Please create a haiku that describes the solit...,"[TheBloke/OpenHermes-2.5-Mistral-7B-AWQ, TheBl...",[<|im_start|>system\nYou are a poet specialisi...,"[Majestic peaks alone,\nSilence echos in vast ...","[1, 3, 2, 3, 4, 1, 1, 2, 3, 3, 3, 3, 2, 3, 2, ...","Majestic peaks rise,\nSilence echoes, souls en...",4,"Majestic peaks alone,\nSilent whispers of ston...",1,False,3,You are a poet specialising in creating Haiku....
3,Can you write a haiku about the mystery of mou...,"[TheBloke/OpenHermes-2.5-Mistral-7B-AWQ, TheBl...",[<|im_start|>system\nYou are a poet specialisi...,"[Majestic peaks rise,\nSilent secrets they hol...","[3, 3, 2, 3, 2, 2, 2, 0, 0, 2, 1, 4, 1, 3, 3, ...","Majestic peaks rise,\nSecrets hidden in the sn...",4,"Majestic peaks, so high,\nSecrets held in snow...",0,False,4,You are a poet specialising in creating Haiku....
4,Please create a haiku that portrays the tranqu...,"[TheBloke/OpenHermes-2.5-Mistral-7B-AWQ, TheBl...",[<|im_start|>system\nYou are a poet specialisi...,"[Majestic peaks, so high,\nSilent and still, t...","[3, 2, 3, 3, 4, 3, 3, 4, 2, 4, 4, 3, 3, 3, 2, ...","Snow-capped peaks rise high,\nSilence whispers...",4,"Snow-capped peaks rise high,\nQuietly standing...",2,False,2,You are a poet specialising in creating Haiku....


## 3. Analyze "question" Column
Perform analysis on the 'question' column to count rows containing 'haiku', 'a haiku', and neither. Provide examples of rows containing neither.

### 4. Count Rows Containing "haiku"
Filter the dataset to count rows where the 'question' column contains the word 'haiku'.

In [4]:
# Count rows where 'question' contains 'haiku' (case-insensitive)
count_haiku = df['question'].str.contains('haiku', case=False, na=False).sum()
print(f"Rows with 'haiku' in question: {count_haiku}")

Rows with 'haiku' in question: 3873


### 5. Count Rows Containing "a haiku"
Filter the dataset to count rows where the 'question' column contains the phrase 'a haiku'.

In [5]:
# Count rows where 'question' contains 'a haiku' (case-insensitive)
count_a_haiku = df['question'].str.contains('a haiku', case=False, na=False).sum()
print(f"Rows with 'a haiku' in question: {count_a_haiku}")

Rows with 'a haiku' in question: 3840


### 6. Count Rows Containing Neither "haiku" Nor "a haiku"
Filter the dataset to count rows where the 'question' column contains neither 'haiku' nor 'a haiku'.

In [6]:
# Count rows where 'question' contains neither 'haiku' nor 'a haiku'
mask_neither = ~df['question'].str.contains('haiku', case=False, na=False)
count_neither = mask_neither.sum()
print(f"Rows with neither 'haiku' nor 'a haiku' in question: {count_neither}")

Rows with neither 'haiku' nor 'a haiku' in question: 250


### 7. Example Rows Containing Neither
Provide example rows from the dataset where the 'question' column contains neither 'haiku' nor 'a haiku'.

In [8]:
# Show example rows where 'question' contains neither 'haiku' nor 'a haiku'
examples_neither = df[mask_neither].head(5)
examples_neither[['question', 'chosen']].to_string(index=False)

"                              question                                                                                  chosen\n                       I'd like you to           Cherry blossoms dance,\\nIn a gentle springtime breeze,\\nNature's soft ballet.\n                                     7           Crisp autumn mornings,\\nLeaves fall gently to the ground,\\nNature's symphony.\n                                    I'           Soft wind rustling leaves,\\nWhispering secrets untold,\\nAutumn's gentle sigh.\nDescribe the beauty of a falling leaf. Leaf falls gently, twirls\\nIn autumn's crisp, cool embrace\\nNature's dance, brief spell\n                                Please     Cherry blossoms bloom,\\nSoft petals dance in the breeze,\\nSpring's fleeting moment."

## 8. Analyze "chosen" Column
Perform analysis on the 'chosen' column to identify rows where the chosen haikus are not valid haikus.

### 9. Count Rows Where "chosen" Haikus Are Not Haikus
Filter the dataset to count rows where the 'chosen' column does not contain valid haikus.

In [9]:
# Define a simple function to check if a text is a haiku (5-7-5 syllable structure)
def is_haiku(text):
    # Split into lines
    lines = [line.strip() for line in text.strip().split('\n') if line.strip()]
    if len(lines) != 3:
        return False
    # Simple syllable count using vowels (approximate)
    def count_syllables(line):
        return len(re.findall(r'[aeiouy]+', line.lower()))
    syllable_pattern = [5, 7, 5]
    return all(abs(count_syllables(line) - target) <= 1 for line, target in zip(lines, syllable_pattern))

# Apply to 'chosen' column
not_haiku_mask = ~df['chosen'].apply(is_haiku)
count_not_haiku = not_haiku_mask.sum()
print(f"Rows where 'chosen' is not a haiku: {count_not_haiku}")

Rows where 'chosen' is not a haiku: 1817


### 10. Example Rows Where "chosen" Are Not Haikus
Provide example rows from the dataset where the 'chosen' column does not contain valid haikus.

In [10]:
# Show example rows where 'chosen' is not a haiku
examples_not_haiku = df[not_haiku_mask].head(5)
examples_not_haiku[['question', 'chosen']].to_string(index=False)

"                                                              question                                                                            chosen\n           Write a haiku that captures the grandeur of mountain peaks.    Snow-capped peaks so high,\\nSilent, majestic, standing,\\nNature's masterpiece.\n  Please create a haiku that describes the solitude of mountain peaks.          Majestic peaks rise,\\nSilence echoes, souls entwine,\\nNature's solitude.\nPlease create a haiku that portrays the tranquility of mountain peaks.  Snow-capped peaks rise high,\\nSilence whispers in the air,\\nTranquility thrives.\n             Can you write a haiku about moss growing on a tree trunk? Moss adheres to bark,\\nGrowing slowly with time's grace,\\nTree's tale in silence.\n            Please compose a haiku that captures the serenity of moss.         Gentle moss below,\\nSilent whispers in the shade,\\nNature's soft embrace."