# Building our preference dataset

What we want is a dataset we can use for Direct Preference Optimization (DPO). To do this we need to have 

- a prompt/input from a user: in our case this is the promps requesting haiku we created in the first notebook
- a set of possible outputs: in our case this is the haiku we generated in the previous notebook
- a ranking of preferences. For DPO we want to have one "chosen/accepted" answer and one "rejected" answer. 

## What do we prefer?

So how to we decide out 20 generated haiku which is one is "best". Most preference dataset are focused towards a much broader task of "alignment". As a result the criteria used for assessing whether a particular response is more helpful than another response is often done by looking at criteria such as how helpful, honest, & harmless a response is. These criteria are not as applicable to our task. The category of 'honesty' doesn't make a huge amount of sense when evaluating a haiku. 

### How do we rank?

Because most preference datasets have a broader scope, and as a result have more complicated criteria for ranking the only options for creating the preference rankings are usually to either rely on human evaluation, or to rely on a language model to judge the quality of generations. Both of these approaches end up being rather expensive or difficult to organize. But in our case, we have a nice feature of the haiku that can allow us to score in another way. 

### Syllable Count

> Traditional Japanese haiku consist of three phrases composed of 17 phonetic units (called on in Japanese, which are similar to syllables) in a 5, 7, 5 pattern
> [Wikipedia](https://en.wikipedia.org/wiki/Haiku)

This feature of the haiku offers us a different way of evaluating a haiku based on it's syllable count. We can use this to create a ranking of the haiku we generated via a relatively simple Python function. This makes scoring/ranking our haiku almost free. This *might* also allow us to steer a LLM to write more 'technically correct' haiku once we have generated this data. 


We'll start by loading the `raw-haikus` config of our dataset which was previously used to generate the haiku we'll be ranking. 

In [5]:
from toolz import frequencies, valmap
from datasets import load_dataset

In [6]:
dataset = load_dataset("davanstrien/haiku_dpo","raw-haikus",split="train")

Downloading readme:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.33M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4303 [00:00<?, ? examples/s]

We can see that this is a Hugging Face `datasets.Dataset`

In [3]:
dataset

Dataset({
    features: ['input', 'generation_model', 'generation_prompt', 'raw_generation_responses', 'generations'],
    num_rows: 4303
})

Let's take a look at an example row

In [4]:
dataset[0]

{'input': 'Can you compose a haiku about the serenity of mountain peaks?',
 'generation_model': ['TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ'],
 'generation_prompt': ['

We can see that each row currently contains the prompt, along with all the generated haiku we created previously i.e. 20 haiku per row for a given prompt. 

In [5]:
dataset[0]['input']

'Can you compose a haiku about the serenity of mountain peaks?'

In [6]:
dataset[0]['generations']

["Majestic peaks so tall,\nSilent whispers of time's grace,\nNature's cathedral.",
 "Mountaintop so high,\nSilence wraps around in peace,\nNature's breath in sighs.",
 "Majestic peaks soar,\nSilent wisdom in their heights,\nNature's calming song.",
 "Majestic mountain,  \nSerenity in every peak,  \nNature's quiet song.",
 "Majestic peaks soar,\nSilent whispers of stone and snow,\nNature's cathedral.",
 "Majestic peak so high,\nIn silence, nature's beauty,\nEternal peace reigns.",
 "Majestic peaks soar,\nSilence speaks, soul at ease,\nNature's grand cathedral.",
 "Majestic peaks soar,\nSilent whispers of stone and sky,\nNature's grand cathedral.",
 "Majestic peaks so high,\nSilent sentinels of time,\nNature's cathedral.",
 "Majestic peaks ascend,\nSilent, stoic, standing tall,\nNature's grand cathedral.",
 "Majestic peaks so high,\nSilent whispers of the sky,\nNature's crown of pride.",
 "Majestic peaks rise,\nSilent strength in snow and stone,\nNature's cathedral.",
 "Majestic peaks ri

We'll remove a few columns to keep things a little bit cleaner. 

In [7]:
columns_to_keep = {'input', 'generation_model', 'generation_prompt', 'generations'}

In [8]:
dataset = dataset.remove_columns([c for c in dataset.column_names if c not in columns_to_keep])

In [9]:
dataset

Dataset({
    features: ['input', 'generation_model', 'generation_prompt', 'generations'],
    num_rows: 4303
})

## Defining our rules 

We want to define some rules for deciding the quality of our haikus. In this notebook, we're really focused on 'technical' features of the haiku, rather than the content or aethetics. We can return to those later. 
One of the first thing we can get a sense for is the number of lines of our generations. We'd like all of them to be 3 lines long. We can define a simple Python function to check this.

In [10]:
def check_is_3_lines(haiku):
    lines = haiku.split("\n")
    return len(lines) == 3

Let's get a simple count of how often a haiku is 3 lines long.

In [11]:
is_three = (check_is_3_lines(haiku) for row in dataset['generations'] for haiku in row)
frequencies(is_three)

{True: 85123, False: 937}

We can see that haiku that don't match the 3 line criteria are relatively rare so we can probably rank these harshly i.e. as zero. 

### Syllables

Haikus are supposed to have a 5-7-5 syllable structure. Whilst not perfect, we should be able to computationally check whether a haiku has this structure. This will allow us to score haikus based on their syllable structure. We can then use this score to rate a haiku as 'technically' correct or not and rank the generations to prefer the technically correct ones over the lower quality scores. 

`cmudict` is a dictionary of words and their phonetic pronunciations. We can use this to count the number of syllables in a haiku.

In [12]:
import cmudict

DICT = cmudict.dict()

In [13]:
DICT['hello']

[['HH', 'AH0', 'L', 'OW1'], ['HH', 'EH0', 'L', 'OW1']]

We'll define some helpers for cleaning out punctuation. 

In [14]:
import nltk
import string
import re
import random 

punctuation_pattern = r"[-_. ]+"


# Download the CMUDict data
nltk.download('cmudict')

from nltk.corpus import cmudict
d = cmudict.dict()

During experimentation with these approaches I found some words were missing from the CMUDict. Instead of ignoring those syllables, I decided to randomly assign them a syllable count. Rather than just being totally random, I used a rough approximation of the distribution of syllable counts for words in the English language. It wasn't that easy to find any good references for this data so I'm basing this on a Reddit post 🤷

In [15]:
def random_syllable_count():
    # This is a rough approximation of the syllable distribution of English words
    # source is obviously reddit: https://www.reddit.com/r/dataisbeautiful/comments/qujsoc/oc_most_common_word_lengths_as_measured_in/
    values = [1, 2, 3, 4, 5]
    probabilities = [0.20, 0.41, 0.24, 0.10, 0.05]
    return random.choices(values, probabilities, k=1)[0]

We now create a function for counting the syllables in a word and then a function for counting the syllables in a haiku.

In [22]:
d = cmudict.dict()

def count_syllables(word):
    """ Count the number of syllables in a word using CMUDict, handling punctuation. """
    # Remove punctuation from the word
    word = word.lower().strip(string.punctuation)

    if word not in d:
        # try and split on punctuation
        split_word = re.split(r"[^\w']+", word)
        if len(split_word) > 1:
            return sum(count_syllables(w) for w in split_word)
        no_apostrophe_s = word.rstrip("'s")
        if no_apostrophe_s in d:
            return count_syllables(no_apostrophe_s)
        else:
            return random_syllable_count()

    # Count the number of vowels (which represent syllables in CMUDict)
    syllables = [phoneme for phoneme in d[word][0] if phoneme[-1].isdigit()]
    return len(syllables)


Now we have a function for counting the syllables in a word we want to define some way of using this to rank our generations. For now I'll score out 4, and knock off 1 point for each line with a syllable count that doesn't match the 5-7-5 structure. We could modify this to be more complex i.e. give a harsher score to haiku that have a syllable count that is further away from the 5-7-5 structure.

In [23]:
def score_haiku_syllables(haiku):
    """
    Scores a haiku based on whether the syllable counts are 5, 7, 5
    If all syllable counts are correct, returns 4
    If one syllable count is incorrect, returns 3
    If two syllable counts are incorrect, returns 2
    If all syllable counts are incorrect, returns 1
    print haikus with a score less than 4 with the syllable counts
    """
    expected_counts = [5, 7, 5]
    lines = haiku.split("\n")
    if len(lines) != 3:
        return 0
    counts = []
    for line in lines:
        line_count = sum(count_syllables(word) for word in line.split(" "))
        counts.append(line_count)
    score = 4 - sum(
        1
        for expected, actual in zip(expected_counts, counts)
        if expected != actual
    )
    if score < 4:
        print(haiku) # you can comment this out if you don't want to see the haiku printed
        print(counts) # you can comment this out if you don't want to see the syllable counts printed
    return score


In [24]:
score_haiku_syllables(dataset[0]['generations'][0])

Majestic peaks so tall,
Silent whispers of time's grace,
Nature's cathedral.
[6, 7, 5]


3

In [25]:
score_haiku_syllables("Morning light ascends,\nSilent peaks embrace dawn,\nNature's breath whispers")

Morning light ascends,
Silent peaks embrace dawn,
Nature's breath whispers
[5, 6, 5]


3

We can use the `map` method to add a new column to our dataset with the score for each generated haiku.

In [26]:
def score_haikus(row):
    haikus = row['generations']
    return {"scores": [score_haiku_syllables(haiku) for haiku in haikus]}

In [27]:
scored = dataset.map(score_haikus)

Map:   0%|          | 0/4303 [00:00<?, ? examples/s]

Majestic peaks so tall,
Silent whispers of time's grace,
Nature's cathedral.
[6, 7, 5]
Majestic mountain,  
Serenity in every peak,  
Nature's quiet song.
[11, 17, 5]
Majestic peaks soar,
Silent whispers of stone and snow,
Nature's cathedral.
[5, 8, 5]
Majestic peak so high,
In silence, nature's beauty,
Eternal peace reigns.
[6, 7, 5]
Majestic peaks soar,
Silence speaks, soul at ease,
Nature's grand cathedral.
[5, 6, 6]
Majestic peaks soar,
Silent whispers of stone and sky,
Nature's grand cathedral.
[5, 8, 6]
Majestic peaks so high,
Silent sentinels of time,
Nature's cathedral.
[6, 7, 5]
Majestic peaks ascend,
Silent, stoic, standing tall,
Nature's grand cathedral.
[6, 7, 6]
Majestic peaks so high,
Silent whispers of the sky,
Nature's crown of pride.
[6, 7, 5]
Majestic peaks rise,
Silent sentinels of time,
Nature's grand cathedral.
[5, 7, 6]
Majestic peaks soar,
Silent whispers of stone and ice,
Nature's throne of peace.
[5, 8, 5]
Majestic peaks so high,
Snow-capped summits touch the s

In [24]:
scored[0]

{'input': 'Can you compose a haiku about the serenity of mountain peaks?',
 'generation_model': ['TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ'],
 'generation_prompt': ['

Let's begin to explore these results. We can (and probably should) do a more in depth analysis of these scores but to get this project moving we'll do some quick and dirty analysis.

In [26]:
list1, list2, list3, list4, *_ = map(list, zip(*scored['scores']))

We'll start by looking at the frerquency of scores.

In [27]:
print({k:v for k,v in sorted(frequencies(list1).items(), key=lambda item: item[1], reverse=True)})
print({k:v for k,v in sorted(frequencies(list2).items(), key=lambda item: item[1], reverse=True)})
print({k:v for k,v in sorted(frequencies(list3).items(), key=lambda item: item[1], reverse=True)})
print({k:v for k,v in sorted(frequencies(list4).items(), key=lambda item: item[1], reverse=True)})

{3: 1647, 4: 1250, 2: 1089, 1: 265, 0: 52}
{3: 1709, 4: 1218, 2: 1067, 1: 262, 0: 47}
{3: 1739, 4: 1200, 2: 1057, 1: 264, 0: 43}
{3: 1714, 4: 1218, 2: 1011, 1: 318, 0: 42}


We can also look at the distribution of scores.

In [28]:
differences = []
for scores in scored['scores']:
    max_score = max(scores)
    min_score = min(scores)
    differences.append(max_score - min_score)

In [29]:
freqs = frequencies(differences)

Look at the frequencies of differences between scores. We want to see how often these rules generate different scores for the same haiku. We want to make sure there is enough variation in the scores. If all haikus are already technically correct we will have a hard time training a model to improve them.

In [30]:
frequencies_of_scores = valmap(lambda x: x/sum(freqs.values())*100, freqs)
{k:v for k,v in sorted(frequencies_of_scores.items(),key=lambda item: item[0])}


{0: 0.023239600278875203,
 1: 0.9760632117127586,
 2: 33.25586799907042,
 3: 56.63490587961887,
 4: 9.10992330931908}

We can see there are relatively few generations where there is no difference between the highest and lowest scoring haiku. This is good as it means we have some variation in our scores.

Let's already push this dataset to the hub, this will be our 'raw' version. We'll do some more initial filtering but keeping this data around in a separate config means we can easily go back to it if we need to.

In [266]:
scored.push_to_hub("davanstrien/haiku_dpo", "raw-scored-haikus", private=True)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/davanstrien/haiku_dpo/commit/0ed529af613bdd7765ad6544ca7c900846ad40f2', commit_message='Upload dataset', commit_description='', oid='0ed529af613bdd7765ad6544ca7c900846ad40f2', pr_url=None, pr_revision=None, pr_num=None)

## Filtering

Again, we could (and will!) do a lot more work with this data, but since we're generating a preference dataset, we at least want to remove rows where based on our criteria there is nothing to prefer i.e.

- We don't want to include rows where all the haikus are perfectly score
- We don't want rows where all the haikus are the lowest score

### Remove all rows where all haiku are top scored

In [28]:
def not_all_scores_are_four(row):
    scores = row['scores']
    return set(scores) != {4}
    


In [29]:
len(scored)

4303

In [30]:
not_all_top_scores = scored.filter(not_all_scores_are_four)
not_all_top_scores

Filter:   0%|          | 0/4303 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'generation_model', 'generation_prompt', 'generations', 'scores'],
    num_rows: 4303
})

### Remove all rows where all haiku are zero

In [31]:
def not_all_scores_are_zero(row):
    scores = row['scores']
    return set(scores) != {0}

In [32]:
cleaned = not_all_top_scores.filter(not_all_scores_are_zero)

Filter:   0%|          | 0/4303 [00:00<?, ? examples/s]

In [33]:
cleaned

Dataset({
    features: ['input', 'generation_model', 'generation_prompt', 'generations', 'scores'],
    num_rows: 4302
})

We can see that we've already reduced the number of rows a little bit doing this but we can assume that these rows probably won't contain any signal for our model to learn from so we don't want them.

We'll now create a function which will help us choose our actual preference pairs. This function will:

- identify min and max score
- identify the haiku with the min and max score
- store the chosen and rejected haiku
- return the chosen and rejected haiku scores
- indicate if a tie was present
- store the difference between the chosen and rejected haiku scores


All of these will potentially be useful for further filtering later. 

In [44]:
import random

def analyze_responses(data):
    responses = data['generations']
    scores = data['scores']

    # Finding the max and min scores
    max_score = max(scores)
    min_score = min(scores)

    # Identifying responses with the highest and lowest scores
    highest_scoring_responses = [responses[i] for i, score in enumerate(scores) if score == max_score]
    lowest_scoring_responses = [responses[i] for i, score in enumerate(scores) if score == min_score]

    # Handling ties
    tie = max_score == min_score

    # Selecting one response randomly in case of a tie
    chosen_response = random.choice(highest_scoring_responses)
    rejected_response = random.choice(lowest_scoring_responses)

    # Calculating the difference in score
    difference_in_score = max_score - min_score
   
    return {
        "chosen": chosen_response,
        "chosen_score": max_score,
        "rejected": rejected_response,
        "rejected_score": min_score,
        "tie": tie,
        "difference_in_score": difference_in_score
    }


In [45]:
cleaned

Dataset({
    features: ['input', 'generation_model', 'generation_prompt', 'generations', 'scores', 'chosen', 'chosen_score', 'rejected', 'rejected_score', 'tie', 'difference_in_score'],
    num_rows: 4302
})

We can now use `map` again to apply this function to our dataset.

In [46]:
cleaned = cleaned.map(analyze_responses)


Map:   0%|          | 0/4302 [00:00<?, ? examples/s]

In [47]:
cleaned[0]

{'input': 'Can you compose a haiku about the serenity of mountain peaks?',
 'generation_model': ['TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ'],
 'generation_prompt': ['

# See how many chosen at top scored

In [48]:
len(cleaned.filter(lambda x: x['chosen_score'] == 4))

Filter:   0%|          | 0/4302 [00:00<?, ? examples/s]

4157

### See how many ties

In [50]:
len(cleaned.filter(lambda x: x['tie'] == True))

Filter:   0%|          | 0/4302 [00:00<?, ? examples/s]

0

In [52]:
cleaned[0]

{'input': 'Can you compose a haiku about the serenity of mountain peaks?',
 'generation_model': ['TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ',
  'TheBloke/OpenHermes-2.5-Mistral-7B-AWQ'],
 'generation_prompt': ['

In [51]:
cleaned.push_to_hub("davanstrien/haiku_dpo", "rule_ranked")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/davanstrien/haiku_dpo/commit/318b45d6cf7981835fbb2f802b803cfac8268a24', commit_message='Upload dataset', commit_description='', oid='318b45d6cf7981835fbb2f802b803cfac8268a24', pr_url=None, pr_revision=None, pr_num=None)

## Keep only biggest difference in scores

Since we likely want to have a clear preference between the chosen and rejected haiku, we'll filter to include as chosen only the haiku with the highest possible score. We'll then remove any rows where the rejected haiku score is greater than 2 i.e. where the score is not that different. The intuition here is that we want to have a clear preference between the chosen and rejected haiku.

In [52]:
frequencies(cleaned['chosen_score'])

{4: 4157, 3: 143, 2: 2}

In [53]:
frequencies(cleaned['rejected_score'])

{2: 1342, 1: 2500, 0: 426, 3: 34}

In [55]:
def filter_top_and_bottom(row):
    return row["chosen_score"] == 4 and row["rejected_score"] < 3


In [56]:
top_and_bottom = cleaned.filter(filter_top_and_bottom)

Filter:   0%|          | 0/4302 [00:00<?, ? examples/s]

In [57]:
len(top_and_bottom)

4123

We'll cleanup the formatting of our data to make it easier to use out of the box as a DPO dataset. 

In [59]:
system_prompt = top_and_bottom[0]['generation_prompt'][0].split("system")[1].split("<|im_end|>")[0]
system_prompt = system_prompt.strip("\n")
system_prompt

'You are a poet specialising in creating Haiku. \nYour haiku consist of three lines, with five syllables in the first line, seven in the second, and five in the third.\nBeyond being technically correct, your haiku should also be beautiful and meaningful'

In [60]:
top_and_bottom = top_and_bottom.map(lambda x: {"system": system_prompt})

Map:   0%|          | 0/4123 [00:00<?, ? examples/s]

In [61]:
top_and_bottom = top_and_bottom.rename_column('input','question')

We'll now push this version to a `default` config. This will be the version that people get back if they use `load_dataset(davanstrien/haiku_dpo)` without specifying a config.

In [62]:
top_and_bottom.push_to_hub("davanstrien/haiku_dpo", "default")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/12.9k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/davanstrien/haiku_dpo/commit/2d135a0406e3689fe361b6f2412bca70c37e1234', commit_message='Upload dataset', commit_description='fix tie function', oid='2d135a0406e3689fe361b6f2412bca70c37e1234', pr_url='https://huggingface.co/datasets/davanstrien/haiku_dpo/discussions/1', pr_revision='refs/pr/1', pr_num=1)