# Evolve Riddles

## load dataset

In [1]:
from datasets import load_dataset, DatasetDict, Dataset,load_from_disk

dataset_name="Hypersniper/riddles_v1"
dataset = load_dataset(dataset_name, split="all")

Downloading readme:   0%|          | 0.00/2.33k [00:00<?, ?B/s]

In [2]:
dataset

Dataset({
    features: ['answer', 'output', 'instruction'],
    num_rows: 469
})

In [58]:
dataset[0]

{'answer': 'music',
 'output': "The question seems to be a riddle that is focusing on an entity that can evoke a range of emotional responses. The given clues are that it can make someone dream or stamp their feet, and it can also make someone laugh or weep. \n\nLet's break down these clues. Dreaming and stamping feet are both reactions that can be associated with feelings of joy, excitement, or anticipation. On the other hand, laughing and weeping are expressions of happiness and sadness respectively. So, we are looking for something that can induce these varied emotional reactions.\n\nConsidering these clues, one plausible answer could be 'music'. Here's why: \n\nMusic has a profound impact on our emotions. It has the power to uplift our spirits, soothe our nerves, make us feel happy, sad, excited, calm, and even lead us into a state of introspection or dreaming. \n\nWhen we listen to upbeat music, it often makes us want to move or dance. Hence, the reference to 'stamp their feet'. I

## load mistral

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "models/Mistral-7B-Instruct-v0.2",
    torch_dtype=torch.bfloat16,
    device_map="auto",    
)
tokenizer = AutoTokenizer.from_pretrained("models/Mistral-7B-Instruct-v0.2")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

# 10-shot prompt mistral to generate more questions

In [3]:
import random

questions=[q for q in dataset["instruction"]]

questions[0:3]

['At the sound of me, one may dream or stamp their feet, At the sound of me, one may laugh or sometimes weep.',
 "Take off my skin, I won't cry, but you will.",
 'When they are caught, they are thrown away. When they escape, you itch all day.']

In [4]:
random.shuffle(questions)
q10_sample=questions[0:10]
prompt_template=""""Below are 10 riddles. Come up with 10 more. Don't output anything else

Riddles:
{questions}"""

prompt=prompt_template.format(questions="\n\n".join(q10_sample))
print(prompt)

"Below are 10 riddles. Come up with 10 more. Don't output anything else

Riddles:
I can run but not walk. Wherever I go, thought follows close behind.

Snake coiled round and round. Snake deep below the ground. Snake that's never had a head. Snake that binds but not with dread.

I have a hundred legs, but cannot stand. I have a long neck, but no head. I cannot see. I'm neat and tidy as can be.

In buckles or lace, they help set the pace. The farther you go, the thinner they grow.

It's in your hand though you can not feel it. Only you and time can reveal it.

I saw a strange creature. Long, hard, and straight, thrusting into a round, dark opening. Preparing to discharge its load of lives. Puffing and squealing noises accompanied it, then a final screech as it slowed and stopped.

What goes in the water red, and comes out black?

I am whole but incomplete. I have no eyes, yet I see. You can see, and see right through me. My largest part is one fourth of what I once was.

What is full of

In [None]:
messages = [
    {"role": "user", "content": prompt},
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
model_inputs = encodeds.to(device)

generated_ids = model.generate(
    model_inputs, 
    max_new_tokens=500, 
    do_sample=True, 
    pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

# Evolve dataset

## Generate 3k more riddles (step 1)

In [20]:
from tqdm import tqdm

prompt_template=""""Below are 10 riddles. Come up with 10 more. Output just the riddles, no numbering. Don't output anything else.

Riddles:
{questions}"""

synthetic_riddles=[]

for _ in tqdm(range(300)):
    random.shuffle(questions)
    q10_sample=questions[0:10]
    
    prompt=prompt_template.format(questions="\n\n".join(q10_sample))
    messages = [{"role": "user", "content": prompt}]
    
    input_tokens = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
    output_tokens = model.generate(
        input_tokens, 
        max_new_tokens=500, 
        do_sample=True, 
        pad_token_id=tokenizer.eos_token_id)
    output_tokens=output_tokens[0][len(input_tokens[0]):]
    output = tokenizer.decode(output_tokens, skip_special_tokens=True)

    synthetic_riddles.extend( output.split("\n") )

100%|█████████████████████████████████████████| 300/300 [35:27<00:00,  7.09s/it]


In [23]:
synthetic_riddles=[s.strip() for s in synthetic_riddles if len(s.strip())>0]
len(synthetic_riddles)

3284

In [51]:
# clean answers
synthetic_riddles
prefixes= \
    [f"{num}." for num in range(50)] + \
    [f"{num})" for num in range(50)] + \
    ["I.",
    "II.",
    "III.",
    "IV.",
    "V.",
    "VI.",
    "VII.",
    "VIII.",
    "IX.",
    "X."]

synthetic_riddles_clean=[]

for r in tqdm(synthetic_riddles):
    r_clean=r
    if r[-1]!="." and r[-1]!="?":
        continue
    if r[-1]==",":
        continue
    if r[-1]==";":
        continue
    for prefix in prefixes:
        if r.startswith(prefix):
             r_clean=r.split(prefix)[1].strip()
    synthetic_riddles_clean.append(r_clean) 

display(len(synthetic_riddles_clean))

# remove duplicates
synthetic_riddles_clean=list(set(synthetic_riddles_clean))
display(len(synthetic_riddles_clean))

# remove almost duplicates where the first 20 chars are the same
synthetic_riddles_clean.sort()

tmp=[]
chars=40
for i in range(1,len(synthetic_riddles_clean)):
    if len(synthetic_riddles_clean[i])<chars and len(synthetic_riddles_clean[i-1])<chars:
        tmp.append(synthetic_riddles_clean[i])
        continue
    if synthetic_riddles_clean[i][:chars]==synthetic_riddles_clean[i-1][:chars]:
        continue
    else:
        tmp.append(synthetic_riddles_clean[i])
synthetic_riddles_clean=tmp

display(len(synthetic_riddles_clean))

100%|████████████████████████████████████| 3284/3284 [00:00<00:00, 52839.09it/s]


3088

1996

1682

In [50]:
synthetic_riddles_clean

['A bag contains apples, some red, some green. If you reach in without looking, and pick one apple, what is the chance that it was ripe?',
 "A bath can't give you one, a river can't float one, but what finds it in the hot desert?",
 'A big tree was once a little seed. A little bird was once a little egg. What was Alice before she was a girl?',
 "A bird is in a cage, but it's not a bird in the wild. What am I?",
 'A bird is in a cage. The cage is in a room. The room is on the first floor. The first floor is in a building. The building is on fire. How will the bird get out?',
 "A bird is in the tree, it's not alive, it has no feathers, it's not alive. It sings a song, but it doesn't breathe, it never eats, it never grows. What is it?",
 'A bird is outside your window, you do not open it. What kind of bird is it?',
 'A bird keeps it in its beak, a fish in its gills, a monk in his cell. What is it?',
 "A bird that can fly but cannot walk. It's not often seen, because it's usually talking. 

In [58]:
import json
def write_pretty_json(file_path, data):
    with open(file_path, "w") as write_file:
        json.dump(data, write_file, indent=4)

synthetic_riddles_dict = [ dict(number=i, messages=[ riddle ]) for i, riddle in enumerate(synthetic_riddles_clean) ]
write_pretty_json("synthetic_riddles_step-1.json", synthetic_riddles_dict)

print(json.dumps(synthetic_riddles_dict[:3],indent=2))

[
  {
    "number": 0,
    "messages": [
      "A bag contains apples, some red, some green. If you reach in without looking, and pick one apple, what is the chance that it was ripe?"
    ]
  },
  {
    "number": 1,
    "messages": [
      "A bath can't give you one, a river can't float one, but what finds it in the hot desert?"
    ]
  },
  {
    "number": 2,
    "messages": [
      "A big tree was once a little seed. A little bird was once a little egg. What was Alice before she was a girl?"
    ]
  }
]


## Generate answers to new riddles with mistral-instruct (step 2)

In [79]:
from tqdm import tqdm
import copy 

prompt_template=""""{riddle}

Think step-by-step, keep your explanations simple, try your very best. If you there is information missing for you to come up with a specific answer, just ask me a short question at the very end of your answer."""

synthetic_riddles_step2=copy.deepcopy(synthetic_riddles_dict)

for riddle in tqdm(synthetic_riddles_step2):
    prompt=prompt_template.format( riddle=riddle["messages"][0] )
    messages = [{"role": "user", "content": prompt}]
    
    input_tokens = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
    output_tokens = model.generate(
        input_tokens, 
        max_new_tokens=500, 
        do_sample=True, 
        pad_token_id=tokenizer.eos_token_id)
    output_tokens=output_tokens[0][len(input_tokens[0]):]
    output = tokenizer.decode(output_tokens, skip_special_tokens=True)
    riddle["messages"].append(output)
    # print("riddle: {}\nanswer: {}\n\n".format(riddle["messages"][0], output))


100%|█████████████████████████████████████| 1682/1682 [1:11:58<00:00,  2.57s/it]


In [83]:
print(json.dumps(synthetic_riddles_step2[:3],indent=2))

[
  {
    "number": 0,
    "messages": [
      "A bag contains apples, some red, some green. If you reach in without looking, and pick one apple, what is the chance that it was ripe?",
      "The information given does not specify anything about the ripeness of the apples being red or green. Therefore, we cannot determine the chance that a picked apple is ripe based on its color alone. Additionally, ripeness is a subjective factor that can vary from apple to apple even within the same color group. So,we would need more information, such as the definition of ripeness used or any context about the storage conditions of the apples, to provide an answer."
    ]
  },
  {
    "number": 1,
    "messages": [
      "A bath can't give you one, a river can't float one, but what finds it in the hot desert?",
      "The answer to the riddle is a shadow. A shadow cannot exist in the absence of sunlight, which is typically found in hot desert climates. Shadows cannot be physically created by a bath o

In [84]:
import json
def write_pretty_json(file_path, data):
    with open(file_path, "w") as write_file:
        json.dump(data, write_file, indent=4)

write_pretty_json("synthetic_riddles_step-2.json", synthetic_riddles_step2)

## Generate follow-up question by user, ie. start turn 2 (step 3)

In [116]:
from tqdm import tqdm
import copy 

prompt_template=""""Please continue the converstation below. Provide the next reply by the user. Formulate a very short question. Imitate a curious 10 year old kid asking a question.

user: {question}
assistant: {answer}"""

synthetic_riddles_step3=copy.deepcopy(synthetic_riddles_step2)

for riddle in tqdm(synthetic_riddles_step3):
    prompt=prompt_template.format( 
        question=riddle["messages"][0],
        answer=riddle["messages"][1] 
    )
    messages = [{"role": "user", "content": prompt}]
    
    input_tokens = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
    output_tokens = model.generate(
        input_tokens, 
        max_new_tokens=500, 
        do_sample=True, 
        pad_token_id=tokenizer.eos_token_id)
    output_tokens=output_tokens[0][len(input_tokens[0]):]
    output = tokenizer.decode(output_tokens, skip_special_tokens=True)

    riddle["messages"].append(output)

100%|███████████████████████████████████████| 1682/1682 [26:20<00:00,  1.06it/s]


In [140]:
# Clean up Mistral's outputs

prefixes=[
    "kid:",
    "Kid:",
    "child:",
    "Child:",
    "old:",
    "user:",
    "User:",    
]

for riddle in synthetic_riddles_step3:
    msg=riddle["messages"][2]
    for prefix in prefixes:
        if prefix in msg:
            msg=msg.split(prefix)[1]
    msg=msg.strip().strip("'\"")
    riddle["messages"][2]=msg.strip()

a=[r["messages"][2] for r in synthetic_riddles_step3]
a.sort()
a

['A 10-year-old might ask: "Is there another thing with lots of keys, but it doesn\'t open locks like a piano?" (Bonus: "Is it similar to the musical answer we figured out earlier?")',
 'A 10-year-old might ask: "What kind of push or pull do we give a lever to make it work?',
 'A curious 10 year old might ask: "Which famous river fits this description in the riddle?',
 'A curious 10-year-old kid might ask, "What animal could people mistake for this riddle, not knowing it\'s about a human?',
 'A curious 10-year-old might ask, "Are you a storyteller hiding behind a riddle? What should I say to \'open\' you and hear your stories?',
 'A curious 10-year-old might ask, "Can we eat artichokes all year long, or is there a special time for it?',
 'A curious 10-year-old might ask, "How does a river start its journey?',
 'A house key, kiddo. House keys open the doors of our homes.',
 'A short table leg, like those on a small table, is the answer to your question.',
 "A: Hm, I'm not sure. Let's fi

In [141]:
import json
def write_pretty_json(file_path, data):
    with open(file_path, "w") as write_file:
        json.dump(data, write_file, indent=4)

write_pretty_json("synthetic_riddles_step-3.json", synthetic_riddles_step3)

## Answer follow-up question by user, ie. finish turn 2 (step 4)

In [3]:
import json
def read_json(file_path):
    f = open(file_path)
    data = json.load(f)
    f.close()
    return data

synthetic_riddles_step3=read_json("synthetic_riddles_step-3.json")

In [6]:
synthetic_riddles_step3[0]

{'number': 0,
 'messages': ['A bag contains apples, some red, some green. If you reach in without looking, and pick one apple, what is the chance that it was ripe?',
  'The information given does not specify anything about the ripeness of the apples being red or green. Therefore, we cannot determine the chance that a picked apple is ripe based on its color alone. Additionally, ripeness is a subjective factor that can vary from apple to apple even within the same color group. So,we would need more information, such as the definition of ripeness used or any context about the storage conditions of the apples, to provide an answer.',
  "But how can we tell if it's more likely to be a ripe apple if it's red, since sometimes red apples aren't ripe yet?"]}

In [11]:
from tqdm import tqdm
import copy 

synthetic_riddles_step4=copy.deepcopy(synthetic_riddles_step3)

for riddle in tqdm(synthetic_riddles_step4):
    messages = [
        {"role": "user", "content": riddle["messages"][0]},
        {"role": "assistant", "content": riddle["messages"][1]},
        {"role": "user", "content": riddle["messages"][2]},
    ]
    
    input_tokens = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
    output_tokens = model.generate(
        input_tokens, 
        max_new_tokens=500, 
        do_sample=True, 
        pad_token_id=tokenizer.eos_token_id)
    output_tokens=output_tokens[0][len(input_tokens[0]):]
    output = tokenizer.decode(output_tokens, skip_special_tokens=True)

    riddle["messages"].append(output)

import json
def write_pretty_json(file_path, data):
    with open(file_path, "w") as write_file:
        json.dump(data, write_file, indent=4)

write_pretty_json("synthetic_riddles_step-4.json", synthetic_riddles_step4)

 15%|█████▌                                | 246/1682 [18:56<1:50:35,  4.62s/it]


KeyboardInterrupt: 

# Push dataset to HF hub

In [3]:
import json
from datasets import Dataset

def read_json(file_path):
    f = open(file_path)
    data = json.load(f)
    f.close()
    return data

dataset_json=read_json("synthetic_riddles_step-4.json")
dataset=Dataset.from_list(dataset_json)

In [4]:
dataset

Dataset({
    features: ['number', 'messages'],
    num_rows: 1682
})

In [5]:
dataset[0]

{'number': 0,
 'messages': ['A bag contains apples, some red, some green. If you reach in without looking, and pick one apple, what is the chance that it was ripe?',
  'The information given does not specify anything about the ripeness of the apples being red or green. Therefore, we cannot determine the chance that a picked apple is ripe based on its color alone. Additionally, ripeness is a subjective factor that can vary from apple to apple even within the same color group. So,we would need more information, such as the definition of ripeness used or any context about the storage conditions of the apples, to provide an answer.',
  "But how can we tell if it's more likely to be a ripe apple if it's red, since sometimes red apples aren't ripe yet?",
  'You are correct that the color of an apple (being red) does not necessarily indicate that it is ripe. Some apple varieties, such as Granny Smith apples, remain green even when fully ripe. Therefore, the color of an apple alone cannot be u

## Login

In [None]:
!huggingface-cli login --token YOUR_TOKEN

## Upload

In [8]:
dataset.push_to_hub("g-ronimo/riddles_evolved")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]