## Baby-step-experiment

In [7]:
import os
import json
import random
from openai import OpenAI
from datasets import load_dataset
from tqdm import tqdm
from dotenv import load_dotenv

### Loading the OpenAI API Key from the `.env` file
Before running this notebook, you need an OpenAI API key. You can get one from the [OpenAI website](https://platform.openai.com/signup). To use it securely in this notebook, do the following:
1. Create a file named `.env` in the same directory as this notebook.
2. Add the following line to the `.env` file, replacing `your_openai_api_key` with your actual OpenAI API key:
   ```
   OPENAI_API_KEY=your_openai_api_key
   ```
3. Add `.env` to your `.gitignore` file to prevent it from being committed to version control.

In [8]:
# This line reads the .env file and loads the variables into the environment
load_dotenv() 

#get the key you loaded from the .env file
api_key = os.getenv("OPENAI_API_KEY")

if api_key:
    client = OpenAI(api_key=api_key)
    print("OpenAI client initialized successfully.")
else:
    raise ValueError("OpenAI API key not found. Make sure it's set in your .env file.")

OpenAI client initialized successfully.


### Loading the GSM8K Dataset
To start off the baby-steps experiment, let's load the GSM8K dataset from the `datasets` library from Hugging Face. This dataset is a collection of 8,000 grade-school math problems that are designed to be solvable by elementary school students.

In [9]:
# Load the source dataset once
print("Loading GSM8K dataset...")
gsm8k_train = load_dataset("gsm8k", "main")['train']
print("Dataset loaded.")

Loading GSM8K dataset...
Dataset loaded.


In [10]:
# Define the taxonomy of unanswerability we will use
UNANSWERABILITY_TAXONOMY = {
    "insufficient_information": "Make the problem unanswerable by removing a single, critical piece of numerical information. For example, if a problem mentions the cost of apples and oranges, remove the cost of apples.",
    "contradictory_information": "Make the problem unanswerable by adding a piece of information that directly contradicts another statement in the problem. For example, if a problem states there are 10 apples, add a sentence stating there are 12 apples.",
    # "ambiguous_question": "Make the problem unanswerable by making the final question ambiguous. The numbers and facts should remain, but the question itself should be interpretable in two or more ways, making a single answer impossible.",
    # "no_solution_possible": "Make the problem unanswerable by changing a number or condition so the premise becomes mathematically impossible. For example, a baker sells 5 cakes for $20 total, and makes a profit of $25.",
}

In [11]:
# The core function that calls the LLM
def make_problem_unanswerable(problem_text, modification_type, modification_instruction):
    system_prompt = "You are an expert in curriculum design and mathematical pedagogy. Your task is to subtly modify a solvable math problem to make it unanswerable, for the purpose of testing a student's critical thinking."

    user_prompt = f"""
    Please rewrite the following math problem.

    **Original Problem:**
    "{problem_text}"

    **Modification Type:**
    {modification_type}

    **Instruction:**
    {modification_instruction}

    **Your Task:**
    1.  Rewrite the problem according to the instruction.
    2.  Make the *minimal necessary change*. The problem should still look like a plausible, well-formed math problem.
    3.  Do NOT use placeholders like '[missing information]' or '[contradiction]'. The change should be subtle.
    4.  Output a JSON object with three keys:
        - "unanswerable_problem": The full text of the newly generated unanswerable problem.
        - "change_summary": A brief, one-sentence description of what you changed.
        - "reasoning": A clear explanation of why the new problem is unanswerable, directly referencing the modification type.

    Example JSON output format:
    {{
      "unanswerable_problem": "A bakery sells chocolate cakes for $18. On a certain day, it sold 10 cakes in total. How many chocolate cakes did it sell?",
      "change_summary": "I removed the price of vanilla cakes and the total revenue.",
      "reasoning": "This problem is now unanswerable due to insufficient_information. It is impossible to determine the number of each type of cake sold without knowing either the price of the other cake or the total revenue."
    }}
    """

    try:
        response = client.chat.completions.create(
            model="gpt-4-turbo",  # Recommended model for this task
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0.5, # Lower temperature for more predictable, instruction-following behavior
        )
        return json.loads(response.choices[0].message.content)
    except Exception as e:
        print(f"An API error occurred: {e}")
        return None

print("Taxonomy and generation function are defined.")

Taxonomy and generation function are defined.


In [12]:
# --- Configuration ---
NUM_SAMPLES_TO_GENERATE = 10 # START WITH A VERY SMALL NUMBER!
OUTPUT_FILE = "unanswerable_math_dataset.jsonl"

# Get a random subset of the data to work with
indices = random.sample(range(len(gsm8k_train)), NUM_SAMPLES_TO_GENERATE)

print(f"Starting generation of {NUM_SAMPLES_TO_GENERATE} samples...")
print(f"Results will be saved to {OUTPUT_FILE}")

# Using 'w' mode to clear the file on each new run
with open(OUTPUT_FILE, 'w') as f:
    # Using tqdm for a progress bar, which works great in notebooks
    for i in tqdm(indices):
        original_problem = gsm8k_train[i]['question']
        mod_type_key, mod_instruction = random.choice(list(UNANSWERABILITY_TAXONOMY.items()))
        
        generated_data = make_problem_unanswerable(original_problem, mod_type_key, mod_instruction)
        
        if generated_data:
            final_record = {
                "original_problem": original_problem,
                "unanswerable_problem": generated_data.get("unanswerable_problem"),
                "modification_type": mod_type_key,
                "change_summary": generated_data.get("change_summary"),
                "reasoning": generated_data.get("reasoning"),
            }
            f.write(json.dumps(final_record) + "\n")

print(f"\nGeneration complete.")

Starting generation of 10 samples...
Results will be saved to unanswerable_math_dataset.jsonl


 10%|█         | 1/10 [00:00<00:01,  6.06it/s]

An API error occurred: Error code: 404 - {'error': {'message': 'The model `gpt-4-turbo` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


 40%|████      | 4/10 [00:09<00:10,  1.83s/it]

An API error occurred: Error code: 404 - {'error': {'message': 'The model `gpt-4-turbo` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
An API error occurred: Error code: 404 - {'error': {'message': 'The model `gpt-4-turbo` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


 80%|████████  | 8/10 [00:23<00:04,  2.48s/it]

An API error occurred: Error code: 404 - {'error': {'message': 'The model `gpt-4-turbo` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


100%|██████████| 10/10 [00:30<00:00,  3.06s/it]


Generation complete.





That worked, but I want to now experiment a little with different models from OpenAI. I also don't know what's up with the weird phenomenon where 6 generations happened correctly but 4 of them failed. 

In [16]:
# The core function that calls the LLM
def make_problem_unanswerable(problem_text, modification_type, modification_instruction, model):
    system_prompt = "You are an expert in curriculum design and mathematical pedagogy. Your task is to subtly modify a solvable math problem to make it unanswerable, for the purpose of testing a student's critical thinking."

    user_prompt = f"""
    Please rewrite the following math problem.

    **Original Problem:**
    "{problem_text}"

    **Modification Type:**
    {modification_type}

    **Instruction:**
    {modification_instruction}

    **Your Task:**
    1.  Rewrite the problem according to the instruction.
    2.  Make the *minimal necessary change*. The problem should still look like a plausible, well-formed math problem.
    3.  Do NOT use placeholders like '[missing information]' or '[contradiction]'. The change should be subtle.
    4.  Output a JSON object with three keys:
        - "unanswerable_problem": The full text of the newly generated unanswerable problem.
        - "change_summary": A brief, one-sentence description of what you changed.
        - "reasoning": A clear explanation of why the new problem is unanswerable, directly referencing the modification type.

    Example JSON output format:
    {{
      "unanswerable_problem": "A bakery sells chocolate cakes for $18. On a certain day, it sold 10 cakes in total. How many chocolate cakes did it sell?",
      "change_summary": "I removed the price of vanilla cakes and the total revenue.",
      "reasoning": "This problem is now unanswerable due to insufficient_information. It is impossible to determine the number of each type of cake sold without knowing either the price of the other cake or the total revenue."
    }}
    """

    try:
        response = client.chat.completions.create(
            model=model,  # Recommended model for this task
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0.5, # Lower temperature for more predictable, instruction-following behavior
        )
        return json.loads(response.choices[0].message.content)
    except Exception as e:
        print(f"An API error occurred: {e}")
        return None

print("Taxonomy and generation function are defined.")

Taxonomy and generation function are defined.


In [None]:
MODIFICATION_PAIRS = list(UNANSWERABILITY_TAXONOMY.items())

def run_baby_step_experiment(model, num_samples=10):
    """
    Run a baby-step experiment to generate unanswerable math problems.
    
    Args:
        model (str): The OpenAI model to use for generation.
        savepath (str): Path to save the generated dataset.
        num_samples (int): Number of samples to generate.
    """
    # Set a seed for reproducibility
    random.seed(42)

    # Get a random subset of the data to work with
    indices = random.sample(range(len(gsm8k_train)), num_samples)

    print(f"Starting generation of {num_samples} samples...")

    with open(f'baby-step-outputs/baby-step-outputs_{model}.jsonl', 'w') as f:
        # Using tqdm for a progress bar, which works great in notebooks
        for i, index in tqdm(enumerate(indices)):
            original_problem = gsm8k_train[index]['question']
            if i % 2 == 0:
                mod_type_key, mod_instruction = MODIFICATION_PAIRS[0]
            else:
                mod_type_key, mod_instruction = MODIFICATION_PAIRS[1]

            generated_data = make_problem_unanswerable(original_problem, mod_type_key, mod_instruction, model)
            
            if generated_data:
                final_record = {
                    "original_problem": original_problem,
                    "unanswerable_problem": generated_data.get("unanswerable_problem"),
                    "modification_type": mod_type_key,
                    "change_summary": generated_data.get("change_summary"),
                    "reasoning": generated_data.get("reasoning"),
                }
                f.write(json.dumps(final_record) + "\n")

    print(f"\nGeneration complete.")

In [18]:
run_baby_step_experiment(model="gpt-4-turbo", num_samples=10)  

Starting generation of 10 samples...


10it [00:45,  4.57s/it]


Generation complete.





In [None]:
NUM_SAMPLES = 10
models_to_test = ["gpt-4o",
                  "gpt-4.1-mini",
                  "o3-mini",
                  "o4-mini"]

In [20]:
for model in models_to_test:
    print()
    run_baby_step_experiment(model=model, num_samples=NUM_SAMPLES)
    print()


Starting generation of 10 samples...


10it [00:27,  2.79s/it]



Generation complete.


Starting generation of 10 samples...


10it [00:29,  2.97s/it]



Generation complete.


Starting generation of 10 samples...


2it [00:00, 10.76it/s]

An API error occurred: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'temperature', 'code': 'unsupported_parameter'}}
An API error occurred: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'temperature', 'code': 'unsupported_parameter'}}
An API error occurred: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'temperature', 'code': 'unsupported_parameter'}}


4it [00:00,  9.98it/s]

An API error occurred: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'temperature', 'code': 'unsupported_parameter'}}


6it [00:00, 10.45it/s]

An API error occurred: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'temperature', 'code': 'unsupported_parameter'}}
An API error occurred: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'temperature', 'code': 'unsupported_parameter'}}
An API error occurred: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'temperature', 'code': 'unsupported_parameter'}}


8it [00:00,  9.25it/s]

An API error occurred: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'temperature', 'code': 'unsupported_parameter'}}


10it [00:01,  9.91it/s]


An API error occurred: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'temperature', 'code': 'unsupported_parameter'}}
An API error occurred: Error code: 400 - {'error': {'message': "Unsupported parameter: 'temperature' is not supported with this model.", 'type': 'invalid_request_error', 'param': 'temperature', 'code': 'unsupported_parameter'}}

Generation complete.


Starting generation of 10 samples...


2it [00:00,  7.76it/s]

An API error occurred: Error code: 404 - {'error': {'message': 'The model `o4-small` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
An API error occurred: Error code: 404 - {'error': {'message': 'The model `o4-small` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


5it [00:00,  9.81it/s]

An API error occurred: Error code: 404 - {'error': {'message': 'The model `o4-small` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
An API error occurred: Error code: 404 - {'error': {'message': 'The model `o4-small` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
An API error occurred: Error code: 404 - {'error': {'message': 'The model `o4-small` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
An API error occurred: Error code: 404 - {'error': {'message': 'The model `o4-small` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


9it [00:01,  8.07it/s]

An API error occurred: Error code: 404 - {'error': {'message': 'The model `o4-small` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
An API error occurred: Error code: 404 - {'error': {'message': 'The model `o4-small` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
An API error occurred: Error code: 404 - {'error': {'message': 'The model `o4-small` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}


10it [00:01,  8.39it/s]

An API error occurred: Error code: 404 - {'error': {'message': 'The model `o4-small` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

Generation complete.






In [None]:
output_dict = {}
for model in ["gpt-4-turbo", "gpt-4o", "gpt-4.1-mini"]:
    data = []
    with open(f'/baby-step-outputs/baby-step-outputs_{model}.jsonl', 'r') as f:
        for line in f:
            data.append(json.loads(line))
    output_dict[model] = data

print(output_dict)

{'gpt-4-turbo': [{'original_problem': 'For every 12 cans you recycle, you receive $0.50, and for every 5 kilograms of newspapers, you receive $1.50. If your family collected 144 cans and 20 kilograms of newspapers, how much money would you receive?', 'unanswerable_problem': 'For every 12 cans you recycle, you receive $0.50. If your family collected 144 cans and 20 kilograms of newspapers, how much money would you receive?', 'modification_type': 'insufficient_information', 'change_summary': 'I removed the payment information for recycling newspapers.', 'reasoning': 'This problem is now unanswerable due to insufficient information. Without knowing the amount of money received for each kilogram of newspapers recycled, it is impossible to calculate the total amount of money received from recycling both cans and newspapers.'}, {'original_problem': 'Betty picked 16 strawberries. Matthew picked 20 more strawberries than Betty and twice as many as Natalie. They used their strawberries to make 

In [None]:
import pandas as pd

pd.DataFrame(output_dict['gpt-4-turbo'])

Unnamed: 0,original_problem,unanswerable_problem,modification_type,change_summary,reasoning
0,"For every 12 cans you recycle, you receive $0....","For every 12 cans you recycle, you receive $0....",insufficient_information,I removed the payment information for recyclin...,This problem is now unanswerable due to insuff...
1,Betty picked 16 strawberries. Matthew picked 2...,Betty picked 16 strawberries. Matthew picked 2...,contradictory_information,Added a statement that Matthew picked a total ...,The problem becomes unanswerable due to contra...
2,Jack has a stack of books that is 12 inches th...,Jack has a stack of books that is 12 inches th...,insufficient_information,Removed the number of books Jack has.,This problem is now unanswerable due to insuff...
3,James dumps his whole collection of 500 Legos ...,James dumps his whole collection of 500 Legos ...,contradictory_information,Added a statement that James counts 240 Legos ...,This problem becomes unanswerable because of c...
4,Ines had $20 in her purse. She bought 3 pounds...,Ines had $20 in her purse. She bought 3 pounds...,insufficient_information,I removed the price per pound of the peaches.,This problem is now unanswerable due to insuff...
5,Aaron pays his actuary membership fees each ye...,Aaron pays his actuary membership fees each ye...,contradictory_information,Added a statement that the membership fee rema...,The problem is now unanswerable due to contrad...
6,Joseph invested $1000 into a hedge fund. The f...,Joseph invested $1000 into a hedge fund. The f...,insufficient_information,I removed the specification of the initial inv...,This problem is now unanswerable due to insuff...
7,The price of buying a wooden toy at the new Cr...,The price of buying a wooden toy at the new Cr...,contradictory_information,"Added a statement that Kendra used a $50 bill,...",This problem is now unanswerable due to contra...
8,James is trying to create a new breed of kitte...,James is trying to create a new breed of kitte...,insufficient_information,I removed the initial tail length of the first...,This problem is now unanswerable due to insuff...
9,The Rotary Club is holding its annual fundrais...,The Rotary Club is holding its annual fundrais...,contradictory_information,Changed the number of omelets seniors eat from...,This problem is now unanswerable due to contra...
