# Project Summary: Fine-Tuning an LLM for Mathematical Problem Classification

The core objective of this project is to fine-tune a small, efficient Large Language Model (LLM) to classify mathematical word problems into three distinct categories based on their solvability.

The methodology is divided into three main phases:

### 1. Rigorous Dataset Generation via Code Formalization

The primary challenge is creating a high-quality, verifiably correct dataset. This is addressed by converting each natural language math problem (from a source like GSM8K) into a parameterized Python function.

*   A powerful generator LLM is used to translate the problem's text and step-by-step solution into a generalized `solve()` function.
*   This function acts as a formal, executable representation of the problem's underlying logic.
*   By making the problem's numerical values the function's arguments, the logic becomes testable and easy to manipulate.

### 2. Creating a Labeled Dataset with Three Solvability Classes

Using the verified Python functions from Phase 1, the final labeled dataset is constructed by programmatically modifying the original problems to fit into one of three classes:

*   **Class 1: Has a Unique Solution**
    *   This is the original, verified problem where all parameters are defined, leading to a single correct answer.

*   **Class 2: Has Multiple Solutions**
    *   Generated by taking a Class 1 problem and removing a key piece of numerical information from the problem statement. This makes the problem underspecified, as different values for the now-missing parameter would lead to different valid solutions.

*   **Class 0: Has No Solution**
    *   Generated by manipulating the parameters of the Python function to yield a logically or physically absurd result (e.g., a negative count of objects) or by introducing a direct contradiction into the problem statement.

### 3. Fine-Tuning the Classifier LLM

The resulting dataset, with its high-confidence labels, is used to fine-tune a smaller, more efficient LLM. The final model will be trained to take a new math problem as input and output its classification (Class 1, 2, or 0), having learned the underlying patterns of solvability, ambiguity, and contradiction from the generated data.

In [86]:
import pandas as pd
import numpy as np
import time

import importlib
import inspect
import os
import re
import json
import random
import openai
import google.generativeai as genai
import anthropic
from openai import OpenAI
from datasets import load_dataset
from tqdm import tqdm
from dotenv import load_dotenv

from typing import List, Dict, Any

In [87]:
# Initialize clients
load_dotenv()
openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
anthropic_client = anthropic.Client(api_key=os.getenv("ANTHROPIC_API_KEY"))

# Load the GSM8K dataset (train split)
gsm8k_train = load_dataset("gsm8k", "main", split="train")

In [88]:
SYSTEM_PROMPT = "You are an expert Python programmer specializing in data formalization. Your role is to meticulously convert natural language math problems and their step-by-step solutions into a single, well-structured Python function. You will be presented with examples of the required format followed by a final task to complete."

PROMPT_GUIDELINES = """### Guidelines

1.  **Function Naming & Docstring:** The function must be named `solve`. It must begin with a docstring that has exactly two lines:
    *   The first line must be: "Code for Q [Index] from the GSM8K dataset (train).", using the index from the task header.
    *   The second line must be a succinct, one-sentence description of what the function returns (e.g., "Returns the total cost of wages and taxes.").

2.  **Function Arguments:** The function arguments must be derived from the 'Question' text. 
    *   Create a distinct argument for every numerical value that is directly stated in the text.
    *   **Note:** Some of these arguments may end up not being used in the function body. This is expected. Do not worry about this and leave the unused arguments in the function signature.

3.  **Argument Formatting:** Each argument must include a type-hint (e.g., `int`, `float`) and a default value equal to its value in the 'Question'. You must also add a comment (`#`) next to each argument that quotes or describes the phrase in the 'Question' it comes from.

4.  **Function Body:** The body of the function should follow the logic of the provided 'Solution'. Each relevant line from the 'Solution' that involves a computation must be included as a comment, immediately followed by the Python code that formalizes that step.

5.  **Calculator Annotations:** Pay close attention to the calculator annotations (e.g., `<<25*8=200>>`) in the 'Solution' as they reveal the precise mathematical operations to implement.

6.  **Final Answer Comment:** Before the final `return` statement, you must add a comment identifying the variable that holds the final answer (e.g., `# The final answer is the grand total`)."""

In [89]:
indices = [310, 3822, 2345, 1202, 7371]
code_strings = {}

for idx in indices:
    module = importlib.import_module(f"code_examples._{idx}")
    code = inspect.getsource(module.solve)
    code_strings[idx] = code

In [90]:
def _format_prompt_query(
        index: int, 
        code_strings: dict = code_strings,
        with_code: bool = False
):
    """
    Internal helper function to format a single entry.
    It creates the text for a problem's Index, Question, Solution, and (if `with_code == True`) the corresponding code.
    """
    sample = gsm8k_train[index] # type: ignore
    question = sample["question"]
    raw_answer = sample["answer"]
    solution = raw_answer.split('####')[0].strip()
    out = \
f"""*Index*: 
{index}

*Question*: 
{question}

*Solution*: 
{solution}

*Code*:
"""
    if with_code:
        out += f"""\n```python
{code_strings[index]}
```
"""
    return out

In [91]:
def craft_user_prompt(
    index: int,
    example_indices: List[int],
    code_examples: Dict[int, str]
    ):
    """
Generates a complete user prompt for the LLM to generate code. This function assembles the guidelines, few-shot examples, and the final unsolved task into a single string, ready to be sent to an LLM.

Args:
    index: The index of the target problem to generate code for.
    example_indices: A list of indices to use as few-shot examples.
    code_examples: A dictionary mapping example indices to their code strings.

Returns:
    A single string containing the full user prompt.
"""
    # This function assumes a variable `PROMPT_GUIDELINES` exists in its scope.

    # Generate the formatted strings for the few-shot examples
    example_prompts = [
        _format_prompt_query(index=idx, 
                             code_strings=code_examples,
                             with_code=True)
        for idx in example_indices
    ]

    # Generate the formatted string for the final task to be completed by the LLM
    task_prompt = _format_prompt_query(index=index, code_strings=code_examples)

    # Combine all parts into a single prompt string
    # We use two newlines to visually separate major sections
    full_prompt = "\n".join([
        PROMPT_GUIDELINES,
        "\n--- EXAMPLES ---\n",
        "\n".join(example_prompts),
        "--- TASK ---\n",
        task_prompt
    ])

    return full_prompt

In [92]:
check = craft_user_prompt(
    index = 5,
    example_indices= [310, 3822, 7371],
    code_examples=code_strings
)
print(check)

### Guidelines

1.  **Function Naming & Docstring:** The function must be named `solve`. It must begin with a docstring that has exactly two lines:
    *   The first line must be: "Code for Q [Index] from the GSM8K dataset (train).", using the index from the task header.
    *   The second line must be a succinct, one-sentence description of what the function returns (e.g., "Returns the total cost of wages and taxes.").

2.  **Function Arguments:** The function arguments must be derived from the 'Question' text. 
    *   Create a distinct argument for every numerical value that is directly stated in the text.
    *   **Note:** Some of these arguments may end up not being used in the function body. This is expected. Do not worry about this and leave the unused arguments in the function signature.

3.  **Argument Formatting:** Each argument must include a type-hint (e.g., `int`, `float`) and a default value equal to its value in the 'Question'. You must also add a comment (`#`) next to e

In [93]:
model_dict = \
{
  "anthropic": [
    "claude-sonnet-4-20250514",
    "claude-3-7-sonnet-20250219",
    "claude-3-5-sonnet-20240620",
    "claude-3-5-haiku-20241022",
    "claude-3-haiku-20240307"
  ],
  "openai": [
    "gpt-4.1",
    "o3-mini",
    "o4-mini",
    "gpt-4.1-mini"
  ],
  "google": [
    "gemini-2.5-pro-preview-06-05",
    "gemini-2.5-pro",
    "gemini-2.5-flash",
    "gemini-2.5-flash-preview-04-17-thinking",
    "gemini-2.0-flash-thinking-exp",
    "gemini-2.5-flash-lite-preview-06-17"
  ]
}

In [94]:
def call_model_api(
        provider: str, 
        model: str, 
        system_prompt: str, 
        user_prompt: str):
    """
    Calls the appropriate LLM API based on the provider and returns the raw text response.
    
    This function handles special cases for reasoning models like o3-mini that do not
    support the temperature parameter.

    Args:
        provider: The API provider ("google", "anthropic", or "openai").
        model: The specific model name to use.
        system_prompt: The system-level instructions for the model.
        user_prompt: The user-level prompt containing examples and the task.

    Returns:
        The model's generated text content as a string, or None if an error occurs.
    """
    try:
        if provider == "google":
            gemini = genai.GenerativeModel(
                model_name=model,
                system_instruction=system_prompt
            )
            generation_config = genai.types.GenerationConfig(
                temperature=0.1,
                max_output_tokens=4000
            )
            response = gemini.generate_content(
                user_prompt,
                generation_config=generation_config
            )
            return response.text

        elif provider == "anthropic":
            response = anthropic_client.messages.create(
                model=model,
                max_tokens=4000,
                temperature=0.1,
                system=system_prompt,
                messages=[{"role": "user", "content": user_prompt}]
            )
            return response.content[0].text

        elif provider == "openai":
            # Prepare the arguments for the API call
            kwargs = {
                "model": model,
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_prompt}
                ]
            }
            
            # Conditionally set parameters based on model type
            if model not in ["o3-mini", "o4-mini"]:
                kwargs["temperature"] = 0.1
                kwargs["max_tokens"] = 4000

            response = openai_client.chat.completions.create(**kwargs)
            return response.choices[0].message.content
        
        else:
            print(f"Unknown provider: {provider}")
            return None
            
    except Exception as e:
        print(f"An API error occurred for {provider} model {model}: {e}")
        return None

In [95]:
def generate_GSM8K_code(
    model_dict: Dict[str, List[str]],
    indices_to_generate: List[int],
    example_indices: List[int],
    system_prompt: str = SYSTEM_PROMPT
):
    """
    Calls multiple LLM APIs, saves the raw output, and logs performance.

    Args:
        model_dict: Dictionary of providers and their models to test.
        indices_to_generate: List of GSM8K problem indices to generate code for.
        example_indices: List of indices to use as few-shot examples in the prompt.
        system_prompt: The system prompt to send to the models.
    """

    # 1. Initialize performance logging
    performance_data = []
    base_output_dir = 'code_generation_outputs'
    os.makedirs(base_output_dir, exist_ok=True)

    # Loop over each problem you want to solve
    for index in indices_to_generate:
        print(f"\n{'='*20} Starting Generation for Index: {index} {'='*20}")

        # 2. Create the output directory for the current problem index
        problem_dir = os.path.join(base_output_dir, str(index))
        os.makedirs(problem_dir, exist_ok=True)
        print(f"Output directory: {problem_dir}")

        # 3. Create the user prompt once for this problem index
        print("Crafting user prompt...")
        user_prompt = craft_user_prompt(
            index=index,
            example_indices=example_indices,
            code_examples=code_strings
        )

        # 4. Loop over each provider and model in your dictionary
        for provider, models in model_dict.items():
            for model_name in models:
                print(f"\n--- Calling {provider.capitalize()} model: {model_name} ---")

                # 5. Call the API and time the request
                start_time = time.time()
                raw_response = call_model_api(provider, model_name, system_prompt, user_prompt)
                end_time = time.time()
                
                time_taken = end_time - start_time
                print(f"  Response received in {time_taken:.2f} seconds.")

                # Log the performance data
                performance_data.append({
                    'provider': provider,
                    'model': model_name,
                    'index': index,
                    'time_taken': time_taken
                })

                # 6. Save the raw response to a file
                if raw_response:
                    output_filename = f'{provider}_{model_name}.txt'
                    output_path = os.path.join(problem_dir, output_filename)
                    try:
                        with open(output_path, 'w', encoding='utf-8') as f:
                            f.write(raw_response)
                        print(f"  Successfully saved raw output to: {output_path}")
                    except IOError as e:
                        print(f"  Error: Failed to write file. Reason: {e}")
                else:
                    print("  No response received. Skipping file save.")

    print(f"\n{'='*20} Generation Complete {'='*20}")

    # 7. Save the performance data to a CSV file at the end
    df = pd.DataFrame(performance_data)
    csv_path = os.path.join(base_output_dir, 'generation_performance.csv')
    df.to_csv(csv_path, index=False)
    print(f"Performance data successfully saved to {csv_path}; displayed below:")
    display(df)

In [96]:
# A few problem indices to generate code for in this test run
problems_to_solve = [3779, 4483, 6237]

# The hand-made examples to include in the prompt
examples_for_prompt = [310, 3822, 7371]

# --- Run the main generation function ---
print(f"Starting test run for indices {problems_to_solve} across all models...")
generate_GSM8K_code(
    model_dict=model_dict,
    indices_to_generate=problems_to_solve,
    example_indices=examples_for_prompt
)

Starting test run for indices [3779, 4483, 6237] across all models...

Output directory: code_generation_outputs/3779
Crafting user prompt...

--- Calling Anthropic model: claude-sonnet-4-20250514 ---
  Response received in 14.59 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/anthropic_claude-sonnet-4-20250514.txt

--- Calling Anthropic model: claude-3-7-sonnet-20250219 ---
  Response received in 9.77 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/anthropic_claude-3-7-sonnet-20250219.txt

--- Calling Anthropic model: claude-3-5-sonnet-20240620 ---
  Response received in 13.57 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/anthropic_claude-3-5-sonnet-20240620.txt

--- Calling Anthropic model: claude-3-5-haiku-20241022 ---
  Response received in 9.21 seconds.
  Successfully saved raw output to: code_generation_outputs/3779/anthropic_claude-3-5-haiku-20241022.txt

--- Calling Anthropic model: claude-3-haiku-2

Unnamed: 0,provider,model,index,time_taken
0,anthropic,claude-sonnet-4-20250514,3779,14.588613
1,anthropic,claude-3-7-sonnet-20250219,3779,9.771739
2,anthropic,claude-3-5-sonnet-20240620,3779,13.573699
3,anthropic,claude-3-5-haiku-20241022,3779,9.208251
4,anthropic,claude-3-haiku-20240307,3779,7.41134
5,openai,gpt-4.1,3779,7.136371
6,openai,o3-mini,3779,10.64477
7,openai,o4-mini,3779,23.341708
8,openai,gpt-4.1-mini,3779,9.332341
9,google,gemini-2.5-pro-preview-06-05,3779,18.293231
