# Exercise 3: Mathematical Problem Solving with LLMs

**This is a marked exercise (graded)**

Apply LLMs to solve mathematical reasoning tasks. Test different pre-trained models with various prompting strategies and optionally fine-tune with LoRA to improve performance.

**Learning Objectives:**
- Evaluate LLMs on mathematical reasoning
- Design effective prompts for numerical tasks
- Implement and compare different prompting strategies
- Optionally: Fine-tune models using LoRA
- Measure performance using accuracy metric with tolerance

**Deliverables:**
- Completed notebook with your approach
- `submission.csv` with predictions on test set (100 problems)
- Score: Accuracy with 2 decimal precision tolerance (threshold: 70%)

## Part 1: Setup and Load Data

In [2]:
!pip install transformers torch peft datasets pandas scikit-learn matplotlib requests -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.1/75.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/504.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m504.9/504.9 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/511.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/119.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.7/119.7 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import requests
import re

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

  * **h_n**: tensor of shape :math:`(D * \text{num\_layers}, H_{out})` or


Using device: cpu


## Part 2: Download Dataset

Download the math problem dataset (1000 problems: 900 train, 100 test).

In [3]:
# URLs for the dataset files
base_url = 'https://www.raphaelcousin.com/modules/data-science-practice/module8/exercise/'

train_url = base_url + 'train.csv'
test_url = base_url + 'test.csv'

def download_file(url, filename):
    """Download a file from URL."""
    response = requests.get(url)
    response.raise_for_status()
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded {filename}")

# Download files
download_file(train_url, 'train.csv')
download_file(test_url, 'test.csv')

Downloaded train.csv
Downloaded test.csv


In [4]:
# Load the datasets
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

print(f"Train set size: {len(train_data)}")
print(f"Test set size: {len(test_data)}")

# Display category distribution
print("\nTraining set category distribution:")
print(train_data['category'].value_counts().sort_index())

print("\nSample training problems:")
print(train_data.head(10))

Train set size: 900
Test set size: 100

Training set category distribution:
category
algebra          150
arithmetic       153
fractions        143
geometry         155
percentage       152
word_problems    147
Name: count, dtype: int64

Sample training problems:
   id       category                                            problem  \
0   0     percentage                                Increase 109 by 25%   
1   1     arithmetic                                   What is 76 + 55?   
2   2  word_problems  Sarah has $286. She spends $128. How much mone...   
3   3       geometry  What is the circumference of a circle with rad...   
4   4       geometry   What is the volume of a cube with side length 3?   
5   5     percentage                                 What is 7% of 132?   
6   6  word_problems  John is 10 years old now. How old was he 15 ye...   
7   7      fractions                  What is 1/5 + 2/5? (decimal form)   
8   8     percentage                                What is 2

## Part 3: Baseline - Dummy Model

Create a baseline to understand what poor performance looks like.

In [5]:
def check_accuracy(predictions, ground_truth, tolerance=0.01):
    """
    Calculate accuracy with tolerance for floating point comparisons.

    Two values are considered equal if their difference is <= tolerance
    OR if they round to the same value at 2 decimal places.
    """
    correct = 0
    for pred, truth in zip(predictions, ground_truth):
        # Check if both round to same 2 decimal places
        if round(pred, 2) == round(truth, 2):
            correct += 1
        # Or if absolute difference is very small
        elif abs(pred - truth) <= tolerance:
            correct += 1

    return correct / len(predictions)

# Dummy baseline: always predict the mean
mean_solution = train_data['solution'].mean()
print(f"Dummy model (always predicts mean): {mean_solution:.2f}")
print("This demonstrates very poor performance. Your model should do much better!")

Dummy model (always predicts mean): 150.79
This demonstrates very poor performance. Your model should do much better!


## Part 4: Utility Functions

Helper functions to extract numerical answers from model outputs.

In [6]:
def extract_number(text):
    """
    Extract the first number from text. Return None if no number found.

    Handles various formats:
    - "The answer is 42"
    - "42"
    - "= 42"
    - "Result: 42.5"
    - Negative numbers: "-15"
    """
    # Try different patterns in order of specificity
    patterns = [
        r'(?:answer|result|equals?|=)\s*:?\s*(-?\d+\.?\d*)',  # "answer is 42" or "= 42"
        r'(-?\d+\.?\d*)\s*$',  # Number at the end
        r'(-?\d+\.?\d*)',  # Any number
    ]

    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            try:
                return float(match.group(1))
            except (ValueError, IndexError):
                continue

    return None

# Test extraction
test_strings = [
    "The answer is 42",
    "42",
    "15 + 27 = 42",
    "Calculating... the result is 42.5!",
    "No number here",
    "The value is -15"
]

print("Number extraction tests:")
for s in test_strings:
    result = extract_number(s)
    print(f"  '{s}' -> {result}")

Number extraction tests:
  'The answer is 42' -> 42.0
  '42' -> 42.0
  '15 + 27 = 42' -> 42.0
  'Calculating... the result is 42.5!' -> 42.5
  'No number here' -> None
  'The value is -15' -> -15.0


## Part 5: Load Pre-trained Model

Load a small, efficient model for math problem solving.

In [7]:
model_name = "microsoft/phi-2"

print(f"Loading model: {model_name}...")

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

model = model.to(device)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

from transformers import pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=model.device.index if model.device.type == 'cuda' else -1
)

print(f"Model loaded successfully!")
# Affichage de la taille du modèle (Phi-2 a environ 2.7 milliards de paramètres)
print(f"Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")
print("Pipeline 'generator' est prêt à l'emploi.")

Loading model: microsoft/phi-2...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cpu


Model loaded successfully!
Model size: 2779.7M parameters
Pipeline 'generator' est prêt à l'emploi.


## Part 6: Prompting Strategies

Test different prompt templates to improve model performance.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import requests
import re

# 1. Chargement du Modèle Qwen
model_name = "Qwen/Qwen1.5-1.8B"

print(f"Loading model: {model_name}...")
# Qwen nécessite trust_remote_code=True
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

model = model.to(device)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Initialisation du pipeline de génération
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=model.device.index if model.device.type == 'cuda' else -1
)

print(f"Model loaded successfully! Size: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")

# 2. Définition de la fonction de génération (basée sur votre code)
def generate_answer(problem, prompt_template="simple", max_new_tokens=50, temperature=0.1):

    if prompt_template == "simple":
        prompt = f"{problem}\nAnswer:"

    elif prompt_template == "instruction":
        prompt = f"Solve this math problem and provide only the numerical answer.\n\nProblem: {problem}\nAnswer:"

    elif prompt_template == "cot":
        prompt = f"Solve this math problem step by step, then provide the final numerical answer.\n\nProblem: {problem}\nSolution:\n"

    elif prompt_template == "few_shot":
        examples = []
        for i in range(min(5, len(train_data))):
            examples.append(f"Problem: {train_data['problem'].iloc[i]}\nAnswer: {train_data['solution'].iloc[i]}")

        examples_text = "\n\n".join(examples)
        prompt = f"Voici quelques exemples de problèmes et solutions:\n\n{examples_text}\n\nMaintenant, résous ce problème: Problem: {problem}\nAnswer:"

    else:
        prompt = problem

    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True if temperature > 0 else False,
            pad_token_id=tokenizer.eos_token_id
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    response = response[len(prompt):].strip()

    return response

# 3. Test des différentes stratégies de Prompting
test_problem = train_data['problem'].iloc[0]
test_solution = train_data['solution'].iloc[0]

print("\n" + "="*70)
print(f"Testing problem: {test_problem}")
print(f"Correct answer: {test_solution}\n")
print("="*70)

for template in ["simple", "instruction", "cot", "few_shot"]:
    response = generate_answer(test_problem, template)
    extracted = extract_number(response)

    is_correct = check_accuracy([extracted if extracted is not None else 0.0], [test_solution])
    correct = "✓" if is_correct else "✗"

    print(f"{correct} {template}:")
    print(f"  Response: {response[:100]}{'...' if len(response) > 100 else ''}")
    print(f"  Extracted: {extracted}\n")

# 4. Évaluation sur Validation Set
best_template = "cot" # Souvent le meilleur pour les modèles spécialisés en mathématiques

val_data = train_data.tail(50).copy()

predictions_val = []
ground_truth_val = val_data['solution'].tolist()

print("\n" + "="*70)
print(f"Evaluating on {len(val_data)} validation problems (Template: {best_template})...")
print("="*70)

for idx, row in val_data.iterrows():
    problem = row['problem']

    response = generate_answer(problem, prompt_template=best_template)
    prediction = extract_number(response)

    if prediction is None:
        prediction = 0.0

    predictions_val.append(prediction)

    if (len(predictions_val) % 10) == 0:
        print(f"Processed {len(predictions_val)}/{len(val_data)} problems...")

accuracy_val = check_accuracy(predictions_val, ground_truth_val)
print(f"\nValidation Accuracy ({best_template}): **{accuracy_val:.2%}**")
print(f"Objectif à atteindre: 70% sur l'ensemble de test.")

## Part 7: Evaluate on Validation Set

Test your best prompting strategy on a subset of training data.

In [None]:
best_template = "few_shot"

val_data = train_data.tail(50).copy()

predictions = []
ground_truth = val_data['solution'].tolist()

print(f"Evaluating on {len(val_data)} validation problems using {best_template} prompting...\n")

for idx, row in val_data.iterrows():
    problem = row['problem']

    response = generate_answer(problem, prompt_template=best_template)
    prediction = extract_number(response)

    if prediction is None:
        prediction = 0.0

    predictions.append(prediction)

    if (len(predictions) % 10) == 0:
        print(f"Processed {len(predictions)}/{len(val_data)} problems...")

accuracy = check_accuracy(predictions, ground_truth)
print(f"\nValidation Accuracy ({best_template}): **{accuracy:.2%}**")
print(f"Need to achieve: 70% on test set")

Evaluating on 50 validation problems using few_shot prompting...

Processed 10/50 problems...
Processed 20/50 problems...


## Part 8: Generate Test Predictions

Generate predictions for the test set and create submission file.

In [None]:
best_template = "few_shot"

print(f"Generating predictions on {len(test_data)} test problems using {best_template} prompting...\n")

test_predictions = []
test_ground_truth = test_data['solution'].tolist()

for idx, row in test_data.iterrows():
    problem = row['problem']

    response = generate_answer(problem, prompt_template=best_template)
    prediction = extract_number(response)

    if prediction is None:
        prediction = 0.0
        print(f"⚠️ Warning: No number extracted for problem {idx}: {problem[:50]}...")

    test_predictions.append(prediction)

    if (idx + 1) % 10 == 0:
        print(f"Processed {idx + 1}/{len(test_data)} problems...")

print("\nAll test predictions generated!")

print("\n--- Final Evaluation ---")
final_accuracy = check_accuracy(test_predictions, test_ground_truth)

print(f"Final Test Accuracy ({best_template}): **{final_accuracy:.2%}**")
print(f"Goal: 70%")

## Part 9: Create Submission File

Save predictions in the required format for evaluation.

In [None]:
# Create submission DataFrame
submission = pd.DataFrame({
    'id': test_data['id'],
    'solution': test_predictions
})

# Save to CSV
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")
print("\nSubmission preview:")
print(submission.head(10))

# Verify all predictions are numerical
non_numeric = submission['solution'].isna().sum()
if non_numeric > 0:
    print(f"\n⚠️  WARNING: {non_numeric} predictions are not numerical!")
    print("These will result in incorrect answers. Please fix them.")
else:
    print("\n✓ All predictions are numerical")

# Show statistics
print("\nPrediction statistics:")
print(submission['solution'].describe())

## Part 10 (Optional): Fine-Tuning with LoRA

If prompting doesn't achieve 70% accuracy, consider fine-tuning with LoRA.

In [None]:
# TODO: Implement LoRA fine-tuning (OPTIONAL)
from peft import LoraConfig, get_peft_model, TaskType
from torch.utils.data import Dataset, DataLoader

# This is a template - implement if needed
print("LoRA fine-tuning is optional.")
print("Use this if prompting strategies don't achieve 70% accuracy.")
print("\nConsider:")
print("- Prepare training dataset in correct format")
print("- Configure LoRA parameters (r=8, alpha=32)")
print("- Train for a few epochs")
print("- Evaluate and compare with prompting approaches")

## Questions

Answer the following questions:

1. **Which prompting strategy worked best and why?**
   - YOUR ANSWER HERE

2. **What types of math problems were most challenging for the model?**
   - YOUR ANSWER HERE

3. **How did you handle number extraction from model outputs?**
   - YOUR ANSWER HERE

4. **What are the limitations of using LLMs for mathematical reasoning?**
   - YOUR ANSWER HERE

5. **If you used LoRA fine-tuning, what were the trade-offs compared to prompting?**
   - YOUR ANSWER HERE (or N/A if you didn't use LoRA)