# Fine-tuning Mistral for Mathematical Problem Solving

This notebook demonstrates how to fine-tune the Mistral-7B language model specifically for solving mathematical word problems using the GSM8K dataset. The implementation uses quantization and Low-Rank Adaptation (LoRA) for efficient training.

## Prerequisites

- Python 3.8+
- CUDA-compatible GPU (16GB+ VRAM recommended)
- Hugging Face account with access to Mistral-7B
- Git installed on your system

### Required Libraries
```bash
pip install -q transformers datasets accelerate bitsandbytes wandb py7zr
pip install -q peft trl huggingface_hub
```

## Key Features

- 4-bit quantization for reduced memory usage
- LoRA fine-tuning for efficient parameter updates
- Step-by-step problem solving approach
- Comparison between base and fine-tuned model outputs
- GSM8K dataset integration

## Notebook Structure

1. **Setup and Dependencies** (Initial Setup)
   - Library installations
   - Basic configurations
   - GPU optimizations
   - Authentication setup

2. **Model Configuration** (Test Cases & Model Loading)
   - Test case definitions
   - Model and tokenizer initialization
   - 4-bit quantization setup
   - Token handling

3. **Dataset Processing** (GSM8K Integration)
   - Loading GSM8K dataset
   - Data formatting and preprocessing
   - Problem template structuring
   - Tokenization setup

4. **Training Configuration** (LoRA Setup)
   - LoRA parameters configuration
   - Training arguments setup
   - Batch size and learning rate settings
   - Gradient accumulation configuration

5. **Training Process** (Model Fine-tuning)
   - Dataset preparation
   - Model training initialization
   - Progress tracking
   - Model saving functionality

6. **Evaluation** (Model Comparison)
   - Base model vs. fine-tuned model comparison
   - Test case execution
   - Solution generation
   - Performance analysis

## Usage Guide

1. **Initial Setup**
   - Set your Hugging Face token
   - Configure environment variables
   - Verify model access

2. **Dataset Configuration**
   - Default: 1000 training examples, 100 test examples
   - Customizable dataset size
   - Structured problem format

3. **Training Parameters**
   - Epochs: 3
   - Batch size: 4
   - Gradient accumulation steps: 16
   - Learning rate: 2e-5
   - FP16 training enabled

4. **Model Customization**
   - Adjustable LoRA parameters (r=32, alpha=64)
   - Configurable target modules
   - Customizable dropout rate
   - Solution generation parameters

## Performance Optimization

- Uses torch.cuda optimizations
- Implements gradient checkpointing
- Employs mixed-precision training
- Utilizes efficient tokenization

## Test Cases

The notebook includes diverse test cases:
1. Factory production and profit calculation
2. Fundraising and commission problems
3. Comparative rate problems
4. Compound interest calculations

## Customization Options

You can customize:
- Dataset size and composition
- Training parameters
- Model architecture settings
- Test case scenarios
- Output formatting

## Troubleshooting

Common issues and solutions:
1. Memory Management:
   - Adjust batch size
   - Modify gradient accumulation
   - Fine-tune quantization settings

2. Training Stability:
   - Adjust learning rate
   - Modify LoRA parameters
   - Check dataset quality

3. Output Quality:
   - Adjust generation parameters
   - Modify problem templates
   - Increase training data

## Best Practices

1. Data Preparation:
   - Use consistent formatting
   - Include diverse problem types
   - Ensure clean data

2. Training:
   - Monitor loss curves
   - Save checkpoints
   - Validate outputs regularly

3. Evaluation:
   - Use diverse test cases
   - Compare with base model
   - Analyze step-by-step solutions

## Notes

- Model saves after each epoch
- Evaluation occurs per epoch
- Solutions include detailed steps
- Outputs are deterministic (can be made stochastic)

In [1]:
# Fine-tuning LLMs: Math Problem Solving Demo
# Section 1: Setup and Dependencies

# Install required packages
!pip install -q transformers datasets accelerate bitsandbytes wandb py7zr
!pip install -q peft trl huggingface_hub

import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import (
    prepare_model_for_kbit_training,
    LoraConfig,
    get_peft_model,
    PeftModel
)
from getpass import getpass
from huggingface_hub import HfApi, login

# Section 2: Hugging Face Authentication
print("Please enter your HuggingFace token (from https://huggingface.co/settings/tokens)")
hf_token = getpass("Enter your HuggingFace token: ")

# Set the token
os.environ["HUGGING_FACE_HUB_TOKEN"] = hf_token
login(token=hf_token)  # This will authenticate your session

# Optional: Set cache directory
os.environ["HF_HOME"] = "/content/hf_home"

# Verify token and model access
try:
    api = HfApi()
    api.model_info("mistralai/Mistral-7B-v0.1")
    print("✅ Successfully authenticated and verified model access!")
except Exception as e:
    print("❌ Error: Unable to access the model. Please check your token and model permissions.")
    print(f"Error details: {str(e)}")
    raise Exception("Model access verification failed")

# Section 3: Configuration and Test Cases
# Enable CUDA optimizations
torch.cuda.empty_cache()
torch.backends.cuda.matmul.allow_tf32 = True

# Configuration
MODEL_NAME = "mistralai/Mistral-7B-v0.1"
OUTPUT_DIR = "math_tuned_model"

# Test cases for clear demonstration
TEST_CASES = [
    """A factory produces widgets at a rate of 800 per hour. Due to quality control, 5% of widgets are rejected. 
    The factory operates 2 shifts per day, each shift being 8 hours long. If the factory sells each good widget for $12.50, 
    and the daily operating cost is $45,000, what is the daily profit?""",
    
    """In a school fundraiser, students sell chocolate bars. Each student receives a 20% commission on their sales. 
    If a chocolate bar costs $5, and the school needs to raise $10,000, and there are 45 students participating, 
    how many chocolate bars does each student need to sell on average to reach the goal?""",
    
    """A car rental company charges $45 per day plus $0.25 per mile. A competing company charges $35 per day plus 
    $0.45 per mile. At how many miles per day would the total cost be the same for both companies? Round to the nearest mile.""",
    
    """Jack is investing $10,000 with 6% annual compound interest. He wants to make periodic withdrawals to supplement 
    his income. If he withdraws $2,000 at the end of each year, how much money will be left in the account after 3 years?
    Round to the nearest dollar."""
]

# Section 4: Dataset Loading and Model Functions
def load_math_dataset():
    """Load and prepare the GSM8K dataset"""
    dataset = load_dataset("gsm8k", "main")
    
    # Take a subset for demonstration
    dataset = dataset.shuffle(seed=42)
    dataset["train"] = dataset["train"].select(range(1000))
    dataset["test"] = dataset["test"].select(range(100))
    
    def format_problem(example):
        return {
            "text": f"""### Problem: {example['question']}
### Step 1: Define Variables
### Step 2: Write Equations
### Step 3: Solve Step-by-Step
### Step 4: Verify Answer
### Final Answer: """
    }
    formatted_dataset = dataset.map(format_problem)
    return formatted_dataset

def load_model_and_tokenizer():
    """Load the base model and tokenizer with 4-bit quantization"""
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=False
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        token=hf_token
    )
    
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, token=hf_token)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    
    return model, tokenizer

def prepare_math_dataset(tokenizer, dataset):
    """Prepare math dataset for training"""
    def tokenize_function(examples):
        return tokenizer(
            examples["text"],
            truncation=True,
            max_length=768,
            padding="max_length"
        )
    
    tokenized_dataset = {}
    for split in dataset.keys():
        tokenized_dataset[split] = dataset[split].map(
            tokenize_function,
            batched=True,
            remove_columns=dataset[split].column_names
        )
    
    return tokenized_dataset

def prepare_model_for_math_training(model):
    """Configure model with LoRA for math problem solving"""
    lora_config = LoraConfig(
        r=32,
        lora_alpha=64,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],  # Add more target modules
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    
    return model

def setup_math_training_args():
    """Configure training arguments optimized for math problems"""
    return TrainingArguments(
        output_dir=OUTPUT_DIR,
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=16,
        learning_rate=2e-5,
        fp16=True,
        logging_steps=10,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        save_total_limit=2,
        push_to_hub=False,
        report_to="none",
        warmup_steps=100
    )

def generate_solution(model, tokenizer, problem):
    """Generate a solution using the model"""
    prompt = f"### Problem: {problem}\n### Let's solve this step by step:"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_length=512,
        temperature=0.7,
        num_return_sequences=1,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def demonstrate_comparison(base_model, fine_tuned_model, tokenizer, test_cases):
    """Compare solutions between base and fine-tuned models"""
    print("\n=== Model Solution Comparison ===\n")
    
    for i, problem in enumerate(test_cases, 1):
        print(f"\nTest Case {i}:\n")
        print("Problem:")
        print(problem)
        
        print("\nBase Model Solution:")
        base_solution = generate_solution(base_model, tokenizer, problem)
        print(base_solution)
        
        print("\nFine-tuned Model Solution:")
        fine_tuned_solution = generate_solution(fine_tuned_model, tokenizer, problem)
        print(fine_tuned_solution)
        
        print("\n" + "="*80)

# Section 5: Main Execution
def main():
    """Main execution function"""
    print("Loading base model and tokenizer...")
    base_model, tokenizer = load_model_and_tokenizer()
    
    print("\nGenerating solutions with base model...")
    for i, test_case in enumerate(TEST_CASES[:2], 1):
        print(f"\nTest Case {i}:")
        print("Problem:", test_case)
        solution = generate_solution(base_model, tokenizer, test_case)
        print("Solution:", solution)
    
    print("\nNow loading and preparing math dataset...")
    dataset = load_math_dataset()
    tokenized_dataset = prepare_math_dataset(tokenizer, dataset)
    
    print("\nPreparing model for training...")
    model = prepare_model_for_math_training(base_model)
    
    print("\nSetting up training arguments...")
    training_args = setup_math_training_args()
    
    print("\nInitializing trainer...")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["test"],
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    print("\nStarting training...")
    trainer.train()
    
    print("\nTraining complete! Saving model...")
    trainer.save_model()
    
    print("\nLoading fine-tuned model for comparison...")
    fine_tuned_model = PeftModel.from_pretrained(
        base_model,
        OUTPUT_DIR,
        device_map="auto"
    )
    
    print("\nComparing solutions between base and fine-tuned models...")
    demonstrate_comparison(base_model, fine_tuned_model, tokenizer, TEST_CASES)

if __name__ == "__main__":
    main()

[0mPlease enter your HuggingFace token (from https://huggingface.co/settings/tokens)


Enter your HuggingFace token:  ········


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


✅ Successfully authenticated and verified model access!
Loading base model and tokenizer...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


Generating solutions with base model...

Test Case 1:
Problem: A factory produces widgets at a rate of 800 per hour. Due to quality control, 5% of widgets are rejected. 
    The factory operates 2 shifts per day, each shift being 8 hours long. If the factory sells each good widget for $12.50, 
    and the daily operating cost is $45,000, what is the daily profit?
Solution: ### Problem: A factory produces widgets at a rate of 800 per hour. Due to quality control, 5% of widgets are rejected. 
    The factory operates 2 shifts per day, each shift being 8 hours long. If the factory sells each good widget for $12.50, 
    and the daily operating cost is $45,000, what is the daily profit?
### Let's solve this step by step:
1. Write the formula for the number of widgets produced per shift.
    Let $P$ be the number of widgets produced per shift.
    $P = 800$
2. Write the formula for the number of widgets rejected per shift.
    Let $R$ be the number of widgets rejected per shift.
    $R = 5

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss
0,1.6926,1.502407
1,1.1897,1.088021
2,1.1279,1.009423


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]



Training complete! Saving model...

Loading fine-tuned model for comparison...

Comparing solutions between base and fine-tuned models...

=== Model Solution Comparison ===


Test Case 1:

Problem:
A factory produces widgets at a rate of 800 per hour. Due to quality control, 5% of widgets are rejected. 
    The factory operates 2 shifts per day, each shift being 8 hours long. If the factory sells each good widget for $12.50, 
    and the daily operating cost is $45,000, what is the daily profit?

Base Model Solution:
### Problem: A factory produces widgets at a rate of 800 per hour. Due to quality control, 5% of widgets are rejected. 
    The factory operates 2 shifts per day, each shift being 8 hours long. If the factory sells each good widget for $12.50, 
    and the daily operating cost is $45,000, what is the daily profit?
### Let's solve this step by step:
### Step 1: Define variables
### Step 2: Write equations
### Step 3: Solve step-by-step
### Step 4: Verify answer
### Final a