# L3: Fine-Tuning with OpenAI

## Why OpenAI Instead of Google Cloud

After attempting Google Cloud Vertex AI in `L3_automation.ipynb`, encountered:
- API deprecation (text-bison)
- Regional limitations (Gemini not available)
- 3+ hours of debugging

OpenAI provides:
- Immediate access
- Same JSONL format (our data works as-is!)
- Reliable service
-Better documentation

## What We're Doing
Fine-tuning GPT-4o-mini on our 10,000 Stack Overflow Python Q&As

## Cost
- Training: ~$6-8
- Free credit: -$5
- **Your cost: ~$1-3**

SETUP

In [2]:
# L3: OpenAI Fine-Tuning Setup
import openai
import json
import os
from dotenv import load_dotenv

# Load API key
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

print("OpenAI initialized")
print(f"Key: {openai.api_key[:20]}...")

OpenAI initialized
Key: sk-proj-c4JHt8PTD-Go...


CELL 3.5

In [20]:
# Create even smaller dataset (100 examples - super cheap!)
import json

def create_smaller_dataset(input_file, output_file, n_examples):
    """Take only first N examples"""
    count = 0
    
    with open(input_file, 'r', encoding='utf-8') as infile, \
         open(output_file, 'w', encoding='utf-8') as outfile:
        
        for line in infile:
            if count >= n_examples:
                break
            outfile.write(line)
            count += 1
    
    return count

print("=" * 70)
print("CREATING TINY DATASET (100 EXAMPLES)")
print("=" * 70)
print("\nCost comparison:")
print("‚îú‚îÄ‚îÄ 7,200 examples: $55")
print("‚îú‚îÄ‚îÄ 500 examples: $4")
print("‚îî‚îÄ‚îÄ 100 examples: $0.50-$1.50 ‚úÖ")
print()

# Create tiny training set (100 examples)
train_count = create_smaller_dataset(
    "tune_data_stack_overflow_python_qa.jsonl", 
    "tune_data_tiny.jsonl", 
    100
)
print(f"‚úÖ Training set: {train_count} examples")

# Create tiny eval set (20 examples)
eval_count = create_smaller_dataset(
    "tune_eval_stack_overflow_python_qa.jsonl", 
    "tune_eval_tiny.jsonl", 
    20
)
print(f" Evaluation set: {eval_count} examples")

# Update file paths for next cells
TRAIN_FILE = "tune_data_tiny.jsonl"
EVAL_FILE = "tune_eval_tiny.jsonl"

print("\n" + "=" * 70)
print(" TINY DATASET READY!")
print("=" * 70)
print(f"\nEstimated cost: $0.50 - $1.50")
print("This will DEFINITELY fit your $3.38 budget!")
print("\nNote: 100 examples is small but still useful!")
print("Many models are fine-tuned on 100-200 examples.")

CREATING TINY DATASET (100 EXAMPLES)

Cost comparison:
‚îú‚îÄ‚îÄ 7,200 examples: $55
‚îú‚îÄ‚îÄ 500 examples: $4
‚îî‚îÄ‚îÄ 100 examples: $0.50-$1.50 ‚úÖ

‚úÖ Training set: 100 examples
 Evaluation set: 20 examples

 TINY DATASET READY!

Estimated cost: $0.50 - $1.50
This will DEFINITELY fit your $3.38 budget!

Note: 100 examples is small but still useful!
Many models are fine-tuned on 100-200 examples.


CHECK DATA FILES

In [4]:
import glob

# Find JSONL files
jsonl_files = glob.glob("*.jsonl")

print("=" * 70)
print("TRAINING DATA FILES")
print("=" * 70)

for file in jsonl_files:
    size = os.path.getsize(file) / (1024 * 1024)  # MB
    
    with open(file, 'r', encoding='utf-8') as f:
        lines = len(f.readlines())
    
    print(f"\n {file}")
    print(f"   Size: {size:.2f} MB")
    print(f"   Examples: {lines:,}")

# Set file paths
TRAIN_FILE = "tune_data_stack_overflow_python_qa.jsonl"
EVAL_FILE = "tune_eval_stack_overflow_python_qa.jsonl"

print("\n" + "=" * 70)
print(f"Will use: {TRAIN_FILE}")
print(f"Will use: {EVAL_FILE}")

TRAINING DATA FILES

 tune_data_stack_overflow_python_qa-20251214_102129.jsonl
   Size: 37.23 MB
   Examples: 8,000

 tune_data_stack_overflow_python_qa.jsonl
   Size: 33.56 MB
   Examples: 7,200

 tune_eval_stack_overflow_python_qa.jsonl
   Size: 3.67 MB
   Examples: 800

Will use: tune_data_stack_overflow_python_qa.jsonl
Will use: tune_eval_stack_overflow_python_qa.jsonl


CONVERT TO OPENAI FORMAT

In [22]:
def convert_to_openai_format(input_file, output_file):
    """Convert to OpenAI messages format"""
    converted = 0
    
    with open(input_file, 'r', encoding='utf-8') as infile, \
         open(output_file, 'w', encoding='utf-8') as outfile:
        
        for line in infile:
            data = json.loads(line)
            
            openai_format = {
                "messages": [
                    {
                        "role": "system",
                        "content": "You are a helpful Python expert who answers like Stack Overflow."
                    },
                    {
                        "role": "user", 
                        "content": data["input_text_instruct"]
                    },
                    {
                        "role": "assistant",
                        "content": data["output_text"]
                    }
                ]
            }
            
            outfile.write(json.dumps(openai_format) + '\n')
            converted += 1
    
    return converted

# Use the TINY files created in Cell 3.5
TRAIN_FILE = "tune_data_tiny.jsonl"  # ‚Üê Changed to tiny!
EVAL_FILE = "tune_eval_tiny.jsonl"    # ‚Üê Changed to tiny!

print("üîÑ Converting training data...")
train_count = convert_to_openai_format(TRAIN_FILE, "train_openai.jsonl")
print(f"‚úÖ {train_count:,} training examples converted")

print("\nüîÑ Converting validation data...")
eval_count = convert_to_openai_format(EVAL_FILE, "eval_openai.jsonl")
print(f"‚úÖ {eval_count:,} validation examples converted")

print("\n" + "=" * 70)
print("‚úÖ DATA READY FOR UPLOAD!")
print("=" * 70)

üîÑ Converting training data...
‚úÖ 100 training examples converted

üîÑ Converting validation data...
‚úÖ 20 validation examples converted

‚úÖ DATA READY FOR UPLOAD!


UPLOAD TO OPENAI

In [23]:
print("=" * 70)
print("UPLOADING FILES TO OPENAI")
print("=" * 70)
print("\n This takes 2-3 minutes...\n")

# Upload training
print("1/2 Uploading training data (8,000 examples)...")
with open("train_openai.jsonl", "rb") as f:
    training_file = openai.files.create(
        file=f,
        purpose="fine-tune"
    )
print(f"     Training file: {training_file.id}")

# Upload validation  
print("\n2/2 Uploading validation data (2,000 examples)...")
with open("eval_openai.jsonl", "rb") as f:
    validation_file = openai.files.create(
        file=f,
        purpose="fine-tune"
    )
print(f"     Validation file: {validation_file.id}")

print("\n" + "=" * 70)
print(" FILES UPLOADED SUCCESSFULLY!")
print("=" * 70)

UPLOADING FILES TO OPENAI

 This takes 2-3 minutes...

1/2 Uploading training data (8,000 examples)...
     Training file: file-W9XP7KVkdfT9cqg5Xz3DBL

2/2 Uploading validation data (2,000 examples)...
     Validation file: file-826aciiDurb8JuqB8X6fP3

 FILES UPLOADED SUCCESSFULLY!


 START FINE-TUNING! 

In [24]:
print("=" * 70)
print(" STARTING FINE-TUNING JOB")
print("=" * 70)
print("\n This will take 30-60 minutes...\n")

fine_tuning_job = openai.fine_tuning.jobs.create(
    training_file=training_file.id,
    validation_file=validation_file.id,
    model="gpt-4o-mini-2024-07-18",
    suffix="stackoverflow-qa"
)

print(" FINE-TUNING STARTED!")
print("\n" + "=" * 70)
print("JOB DETAILS")
print("=" * 70)
print(f"\n Job ID: {fine_tuning_job.id}")
print(f" Status: {fine_tuning_job.status}")
print(f" Base Model: {fine_tuning_job.model}")

# Save job ID
with open("finetuning_job_id.txt", "w") as f:
    f.write(fine_tuning_job.id)

print("\n Job ID saved to: finetuning_job_id.txt")

print("\n" + "=" * 70)
print("WHAT HAPPENS NOW?")
print("=" * 70)
print("""
 Training runs in the cloud (30-60 min)
 You can close this notebook
 OpenAI will email you when complete
 Check status anytime with Cell 7 below
 Cost: ~$6-8 (minus your $5 free credit!)

NEXT STEPS:
1. Wait for training to complete
2. Run Cell 7 to check status
3. When done, go to L4 to use your model!
""")
print("=" * 70)

 STARTING FINE-TUNING JOB

 This will take 30-60 minutes...

 FINE-TUNING STARTED!

JOB DETAILS

 Job ID: ftjob-F15zqv4WC6HSQWthkarxmgac
 Status: validating_files
 Base Model: gpt-4o-mini-2024-07-18

 Job ID saved to: finetuning_job_id.txt

WHAT HAPPENS NOW?

 Training runs in the cloud (30-60 min)
 You can close this notebook
 OpenAI will email you when complete
 Check status anytime with Cell 7 below
 Cost: ~$6-8 (minus your $5 free credit!)

NEXT STEPS:
1. Wait for training to complete
2. Run Cell 7 to check status
3. When done, go to L4 to use your model!



CHECK STATUS

In [28]:
# Check fine-tuning status
# Run this cell anytime to see progress!

print(" Checking training status...\n")

try:
    with open("finetuning_job_id.txt", "r") as f:
        job_id = f.read().strip()
    
    job = openai.fine_tuning.jobs.retrieve(job_id)
    
    print("=" * 70)
    print("FINE-TUNING STATUS")
    print("=" * 70)
    
    print(f"\n Job ID: {job.id}")
    print(f" Status: {job.status}")
    print(f" Base Model: {job.model}")
    
    if job.status == "succeeded":
        print("\n" + " " * 35)
        print("TRAINING COMPLETE!")
        print(" " * 35)
        print(f"\n‚úÖ Your fine-tuned model: {job.fine_tuned_model}")
        
        # Save model name
        with open("finetuned_model_name.txt", "w") as f:
            f.write(job.fine_tuned_model)
        
        print("\n Model name saved to: finetuned_model_name.txt")
        print("\n NEXT STEP: Go to L4_predictions.ipynb!")
        print("   1. Restart kernel")
        print("   2. Run from Cell 13")
        print("   3. Your custom model will load automatically!")
        
    elif job.status == "running":
        print("\n Training in progress...")
        print("   Check back in 10-15 minutes!")
        print("   Or wait for email notification")
        
    elif job.status == "failed":
        print("\n Training failed!")
        print(f"   Error: {job.error}")
        
    else:
        print(f"\n Current status: {job.status}")
    
    print("\n" + "=" * 70)
    
except FileNotFoundError:
    print(" finetuning_job_id.txt not found!")
    print("   Did you run Cell 6 to start training?")
    
except Exception as e:
    print(f" Error: {e}")

 Checking training status...

FINE-TUNING STATUS

 Job ID: ftjob-F15zqv4WC6HSQWthkarxmgac
 Status: validating_files
 Base Model: gpt-4o-mini-2024-07-18

 Current status: validating_files



In [29]:
# Get detailed error information
import openai
from dotenv import load_dotenv
import os

load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

with open("finetuning_job_id.txt", "r") as f:
    job_id = f.read().strip()

job = openai.fine_tuning.jobs.retrieve(job_id)

print("=" * 70)
print("ERROR DETAILS")
print("=" * 70)

print(f"\nüìç Job ID: {job.id}")
print(f"üìä Status: {job.status}")

if job.error:
    print(f"\n‚ùå Error Code: {job.error.code}")
    print(f"‚ùå Error Message: {job.error.message}")
    print(f"‚ùå Error Param: {job.error.param}")
else:
    print("\nNo error details available")

print("\n" + "=" * 70)

ERROR DETAILS

üìç Job ID: ftjob-F15zqv4WC6HSQWthkarxmgac
üìä Status: validating_files

‚ùå Error Code: None
‚ùå Error Message: None
‚ùå Error Param: None



In [30]:
# Check if files were validated properly
import openai

# Get training file details
print("=" * 70)
print("CHECKING TRAINING FILE")
print("=" * 70)

try:
    with open("train_openai.jsonl", "r") as f:
        lines = f.readlines()
        
    print(f"\n‚úÖ File exists")
    print(f"‚úÖ Total lines: {len(lines)}")
    
    # Check first example
    import json
    first = json.loads(lines[0])
    
    print("\nüìÑ First example structure:")
    print(json.dumps(first, indent=2)[:500])
    
    # Check for common issues
    print("\nüîç Validation checks:")
    
    for i, line in enumerate(lines[:10]):  # Check first 10
        try:
            data = json.loads(line)
            
            # Check required fields
            if "messages" not in data:
                print(f"‚ùå Line {i+1}: Missing 'messages' field")
            else:
                msgs = data["messages"]
                if len(msgs) < 2:
                    print(f"‚ùå Line {i+1}: Need at least 2 messages")
                    
        except json.JSONDecodeError:
            print(f"‚ùå Line {i+1}: Invalid JSON")
    
    print("\n‚úÖ First 10 lines look valid")
    
except Exception as e:
    print(f"‚ùå Error: {e}")

print("\n" + "=" * 70)

CHECKING TRAINING FILE

‚úÖ File exists
‚úÖ Total lines: 100

üìÑ First example structure:
{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful Python expert who answers like Stack Overflow."
    },
    {
      "role": "user",
      "content": "Please answer the following Stackoverflow question on Python.\nAnswer it like you are a developer answering Stackoverflow questions.\n\nStackoverflow question:\nMLFlow active run does not match environment run id<p>I am trying to perform an MLFlow run but stuck with the following error after trying a lot of things

üîç Validation checks:

‚úÖ First 10 lines look valid

