# Advanced Uses of Generative AI 

## AI Math Tutor – Fine-Tuned ChatGPT for Elementary Learners

This project explores the use of fine-tuning techniques to create an interactive AI-powered math tutor designed for elementary students. Using the OpenAI API, a base ChatGPT Turbo model was fine-tuned to deliver clear, friendly, and age-appropriate explanations—even when students provide vague or unstructured input.

The project follows the official OpenAI fine-tuning workflow and incorporates strategies from *Generative AI in Action* (Chapter 7), including prompt design, data formatting, and conversational behavior shaping.

This notebook documents the complete end-to-end process:
- Preparing and formatting training data in JSONL
- Uploading and managing files with the OpenAI CLI
- Running the fine-tuning job
- Testing and evaluating the customized model responses

The final model is integrated into a Streamlit web application to deliver a simple and engaging interface for student use.

Here are the steps I followed to fine-tune a generative AI model using the OpenAI API and Python:

- **Dataset Creation:**
  
    I created a small, high-quality dataset of kid-friendly math question-answer pairs using OpenAI's chat fine-tuning format, ensuring a friendly          tone, emojis, and second-grade-level analogies.  The dataset was validated to ensure correct formatting and readiness for fine-tuning.

- **Uploading the Dataset:**
  
    The training file was uploaded to OpenAI using the REST API.

- **Fine-Tuning the Model:**
  
    A fine-tuning job was launched using gpt-3.5-turbo, and progress was monitored until completion.

- **Testing and Evaluation:**
  
    After fine-tuning, I tested the model on new prompts to evaluate tone, clarity, and how well it followed the desired kid-friendly style.

- **Reflections and Observations:**
  
    I documented insights from the results, including improvements over the base model and any limitations or next steps.


#### Setting Up OpenAI API Key

To start, I imported the necessary libraries and retrieved the API key from the system's environment variables. This ensures that the API key is securely accessed and available for making the API calls.

In [1]:
import openai
import os

# The API key is stored as an environment variable for security reasons.
# It is set in the system environment and can be accessed by using the os.getenv() function.
api_key = os.getenv("OPENAI_API_KEY")

# If the key is missing, it will print an error message to indicate the issue.
if not api_key:
    print("Error: API key is not set. Please ensure the OPENAI_API_KEY environment variable is configured properly.")
    print("To set your API key, run the following command in your terminal: export OPENAI_API_KEY=your_api_key_here")
    exit(1)  # Exit if the API key is not found

# Set the API key for the OpenAI API
openai.api_key = api_key

#### Dataset Preparation


For this milestone, I created a custom training dataset designed to help the model respond to elementary math questions with warm, age-appropriate explanations.

The dataset was hand-crafted in JSONL format, with each record containing a user prompt (e.g., a math question) and an assistant response (e.g., a cheerful, emoji-enhanced explanation). Each entry followed the OpenAI fine-tuning schema using a messages list with clearly defined user and assistant roles.

To keep the fine-tuning process manageable for this exercise, I used 10 high-quality examples and ensured that each sample was:

- Engaging and easy to follow for a second-grade reading level

- Aligned with best practices for fine-tuning (no special tokens, consistent formatting)

- Valid JSONL with no parsing issues or escape character errors

This dataset served as the foundation for the fine-tuning process that followed.

I verified the first couple of records to confirm the data was correctly formatted:

In [2]:
# Open the original file and read the contents
with open('math_tutor_dataset.jsonl', 'r', encoding='utf-8') as file:
    lines = file.readlines()

# Print the first two records as-is
for record in lines[:2]:
    print(record)

{"messages": [{"role": "user", "content": "What is 7 + 5? Explain it like I'm in 2nd grade."}, {"role": "assistant", "content": "Let’s count on our fingers! Start with 7 fingers up, and then count 5 more: 8, 9, 10, 11, 12! So, 7 + 5 = 12! 🎉 You did it! 🎉"}]}

{"messages": [{"role": "user", "content": "I have 3 boxes of crayons. Each box has 4 crayons. How many crayons do I have in total?"}, {"role": "assistant", "content": "Let’s do the math! If each box has 4 crayons and you have 3 boxes, you multiply 3 x 4 = 12 crayons! That’s a whole rainbow of colors! 🌈"}]}



#### Uploading the Cleaned Dataset for Fine-Tuning

Once the dataset was prepared, the next step was to upload it to OpenAI’s servers to make it ready for the fine-tuning process. This involved using the OpenAI API to upload the training file. The dataset would then be used to fine-tune the base model, ensuring it can respond with age-appropriate, engaging math explanations.

To handle potential issues during the upload, I implemented error handling in the code to ensure smooth file submission. This step made the dataset available for fine-tuning, allowing the model to learn from the provided examples.

In [3]:
from openai import OpenAI

TRAINING_FILENAME = 'math_tutor_dataset.jsonl'  # Path to the cleaned training dataset

# Initialize OpenAI client
client = OpenAI()

# Upload the training file to OpenAI's servers
try:
    # Open the training file in binary mode ('rb') for reading
    with open(TRAINING_FILENAME, "rb") as f:
        # Upload the file to OpenAI with the purpose of fine-tuning
        file_response = client.files.create(
            file=f,  # The file to upload
            purpose="fine-tune"  # The purpose for the file upload (fine-tuning)
        )
    
    # Extract the file ID from the response
    training_file_id = file_response.id
    
    # Print a success message with the file ID
    print(f"Training file uploaded successfully. File ID: {training_file_id}")

# Handle any exceptions that occur during the file upload process
except Exception as e:
    # Print the error message and exit the program if file upload fails
    print(f"Error uploading training file: {e}")
    exit()

Training file uploaded successfully. File ID: file-WUD3eoUxN4Y7JpsxCeXjpd


#### Create the Fine-Tuning Job

After uploading the cleaned training dataset, the next step was to initiate the fine-tuning job using the file ID provided by OpenAI. This triggered the model training process, enabling the model to learn from the dataset and improve its ability to provide age-appropriate math explanations.

For this fine-tuning task, I selected the base model (gpt-3.5-turbo) and set the relevant hyperparameters, such as the number of training epochs (n_epochs), to control how many times the model would learn from the dataset. Additionally, I applied a custom suffix to the fine-tuned model’s name for easy identification.

Once the job was successfully created, I received a job ID, which I used to monitor the progress and status of the fine-tuning process.

In [4]:
# Initiate fine-tuning job
try:
    # Create the fine-tuning job using the uploaded training file and model parameters
    fine_tune_response = client.fine_tuning.jobs.create(
        training_file=training_file_id,  # Pass the uploaded training file ID
        model="gpt-3.5-turbo",  # Specify the base model for fine-tuning (gpt-3.5-turbo)
        hyperparameters={"n_epochs": 3},  # Set hyperparameters for fine-tuning (e.g., number of epochs)
        suffix="math_tutor"  # Add a custom suffix to the fine-tuned model for identification
    )
    
    # Extract the job ID from the response
    fine_tuning_job_id = fine_tune_response.id
    
    # Print success message with the job ID for monitoring purposes
    print(f"Fine-tuning job started successfully. Job ID: {fine_tuning_job_id}")
    
except Exception as e:
    # If there is an error in initiating the fine-tuning job, print the error message and exit
    print(f"Error initiating fine-tuning job: {e}")
    exit()

Fine-tuning job started successfully. Job ID: ftjob-HS8NjJ9TFAKnDJxoVSSx8lSp


#### Monitoring and Documenting the Fine-Tuning Job

After initiating the fine-tuning job, I used the OpenAI API to monitor its status and document the training progress. This involved two key parts:

- **Job Monitoring**
  
    I implemented a script to track the job status by querying the OpenAI endpoint every 30 seconds. The script continued polling until the job either      succeeded or failed. Upon successful completion, I retrieved the fine-tuned model name for future inference tasks.

- **Metrics Tracking**
  
    Once the job completed, I retrieved detailed job events including training loss logged at each step. These events provided visibility into how the      model learned over time. The final output included the job status and the ID of the fine-tuned model.

Here's the code I used to monitor the fine-tuning job and retrieve the status amd metrics:

In [5]:
def get_fine_tuning_job_status(job_id):
    # Function to retrieve the status of a fine-tuning job
    url = f"https://api.openai.com/v1/fine_tuning/jobs/{job_id}"
    headers = {
        "Authorization": f"Bearer {api_key}"  # Use environment variable for API key
    }
    try:
        # Send GET request to retrieve job status
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Raise an exception for bad status codes
        job_data = response.json()
        return job_data['status']  # Return job status (succeeded, failed, etc.)
    except requests.exceptions.RequestException as e:
        # Handle request exceptions (e.g., network errors)
        print(f"Error retrieving fine-tuning job status: {e}")
        return None
    except json.JSONDecodeError:
        # Handle errors when decoding the JSON response
        print("Error decoding JSON response.")
        return None

In [7]:
import requests
import json
import time

# Monitor the fine-tuning job status
if fine_tuning_job_id:
    print("Monitoring fine-tuning job...")
    while True:
        # Check job status
        status = get_fine_tuning_job_status(fine_tuning_job_id)
        
        # If job is completed (either succeeded or failed), break the loop
        if status in ["succeeded", "failed"]:
            break
        
        # Print status and wait before checking again
        print(f"Status: {status}. Checking again in 30 seconds...")
        time.sleep(30)  # Wait for 30 seconds before checking again

    print(f"Fine-tuning job completed with status: {status}")

    if status == "succeeded":
        # If job succeeded, retrieve the fine-tuned model ID
        url = f"https://api.openai.com/v1/fine_tuning/jobs/{fine_tuning_job_id}"
        headers = {
            "Authorization": f"Bearer {api_key}"
        }
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Ensure the request was successful
        job_data = response.json()
        fine_tuned_model_name = job_data['fine_tuned_model']
        print(f"Fine-tuned model name: {fine_tuned_model_name}")

        # Retrieve and print training metrics from OpenAI client
        client = OpenAI()
        events = client.fine_tuning.jobs.list_events(fine_tuning_job_id)

        print("\n--- Training Metrics ---")
        for event in events.data:
            print(f"{event.created_at}: {event.message}")

        job_info = client.fine_tuning.jobs.retrieve(fine_tuning_job_id)
        print(f"\nFinal Status: {job_info.status}")
        print(f"Fine-tuned model: {job_info.fine_tuned_model}")

    else:
      print("Fine-tuning failed.")
      fine_tuned_model_name = None
else:
    # Handle case where fine-tuning job ID is not available
    print("Fine-tuning job ID is not available. Cannot monitor.")
    fine_tuned_model_name = None


Monitoring fine-tuning job...
Fine-tuning job completed with status: succeeded
Fine-tuned model name: ft:gpt-3.5-turbo-0125:personal:math-tutor:BVPvM7jo

--- Training Metrics ---
1746827527: The job has successfully completed
1746827521: New fine-tuned model created
1746827520: Checkpoint created at step 30
1746827520: Checkpoint created at step 15
1746827508: Step 45/45: training loss=0.49
1746827508: Step 44/45: training loss=0.58
1746827505: Step 43/45: training loss=0.58
1746827505: Step 42/45: training loss=0.44
1746827505: Step 41/45: training loss=0.43
1746827502: Step 40/45: training loss=0.40
1746827502: Step 39/45: training loss=0.68
1746827499: Step 38/45: training loss=0.25
1746827496: Step 37/45: training loss=0.27
1746827496: Step 36/45: training loss=0.57
1746827494: Step 35/45: training loss=0.65
1746827494: Step 34/45: training loss=0.41
1746827494: Step 33/45: training loss=0.66
1746827491: Step 32/45: training loss=0.42
1746827491: Step 31/45: training loss=0.54
1746

#### Training loss helps assess how well the model is fitting the data.  

The training process concluded with a final loss of 0.49 over 45 steps, indicating that the model effectively learned from the limited dataset. While this loss value suggests a good fit for the training data, it's important to note that a more extensive and diverse dataset would be necessary to ensure the model's robustness and generalization capabilities in real-world applications.

#### Model Testing

With the fine-tuning job successfully completed and the fine-tuned model ready, I proceeded to test both the base model (gpt-3.5-turbo) and the newly fine-tuned model using a set of sample prompts. The goal of this step was to compare the responses from the base model and the fine-tuned model to see how they differed in terms of relevance, engagement, and clarity, particularly for elementary math questions.

This quick comparison helped me evaluate whether the fine-tuned model performed as expected and provided age-appropriate, engaging, and helpful explanations for the target audience.

In [8]:
def test_model(model_name, test_strings, description):
    print(f"\nTesting {description}: {model_name}")
    for test_string in test_strings:
        try:
            response = client.chat.completions.create(
                model=model_name,
                messages=[
                    {"role": "system", "content": "You are a friendly and engaging math tutor who provides clear, fun, and easy-to-understand math explanations for elementary students. Use emojis and simple language to explain the concepts."},
                    {"role": "user", "content": test_string}
                ],
                max_tokens=75,
            )
            output = response.choices[0].message.content.strip()
            print(f"\nInput: {test_string}")
            print(f"Output: {output}\n")
        except Exception as e:
            print(f"Error testing model with input '{test_string}': {e}")

In [15]:
# Initialize OpenAI client
client = OpenAI()

# Prompt strings 
test_prompts = [
    "What is 5 + 3?",
    "Can you explain what 10 divided by 2 means?",
    "I have 16 cookies 🍪 and want to share them equally with 4 friends. How many cookies will each friend get?"
]

print("\n--- Test foundation model ---")
# Test foundation model
test_model("gpt-3.5-turbo", test_prompts, "foundation model")

print("\n--- Test fine-tuned model ---")
# Test fine-tuned model
if fine_tuned_model_name:
    test_model(fine_tuned_model_name, test_prompts, "fine-tuned model")
else:
    print("No fine-tuned model to test.")


--- Test foundation model ---

Testing foundation model: gpt-3.5-turbo

Input: What is 5 + 3?
Output: Great question! 🌟 When you add 5 and 3 together, you get 8! 🎉 So, 5 + 3 = 8. Isn't math fun? 😀🧮


Input: Can you explain what 10 divided by 2 means?
Output: Of course! 🌟 When we divide 10 by 2, it means we are sharing 10 things equally into 2 groups. 🤝 So, if we have 10 cookies and we want to share them with 2 friends, each friend would get 5 cookies. 🍪🍪🍪🍪🍪 So


Input: I have 16 cookies 🍪 and want to share them equally with 4 friends. How many cookies will each friend get?
Output: Great question! 🌟 If you have 16 cookies 🍪 and want to share them equally with 4 friends, you can divide the total number of cookies by the number of friends. 
16 cookies ÷ 4 friends = 4 cookies each! 🎉
Each friend will get 4 cookies. Enjoy sharing your yummy cookies! 😊


--- Test fine-tuned model ---

Testing fine-tuned model: ft:gpt-3.5-turbo-0125:personal:math-tutor:BVPvM7jo

Input: What is 5 + 3?
Output:

#### Evaluation of the results and Conclusion

Fine-tuning worked well and successful in shaping the model's responses to better suit its intended audience. The fine-tuned model consistently delivered responses in a more playful, child-friendly tone that matched the goal of building an elementary math tutor. It used metaphors (like toys and pizza), added emojis, and kept the tone encouraging and fun—all of which help make math more engaging for young students.

That said, the base GPT-3.5-turbo model also performed well. It gave accurate and clear answers, and in some cases, was a bit more concise. However, it didn’t always maintain the consistent, imaginative tone or child-oriented language that the fine-tuned version achieved.

Overall, both models were capable, but the fine-tuned version better matched the intended teaching style for younger audiences. This shows that fine-tuning can be a powerful tool to adapt a general-purpose model to meet specific communication needs and user groups.

- **Metrics Considered**

Given the scope of our term project and the limited size of our fine-tuning dataset, I focused on the following key metrics to evaluate our model's performance:

Training Loss: The final training loss achieved was 0.495, indicating that the model effectively minimized errors on the training data.

Training Accuracy: The model attained a training accuracy of 80.9%, reflecting its proficiency in predicting the correct tokens during training.

These metrics were obtained from the OpenAI fine-tuning dashboard, which provides real-time insights into the model's learning progress .

- **Model Testing and Validation**

To assess the qualitative improvements brought by fine-tuning, I conducted comparative testing between the base model (gpt-3.5-turbo) and fine-tuned model. 

#### Key observations include:

 *Enhanced Engagement:* The fine-tuned model consistently incorporated emojis and adopted a more conversational tone, aligning with our objective of creating a friendly math tutor for elementary students.

 *Improved Clarity:* Explanations provided by the fine-tuned model were more tailored to young learners, using simple language and relatable examples.

- **Human Evaluation**

Recognizing the importance of human judgment in evaluating generative models, I performed manual reviews of the model's responses. This involved:

Comparative Analysis: Assessing outputs from both the base and fine-tuned models to identify improvements in tone, clarity, and engagement.

Qualitative Feedback: Noting areas where the fine-tuned model excelled or required further refinement.

While this evaluation was informal, it provided valuable insights into the model's real-world applicability and effectiveness in achieving our educational goals.

- **Token Size Experimentation**

One key learning from this project was the impact of the max_tokens parameter on the completeness of the model’s responses. Initially, setting a low max_tokens value resulted in truncated outputs, limiting the model’s ability to deliver full explanations. To address this, the parameter was gradually increased, which enabled the model to generate more comprehensive and coherent responses. This experimentation highlighted the importance of carefully tuning generation parameters to achieve a balance between response quality and resource efficiency.

    
### Conclusion
    
Throughout this milestone, I gained several important learnings:

**Fine-Tuning Effectiveness:** Even with a modest dataset, fine-tuning can significantly enhance a model's alignment with specific stylistic and functional objectives.

**Importance of Evaluation Metrics:** Monitoring training loss and accuracy provides essential feedback on the model's learning trajectory, while human evaluations offer nuanced insights into its practical performance.

**Prompt and Parameter Tuning:** Adjusting prompt and generation parameters like max_tokens is crucial for optimizing output quality.

For future work, expanding the dataset and incorporating more structured evaluations would further strengthen the model's capabilities and reliability.