# **Fine-Tuning GPT-4o mini on Custom Dataset**

**The fine-tuned model is designed to assist real estate agents in efficiently searching for property information, providing concise and relevant responses to their queries.**


## Step 1: Install all the necessary libraries

In [2]:
!pip install openai json requests os time tiktoken



ERROR: Could not find a version that satisfies the requirement json (from versions: none)
ERROR: No matching distribution found for json


In [40]:
import openai
import pandas as pd
import json
import tiktoken
import numpy as np
from collections import defaultdict
import os
from openai import AzureOpenAI
from dotenv import load_dotenv

## Step 2: Please set up environment variables

In [41]:


load_dotenv("azure.env")


openai.api_type: str = "azure"
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.api_base = os.getenv("OPENAI_API_BASE")
openai.api_version = "2024-07-18"

In [42]:
import os

current_directory = os.getcwd()
print(f"Current working directory: {current_directory}")

# Set the current working directory to the desired path
desired_directory = "c:\\Users\\akaruparti\\Documents\\gpt-4o-mini-FT\\GPT4o-mini-fine-tuning\\" # Replace this with the actual path if different
os.chdir(desired_directory)
print(f"New working directory: {os.getcwd()}")

Current working directory: c:\Users\akaruparti\Documents\gpt-4o-mini-FT\GPT4o-mini-fine-tuning
New working directory: c:\Users\akaruparti\Documents\gpt-4o-mini-FT\GPT4o-mini-fine-tuning


## Step 3: Load the training data, validation data and testing data provided

In [43]:
# Load the training set
with open('data/training_data.jsonl', 'r', encoding='utf-8-sig') as f:
    training_dataset = [json.loads(line) for line in f]

# Training dataset stats
print("Number of examples in training set:", len(training_dataset))
print("First example in training set:")
for message in training_dataset[0]["messages"]:
    print(message)

# Load the validation set
with open('data/validation_data.jsonl', 'r', encoding='utf-8-sig') as f:
    validation_dataset = [json.loads(line) for line in f]


# Training dataset stats
print("Number of examples in training set:", len(training_dataset))
print("First example in training set:")
for message in training_dataset[0]["messages"]:
    print(message)


# Validation dataset stats
print("\nNumber of examples in validation set:", len(validation_dataset))
print("First example in validation set:")
for message in validation_dataset[0]["messages"]:
    print(message)


Number of examples in training set: 10
First example in training set:
{'role': 'system', 'content': 'You are a real estate customer support agent whose primary goal is to help users find and inquire about properties. You are friendly and concise. You only provide factual answers to queries and assist users in finding property information.'}
{'role': 'user', 'content': 'What is the price of the property at 123 Main Street?'}
{'role': 'assistant', 'content': 'The property at 123 Main Street is listed at $450,000. Would you like more details or schedule a viewing?'}
Number of examples in training set: 10
First example in training set:
{'role': 'system', 'content': 'You are a real estate customer support agent whose primary goal is to help users find and inquire about properties. You are friendly and concise. You only provide factual answers to queries and assist users in finding property information.'}
{'role': 'user', 'content': 'What is the price of the property at 123 Main Street?'}
{'

## Step 4: Token Count Validation

In [44]:

encoding = tiktoken.get_encoding("cl100k_base") # default encoding used by gpt-4, turbo, and text-embedding-ada-002 models

def num_tokens_from_messages(messages, tokens_per_message=3, tokens_per_name=1):
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3
    return num_tokens

def num_assistant_tokens_from_messages(messages):
    num_tokens = 0
    for message in messages:
        if message["role"] == "assistant":
            num_tokens += len(encoding.encode(message["content"]))
    return num_tokens

def print_distribution(values, name):
    print(f"\n#### Distribution of {name}:")
    print(f"min / max: {min(values)}, {max(values)}")
    print(f"mean / median: {np.mean(values)}, {np.median(values)}")
    print(f"p5 / p95: {np.quantile(values, 0.1)}, {np.quantile(values, 0.9)}")

files = ['data/training_data.jsonl', 'data/validation_data.jsonl']

for file in files:
    print(f"Processing file: {file}")
    with open(file, 'r', encoding='utf-8-sig') as f:
        dataset = [json.loads(line) for line in f]

    total_tokens = []
    assistant_tokens = []

    for ex in dataset:
        messages = ex.get("messages", {})
        total_tokens.append(num_tokens_from_messages(messages))
        assistant_tokens.append(num_assistant_tokens_from_messages(messages))
    
    print_distribution(total_tokens, "total tokens")
    print_distribution(assistant_tokens, "assistant tokens")
    print('*' * 50)

Processing file: data/training_data.jsonl

#### Distribution of total tokens:
min / max: 81, 104
mean / median: 92.4, 93.0
p5 / p95: 85.5, 98.6

#### Distribution of assistant tokens:
min / max: 23, 33
mean / median: 27.6, 26.5
p5 / p95: 23.9, 32.1
**************************************************
Processing file: data/validation_data.jsonl

#### Distribution of total tokens:
min / max: 82, 100
mean / median: 93.18181818181819, 95.0
p5 / p95: 83.0, 99.0

#### Distribution of assistant tokens:
min / max: 25, 37
mean / median: 29.636363636363637, 30.0
p5 / p95: 26.0, 36.0
**************************************************


## Step 5: Upload training files to Azure Open AI

In [45]:
# Initialize AzureOpenAI client

load_dotenv("azure.env")

print("Azure API Base:", os.getenv("OPENAI_API_BASE"))
print("Azure API Key:", os.getenv("OPENAI_API_KEY"))


client = AzureOpenAI(
  azure_endpoint=os.getenv("OPENAI_API_BASE"), 
  api_key=os.getenv("OPENAI_API_KEY"),  
  api_version="2024-05-01-preview"  # This API version or later is required to access fine-tuning for turbo/babbage-002/davinci-002
)

training_file_name = 'data/training_data.jsonl'
validation_file_name = 'data/validation_data.jsonl'

# Upload the training and validation dataset files to Azure OpenAI with the SDK.

training_response = client.files.create(
    file=open(training_file_name, "rb"), purpose="fine-tune"
)
training_file_id = training_response.id

validation_response = client.files.create(
    file=open(validation_file_name, "rb"), purpose="fine-tune"
)
validation_file_id = validation_response.id

print("Training file ID:", training_file_id)
print("Validation file ID:", validation_file_id)


Azure API Base: https://aoaiswedencentral0312.openai.azure.com/
Azure API Key: dbe46fb62095427b82e7fd020712f684
Training file ID: file-eb6a9476d11247c88838a18e607bcd19
Validation file ID: file-0e0b5ccd2a474af3966073f449a44778


## Step 6: Submit your fine-tuning job
### Take close to 45 minutes to fine-tune the model with a dataset consisting of 10 samples and 91000 tokens

![alt text](image.png)

In [47]:
response = client.fine_tuning.jobs.create(
    training_file=training_file_id,
    validation_file=validation_file_id,
    model="gpt-4o-mini", # Enter base model name. Note that in Azure OpenAI the model name contains dashes and cannot contain dot/period characters. 

)

job_id = response.id

# You can use the job ID to monitor the status of the fine-tuning job.
# The fine-tuning job will take some time to start and complete.

print("Job ID:", response.id)
print("Status:", response.status)
print(response.model_dump_json(indent=2))

Job ID: ftjob-ba7c72eaeda14eda87a76271b6a80e24
Status: pending
{
  "id": "ftjob-ba7c72eaeda14eda87a76271b6a80e24",
  "created_at": 1724251727,
  "error": null,
  "fine_tuned_model": null,
  "finished_at": null,
  "hyperparameters": {
    "n_epochs": -1,
    "batch_size": -1,
    "learning_rate_multiplier": 1
  },
  "model": "gpt-4o-mini-2024-07-18",
  "object": "fine_tuning.job",
  "organization_id": null,
  "result_files": null,
  "status": "pending",
  "trained_tokens": null,
  "training_file": "file-eb6a9476d11247c88838a18e607bcd19",
  "validation_file": "file-0e0b5ccd2a474af3966073f449a44778",
  "seed": 1388987542
}


## Step 7: Track Training Status

In [48]:
# Track training status

from IPython.display import clear_output
import time

start_time = time.time()

# Get the status of our fine-tuning job.
response = client.fine_tuning.jobs.retrieve(job_id)

status = response.status

# If the job isn't done yet, poll it every 10 seconds.
while status not in ["succeeded", "failed"]:
    time.sleep(10)
    
    response = client.fine_tuning.jobs.retrieve(job_id)
    print(response.model_dump_json(indent=2))
    print("Elapsed time: {} minutes {} seconds".format(int((time.time() - start_time) // 60), int((time.time() - start_time) % 60)))
    status = response.status
    print(f'Status: {status}')
    clear_output(wait=True)

print(f'Fine-tuning job {job_id} finished with status: {status}')

# List all fine-tuning jobs for this resource.
print('Checking other fine-tune jobs for this resource.')
response = client.fine_tuning.jobs.list()
print(f'Found {len(response.data)} fine-tune jobs.')

Fine-tuning job ftjob-ba7c72eaeda14eda87a76271b6a80e24 finished with status: succeeded
Checking other fine-tune jobs for this resource.
Found 3 fine-tune jobs.


## Step 8: Retrieve Fine Tuned Model Name

In [49]:
response = client.fine_tuning.jobs.retrieve(job_id)

print(response.model_dump_json(indent=2))
fine_tuned_model = response.fine_tuned_model

{
  "id": "ftjob-ba7c72eaeda14eda87a76271b6a80e24",
  "created_at": 1724251727,
  "error": null,
  "fine_tuned_model": "gpt-4o-mini-2024-07-18.ft-ba7c72eaeda14eda87a76271b6a80e24",
  "finished_at": 1724254215,
  "hyperparameters": {
    "n_epochs": 10,
    "batch_size": 1,
    "learning_rate_multiplier": 1
  },
  "model": "gpt-4o-mini-2024-07-18",
  "object": "fine_tuning.job",
  "organization_id": null,
  "result_files": [
    "file-34f0666467f24d10b4854f94735d4e89"
  ],
  "status": "succeeded",
  "trained_tokens": 9240,
  "training_file": "file-eb6a9476d11247c88838a18e607bcd19",
  "validation_file": "file-0e0b5ccd2a474af3966073f449a44778",
  "seed": 1388987542
}


In [83]:
response = client.fine_tuning.jobs.retrieve(job_id)
fine_tuned_model_id = response.fine_tuned_model

if fine_tuned_model_id is None:
    raise RuntimeError(
        "Fine-tuned model ID not found. Your job has likely not been completed yet."
    )

print("Fine-tuned model ID:", fine_tuned_model_id)

Fine-tuned model ID: gpt-4o-mini-2024-07-18.ft-ba7c72eaeda14eda87a76271b6a80e24


## Step 9 : Deploy the fine tuned model on Azure AI Studio. This can be done programmatically too. 

## Sample Testing 1

### Now ask the fine-tuned models if there is any park near a particular property. It will respond to the information based on the language used in the trained data.

In [103]:
import os
from openai import AzureOpenAI

client = AzureOpenAI(
  azure_endpoint=os.getenv("OPENAI_API_BASE"), 
  api_key=os.getenv("OPENAI_API_KEY"),  
  api_version="2024-05-01-preview"  # This API version or later is required to access fine-tuning for turbo/babbage-002/davinci-002
)

response = client.chat.completions.create(
    model="gpt-4o-mini-2024-07-18-ft-ba7c72eaeda14eda87a76271b6a80e24", # model = "Custom deployment name you chose for your fine-tuning model"
    messages=[
        {"role": "system", "content": "You are a real estate customer support agent whose primary goal is to help users find and inquire about properties. You are friendly and concise. You only provide factual answers to queries and assist users in finding property information."},
        {"role": "user", "content": "Are there any nearby parks to the property on 110 Grove Street?"},
      
    ]
)

print(response.choices[0].message.content)

Yes, there is a large park within walking distance of the property on 110 Grove Street. Would you like more information about the amenities in the area?
