# Fine Tuning GPT-3.5-Turbo

In this notebook, we walk through an example of fine-tuning gpt-3.5-turbo.

Specifically, we attempt to distill GPT-4's knowledge, by generating training data with GPT-4 to then fine-tune GPT-3.5.

All training data is generated using two different sections of our index data, creating both a training and evalution set.

We then finetune with our `OpenAIFinetuneEngine` wrapper abstraction.

Evaluation is done using the `ragas` library, which we will detail later on.

In [59]:
%pip install llama-index-llms-vertex

Collecting llama-index-llms-vertex
  Downloading llama_index_llms_vertex-0.1.4-py3-none-any.whl.metadata (642 bytes)
Downloading llama_index_llms_vertex-0.1.4-py3-none-any.whl (7.0 kB)
Installing collected packages: llama-index-llms-vertex
Successfully installed llama-index-llms-vertex-0.1.4
Note: you may need to restart the kernel to use updated packages.


In [79]:
import google
creds,project = google.auth.default()
project_id = "sbx-196865-genaift-ds-ccd784e6"


In [77]:
# creds.project_id

In [147]:
llm = Vertex(model="gemini-pro", project=project_id, credentials=creds)

In [136]:
# help(Vertex)

In [1]:
%pip install llama-index-finetuning
%pip install llama-index-finetuning-callbacks
%pip install llama-index-llms-openai
%pip install spacy

Collecting llama-index-finetuning
  Downloading llama_index_finetuning-0.1.4-py3-none-any.whl.metadata (1.1 kB)
Collecting llama-index-core<0.11.0,>=0.10.11.post1 (from llama-index-finetuning)
  Downloading llama_index_core-0.10.16.post1-py3-none-any.whl.metadata (3.6 kB)
Collecting llama-index-embeddings-adapter<0.2.0,>=0.1.2 (from llama-index-finetuning)
  Downloading llama_index_embeddings_adapter-0.1.3-py3-none-any.whl.metadata (695 bytes)
Collecting llama-index-llms-gradient<0.2.0,>=0.1.1 (from llama-index-finetuning)
  Downloading llama_index_llms_gradient-0.1.2-py3-none-any.whl.metadata (685 bytes)
Collecting llama-index-llms-openai<0.2.0,>=0.1.1 (from llama-index-finetuning)
  Downloading llama_index_llms_openai-0.1.7-py3-none-any.whl.metadata (557 bytes)
Collecting llama-index-postprocessor-cohere-rerank<0.2.0,>=0.1.1 (from llama-index-finetuning)
  Downloading llama_index_postprocessor_cohere_rerank-0.1.2-py3-none-any.whl.metadata (720 bytes)
Collecting sentence-transformers<

In [2]:
!pip install llama-index-embeddings-openai

Collecting llama-index-embeddings-openai
  Downloading llama_index_embeddings_openai-0.1.6-py3-none-any.whl.metadata (654 bytes)
Downloading llama_index_embeddings_openai-0.1.6-py3-none-any.whl (6.0 kB)
Installing collected packages: llama-index-embeddings-openai
Successfully installed llama-index-embeddings-openai-0.1.6


In [3]:
!pip install  pypdf sentence-transformers ragas

Collecting pypdf
  Downloading pypdf-4.1.0-py3-none-any.whl.metadata (7.4 kB)
Collecting ragas
  Downloading ragas-0.1.3-py3-none-any.whl.metadata (4.7 kB)
Collecting langchain (from ragas)
  Downloading langchain-0.1.11-py3-none-any.whl.metadata (13 kB)
Collecting langchain-core (from ragas)
  Downloading langchain_core-0.1.29-py3-none-any.whl.metadata (6.0 kB)
Collecting langchain-community (from ragas)
  Downloading langchain_community-0.0.25-py3-none-any.whl.metadata (8.1 kB)
Collecting langchain-openai (from ragas)
  Downloading langchain_openai-0.0.8-py3-none-any.whl.metadata (2.5 kB)
Collecting pysbd>=0.3.4 (from ragas)
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Collecting appdirs (from ragas)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain->ragas)
  Downloading langchain_text_splitters-0.0.1-py3-none-any.whl.metadata (2.0 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain->

In [4]:
!pip install llama-index-readers-file

Collecting llama-index-readers-file
  Downloading llama_index_readers_file-0.1.8-py3-none-any.whl.metadata (926 bytes)
Collecting bs4<0.0.3,>=0.0.2 (from llama-index-readers-file)
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting pymupdf<2.0.0,>=1.23.21 (from llama-index-readers-file)
  Downloading PyMuPDF-1.23.26-cp310-none-manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.23.22 (from pymupdf<2.0.0,>=1.23.21->llama-index-readers-file)
  Downloading PyMuPDFb-1.23.22-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.4 kB)
Downloading llama_index_readers_file-0.1.8-py3-none-any.whl (34 kB)
Downloading bs4-0.0.2-py2.py3-none-any.whl (1.2 kB)
Downloading PyMuPDF-1.23.26-cp310-none-manylinux2014_x86_64.whl (4.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading PyMuPDFb-1.23.22-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_

In [60]:
import os
import openai

In [61]:
os.environ["OPENAI_API_KEY"] = "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"
openai.api_key = os.environ["OPENAI_API_KEY"]

In [62]:
# !pip install nest_asyncio


In [63]:
import nest_asyncio
nest_asyncio.apply()

## Data Setup

Here, we first down load the PDF that we will use to generate training data.

In [64]:
# !curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

The next step is generating a training and eval dataset.

We will generate 40 questions on different sections of the PDF we downloaded.

We can use GPT-3.5 on the eval questions to get our baseline performance.

Then, we will use GPT-4 on the train questions to generate our training data. The training data will be collected with out `OpenAIFineTuningHandler`.

This step is entirely optional if you don't want to spend the time/tokens -- the eval and training questions are also provided in this folder, as well as the training data!

### Train Generation

In [241]:
from llama_index.core import SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import DatasetGenerator

documents = SimpleDirectoryReader(
    input_files=["accenture-reports-first-quarter-fiscal-2024-results.pdf",
                 "accentures-fourth-quarter-fiscal-2022-earnings-release.pdf",
                 "final-q4-fy23-earnings-press-release.pdf",
                "A061213R.pdf",
                "accenture-fiscal-2021-annual-report.pdf"]
).load_data()

# Shuffle the documents
import random

random.seed(42)
random.shuffle(documents)

gpt_35_llm = OpenAI(model="gpt-3.5-turbo-0125", temperature=0.3)

In [242]:
gpt_35_llm

OpenAI(callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7f6ce963c730>, system_prompt=None, messages_to_prompt=<function messages_to_prompt at 0x7f6e0ada4280>, completion_to_prompt=<function default_completion_to_prompt at 0x7f6e0ae0c670>, output_parser=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>, query_wrapper_prompt=None, model='gpt-3.5-turbo-0125', temperature=0.3, max_tokens=None, additional_kwargs={}, max_retries=3, timeout=60.0, default_headers=None, reuse_client=True, api_key='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', api_base='https://api.openai.com/v1', api_version='')

In [127]:
# documents

In [266]:
question_gen_query = (
    "You are a Financial expert in Accenture C suite. Your task is to setup "
    "a set of questions and capture the context in tabular format for the given question as per the format of examples given below from the "
    "Accenture finance reports, formulate "
    "a single question based on the tabular data that you extract from the\
    context. Restrict the question to the context information that you are able to extract. Please include numbers in your response. Make sure that the data to the columns present in the tabular data format, include only numbers"
    "Put numbers only in the data as you are able to extract from the context, do not interpolate or calculate any values. Extract only the accurate values from the data."
    "Return the extracted question and context as a single string enclosed in a list as [{question:<put the question here>,context:<put the context here>}]"
    "here are few examples:"
    "[{question: Generate a market financial narrative using the context,\
    context: Market | Revenue | Revenue to Plan | Sales | Sales to Plan | Profit | Profit to Plan | Costs | Costs to Plan | Backlog | Backlog to Plan\
North America | $5.3M | -12% | $3.4M | 9% | $4.9M | 5% | $2.1M | 15% | $7.2M | 30%\
Europe | $6.7M | 25% | $7.1M | -20% | $8.3M | -35% | $4.8M | -25% | $5.9M | -10%\
Asia | $8.2M | -30% | $9.8M | 15% | $7.6M | 20% | $6.3M | 10% | $4.5M | -18%},"
"{question:Write a financial report of Technology services based on the context,\
context: Technology Services\
Service Group | Bookings ($) | Bookings YoY (%) | Revenue ($) | Revenue YoY (%) | Profit ($) | Profit YoY (%) | Sales ($) | Sales YoY (%)\
SI | $14.3M | -1.2% | $6.2M | -4.1% | $3.6M | -2.9% | $6.0M | 0.3%\
AMS | $13.7M | 2.8% | $4.8M | 9.5% | $4.5M | 4.9% | $3.6M | -14.4%\
IMS | $8.1M | 5.9% | $6.6M | -2.9% | $4.1M | -5.4% | $2.0M | -7.1%}]")

preamble = "### Finance Report Summary ### Human: You are a financial analyst reviewing a report. Your task is to formulate a question based on the data provided.\
Provide detailed and accurate questions and answers. ### Assistant: Assume the role of an expert analyst. Generate the tabular data required to answer the question here. \
Provide only numbers inside the columns. ### Assistant: Assume you are a financial expert. Create the answer or summary based on the question here. Provide detailed and crisp answers.\", "

question_gen_query = (
    """Assume you are a financial analyst tasked with extracting insights from a finance report. """
    """Your goal is to generate questions and answers based on the provided text. If the question is not related to finance, respond with 'This is not related to finance'.The format should adhere to the following structure:"""
    # """{\"text:"""
    """{\"question\": \"What is the question you would ask regarding the provided financial data?\", """
    """\"table\": \"| Sample Column 1 | Sample Column 2 | Sample Column 3 |\\n|---|---|---|\\n| Sample Data 1 | Sample Data 2 | Sample Data 3 |\", """
    """\"answer\": \"Your detailed and crisp answer or summary goes here.\", """
    """\"clause\": \"If the question is not related to finance, respond with 'This is not related to finance'.\"}"""
    )







dataset_generator = DatasetGenerator.from_documents(
    documents[150:],
    question_gen_query=question_gen_query,
    # llm=llm
    llm=gpt_35_llm
)


  return cls(


In [256]:
# len(documents)

In [267]:
system_message =question_gen_query

In [268]:
# NOTE: this may take some time. Go grab a coffee!
questions = dataset_generator.generate_questions_from_nodes(num=50)
print("Generated ", len(questions), " questions")

Generated  32  questions


  return QueryResponseDataset(queries=queries, responses=responses_dict)


In [270]:
questions

['{"question": "What impact did gains related to an investment have on Accenture\'s effective tax rates and diluted earnings per share for fiscal 2021 and 2020?", "table": "| Fiscal Year | Gains on Investment (in millions) | Tax Expense (in millions) | Effective Tax Rate | Diluted Earnings Per Share |\\n|---|---|---|---|---|\\n| 2021 | $271 | $41 | 22.8% | $9.16 |\\n| 2020 | $332 | $52 | 23.5% | $7.89 |", "answer": "The gains related to an investment in fiscal 2021 and 2020 impacted Accenture\'s effective tax rates and diluted earnings per share as follows: In fiscal 2021, the gains were $271 million with a related tax expense of $41 million, resulting in an effective tax rate of 22.8% and diluted earnings per share of $9.16. In fiscal 2020, the gains were $332 million with a related tax expense of $52 million, resulting in an effective tax rate of 23.5% and diluted earnings per share of $7.89.", "clause": "If the question is not related to finance, respond with \'This is not related t

In [271]:
with open("train_questions_3.txt", "w") as f:
    for question in questions:
        f.write(question + "\n")

In [253]:
questions

['{"question": "How are long-lived assets assessed for recoverability according to the provided financial data?", "table": "| Fiscal Year | Research and Development Costs | Advertising Costs | Provision for Doubtful Accounts |\\n|---|---|---|---|\\n| 2021 | $1,118,320 | $171,883 | $6,199 |\\n| 2020 | $870,611 | $57,658 | $147 |\\n| 2019 | $799,734 | $85,521 | $974 |", "answer": "Long-lived assets or groups of assets are assessed based on a comparison of the carrying amount to the estimated future net cash flows. If estimated future undiscounted net cash flows are less than the carrying amount, the asset is considered impaired and a loss is recorded. Intangible assets with finite lives are generally amortized using the straight-line method over their estimated economic useful lives.", "clause": "If the question is not related to finance, respond with \'This is not related to finance\'."}',
 '{"question": "What is the question you would ask regarding the provided financial data?", "table

### Eval Generation

Now, lets generate questions on a completely different set of documents, in order to create our eval dataset.

In [20]:
dataset_generator = DatasetGenerator.from_documents(
    documents[
        23:
    ],  # since we generated ~1 question for 40 documents, we can skip the first 40
    question_gen_query=question_gen_query,
    llm=gpt_35_llm,
)

  return cls(


In [21]:
# NOTE: this may take some time. Go grab a coffee!
questions = dataset_generator.generate_questions_from_nodes(num=40)
print("Generated ", len(questions), " questions")

Generated  28  questions


  return QueryResponseDataset(queries=queries, responses=responses_dict)


In [22]:
with open("eval_questions.txt", "w") as f:
    for question in questions:
        f.write(question + "\n")

In [23]:
questions

['What was the increase in Health & Public Service revenues in the fourth quarter of fiscal 2023 compared to the fourth quarter of fiscal 2022?',
 "What is Accenture's rationale for providing full-year revenue guidance on a local-currency basis rather than in U.S. dollars?",
 "What was Accenture's annual effective tax rate for fiscal 2022 compared to fiscal 2021?",
 'What is the total amount of current assets as of August 31, 2023, for Accenture plc according to the first quarter 2024 report?',
 "What is Accenture's total number of employees and in how many countries do they serve clients?",
 "Question: What are some of the risks mentioned in the first quarter 2024 report for Accenture that could potentially impact the company's results of operations?",
 'Answer: Some of the risks mentioned in the report include adverse effects from volatile economic and political conditions, challenges in matching people and skills with client demand, legal, reputational, and financial risks related t

## Initial Eval with GPT-3.5-Turbo Query Engine

For this eval, we will be using the [`ragas` evaluation library](https://github.com/explodinggradients/ragas).

Ragas has a ton of evaluation metrics for RAG pipelines, and you can read about them [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md).

For this notebook, we will be using the following two metrics

- `answer_relevancy` - This measures how relevant is the generated answer to the prompt. If the generated answer is incomplete or contains redundant information the score will be low. This is quantified by working out the chance of an LLM generating the given question using the generated answer. Values range (0,1), higher the better.
- `faithfulness` - This measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.

In [24]:
questions = []
with open("eval_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

In [25]:
from llama_index.core import VectorStoreIndex

# limit the context window to 2048 tokens so that refine is used
from llama_index.core import Settings

Settings.context_window = 2048

index = VectorStoreIndex.from_documents(
    documents,
)

query_engine = index.as_query_engine(similarity_top_k=2, llm=gpt_35_llm)

In [26]:
contexts = []
answers = []

for question in questions:
    response = query_engine.query(question)
    contexts.append([x.node.get_content() for x in response.source_nodes])
    answers.append(str(response))

In [30]:
answers

['The increase in Health & Public Service revenues in the fourth quarter of fiscal 2023 compared to the fourth quarter of fiscal 2022 was 13%.',
 "Accenture provides full-year revenue guidance on a local-currency basis instead of in U.S. dollars due to the potential significant impact of foreign exchange rate fluctuations, which could vary considerably from the company's stated assumptions.",
 "Accenture's annual effective tax rate for fiscal 2022 was 24.0%, which was higher compared to the effective tax rate of 22.8% for fiscal 2021.",
 'The total amount of current assets as of August 31, 2023, for Accenture plc according to the first quarter 2024 report is $23,381,931.',
 'Accenture has 733,000 people serving clients in more than 120 countries.',
 "Some of the risks mentioned in the first quarter 2024 report for Accenture that could potentially impact the company's results of operations include adverse effects from uncertain economic and political conditions, challenges in maintainin

In [27]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

ds = Dataset.from_dict(
    {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
    }
)

result = evaluate(ds, [answer_relevancy, faithfulness])
print(result)

Evaluating:   0%|          | 0/56 [00:00<?, ?it/s]

{'answer_relevancy': 0.9261, 'faithfulness': 0.9167}


In [None]:
#how to run a custom model
from openai import OpenAI
client = OpenAI()

completion = client.chat.completions.create(
  model="ft:gpt-3.5-turbo-0125:personal:acnearningsreport:8xaOhqJD",
  messages=[
    {"role": "system", "content": "You are a geography geek"},
    {"role": "user", "content": "What is the capital of burundy"}
  ]
)
print(completion.choices[0].message)

ChatCompletionMessage(content='The capital of Burundi is Bujumbura.', role='assistant', function_call=None, tool_calls=None)


## GPT-4 to Collect Training Data

Here, we use GPT-4 and the `OpenAIFineTuningHandler` to collect data that we want to train on.

In [28]:
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager

finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)
llm.callback_manager = callback_manager

In [29]:
llm


OpenAI(callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7f0552430160>, system_prompt=None, messages_to_prompt=<function messages_to_prompt at 0x7f064cd06170>, completion_to_prompt=<function default_completion_to_prompt at 0x7f064cb7e4d0>, output_parser=None, pydantic_program_mode=<PydanticProgramMode.DEFAULT: 'default'>, query_wrapper_prompt=None, model='gpt-3.5-turbo', temperature=0.3, max_tokens=None, additional_kwargs={}, max_retries=3, timeout=60.0, default_headers=None, reuse_client=True, api_key='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', api_base='https://api.openai.com/v1', api_version='')

In [None]:
finetuning_handler

<llama_index.finetuning.callbacks.finetuning_handler.OpenAIFineTuningHandler at 0x7a979f58f850>

In [None]:
finetuning_handler.get_finetuning_events()

In [None]:
questions = []
with open("train_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

In [None]:
questions

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents,
)

query_engine = index.as_query_engine(similarity_top_k=2, llm=llm)

In [None]:
def generate_chat_output(system_message, user_messages, response_messages):
    # Validate input lengths
    if len(user_messages) != len(response_messages):
        raise ValueError("The length of user_messages and response_messages must be equal.")

    output = []  # List to hold the output dictionaries

    for user_message, response_message in zip(user_messages, response_messages):
        # Construct the message group
        message_group = {
            "messages": [
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message},
                {"role": "assistant", "content": response_message}
            ]
        }

        # Append the constructed group to the output list
        output.append(message_group)

    return output

# Test the function
system_message = "Marv is a factual chatbot that is also sarcastic."
user_messages = ["What's the capital of France?", "Who wrote 'Romeo and Juliet'?", "How far is the Moon from Earth?"]
response_messages = [
    "Paris, as if everyone doesn't know that already.",
    "Oh, just some guy named William Shakespeare. Ever heard of him?",
    "Around 384,400 kilometers. Give or take a few, like that really matters."
]

# Generate and print the output
output = generate_chat_output(system_message, user_messages, response_messages)
for message_group in output:
    print(message_group)


In [None]:
import time
user_messages=[]
response_messages=[]
for question in questions:
    response = query_engine.query(question)
    print (question)
    user_messages.append(question)
    response_messages.append(str(response))
    print(response)
    time.sleep(1)

In [None]:
evalquestions = []
with open("eval_questions.txt", "r") as f:
    for line in f:
        evalquestions.append(line.strip())

In [None]:
import time
user_messageseval=[]
response_messageseval=[]
for question in evalquestions:
    response = query_engine.query(question)
    print (question)
    user_messageseval.append(question)
    response_messageseval.append(str(response))
    print(response)
    time.sleep(1)

In [None]:
import json
output = generate_chat_output(system_message, user_messages, response_messages)

# Convert the output to a JSON string
json_output = json.dumps(output, indent=4)

# Save to a JSON file
with open("finetuning_events.json", "w") as json_file:
    json_file.write(json_output)

# Save to a JSONL file
with open("finetuning_events.jsonl", "w") as jsonl_file:
    for item in output:
        jsonl_file.write(json.dumps(item) + '\n')

print("Output saved in both JSON and JSONL formats.")

NameError: name 'generate_chat_output' is not defined

In [None]:
import json
output = generate_chat_output(system_message, user_messageseval, response_messageseval)

# Convert the output to a JSON string
json_output = json.dumps(output, indent=4)

# Save to a JSON file
with open("eval_events.json", "w") as json_file:
    json_file.write(json_output)

# Save to a JSONL file
with open("eval_events.jsonl", "w") as jsonl_file:
    for item in output:
        jsonl_file.write(json.dumps(item) + '\n')

print("Output saved in both JSON and JSONL formats.")

## Create `OpenAIFinetuneEngine`

We create an `OpenAIFinetuneEngine`: the finetune engine will take care of launching a finetuning job, and returning an LLM model that you can directly plugin to the rest of LlamaIndex workflows.

We use the default constructor, but we can also directly pass in our finetuning_handler into this engine with the `from_finetuning_handler` class method.



In [None]:
finetuning_handler.save_finetuning_events("finetuning_events.jsonl")

In [None]:
from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo-0125",
    "finetuning_events.jsonl",
    # start_job_id="<start-job-id>"  # if you have an existing job, can specify id here
)

# finetune_engine = OpenAIFinetuneEngine.from_finetuning_handler(
#     finetuning_handler,
#     "gpt-3.5-turbo",
#     "tmp.jsonl"
# )

In [None]:
finetune_engine.finetune()

Num examples: 40
First example:
{'role': 'system', 'content': 'Marv is a factual chatbot that is also sarcastic.'}
{'role': 'user', 'content': "What is Accenture's total number of employees serving clients in more than 120 countries?"}
{'role': 'assistant', 'content': "Accenture's total number of employees serving clients in more than 120 countries is approximately 733,000 people."}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 56, 233
mean / median: 105.075, 91.5
p5 / p95: 62.9, 163.90000000000003

#### Distribution of num_assistant_tokens_per_example:
min / max: 7, 182
mean / median: 54.675, 37.5
p5 / p95: 19.700000000000003, 113.30000000000004

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~4203 tokens that will be cha

In [None]:
finetune_engine.get_current_job()

FineTuningJob(id='ftjob-I6PVWDqXJkFEg0Bs81tH1VPl', created_at=1709230766, error=Error(code=None, message=None, param=None, error=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-t8Y1yHFRT15iXYj3Z2IftwaE', result_files=[], status='validating_files', trained_tokens=None, training_file='file-GjllJWui8Cefq6HmGLWKhewn', validation_file=None, user_provided_suffix=None)

In [None]:
ft_llm = finetune_engine.get_finetuned_model(temperature=0.3)

ValueError: Job ftjob-I6PVWDqXJkFEg0Bs81tH1VPl does not have a finetuned model id ready yet.

## Evaluation

After some time, your model will be done training!

The next step is running our fine-tuned model on our eval dataset again to measure any performance increase.

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager


# Option 1: pass in ft_llm directly into Settings
from llama_index.core import Settings

Settings.llm = ft_llm
Settings.context_window = (
    2048  # limit the context window artifically to test refine process
)

# # Option 2: you can also specify the model name manually
# ft_model_name = "ft:gpt-3.5-turbo-0613:..."
# Settings.llm = OpenAI(model=ft_model_name, temperature=0.3)

In [None]:
questions = []
with open("eval_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(similarity_top_k=2, llm=ft_llm)

In [None]:
contexts = []
answers = []

for question in questions:
    response = query_engine.query(question)
    contexts.append([x.node.get_content() for x in response.source_nodes])
    answers.append(str(response))

In [None]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

ds = Dataset.from_dict(
    {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
    }
)

result = evaluate(ds, [answer_relevancy, faithfulness])
print(result)

## Exploring Differences

Let's quickly compare the differences in responses, to demonstrate that fine tuning did indeed change something.

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

In [None]:
questions = []
with open("eval_questions.txt", "r") as f:
    for line in f:
        questions.append(line.strip())

In [None]:
print(questions[12])

### Original

In [None]:
from llama_index.core.response.notebook_utils import display_response
from llama_index.llms.openai import OpenAI


gpt_35_llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)

In [None]:
query_engine = index.as_query_engine(llm=gpt_35_llm)

response = query_engine.query(questions[12])

display_response(response)

### Fine-Tuned

In [None]:
query_engine = index.as_query_engine(llm=ft_llm)

response = query_engine.query(questions[12])

display_response(response)

As we can see, the fine-tuned model provides a more thorough response! This lines up with the increased faithfullness score from ragas, since the answer is more representative of the retrieved context.

## Conclusion

So, in conclusion, finetuning with only ~61 questions actually helped improve our eval scores!

**answer_relevancy: 0.9725 -> 0.9607**

The answer relevancy dips slightly but it's very small.

**faithfulness: 0.7325 -> 0.7917**

The faithfulness appears to have been improved! This mains the anwers given better fuffil the original question that was asked.