# Custom evaluations of the fine tuned model locally

After fine tuning and deploying your model for testing, you can compare it to the accuracy and faithfulness of responses from other models in RAG based systems.  If you'd like to compare your model to others and see how it performs, this notebook will help you do that.

NOTE: If you'd like to do a standard evaluation using the llm-eval-app service, use the `eval_rh_api.ipynb` notebook.

To prepare for the evaluation, you will need to have the following:
1. A set of models deployed and accessible via an API.
2. A config.yaml file with the model information.
3. A set of reference questions and answers in a common format (csv, jsonl, or qna.yaml).
4. A set of context data in PDF format.  These are generally the documents that the model was trained on.

The process involves the following steps:
1. Sanity check the models to ensure your configuration is working correctly.
2. Generate reference questions and answers from the `reference_answers` directory.
3. Generate sample context data using a Milvus Lite Vector DB and the PDFs in the `data_preparation/document_collection` directory.
4. Get responses from each of the available models.
5. Grade responses using InstructLab.
6. Grade responses using OpenAI ChatGPT-4o as a Judge Model.
7. Save the results and create a resulting score report in Excel, Markdown, and HTML.

By the end of the notebook, you will have json file with the evaluation and a summary of the evaluation results in an Excel, Markdown, and HTML.

#### Summary
| question index   |   lab-tuned-granite |   lab-tuned-granite-rag |   granite-3.0-8b-instruct-rag |   gpt-4-rag |
|:-----------------|--------------------:|------------------------:|------------------------------:|------------:|
| Q1               |                   4 |                       5 |                             5 |     4       |
| Q2               |                   1 |                       5 |                             5 |     5       |
| ...              |                 ... |                     ... |                           ... |   ...       |
| QX               |                   4 |                       5 |                             5 |     5       |
| Sum              |                   9 |                      15 |                            15 |    14       |
| Average          |                   3 |                       5 |                             5 |     4.66667 |


#### lab-tuned-granite
| user_input | reference | retrieved_context |  response |   score |     reasoning |
|:-----------|----------:|------------------:|----------:|--------:|--------------:|
| What is ...| It is...  | There is ...      | It is...  |  4      | The answer... |

#### lab-tuned-granite-rag
| user_input | reference | retrieved_context |  response |   score |     reasoning |
|:-----------|----------:|------------------:|----------:|--------:|--------------:|
| What is ...| It is...  | There is ...      | It is...  |  4      | The answer... |


### Needed packages and imports

The following packages are needed to run the evaluation service.  If you have not already installed them, you can do so by running the following command:

In [1]:
!pip install -r requirements.txt

Collecting instructlab==0.22.1 (from -r requirements.txt (line 1))
  Using cached instructlab-0.22.1-py3-none-any.whl.metadata (56 kB)
Collecting docling==2.8.3 (from -r requirements.txt (line 2))
  Using cached docling-2.8.3-py3-none-any.whl.metadata (7.7 kB)
Collecting einops==0.8.0 (from -r requirements.txt (line 3))
  Using cached einops-0.8.0-py3-none-any.whl.metadata (12 kB)
Collecting langchain==0.3.12 (from -r requirements.txt (line 4))
  Using cached langchain-0.3.12-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community==0.3.12 (from -r requirements.txt (line 5))
  Using cached langchain_community-0.3.12-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-core==0.3.25 (from -r requirements.txt (line 6))
  Using cached langchain_core-0.3.25-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-milvus==0.1.7 (from -r requirements.txt (line 7))
  Using cached langchain_milvus-0.1.7-py3-none-any.whl.metadata (1.9 kB)
Collecting langchain-openai==0.2.12 (from -r r

### Testing Configuration - `config.yaml`

Before running the evaluation, you will need to create a `config.yaml` file with the model information.  There is [config_example.yaml](config_example.yaml) that has some <FIELDS> that need to be filled out, such as API Key, but you can use it to get started.

The file should be in the following format:
```yaml
name: my-eval # this determines the output directory of the evaluation.
judge:
    model_name: gpt-4o # choose the best OpenAI model for judging the responses.
    api_key: sk-12345  # OpenAI API Key is required to run the evaluations for both InstructLab and OpenAI
    template: |        # This is the langchain scoring template for the judge model. It is used in the ChatGPT-4o model to score the responses. The InstructLab model uses its own scoring template.
      Evaluate the answer_quality as:
      - Score 1: The response is completely incorrect, inaccurate, and/or not factual.
      - Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
      ...
testing_configs:
  - name: lab-tuned-granite # this is a name for the testing configuration.  It also determines output file names.
    endpoint_url: https://openai-api.com/v1 # The endpoint URL for the model. Ignore if using OpenAI. Don't forget /v1.
    model_name: finetuned   # model name used by the OpenAI API. e.g. finetuned, gpt-4, etc.
    model_type: vllm        # vllm/openai depending on the model type.  openai will ignore the endpoint_url
    api_key: eyafsdasdfsdf  # API Key for the served model
    rag: False              # Whether or not using RAG.  True if the template has context fields, False if it does not.
    template: |
      <|system|> I am a Red Hat Instruct Model
      <|user|>
      Answer the following question based on internal knowledge.
      Question: {question}
      Answer:
      <|assistant|>
  - name: gpt-4-rag
    model_name: gpt-4
    model_type: openai
    api_key: SK-12345
    rag: True
    template: |
      Context:
      {context}
      Answer the following question from the above context.
      Question: {question}
      Answer:
```

In [2]:
from eval_utils import get_config, check_judge_config, check_testing_config

config = get_config()
check_judge_config(config.get("judge"))
for testing_config in config.get("testing_configs"):
    check_testing_config(testing_config)


## Sanity check models

We will first test each of the models to ensure they are working correctly.  This will help us identify any issues with the configuration before running the evaluation.


#### Test Requests

In [3]:
from eval_utils import create_llm, chat_request, get_first_config

for testing_config in config["testing_configs"]:
    print("-" * 80)
    print(testing_config.get("name") or testing_config.get("model_name"))
    llm = create_llm(testing_config)
    question = "Who are you?"
    if testing_config.get("rag"):
        retrieved_context = "Pretend to be a human named Bob"
    else:
        retrieved_context = None
    answer = chat_request(llm, testing_config.get("template"), question, retrieved_context)
    print(f"Question: {question}? Answer: {answer}")

--------------------------------------------------------------------------------
claude-3-7-sonnet
Anthropic is here with model claude-3-7-sonnet-20250219
Question: Who are you?? Answer: I'm Bob, a human. How can I help you today?


## Generate Reference Data (Questions, Answers, and Context)

### Use qna.yaml, csv, jsonl to create some data

Before creating a set of reference ansers in a common `jsonl` format, you must:

1. Put your reference answers in the `reference_answers` directory
2. Put any relevant source PDF documents in the `data_preparation/document_collection`.

The reference answers should be in the format of a csv, jsonl, or a qna.yaml file.  It's preferable to use questions and reference answers made by human subject matter experts.  To this end CSV and jsonl files are easy formats to work with.  A qna.yaml file can also be added as an easy option.

The CSV should be formatted with `user_input` and `reference` fields.
| user_input | reference |
|:-----------|----------:|
| What is ...| It is...  |

The JSONL should be formatted with `user_input` and `reference` fields.
```json lines
{"user_input": "What is ...", "reference": "It is..."}
{"user_input": "What is ...", "reference": "It is..."}
```

The YAML file should be formatted with `seed_examples` and `questions_and_answers` fields.  This mirrors the normal `qna.yaml` format so that you can reuse the qna.yaml from your taxonomy.
```yaml
seed_examples:
    questions_and_answers:
      - question: >
          relevant question?
        answer: >
          reference answer
      - question: >
          relevant question 2?
        answer: >
          reference answer 2
```
After transforming the data, we will write the data to a `jsonl` file and add a `retrieved_context` field to the data. A Milvus Lite Vector DB will be generated from the PDFs in `data_preparation/document_collection`.  The context will be retrieved from the document collection.

At this point you can inspect the `results/reference_answers.jsonl` file to see the data and fix any issues you see, such as manually fixing the `retrieved_context` field before moving on.

In [4]:
import os
import json
from eval_utils import get_output_dir, get_reference_answers, get_context, write_jsonl

output_directory =  get_output_dir()
reference_answers = get_reference_answers("./reference_answers")
reference_answers = get_context(reference_answers, "../data_preparation/document_collection")
print(str(len(reference_answers)) + " reference answers loaded")

os.makedirs(output_directory, exist_ok=True)
write_jsonl(f"{output_directory}/reference_answers.jsonl", reference_answers)

os.makedirs(output_directory, exist_ok=True)

reference_answers/knowledge/finance/banking/products/flexible_savings/qna.yaml: 16 questions
reference_answers/knowledge/finance/banking/products/flexible_premier_checking/qna.yaml: 31 questions
reference_answers/knowledge/finance/banking/products/flexible_money_market_savings/qna.yaml: 46 questions
reference_answers/knowledge/finance/banking/products/flexible_core_checking/qna.yaml: 61 questions
reference_answers/knowledge/finance/banking/products/flexible_enhanced_checking/qna.yaml: 77 questions
reference_answers/knowledge/finance/banking/products/flexible_checking/qna.yaml: 92 questions
reference_answers/knowledge/finance/banking/policies/qna.yaml: 107 questions
reference_answers/knowledge/finance/banking/enablement/qna.yaml: 126 questions


  embeddings = HuggingFaceEmbeddings(
<All keys matched successfully>


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

RPC error: [create_index], <MilvusException: (code=65535, message=invalid index type: HNSW, local mode only support FLAT IVF_FLAT AUTOINDEX: )>, <Time:{'RPC start': '2025-05-14 16:58:14.762092', 'RPC error': '2025-05-14 16:58:14.762584'}>


345 documents loaded.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

126 reference answers loaded


## Get responses from each of the available models

Now that we have the `user_input`, `reference`, and `retrieved_context` fields in the `reference_answers.jsonl` file, we can get responses from each of the available models.  We will save the responses in a `responses` directory for each model.  The responses will be saved in a `jsonl` file with the format of `user_input`, `reference`, `retrieved_context`, and `response`.

In [5]:
import json
import os
import pandas as pd
from eval_utils import read_jsonl

config = get_config()
output_directory = get_output_dir()

responses_directory = output_directory + "/responses"
os.makedirs(responses_directory, exist_ok=True)

reference_answers = read_jsonl(f"{output_directory}/reference_answers.jsonl")

In [6]:
from eval_utils import chat_request, get_testing_config_name
reference_answers_df = pd.DataFrame(reference_answers)

for testing_config in config["testing_configs"]:
    print("-" * 80)
    print(testing_config.get("name") or testing_config.get("model_name"))
    responses = reference_answers_df.copy()
    responses["response"] = ""
    llm = create_llm(testing_config)
    for index, row in responses.iterrows():
        question = row["user_input"]
        print(f"Question {index + 1}:", question[:40])
        if testing_config.get("rag"):
            retrieved_context = row["retrieved_context"]
        else:
            retrieved_context = None
        answer = chat_request(llm, testing_config.get("template"), question, retrieved_context)
        print("Answer: " + answer[:40])
        responses.at[index, "response"] = answer
    testing_config_name = get_testing_config_name(testing_config)
    responses.to_json(f"{responses_directory}/{testing_config_name}_responses.jsonl", orient="records", lines=True)

--------------------------------------------------------------------------------
claude-3-7-sonnet
Anthropic is here with model claude-3-7-sonnet-20250219
Question 1: What is the monthly maintenance fee for 
Answer: Based on the context provided, the month
Question 2: How can the monthly maintenance fee for 
Answer: Based on the context provided, the month
Question 3: Is the Parasol Financial Flexible Saving
Answer: Yes, the Parasol Financial Flexible Savi
Question 4: How is interest calculated on the Paraso
Answer: Based on the context provided, there is 
Question 5: What is extra interest?
Answer: Based on the context provided, extra int
Question 6: Where can I find current interest rate i
Answer: You can find current interest rate infor
Question 7: Does the Parasol Financial Flexible Savi
Answer: Based on the context provided, it is sta
Question 8: What happens if I don't have enough mone
Answer: If you don't have enough money in your a
Question 9: Can I be charged a fee by a mercha

## Grade responses using InstructLab

Now that we have the responses from each of the models, we can grade the responses using InstructLab.  We will save the scores in a `ilab_scores` directory for each model.  The scores will be saved in a `jsonl` file with the format of `user_input`, `reference`, `retrieved_context`, `response`, `score`.  InstructLab will utilize Ragas along with ChatGPT-4o as a Judge Model to score the responses.

In [7]:
config = get_config()
output_directory = get_output_dir()
responses_directory = output_directory + "/responses"
ilab_scores_directory = output_directory + "/ilab_scores"
os.makedirs(ilab_scores_directory, exist_ok=True)

In [8]:
from instructlab_ragas import ModelConfig, RagasEvaluator, RunConfig, Sample
import os

for testing_config in config["testing_configs"]:
    testing_config_name = get_testing_config_name(testing_config)
    print("-" * 80)
    print(testing_config_name)

    responses_filename = f"{responses_directory}/{testing_config_name}_responses.jsonl"
    print(responses_filename)
    responses = pd.read_json(responses_filename, orient="records", lines=True)
    responses_list = responses[["user_input", "reference", "response"]].to_dict(orient="records")

    os.environ["OPENAI_API_KEY"] = config["judge"]["api_key"]
    evaluator = RagasEvaluator()
    evaluation_result = evaluator.run(dataset=responses_list)

    scores = pd.DataFrame(responses_list)
    scores["score"] = [score["domain_specific_rubrics"] for score in evaluation_result.scores]
    scores_filename = f"{ilab_scores_directory}/{testing_config_name}_scores"
    scores.to_json(f"{scores_filename}.jsonl", orient="records", lines=True)

--------------------------------------------------------------------------------
claude_3_7_sonnet
finetuned_eval/responses/claude_3_7_sonnet_responses.jsonl


  critic_lm = ChatOpenAI(model=judge_model_name, api_key=judge_openai_api_key)


Evaluating:   0%|          | 0/126 [00:00<?, ?it/s]

Batch 1/32:   0%|          | 0/4 [00:00<?, ?it/s]

## Grade responses using OpenAI ChatGPT-4o as a Judge Model

Alternatively, you can customize and score the responses using OpenAI ChatGPT-4o as a judge model and your own custom template from the `judge` field in the config.yaml.

```yaml
name: my-eval # this determines the output directory of the evaluation.
judge:
  endpoint_url: '' # defaults to OpenAI API endpoint
  model_name: gpt-4o
  api_key: your-openai-key
  template: |
    You are an evaluation system tasked with assessing the answer quality of a AI generated response in relation to the posed question and reference answer. Assess if the response is correct, accurate, and factual based on the reference answer.
    For evaluating factuality of the answer look at the reference answer compare the model answer to it.
    Evaluate the answer_quality as:
    - Score 1: The response is completely incorrect, inaccurate, and/or not factual.
    - Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
    - Score 3: The response is somewhat correct, accurate, and/or factual.
    - Score 4: The response is mostly correct, accurate, and factual.
    - Score 5: The response is completely correct, accurate, and factual.
    Here is the question: \n ------- \n {question} \n -------
    Here is model answer: \n ------- \n {answer} \n -------
    Here is the reference answer(may be very short and lack details or indirect, long and extractive):  \n ------- \n {reference_answer} \n ------- \n
    Assess the quality of model answer with respect to the Reference Answer, but do not penalize the model answer for adding details or give a direct answer to user question.
    Approach your evaluation in step-by-step manner.
    For evaluating first list out keys facts covered in the reference answer and check how many are covered by the model answer.
    If the question or reference answer is about steps then check if the steps and their order in model answer match with reference answer.
    Provide your response as JSON object with two keys: 'reasoning' and 'answer_quality'.
```

From this template, ChatGPT-4o will return a JSON object with the `answer_quality` and `reasoning` fields.  The `answer_quality` field will be a score between 1 and 5, with 5 being the best score.  The `reasoning` field will provide a reason for the score.

We will save the scores in a `openai_scores` directory for each model.  The scores will be saved in a `jsonl` file with the format of `user_input`, `reference`, `retrieved_context`, `response`, `score`, and `reasoning`.



In [9]:
import os
from eval_utils import get_config, get_output_dir

config = get_config()
output_directory = get_output_dir()
responses_directory = output_directory + "/responses"
openai_scores_directory = output_directory + "/openai_scores"
os.makedirs(openai_scores_directory, exist_ok=True)

In [10]:
from langchain.prompts import PromptTemplate

scoring_template_str = config["judge"].get("template")
assert scoring_template_str
SCORING_PROMPT = PromptTemplate.from_template(scoring_template_str)

In [11]:
import re
from openai import OpenAI

judge_client = OpenAI(api_key=config["judge"]["api_key"])
judge_model_name = config["judge"]["model_name"]

def openai_score_request(question, answer, reference_answer):
    messages = [
        {
            "role": "user",
            "content": SCORING_PROMPT.format(
                question=question,
                answer=answer,
                reference_answer=reference_answer
            )
        }
    ]

    completion = judge_client.chat.completions.create(
        model=judge_model_name,
        messages=messages,
        n=1,
        temperature=0.0,
        max_tokens=1024,
    )
    response_content = completion.choices[0].message.content
    response_content = re.sub(r'^```json', '', response_content)
    response_content = re.sub(r'```$', '', response_content)
    try:
        result = json.loads(response_content)
    except Exception as e:
        result = {"answer_quality": 0, "reasoning": "Error"}
        print("response_content:", response_content)
        print(f"An error occurred: {e}")

    score = result["answer_quality"]
    reasoning = result["reasoning"]
    return score, reasoning

In [12]:
from eval_utils import replace_special_char

for testing_config in config["testing_configs"]:
    testing_config_name = get_testing_config_name(testing_config)
    print("-" * 80)
    print(testing_config_name)

    responses_filename = f"{responses_directory}/{testing_config_name}_responses.jsonl"
    scores = pd.read_json(responses_filename, orient="records", lines=True)
    scores["score"] = None
    scores["reasoning"] = None

    for index, row in scores.iterrows():
        user_input = row["user_input"]
        response = row["response"]
        reference_answer = row["reference"]
        print(f"Question {index + 1}:", user_input)
        if response:
            score, reasoning = openai_score_request(user_input, response, reference_answer)
            scores.at[index, "score"] = score
            scores.at[index, "reasoning"] = reasoning
            print("Answer:", response[:80])
            print("Score:", score, reasoning[:80])

    judge_name = replace_special_char(judge_model_name)
    scores_filename = f"{openai_scores_directory}/{testing_config_name}_scores"
    scores.to_json(f"{scores_filename}.jsonl", orient="records", lines=True)

--------------------------------------------------------------------------------
claude_3_7_sonnet
Question 1: What is the monthly maintenance fee for the Parasol Financial Flexible Savings account?
Answer: Based on the context provided, the monthly maintenance fee for the Parasol Finan
Score: 5 The reference answer provides one key fact: the monthly maintenance fee for the 
Question 2: How can the monthly maintenance fee for the Parasol Financial Flexible Savings account be waived?
Answer: Based on the context provided, the monthly maintenance fee of $8.00 for the Para
Score: 5 The reference answer lists four key ways to waive the monthly maintenance fee fo
Question 3: Is the Parasol Financial Flexible Savings Account insured by the Federal Deposit Insurance Corporation (FDIC)
Answer: Yes, the Parasol Financial Flexible Savings Account is insured by the Federal De
Score: 5 The reference answer states that the Parasol Financial Flexible Savings Account 
Question 4: How is interest calc

## Save Results

Now that we have the scores for each of the models, we can save the results in the `evaluation.json` file. The results will include the `reference_answers`, `ilab_evaluation`, and `openai_evaluation` fields.

```json
{
    "reference_answers": [
        {"user_input": "What is ...", "reference": "It is...", "retrieved_context": "There is ..."},
        {"user_input": "What is ...","reference": "It is...","retrieved_context": "There is ..."}
    ],
    "ilab_evaluation": {
        "status": "complete",
        "results": [
            {
                "name": "lab-tuned-granite",
                "scores": [
                    {"user_input": "What is ...", ..., "score": 4},
                    {"user_input": "What is ...", ..., "score": 4},
                ]
            }
        ]
    },
    "openai_evaluation": {
        "status": "complete",
        "results": [
            {
                "name": "other-model-rag",
                "scores": [
                    {"user_input": "What is ...", ..., "score": 4,"reasoning": "The answer..."}
                    {"user_input": "What is ...", ..., "score": 4,"reasoning": "The answer..."}
                ]
            }
        ]
    }
}
```


In [13]:
import json

config = get_config()
output_directory = get_output_dir()
responses_directory = output_directory + "/responses"
ilab_scores_directory = output_directory + "/ilab_scores"
openai_scores_directory = output_directory + "/openai_scores"
os.makedirs(openai_scores_directory, exist_ok=True)


def read_eval_results(directory):
    results = []
    for testing_config in config["testing_configs"]:
        testing_config_name = get_testing_config_name(testing_config)
        scores = pd.read_json(f"{directory}/{testing_config_name}_scores.jsonl", orient="records", lines=True)
        results.append({
            "name": testing_config.get("name") or testing_config.get("model_name"),
            "scores": scores.to_dict(orient="records")
        })
    return {
        "status": "complete",
        "results": results
    }

evaluation = {}
evaluation["reference_answers"] = read_jsonl(f"{output_directory}/reference_answers.jsonl")
evaluation["ilab_evaluation"] = read_eval_results(ilab_scores_directory)
evaluation["openai_evaluation"] = read_eval_results(openai_scores_directory)
json.dump(evaluation, open(f"{output_directory}/evaluation.json", 'w'), indent=4)

## Create resulting score report Excel / Markdown / HTML

Now that the evaluation is complete, we can summarize the results in an Excel, Markdown, and HTML file for both the InstructLab evaluation and the OpenAI evaluation.  Feel free to use either.  You can find the files in the `results` directory and inspect the results.  The summary scores are between 1 and 5, with 5 being the best score.  The first table is a summary for each model and each model detail, including all the data follows.  If you're worried about the results, this should help diagnose any issues like subpar context retrieval.

#### Summary
| question index   |   lab-tuned-granite |   lab-tuned-granite-rag |   granite-3.0-8b-instruct-rag | gpt-4-rag |
|:-----------------|--------------------:|------------------------:|------------------------------:|----------:|
| Q1               |                   4 |                       5 |                             5 |         4 |
| Q2               |                   1 |                       5 |                             5 |         5 |
| ...              |                 ... |                     ... |                           ... |       ... |
| QX               |                   4 |                       5 |                             5 |         5 |
| Sum              |                   9 |                      15 |                            15 |        14 |
| Average          |                   3 |                       5 |                             5 |   4.66667 |


#### lab-tuned-granite
| user_input | reference | retrieved_context |  response |   score |     reasoning |
|:-----------|----------:|------------------:|----------:|--------:|--------------:|
| What is ...| It is...  | There is ...      | It is...  |  4      | The answer... |

#### lab-tuned-granite-rag
| user_input | reference | retrieved_context |  response |   score |     reasoning |
|:-----------|----------:|------------------:|----------:|--------:|--------------:|
| What is ...| It is...  | There is ...      | It is...  |  4      | The answer... |




In [14]:
import json

output_directory = get_output_dir()
eval = json.load(open(f"{output_directory}/evaluation.json"))

In [15]:
from eval_utils import summarize_results, write_excel, write_markdown, write_html

ilab_summary_output_df = summarize_results(eval.get("ilab_evaluation").get("results"))
openai_summary_output_df = summarize_results(eval.get("openai_evaluation").get("results"))

write_excel(
    ilab_summary_output_df,
    eval.get("ilab_evaluation").get("results"),
    f"{output_directory}/ilab_scores.xlsx"
)

write_excel(
    openai_summary_output_df,
    eval.get("openai_evaluation").get("results"),
    f"{output_directory}/openai_scores.xlsx"
)

write_markdown(
    ilab_summary_output_df,
    eval.get("ilab_evaluation").get("results"),
    f"{output_directory}/ilab_scores.md"
)

write_markdown(
    openai_summary_output_df,
    eval.get("openai_evaluation").get("results"),
    f"{output_directory}/openai_scores.md"
)

write_html(
    ilab_summary_output_df,
    eval.get("ilab_evaluation").get("results"),
    f"{output_directory}/ilab_scores.html"
)

write_html(
    openai_summary_output_df,
    eval.get("openai_evaluation").get("results"),
    f"{output_directory}/openai_scores.html"
)
