#### **Evaluation pipeline for Instruction Finetuned Model**

In [10]:
## load the data 
import json 

with open('instruction-data-with-response.json', 'r') as f:
    test_data = json.load(f)

In [11]:
print(f"Number of examples in test data : {len(test_data)}")

Number of examples in test data : 110


In [12]:
## an example for test data 
test_data[0]

{'instruction': 'Rewrite the sentence using a simile.',
 'input': 'The car is very fast.',
 'output': 'The car is as fast as lightning.',
 'model_response': 'The car is as fast as a cheetah.'}

##### **checking whether ollama in running or not**

In [3]:
import psutil

def check_if_running(process_name):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break
    return running

ollama_running = check_if_running("ollama")

if not ollama_running:
    raise RuntimeError("Ollama not running. Launch ollama before proceeding.")
print("Ollama running:", check_if_running("ollama"))

Ollama running: True


In [6]:
import urllib.request
import json

def query_model(
    prompt,
    model="llama3",
    url="http://localhost:11434/api/chat"
):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "options": {     # Settings below are required for deterministic responses
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }


    # Convert the dictionary to a JSON formatted string and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a request object, setting the method to POST and adding necessary headers
    request = urllib.request.Request(
        url,
        data=payload,
        method="POST"
    )
    request.add_header("Content-Type", "application/json")

    # Send the request and capture the response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read and decode the response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data

In [7]:
model = "llama3"
result = query_model("What do Llamas eat?", model)
print(result)

Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy the sweet taste and texture of fresh hay.
3. Grains: Llamas may receive grains like oats, barley, or corn as part of their daily ration. However, it's essential to provide these grains in moderation, as they can be high in calories.
4. Fruits and vegetables: Llamas enjoy a variety of fruits and veggies, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.
5. Minerals: Llamas require access to mineral supplements, which help maintain their overall health and well-being.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, including plants like willow, alder, and birch.
2. Bark: In some cases, ll

#### **Prompt for Ollama**
We'll write a prompt that instructs llama2 to compare actual response and model generated response, and score it in range of 0 to 100, later we'll save the outputs in a .txt file


In [8]:
def build_prompt(data):
    return f"""
    You are an evaluation agent. Your task is to compare the correctness and alignment of a model's response with respect to a given instruction and input.

    You are provided with:
    - Instruction: {data['instruction']}
    - Input: {data['input']}
    - Ground-truth Output: {data['output']}
    - Finetuned Model Response: {data['model_response']}

    Your goals:
    1. Compare the finetuned model's response against the ground-truth output.
    2. Evaluate how well the response satisfies the instruction.
    3. Assign a similarity/quality score from 0 to 100.
    4. Provide a brief explanation for the score.

    **Important rules:**
    - Base the score strictly on correctness, completeness, and faithfulness to the required instruction.
    - Do NOT add extra fields.
    - Do NOT alter the content of the instruction, input, output, or model response.

    **Output format (return valid JSON only):**
    {{
        "instruction": "{data['instruction']}",
        "input": "{data['input']}",
        "output": "{data['output']}",
        "model_response": "{data['model_response']}",
        "score": <number between 0 and 100>,
        "reasoning": "<short explanation>"
    }}
    """


In [15]:
# lets try with one example 

print(query_model(build_prompt(test_data[0]), model,))

Here is the evaluation result:

{
    "instruction": "Rewrite the sentence using a simile.",
    "input": "The car is very fast.",
    "output": "The car is as fast as lightning.",
    "model_response": "The car is as fast as a cheetah.",
    "score": 90,
    "reasoning": "The model's response is close to the ground-truth output, using a simile to compare the car's speed to that of an animal. While it doesn't exactly match the lightning example, it still effectively conveys the idea of the car being very fast."
}


In [18]:
## Lets do it for whole script 
from tqdm import tqdm

if __name__ == "__main__":
    example_no = 0
    for data in tqdm(test_data):
        response = query_model(build_prompt(data), model)
        with open('evaluation_results.txt', 'a') as f:
            f.write(response + "\n\n")
        example_no += 1
        if (example_no%10) == 0:
            print(response)
            print(f"---------------------------------------")

  9%|▉         | 10/110 [05:54<1:00:26, 36.27s/it]

Here is the evaluation result:

{
    "instruction": "Classify the following numbers as prime or composite.",
    "input": ": 11, 14, 19.",
    "output": "Prime numbers: 11, 19
Composite numbers: 14",
    "model_response": "Prime numbers: 11, 20
Composite numbers: 20",
    "score": 66,
    "reasoning": "The model correctly identified the prime number 11 and incorrectly classified 20 as a prime number. It also correctly identified the composite number 14. However, it missed the other prime number 19 in its response."
}
---------------------------------------


 18%|█▊        | 20/110 [11:55<57:41, 38.46s/it]  

Here is the evaluation result:

{
    "instruction": "Generate a sentence that follows the pattern: \"Never have I ever _____ without _____\"",
    "input": "",
    "output": "Never have I ever traveled without a map.",
    "model_response": "The sentence \"Never have I ever\" contains an exclamation mark to show excitement.",
    "score": 20,
    "reasoning": "The model response does not follow the required pattern and does not generate a sentence that completes the phrase. The input is ignored, and the output is irrelevant to the task."
}

Score: 20

Reasoning: The model response is not a correct completion of the given instruction. It provides an analysis of the original sentence instead of generating a new one that follows the pattern.
---------------------------------------


 27%|██▋       | 30/110 [17:46<44:23, 33.29s/it]

Here is the evaluation result:

{
    "instruction": "Convert 5 miles to kilometers.",
    "input": "",
    "output": "5 miles is approximately 8.05 kilometers.",
    "model_response": "5 miles is 5000 kilometers.",
    "score": 0,
    "reasoning": "The model response is completely incorrect, providing a conversion factor that is orders of magnitude off from the actual value. This indicates a lack of understanding of the instruction and the task at hand."
}
---------------------------------------


 36%|███▋      | 40/110 [23:18<40:49, 34.99s/it]

Here is the evaluation result:

{
    "instruction": "Explain what a sonnet is.",
    "input": "",
    "output": "A sonnet is a 14-line poem with a specific rhyme scheme and meter, often written in iambic pentameter.",
    "model_response": "A sonnet is a set of defined, step-by-step procedures or rules to solve a problem or accomplish a task.",
    "score": 20,
    "reasoning": "The model's response does not accurately describe what a sonnet is. The correct definition provided in the ground-truth output highlights the poetic form and its characteristics, whereas the model's response incorrectly defines a sonnet as a set of procedures or rules."
}
---------------------------------------


 45%|████▌     | 50/110 [29:19<37:46, 37.77s/it]

Here is the evaluation result:

{
    "instruction": "Find a synonym for the given verb.",
    "input": "Begin",
    "output": "Commence",
    "model_response": "A synonym for the word 'begin' could be 'start'.",
    "score": 60,
    "reasoning": "The model's response is partially correct, as 'start' can indeed be a synonym for 'begin'. However, it does not fully satisfy the instruction to find a synonym for the given verb. The ground-truth output 'Commence' is more accurate and directly answers the instruction."
}
---------------------------------------


 55%|█████▍    | 60/110 [36:31<34:47, 41.76s/it]

Here is the evaluation result:

{
    "instruction": "Categorize the following sentence as either factual or opinion-based.",
    "input": "Chocolate is the best dessert.",
    "output": "Opinion-based.",
    "model_response": "The topic of the given sentence is chocolate.",
    "score": 20,
    "reasoning": "The model's response does not directly answer the instruction, which specifically asks for categorization as factual or opinion-based. The provided response is also unrelated to the task and does not provide any insight into whether the sentence is factual or opinion-based."
}

Note: I assigned a score of 20 because the model's response is not relevant to the instruction and does not provide any correct information about the sentence being categorized.
---------------------------------------


 64%|██████▎   | 70/110 [39:41<14:10, 21.26s/it]

Here is the evaluation result:

{
    "instruction": "What is a synonym of 'sad'?",
    "input": "",
    "output": "A synonym for 'sad' is 'unhappy'.",
    "model_response": "A synonym for 'sad' is 'ambitious'.",
    "score": 0,
    "reasoning": "The model's response does not provide a correct synonym of 'sad', as 'ambitious' is not a synonym of 'sad'. The ground-truth output correctly identifies 'unhappy' as a synonym, while the model's response is incorrect."
}
---------------------------------------


 73%|███████▎  | 80/110 [42:53<09:20, 18.69s/it]

Here is the evaluation result:

{
    "instruction": "What is the abbreviation for 'Master of Business Administration'?",
    "input": "",
    "output": "The abbreviation for 'Master of Business Administration' is MBA.",
    "model_response": "The abbreviation for 'Master of Business Administration' is 'MBA'.",
    "score": 100,
    "reasoning": "The model's response perfectly matches the ground-truth output, providing the correct answer to the given instruction."
}
---------------------------------------


 82%|████████▏ | 90/110 [46:28<07:05, 21.28s/it]

Here is the evaluation result:

{
    "instruction": "What is the molecular formula for water?",
    "input": "",
    "output": "The molecular formula for water is H2O.",
    "model_response": "The chemical formula for water (dihydrogen monoxide) is H2O.",
    "score": 100,
    "reasoning": "The model's response is identical to the ground-truth output, accurately providing the molecular formula for water as H2O. The response also correctly addresses the instruction and provides additional context in parentheses without affecting the correctness of the answer."
}
---------------------------------------


 91%|█████████ | 100/110 [50:02<03:43, 22.31s/it]

Here is the evaluation result:

{
    "instruction": "Generate a sentence using the word 'innovative'.",
    "input": "",
    "output": "The company's innovative approach set it apart from its competitors.",
    "model_response": "Her innovative ideas led to the success of the project.",
    "score": 80,
    "reasoning": "The model response is mostly correct, but it does not exactly match the ground-truth output. The main difference is in the subject-verb agreement and sentence structure. The model's response uses 'her' as the subject, whereas the ground-truth output uses 'the company'. Additionally, the sentence structures are slightly different. However, both responses do use the word 'innovative' correctly."
}
---------------------------------------


100%|██████████| 110/110 [53:28<00:00, 29.17s/it]

Here is the evaluation result:

{
    "instruction": "Convert 50 miles per hour to kilometers per hour.",
    "input": "",
    "output": "50 miles per hour is approximately 80.47 kilometers per hour.",
    "model_response": "50 miles per hour is 0.5 kilometers per hour.",
    "score": 0,
    "reasoning": "The model response is completely incorrect, as 50 miles per hour cannot be converted to 0.5 kilometers per hour. The correct conversion is approximately 80.47 kilometers per hour."
}
---------------------------------------



