# Evaluating the fine-tuned LLM

In [2]:
import psutil
def check_if_running(process_name):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break
    return running
ollama_running = check_if_running("ollama")

if not ollama_running:
    raise RuntimeError(
        "Ollama not running. Launch ollama before proceeding."
    )
print("Ollama running:", check_if_running("ollama"))

Ollama running: True


In [3]:
import json
from tqdm import tqdm

file_path = "instruction-data-with-response.json"
with open(file_path, "r") as file:
     test_data = json.load(file)
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    input_text = (
        f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    )
    return instruction_text + input_text

## Querying a local Ollama model

- Using the query_model function defined earlier, we can evaluate the responses generated by our fine-tuned model that prompts the gemma2:2b model to rate our finetuned model’s responses on a scale from 0 to 100 based on the given test set
response as reference.

In [None]:
import urllib.request
def query_model(
    prompt, 
    model="gemma2:2b", 
    url="http://localhost:11434/api/chat"
):
    data = { 
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "options": { 
        "seed": 123,
        "temperature": 0,
        "num_ctx": 2048
    }
    }
    payload = json.dumps(data).encode("utf-8") 
    request = urllib.request.Request( 
        url, 
        data=payload, 
        method="POST" 
    )
    
    request.add_header("Content-Type", "application/json") 
    response_data = ""
    with urllib.request.urlopen(request) as response: 
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]
    return response_data

In [7]:
model = "gemma2:2b"
result = query_model("What do Llamas eat?", model)
print(result)

Llamas are herbivores, which means they primarily eat plants!  Here's a breakdown of their diet:

**Main Food Sources:**

* **Grasses:** This is the staple food for llamas. They love to munch on various types of grasses like Timothy grass, orchard grass, and bromegrass. 
* **Hay:** Dried hay provides additional nutrition during colder months or when access to fresh pasture is limited.  
* **Forbs:** These are flowering plants that offer a variety of nutrients.

**Other Occasional Foods:**

* **Leaves:** Llamas will eat leaves from trees and shrubs, but this should be done in moderation as they can be less nutritious than grasses. 
* **Fruits & Vegetables:** In some cases, llamas may enjoy fruits like apples or carrots, but these are not essential parts of their diet.  

**Important Considerations:**

* **Fresh Water:** Llamas need access to fresh water at all times. 
* **Variety:** Offering a variety of grasses and forbs helps ensure they get the full spectrum of nutrients. 
* **Qualit

In [9]:
for entry in test_data[:3]:
    prompt = (
        f"Given the input `{format_input(entry)}` "
        f"and correct output `{entry['output']}`, "
        f"score the model response `{entry['model_response']}`"
        f" on a scale from 0 to 100, where 100 is the best score. "
    )
    print("\nDataset response:")
    print(">>", entry['output'])
    print("\nModel response:")
    print(">>", entry["model_response"])
    print("\nScore:")
    print(">>", query_model(prompt, model))
    print("\n-------------------------")


Dataset response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a cheetah.

Score:
>> I would give the model's response a score of **75/100**.

Here's why:

* **Strengths:** The simile "The car is as fast as a cheetah" effectively compares the speed of the car to that of a cheetah, which is known for its incredible speed. 
* **Weaknesses:**  While this simile works well, it could be even stronger with some slight adjustments. 

**Here's how we can improve the response:**

The car is as fast as a **bolt of lightning**. This comparison highlights the suddenness and explosive power of the car's speed.


**Explanation:**

* **Specificity:**  Using "bolt of lightning" adds a more vivid image to the simile, emphasizing the car's swiftness. 
* **Impact:** The comparison evokes a sense of awe and wonder at the car's incredible speed.



Overall, the model's response is good but could be even better with some minor tweaks. 


-------------------------

Dataset r

## Evaluating the instruction fine-tuning LLM


#### Finding the average score by instructing the model to respond in only integer answer.

In [11]:
def generate_model_scores(json_data, json_key, model="gemma2:2b"):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."    
        )
        score = query_model(prompt, model)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"Could not convert score: {score}")
        continue
    return scores

In [12]:
scores = generate_model_scores(test_data, "model_response")
print(f"Number of scores: {len(scores)} of {len(test_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")

Scoring entries: 100%|██████████| 110/110 [05:34<00:00,  3.04s/it]

Number of scores: 110 of 110
Average score: 56.10






# End of Notebook