In [1]:
import psutil

def check_if_running(process_name):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break
    return running

ollama_running = check_if_running("ollama")

if not ollama_running:
    raise RuntimeError("Ollama not running. Launch ollama before proceeding.")
print("Ollama running:", check_if_running("ollama"))

Ollama running: True


In [3]:
import json
from tqdm import tqdm

file_path = "instruction-data-with-response.json"

with open(file_path, "r") as file:
    test_data = json.load(file)


def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

In [8]:
import urllib.request

def query_model(
    prompt,
    model="llama3.2",
    url="http://localhost:11434/api/chat"
):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "options": {     # Settings below are required for deterministic responses
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }


    # Convert the dictionary to a JSON formatted string and encode it to bytes
    payload = json.dumps(data).encode("utf-8")

    # Create a request object, setting the method to POST and adding necessary headers
    request = urllib.request.Request(
        url,
        data=payload,
        method="POST"
    )
    request.add_header("Content-Type", "application/json")

    # Send the request and capture the response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        # Read and decode the response
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data


model = "llama3.2"
result = query_model("list 5 egyptian rappers, think step by step", model)
print(result)

Here are 5 Egyptian rappers:

1. Tamer Hosny - Known for his unique blend of Arabic and English lyrics.
2. Mohamed Ramadan - A popular rapper and singer who has collaborated with international artists.
3. Ahmed Adel - A rising star in the Egyptian hip-hop scene, known for his energetic live performances.
4. Karim Hafez - A veteran rapper and songwriter who has been active in the industry since the 1990s.
5. Mohamed Mounir - A rapper and singer who has gained a large following in Egypt and the Middle East with his catchy hooks and lyrics.


In [10]:
for entry in test_data[:3]:
    prompt = (
        f"Given the input `{format_input(entry)}` "
        f"and correct output `{entry['output']}`, "
        f"score the model response `{entry['model_response']}`"
        f" on a scale from 0 to 100, where 100 is the best score. "
    )
    print("\nDataset response:")
    print(">>", entry['output'])
    print("\nModel response:")
    print(">>", entry["model_response"])
    print("\nScore:")
    print(">>", query_model(prompt))
    print("\n-------------------------")


Dataset response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a cheetah.

Score:
>> To rewrite the sentence using a simile, we need to compare the speed of the car to something else.

Correct output: The car is as fast as lightning.

Score: 100

The model response "The car is as fast as a cheetah." is not correct because it uses a metaphor (a direct comparison) instead of a simile (a comparison using "like" or "as"). A simile should use the words "like" or "as" to make the comparison, which is what the original instruction asked for.

-------------------------

Dataset response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.

Model response:
>> The type of cloud associated with thunderstorms is a cumulus cloud.

Score:
>> I would rate the model response a 20.

The reason for this low score is that the model response contains an error in its classification of clouds. Cumulonimbus clouds are indeed associated with thunders

In [11]:
def generate_model_scores(json_data, json_key, model="llama3.2"):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = query_model(prompt, model)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"Could not convert score: {score}")
            continue

    return scores


scores = generate_model_scores(test_data, "model_response")
print(f"Number of scores: {len(scores)} of {len(test_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")

Scoring entries:   1%|▋                                                                        | 1/110 [00:04<08:20,  4.59s/it]

Could not convert score: 60

Explanation: The original sentence "The car is very fast" can be rewritten using a simile as "The car is as fast as lightning." This response accurately conveys the idea of speed and uses a common simile to do so.

However, the provided correct output "The car is as fast as a cheetah" is not entirely accurate. While it does use a simile, it's more of an analogy than a traditional simile. A traditional simile would compare two things using "like" or "as," whereas an analogy compares two things without using those words.

Therefore, the score of 60 reflects that while the response is close, it's not entirely accurate in terms of simile usage.


Scoring entries:   9%|██████▌                                                                 | 10/110 [00:25<04:03,  2.44s/it]

Could not convert score: 40

The model's response contains an error in identifying 14 as composite and incorrectly including 19 in both categories.


Scoring entries:  15%|██████████▍                                                             | 16/110 [00:39<04:02,  2.58s/it]

Could not convert score: 90

Explanation: The correct classification is:

- Mercury: Liquid
- Oxygen: Gas
- Wood: Solid

The model response "Solid: Mercury / Liquid: Oxygen / Gas: Wood" is mostly correct but in a different format. It's still accurate, so the score remains high.


Scoring entries:  16%|███████████▊                                                            | 18/110 [00:44<03:46,  2.46s/it]

Could not convert score: 60

A note was left by someone.


Scoring entries:  29%|████████████████████▉                                                   | 32/110 [01:18<03:27,  2.66s/it]

Could not convert score: 80

A piece of cake is a type of dessert, not an idiomatic expression meaning something is easy. The original sentence "It's a piece of cake" is a common cliché phrase used to describe tasks that are effortless or require little effort. A more accurate and creative response would be "It's very easy."


Scoring entries:  35%|████████████████████████▊                                               | 38/110 [01:32<03:07,  2.60s/it]

Could not convert score: 80

Explanation: The model's response is mostly correct, but it could be improved by adding more context or providing alternative passive voice constructions. For example, "The plants were watered" could also be rephrased as "Water was applied to the plants."


Scoring entries:  48%|██████████████████████████████████▋                                     | 53/110 [02:08<02:25,  2.55s/it]

Could not convert score: 80

The model's response is mostly correct but contains an error in categorization. The bicycle should be classified as "Vehicles" instead of "Bicyclist".


Scoring entries:  67%|████████████████████████████████████████████████▍                       | 74/110 [02:58<01:33,  2.60s/it]

Could not convert score: 90

The model response "Technical document." is accurate because it correctly classifies the input text as a technical document, which is a manual providing instructions for installing software. The only reason it's not a perfect score is that the response could be more explicit and detailed, but overall, it accurately completes the request.


Scoring entries:  76%|██████████████████████████████████████████████████████▉                 | 84/110 [03:21<01:02,  2.39s/it]

Could not convert score: 80

He remained very calm.


Scoring entries:  99%|██████████████████████████████████████████████████████████████████████▎| 109/110 [04:20<00:02,  2.38s/it]

Could not convert score: 60

She never forgets to call.


Scoring entries: 100%|███████████████████████████████████████████████████████████████████████| 110/110 [04:22<00:00,  2.39s/it]

Number of scores: 100 of 110
Average score: 49.24




