## 3 Evaluating Locally deployed models


### 3.1 Load the (Quantized) model to a single GPU


In [1]:
!pip install flash-attn

Looking in indexes: https://mirrors.cernet.edu.cn/pypi/web/simple, https://pypi.ngc.nvidia.com
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import accelerate, bitsandbytes
import torch, os
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

from transformers import pipeline

os.environ["BNB_CUDA_VERSION"] = "125"

## choose one of the local models.  Let's use a small one for faster evaluation.
model_path = "/ssdshare/share/Phi-3-mini-128k-instruct/"

# here is how you load a local model using haggingface api
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="cuda:0", torch_dtype="auto", trust_remote_code=True
)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Verify that the model is loaded to GPU (look at the memory utilization).


In [3]:
# let's check the GPU memory utilization after loading the model
!nvidia-smi

Wed Apr 23 16:27:44 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:37:00.0 Off |                  Off |
| 30%   34C    P0             63W /  450W |    7683MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### 3.2 Generate responses locally


In [4]:
def chat_resp(model, tokenizer, question_list):
    # here is how you use the pipeline to generate local responses
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
    )
    generation_args = {
        "max_new_tokens": 500,
        "return_full_text": False,
        "temperature": 0.6,
        "do_sample": True,
    }

    output = pipe(
        question_list, **generation_args
    )  # note that you send in a list of questions (faster)
    # here is how you get the response               # however if you send too many questions, it will run out of memory
    return output


def chat_resp_batched(model, tokenizer, question_list, batch_size=4):
    # Split a large question list into batches of the specified size, to avoid running out of memory
    batches = [
        question_list[i : i + batch_size]
        for i in range(0, len(question_list), batch_size)
    ]
    all_responses = []
    for batch in batches:
        print(f"processing batch: %s " % batch)
        responses = chat_resp(model, tokenizer, batch)
        all_responses.extend(responses)
    return all_responses


def gsm8k_prompt(question):
    # add system prompt to the question
    chat = [
        {
            "role": "system",
            "content": """Please solve the given math problem by providing a detailed, step-by-step explanation. Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. After you have worked through the problem, conclude your response by summarizing the solution and stating the final answer as a single exact numerical value on the last line. """,
        },
        {"role": "user", "content": "Question: " + question},
    ]
    return chat

In [5]:
## Test the model with a sample question
p = gsm8k_prompt(
    "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"
)
p = [p]  # remember to send in a list of questions

chat_resp(model, tokenizer, p)  # p is the list of questions


The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
You are not running the flash-attention implementation, expect numerical differences.


[[{'generated_text': ' Step 1: Identify the number of clips sold in April.\nNatalia sold clips to 48 of her friends in April. So, the number of clips sold in April is 48.\n\nStep 2: Determine the number of clips sold in May.\nIt is mentioned that Natalia sold half as many clips in May as she did in April. To find out the number of clips sold in May, we need to calculate half of the number of clips sold in April.\n\nStep 3: Calculate half the number of clips sold in April.\nHalf of 48 is calculated by dividing 48 by 2.\n48 / 2 = 24\n\nSo, Natalia sold 24 clips in May.\n\nStep 4: Calculate the total number of clips sold in April and May.\nNow, we need to add the number of clips sold in April to the number of clips sold in May to find the total number of clips sold over the two months.\n\nStep 5: Add the number of clips sold in both months.\n48 (April) + 24 (May) = 72\n\nNatalia sold a total of 72 clips in April and May.'}]]

### 3.3 Prepare the evaluation datasets


In [6]:
# add proxy to access huggingface ...
os.environ["HTTP_PROXY"] = "http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ["HTTPS_PROXY"] = "http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ["ALL_PROXY"] = "socks5://Clash:QOAF8Rmd@10.1.0.213:7893"

In [7]:
from datasets import load_dataset

dataset = load_dataset("gsm8k", "main")  # read directly from huggingface

# if you want to use a local dataset, you can use the following code
# from datasets import load_dataset, load_from_disk
# dataset = load_from_disk("/ssdshare/share/gsm8k")

# to save time, we only use a small subset
subset = dataset["test"][5:12]
questions = subset["question"]
answers = subset["answer"]

dataset

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 1319
    })
})

In [8]:
# We only want the numeric answers from the dataset for evalation (maybe a bad choice?)


def get_exact_answer(x):
    i = x.index("####")
    return x[i + 5 :].strip("\n")


num_answers = list(map(get_exact_answer, answers))
print(num_answers)


['64', '260', '160', '45', '460', '366', '694']


In [9]:
# this is very tentative and bad way to find the exact answer, consider fixing it.

import re


def get_numbers(s):
    number = []
    lines = s.split("\n")
    for i in range(-1, -len(lines), -1):
        number = re.findall(r"\d+(?:\.\d+)?", lines[i])
        if len(number) > 0:
            break
    if len(number) == 0:
        return "-9999"
    return number[-1]  # the last number is the answer

### 3.4 Evaluate!


In [10]:
question_prompts = [gsm8k_prompt(q) for q in questions]
resps = chat_resp_batched(model, tokenizer, question_prompts, batch_size=10)


processing batch: [[{'role': 'system', 'content': 'Please solve the given math problem by providing a detailed, step-by-step explanation. Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. After you have worked through the problem, conclude your response by summarizing the solution and stating the final answer as a single exact numerical value on the last line. '}, {'role': 'user', 'content': 'Question: Kylar went to the store to buy glasses for his new apartment. One glass costs $5, but every second glass costs only 60% of the price. Kylar wants to buy 16 glasses. How much does he need to pay for them?'}], [{'role': 'system', 'content': 'Please solve the given math problem by providing a detailed, step-by-step explanation. Begin by outlining each step involved in your solution, ensuring clarity and precision in your calculations. After you have worked through the problem, conclude your response by summarizing the solution and s

In [11]:
llm_answers = []

for resp in resps:
    gen_text = resp[0]["generated_text"]
    print("--------")
    print(gen_text)
    print("--------")
    num = get_numbers(gen_text)
    print(num)
    llm_answers.append(num)
    print("---------")
    print(llm_answers)

--------
 Step 1: Identify the cost of the first glass
The first glass costs $5.

Step 2: Determine the cost of every second glass
Every second glass costs 60% of the price of the first glass. To find this, multiply the price of the first glass ($5) by 60% (or 0.60).
$5 * 0.60 = $3
So, every second glass costs $3.

Step 3: Calculate the number of first and second glasses
Kylar wants to buy 16 glasses. To find out how many first and second glasses he will get, we need to divide 16 by 2 (since every second glass is considered as one).
16 / 2 = 8
So, Kylar will get 8 first glasses and 8 second glasses.

Step 4: Calculate the total cost of first glasses
To find the total cost of the first glasses, multiply the number of first glasses (8) by the price of the first glass ($5).
8 * $5 = $40

Step 5: Calculate the total cost of second glasses
To find the total cost of the second glasses, multiply the number of second glasses (8) by the price of the second glass ($3).
8 * $3 = $24

Step 6: Calc

In [12]:
print(llm_answers)
print(num_answers)

['16', '260', '80', '4', '460', '366', '694']
['64', '260', '160', '45', '460', '366', '694']


In [13]:
## manual way to compute the correct rate

error = 0
for i in range(0, len(llm_answers)):
    if llm_answers[i] != num_answers[i]:
        error += 1
print(
    f"number of errors: %s \n correct rate: %s" % (error, 1 - error / len(llm_answers))
)

number of errors: 3 
 correct rate: 0.5714285714285714


In [14]:
## the way of using HuggingFace evaluate functions

import evaluate

exact_match = evaluate.load("exact_match")
results = exact_match.compute(predictions=llm_answers, references=num_answers)
print(results)

Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

{'exact_match': 0.5714285714285714}


In [15]:
# do not forget to clean the gpu memory
import torch

torch.cuda.empty_cache()


In [16]:
# check the GPU memory utilization
!nvidia-smi


Wed Apr 23 16:30:23 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:37:00.0 Off |                  Off |
| 31%   45C    P0             66W /  450W |    7783MiB /  24564MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                