# Evaluation Pipeline

Evaluation the performance of LLM over question answering task, focus on commonsense Reasoning ability.

## Setup

Install the required library

In [10]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes

In [1]:
import torch
print(torch.cuda.is_available())

True


## Load the Model

In this section we will load the [Llama 7B model](https://huggingface.co/meta-llama/Llama-2-7b-hf), quantize it in 4bit and try to inference on it.

In [2]:
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

# Load peft config for pre-trained checkpoint etc.
peft_model_id = "/media/asdw/130a808e-ec26-42a1-93ab-42b857d97bd4/capstone_files/llama3-8b-qa-med-12k-r64-a16"

In [3]:
# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
  peft_model_id,
  # device_map="auto",
  # torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

print("Peft model loaded")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Peft model loaded


## Load Dataset

we will use the timdettmers/openassistant-guanaco dataset for our evaluation.

The dataset can be found [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco/tree/831dabac2283d99420cda0b673d7a2a43849f17a)

In [4]:
from datasets import load_dataset
from random import randint
import pandas as pd

In [5]:
# Load our test dataset
eval_dataset = load_dataset("json", data_files="/home/asdw/amazon-llm/code/CKM/dataset/medquad_instruct_test.json", split="train")

In [6]:
def extract_prompt_result(text):
    parts = text.split("<s>[INST]")
    prompt = parts[0].split("[/INST] \\n")[1].strip()
    # parts = text.split("### Assistant:")
    # prompt = parts[0].split("### Human:")[1].strip()

    result = parts[1].strip()
    return prompt, result

In [7]:
def extract_prompt_result(text):
    # Split the input text based on the instruction delimiter
    parts = text.split("<s>[INST]")
    
    if len(parts) < 2:
        raise ValueError("Invalid format: Missing '<s>[INST]' delimiter")

    # Split the second part to isolate the instruction and the following text
    instruction_and_result = parts[1].split("[/INST] \\n")

    if len(instruction_and_result) < 2:
        raise ValueError("Invalid format: Missing '[/INST] \\n' delimiter")

    # Extract the question (instruction part)
    question = instruction_and_result[0].strip()

    # Extract the result (remaining part)
    result = instruction_and_result[1].strip()
    
    return question, result

In [8]:
# Apply the function to the 'text' column
eval_df = eval_dataset.to_pandas()
eval_df[['prompt', 'result']] = eval_df['text'].apply(extract_prompt_result).apply(pd.Series)

# Display the new DataFrame
eval_df[['prompt', 'result']]

Unnamed: 0,prompt,result
0,"Answer the question truthfully, you are a medi...",What causes Madelung disease? The exact underl...
1,"Answer the question truthfully, you are a medi...",3MC syndrome is a disorder characterized by un...
2,"Answer the question truthfully, you are a medi...",What are the signs and symptoms of Buschke Oll...
3,"Answer the question truthfully, you are a medi...",Certain factors affect prognosis (chance of re...
4,"Answer the question truthfully, you are a medi...","Benign chronic pemphigus, often called Hailey-..."
...,...,...
195,"Answer the question truthfully, you are a medi...",How is Pelizaeus-Merzbacher disease inherited?...
196,"Answer the question truthfully, you are a medi...",Phenylketonuria (PKU) is a genetic disorder in...
197,"Answer the question truthfully, you are a medi...",Mucolipidosis III alpha/beta is a slowly progr...
198,"Answer the question truthfully, you are a medi...","If you have chronic hepatitis C, you should do..."


In [9]:
tmp = eval_df.copy()

tmp['result_len'] = len(str(tmp['result']))


print(tmp['result'].apply(lambda x: len(x.split())).mean())
print(tmp['result'].apply(lambda x: len(x.split())).std())

195.175
231.4525345589649


## Inference

In [9]:
def generate(prompt):
    formatted_prompt = (
    # f"A chat between a curious human and an artificial intelligence assistant."
    # f"The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
    # f"### Human: {prompt} ### Assistant:"
    f"Answer the question truthfully, you are a medical professional. This is the question: {prompt}"
)
    inputs = tokenizer(formatted_prompt, return_tensors="pt")
    outputs = model.generate(**inputs,
                         max_new_tokens=128,
                         do_sample=True,
                         temperature=0.9,
                         top_k=50,
                         top_p=0.9)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):]

In [6]:
prompt = "I want to start doing astrophotography as a hobby, any suggestions what could i do? Explain in 50 words."
formatted_prompt = (
    # f"A chat between a curious human and an artificial intelligence assistant."
    f"The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
    f"### Human: {prompt} ### Assistant:"
)
# inputs = tokenizer(formatted_prompt, return_tensors="pt")
# outputs = model.generate(inputs=inputs.input_ids, max_new_tokens=128)
# print(tokenizer.decode(outputs[0], skip_special_tokens=True))


In [10]:
rand_idx = randint(0, len(eval_df))

prompt = eval_df['prompt'][rand_idx]
print(prompt)

How does the Cypress test runner compare to the Selenium IDE in terms of usability and functionality?


In [11]:
res = generate("What are the symptoms of Mucopolysaccharidosis type VI ?")

print(res)

sional. This is the question: What are the symptoms of Mucopolysaccharidosis type VI ? [/] \n What are the signs and symptoms of Mucopolysaccharidosis type VI? The Human Phenotype Ontology provides the following list of signs and symptoms for Mucopolysaccharidosis type VI. If the information is available, the table below includes how often the symptom is seen in people with this condition. You can use the MedlinePlus Medical Dictionary to look up the definitions for these medical terms. Signs and Symptoms Approximate number of patients (when available) Abnormality of the teeth 90% Abnormality of the voice 


In [11]:
prompt = " What are the symptoms of Loeys-Dietz syndrome ?"
res = generate(prompt)

print(res)

e answers to the user's questions.
### Human:  What are the symptoms of Loeys-Dietz syndrome ? ### Assistant: The symptoms of LMNA-related Loeys-Dietz syndrome can vary depending on the individual, but they can include:

1.  Joint hypermobility, which is excessive range of motion in the joints.

2.  Skin that is easily stretched and has a "tent-like" appearance.

3.  Abnormalities of the spine and other bones, such as curvature, deformities, and weakness.

4.  Heart and blood vessel problems, such as aortic root dilatation,


: 

## Evaluation

In [10]:
eval_df.head(5)

Unnamed: 0,text,prompt,result
0,"<s>[INST] Answer the question truthfully, you ...","Answer the question truthfully, you are a medi...",What causes Madelung disease? The exact underl...
1,"<s>[INST] Answer the question truthfully, you ...","Answer the question truthfully, you are a medi...",3MC syndrome is a disorder characterized by un...
2,"<s>[INST] Answer the question truthfully, you ...","Answer the question truthfully, you are a medi...",What are the signs and symptoms of Buschke Oll...
3,"<s>[INST] Answer the question truthfully, you ...","Answer the question truthfully, you are a medi...",Certain factors affect prognosis (chance of re...
4,"<s>[INST] Answer the question truthfully, you ...","Answer the question truthfully, you are a medi...","Benign chronic pemphigus, often called Hailey-..."


In [11]:
from datasets import Dataset
eval_dataset = Dataset.from_dict(eval_df)
eval_dataset

Dataset({
    features: ['text', 'prompt', 'result'],
    num_rows: 200
})

In [12]:
def evaluate_model(sample):
    # Convert the list to a PyTorch tensor
    formatted_prompt = (
    # f"A chat between a curious human and an artificial intelligence assistant."
    f"Answer the question truthfully, you are a medical professional. This is the question: {sample['prompt']}"

)
    inputs = tokenizer(formatted_prompt, return_tensors="pt")

    # Generate outputs
    outputs = model.generate(**inputs,
                         max_new_tokens=128,
                         do_sample=True,
                         temperature=0.9,
                         top_k=50,
                         top_p=0.9)
    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(sample['prompt']):]

    labels = sample['result']

    # Some simple post-processing
    return prediction, labels

In [13]:
import random
from tqdm import tqdm

total_rows = len(eval_dataset)
random_indices = random.sample(range(total_rows), 200) # Random select 2 entries for evaluate
test_samples = eval_dataset.select(random_indices)

In [14]:
predictions, references = [] , []
for sample in tqdm(test_samples):
    p,l = evaluate_model(sample)
    predictions.append(p)
    references.append(l)

  0%|          | 0/200 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  0%|          | 1/200 [02:13<7:23:46, 133.80s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  1%|          | 2/200 [04:28<7:22:23, 134.06s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  2%|▏         | 3/200 [06:42<7:20:30, 134.16s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  2%|▏         | 4/200 [08:57<7:19:36, 134.58s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  2%|▎         | 5/200 [11:12<7:17:43, 134.68s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  3%|▎         | 6/200 [13:26<7:14:44, 134.46s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  4%|▎         | 7/200 [15:40<7:11:46, 134.23s/it]Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  4%|▍         | 8/200 [17:

In [15]:
# Create a DataFrame
df = pd.DataFrame({
    'prediction': predictions,
    'references': references
})

In [17]:
# Save DataFrame to CSV file
csv_filename = '/home/asdw/amazon-llm/code/CKM/Evaluation/inference_llama3_result_med_12k_maxlen_512_r64_a16.csv'
df.to_csv(csv_filename, index=False)

## Load the evaluation result

In [18]:
import evaluate
import numpy as np
from tqdm import tqdm

2024-05-21 09:03:18.705207: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [19]:
# Reload data from CSV file
import pandas as pd
csv_filename = '/home/asdw/amazon-llm/code/CKM/Evaluation/inference_llama3_result_med_12k_maxlen_512_r64_a16.csv'
loaded_df = pd.read_csv(csv_filename)

In [20]:
# Access the lists from the DataFrame
loaded_predictions = loaded_df['prediction'].tolist()
loaded_references = loaded_df['references'].tolist()

## BERT Score

In [21]:
# Metric
bertscore  = evaluate.load("bertscore")

def get_bert_score(predictions_, references_): # compute mean score of each prediction-reference match
  score = bertscore.compute(predictions=predictions_, references=references_, lang="en")
  BERT_names = ["precision", "recall"]
  bert_dict = dict((bn,  np.mean(score[bn])) for bn in BERT_names)
  bert_dict['f1'] = 2*(bert_dict['precision']*bert_dict['recall']) / (bert_dict['precision']+bert_dict['recall'])
  return bert_dict

In [22]:
torch.cuda.empty_cache()

In [23]:
scores = get_bert_score(loaded_predictions, loaded_references)

# print results
print(f"\nPrecision: {scores['precision']* 100:2f}%")
print(f"Recall: {scores['recall']* 100:2f}%")
print(f"f1: {scores['f1']* 100:2f}%")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Precision: 86.551847%
Recall: 85.274260%
f1: 85.908304%


## ROUGE Score

In [24]:
metric = evaluate.load("rouge")

In [25]:
# compute metric
rogue = metric.compute(predictions=loaded_predictions, references=loaded_references, use_stemmer=True)

# print results
print(f"Rogue1: {rogue['rouge1']* 100:2f}%")
print(f"rouge2: {rogue['rouge2']* 100:2f}%")
print(f"rougeL: {rogue['rougeL']* 100:2f}%")
print(f"rougeLsum: {rogue['rougeLsum']* 100:2f}%")

Rogue1: 36.463975%
rouge2: 17.431370%
rougeL: 24.310043%
rougeLsum: 25.005254%


### BLUERT Score

https://ai.googleblog.com/2020/05/evaluating-natural-language-generation.html

https://github.com/google-research/bleurt

https://arxiv.org/abs/2004.04696

> Can  capture semantic similarties between sentences

Employ transfer learning:

1. Use the contextual word representation of BERT

2. To prevent to effect of domain shift and quality drift -> Instead of collecting human ratings, we use a collection of metrics and models from the literature (BLEU, BERT Score, ROUGE...)

3. To scaled up the training example, the use random deletions / round-trip translation / random substitution... to get multiple scores out of smae metric

4. Fine-tuning on human ratings

In [None]:
# git clone https://github.com/google-research/bleurt.git
# cd bleurt
# pip install .
# wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
# unzip BLEURT-20.zip
# python -m bleurt.score_files \
#   -candidate_file=bleurt/test_data/candidates \
#   -reference_file=bleurt/test_data/references \
#   -bleurt_checkpoint=BLEURT-20

In [26]:
import os
new_working_directory = "/home/asdw/amazon-llm/code/vincent_dev/bleurt"
os.chdir(new_working_directory)

In [27]:
from bleurt import score

In [28]:
checkpoint = "BLEURT-20"
scorer = score.BleurtScorer(checkpoint)

INFO:tensorflow:Reading checkpoint BLEURT-20.


INFO:tensorflow:Reading checkpoint BLEURT-20.


INFO:tensorflow:Config file found, reading.


INFO:tensorflow:Config file found, reading.


INFO:tensorflow:Will load checkpoint BLEURT-20


INFO:tensorflow:Will load checkpoint BLEURT-20


INFO:tensorflow:Loads full paths and checks that files exists.


INFO:tensorflow:Loads full paths and checks that files exists.


INFO:tensorflow:... name:BLEURT-20


INFO:tensorflow:... name:BLEURT-20


INFO:tensorflow:... bert_config_file:bert_config.json


INFO:tensorflow:... bert_config_file:bert_config.json


INFO:tensorflow:... max_seq_length:512


INFO:tensorflow:... max_seq_length:512


INFO:tensorflow:... vocab_file:None


INFO:tensorflow:... vocab_file:None


INFO:tensorflow:... do_lower_case:None


INFO:tensorflow:... do_lower_case:None


INFO:tensorflow:... sp_model:sent_piece


INFO:tensorflow:... sp_model:sent_piece


INFO:tensorflow:... dynamic_seq_length:True


INFO:tensorflow:... dynamic_seq_length:True


INFO:tensorflow:Creating BLEURT scorer.


INFO:tensorflow:Creating BLEURT scorer.


INFO:tensorflow:Creating SentencePiece tokenizer.


INFO:tensorflow:Creating SentencePiece tokenizer.


INFO:tensorflow:Creating SentencePiece tokenizer.


INFO:tensorflow:Creating SentencePiece tokenizer.


INFO:tensorflow:Will load model: BLEURT-20/sent_piece.model.


INFO:tensorflow:Will load model: BLEURT-20/sent_piece.model.


INFO:tensorflow:SentencePiece tokenizer created.


INFO:tensorflow:SentencePiece tokenizer created.


INFO:tensorflow:Creating Eager Mode predictor.


INFO:tensorflow:Creating Eager Mode predictor.


INFO:tensorflow:Loading model.


INFO:tensorflow:Loading model.
2024-05-21 09:05:48.495136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5138 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:67:00.0, compute capability: 7.5
2024-05-21 09:05:48.495731: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9080 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:68:00.0, compute capability: 7.5


INFO:tensorflow:BLEURT initialized.


INFO:tensorflow:BLEURT initialized.


In [29]:
scores = scorer.score(references=loaded_predictions, candidates=loaded_references)
print(f"BLUERT Score: {np.mean(scores)* 100:2f}%")

BLUERT Score: 38.047747%


1. Update the evaluation pipeline
2. Package the model into callable API
3. Create UI for demo