<a href="https://colab.research.google.com/github/ellolo/cthulhu_fine_tuning/blob/main/5_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# System evaluation

This Notebook contains code to evaluate several LLM-based systems on the task of answering 50 questions regarding the content of the Chtulhu Rulebook.
The questions are created manually, together with their expected best answer.

## Evaluation dataset
I evaluate the systems on a dataset of 50 Question Answering (QA) pairs manually derived from the Chtulhu Rulebook. The dataset reflects typical questions that a player could ask about the game, and varies from questions that require a precise answer to open-eneded questions that require the system to reason and summarize the content of the book. The answers provided in the dataset are our gold standard to evaluate the systems.

Examples:

```
{'question': 'Who wrote the Necronomicon chapter?',
 'answer': 'Keith Herber',
 'chapter': 1}

{'question': 'Can you provide a short description of how a Chase takes place',
 'answer': 'A chase take place when the participants have an escape route. The Keeper positions the participants, decided the order of play and how many movements each participant can make in each turn. At the beginning each participant makes a CON roll to adjust their MOV rating. Then, at each round of the chase the participants act in DEX order, and move a number of locations based on their MOV rating. In some cases attacks can also be performed. The Keeper can add hazards and barriers to the chase as well.',
 'chapter': 7}
```

## Evaluation metrics

Each answer in manually evaluated on three metrics:
- **Correctness**: how closely the generated text answers the question at hand.
- **Topicality**: how close the generated text is to the topic of the question at hand.
- **Fluency**: how natural and grammatically correct the generated text is.

For each metric a score from 1 to 5 is assigned, where 5 is best and 1 is worst.
To compare the systems, I take the average of the scores across all questions.
Plese refer to README file for more details on the evaluation metrics and dataset.

## Evaluated systems

I evaluated the following models:

- **base models**: `Llama-3.1-8B-Instruct` and `Llama-3.2-1B-Instruct`
- **LoRA QA fine-tuning**: LoRA adapters on top of a base model, fine-tuned using a QA dataset derived from the Chtulhu Rulebook. The code for to generate this system is in `2_lora_qa_finetuning.ipynb`.
- **LoRA continued pretraining**: LoRA adapters on top of a base model, trained on the raw text extracted from the Chtulhu Rulebook. The code for to generate this system is in `3_lora_cont_pretraining.ipynb`.
- **LoRA combined training**: LoRA adapters on top of a base model, trained on a dataset consisting of both the raw text and a sample of 2,000 QA pairs from the QA dataset. The code for to generate this system is in `4_lora_combined_training.ipynb`.
- **LoRA stacked**: LoRA QA and Continued Pretraining adapters stacked on each other, i.e. we sum their weights.
- **Vanilla RAG**: A simple RAG agent provided with Cthulhu Rulebook as knowledge base and using the base model as LLM.
- **RAG with LoRA**: A simple RAG agent provided with Cthulhu Rulebook as knowledge base and using LoRA adapters on top of the base model as LLM.

## Code in this Notebook

The notebook contains the following code:

- **Section 1,2**: Installs dependencies and load evaluation dataset
- **Section 3**: Implements helper functions to run a system over the evaluation dataset and generate answers
- **Section 4,5**: Loads base language model and runs it on the dataset to generate answers
- **Section 6**: Loads system based on single LoRa adapters and runs them on the dataset  to generate answers
- **Section 7**: Loads system based on stacked LoRa adapters and runs them on the dataset to generate answers
- **Section 8, 9**: Sets-up RAG systems and runs them on the dataset to generate answers
- **Section 10**: Puts together all generated answers of all system in a single file that will be used for manual evaluation.
- **Section 11**: After manual annotation is completed, reads back the annotation file and computes metrics for all systems.



# 1. Install dependencies and clone github repo

In [1]:
%%capture
! pip install -U 'transformers[torch]' datasets bitsandbytes peft evaluate rouge_score langchain langchain-huggingface langchain-community sentence-transformers faiss-gpu-cu12 #faiss-cpu
!pip install numpy==1.26.4 --force-reinstall

In [2]:
# get HF token from secrets and login in HF hub, so that we can download models from HF Hub

from huggingface_hub import login
from google.colab import userdata

HF_TOKEN=userdata.get('HF_TOKEN')
if HF_TOKEN:
  login(HF_TOKEN)
else:
  login()

In [None]:
# Clone github repository.

# Past here the SSH key stored on my personal laptopat: ~/dev/llm_cthulhu_fine_tuning/keys
# This is not safe, but unfortunaly using Colab secrets did not work.

! mkdir -p /root/.ssh
with open("/root/.ssh/id_rsa", mode="w") as fp:
    fp.write("""<YOUR SSH KEY TO THE REPO>""")

# <COPY FROM LOCAL DISK AT: ~/dev/llm_cthulhu_fine_tuning/keys>
! ssh-keyscan -t rsa github.com >> ~/.ssh/known_hosts
! chmod go-rwx /root/.ssh/id_rsa
! git clone git@github.com:ellolo/cthulhu_fine_tuning.git

# 2. Load evaluation dataset





In [2]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
dataset_file = "evaluation/evaluation-dataset/cthulhu_eval_dataset.csv"

%cd /content/cthulhu_fine_tuning

/content/cthulhu_fine_tuning


In [None]:
# Load evaluation dataset

from datasets import load_dataset

dataset = load_dataset("csv", data_files=dataset_file, split="train")
dataset[0]

# 3. Define Helper functions

Helper functions to generate answers for the questions in the evaluation dataset using a target model, and pushing these results in github.

In [46]:
# helper functions for evaluation

# system prompt used for all systems (excluding rag)
DEFAULT_SYSTEM_PROMPT = "You are a helpful AI assistant."

# system prompt used for rag systems
DEFAULT_SYSTEM_PROMPT_RAG = (
 "Using the information contained in the context, give a comprehensive answer "
 "to the question. "
 "Respond only to the question asked, response should be concise and relevant "
 "to the question. "
 "Provide the number of the source document when relevant. "
 "If the answer cannot be deduced from the context, do not give an answer. "
)

from dataclasses import dataclass

@dataclass
class Adapter:
  name: str
  subfolder: str

  @property
  def shortname(self):
    return self.name.replace(".", "_")


def generate(model, system_prompt, user_prompt, temperature=0.1, greedy=False):
  """
  Generates text from a language model given a system prompt and user prompt.

  The input context provided to the language model is in chatml format, where
  the first message is a system message with the system prompt, and the second
  message is a user message with the user prompt.

  Args:
    model: The language model to use for generation.
    system_prompt: The system prompt to provide to the model.
    user_prompt: The user prompt to provide to the model.
    temperature: The temperature for text generation (default is 0.1).
    greedy: Whether to use greedy decoding (default is False).

  Returns:
    The generated text as a string.
  """
  # format to chatml
  chat = [
      {"role": "system", "content": f"{system_prompt}"},
      {"role": "user", "content": f"{user_prompt}"}
  ]
  chatml = tokenizer.apply_chat_template(
      chat,
      tokenize = False,
      add_generation_prompt = True
  )

  # tokenize
  inputs = tokenizer(chatml, return_tensors="pt").to(device)
  prompt_length = inputs['input_ids'].shape[1]

  # generate tokens
  with torch.no_grad():
    if greedy:
      # greedy generation
      output_ids = model.generate(**inputs, max_new_tokens = 256, do_sample = False, pad_token_id=tokenizer.pad_token_id)
    else:
      # multinomial sampling generation
      output_ids = model.generate(**inputs, max_new_tokens = 256, temperature = temperature, pad_token_id=tokenizer.pad_token_id)

  # decode generated tokens into text
  output_text = tokenizer.decode(output_ids[0][prompt_length:], skip_special_tokens=True) # remove input prompt from output (https://discuss.huggingface.co/t/generate-returns-full-prompt-plus-answer/70453)
  return output_text


def answer_question(model, example, temperature=0.1, greedy=False, system_prompt=None):
    """
    Answers a question using a language model.

    Args:
      model: The language model to use for generation.
      example: A dictionary containing the input data, expected to have a "question" key.
      temperature: The temperature for text generation (default is 0.1).
      greedy: Whether to use greedy decoding (default is False).
      system_prompt: A system prompt to use for the model (default is DEFAULT_SYSTEM_PROMPT).

    Returns:
      The generated answer as a string.
    """
    if not system_prompt:
      system_prompt = DEFAULT_SYSTEM_PROMPT
    user_prompt = f"{example['question']}"
    return generate(model, system_prompt, user_prompt, temperature=temperature, greedy=greedy)


def evaluate_model(model, dataset, temperature=0.1, greedy=False, system_prompt=None):
  """
  Generates answers for all questions in a dataset using a given language model.

  Args:
    model: The language model to use for generation.
    dataset: A Hugging Face Dataset object, containing the questions in the field "question".
    temperature: The temperature for text generation (default is 0.1).
    greedy: Whether to use greedy decoding (default is False).
    system_prompt: A system prompt to use for the model (default is DEFAULT_SYSTEM_PROMPT).

  Returns:
    A Hugging Face Dataset object with an added "generated_answer" column containing the generated answers.
  """
  generated_texts = []
  for ct, example in enumerate(dataset):
    print(f"Generating answer {ct+1}/{len(dataset)}")
    generated_text = answer_question(model, example, temperature=temperature, greedy=greedy, system_prompt=system_prompt)
    generated_texts.append(generated_text)
  return dataset.add_column("generated_answer", generated_texts)


def answer_question_rag(model, example, rag_store, temperature=0.1, greedy=False, system_prompt=None):
    """
    Answers a question using a RAG approach with a language model.

    Args:
      model: The language model to use for generation.
      example: A dictionary containing the input data, expected to have a "question" key.
      rag_store: The RAG store containing relevant documents.
      temperature: The temperature for text generation (default is 0.1).
      greedy: Whether to use greedy decoding (default is False).
      system_prompt: A system prompt to use for the model (default is DEFAULT_SYSTEM_PROMPT_RAG).

    Returns:
      A tuple containing the generated answer as a string and a list of the top RAG documents used.
    """
    if not system_prompt:
      system_prompt = DEFAULT_SYSTEM_PROMPT_RAG
    top_chunks = rag_store.similarity_search(example["question"], k=5)
    rag_context = "\nExtracted documents:\n\n"
    rag_context += "\n\n".join([f"Document {str(i)}:::\n" + doc.page_content for i, doc in enumerate(top_chunks)])
    user_prompt = (
        "Context:\n"
        f"{rag_context}\n"
        "---\n"
        "Here below is the question you need to answer.\n"
        f"Question: {example['question']}"
    )
    output_text = generate(model, system_prompt, user_prompt, temperature=temperature, greedy=greedy)
    return (output_text, [doc.page_content for doc in top_chunks])


def evaluate_model_rag(model, dataset, rag_store, temperature=0.1, greedy=False, system_prompt=None):
  """
  Generates answers for all questions in a dataset using a RAG approach with a given language model.

  Args:
    model: The language model to use for generation.
    dataset: A Hugging Face Dataset object, containing the questions in the field "question".
    rag_store: The RAG store containing relevant documents.
    temperature: The temperature for text generation (default is 0.1).
    greedy: Whether to use greedy decoding (default is False).
    system_prompt: A system prompt to use for the model (default is DEFAULT_SYSTEM_PROMPT_RAG).

  Returns:
    A Hugging Face Dataset object with added "generated_answer" and "rag_top_docs" columns,
    containing the generated answers and the list of the top RAG documents used for the answers.
  """
  generated_texts = []
  rag_results = []
  for ct, example in enumerate(dataset):
    print(f"Generating answer {ct+1}/{len(dataset)}")
    generated_text, rag_result = answer_question_rag(model, example, rag_store, temperature=temperature, greedy=greedy, system_prompt=system_prompt)
    generated_texts.append(generated_text)
    rag_results.append("\n==========\n".join(rag_result))
  return dataset.add_column("generated_answer", generated_texts).add_column("rag_top_docs", rag_results)


def push_results_to_git(dataset_with_answers, adapter: Adapter, system_prompt=None, quantized_model=False):
  """
  Pushes evaluation results to the git repository.

  Args:
    dataset_with_answers: A Hugging Face Dataset object containing the generated answers.
    adapter: An Adapter object containing the adapter name and subfolder.
    system_prompt: The system prompt used for generation (optional).
    quantized_model: A boolean indicating if the model was quantized (default is False).
  """

  output_file_name = adapter.name + "-" + adapter.subfolder

  if not system_prompt:
    system_prompt = DEFAULT_SYSTEM_PROMPT

  if system_prompt != DEFAULT_SYSTEM_PROMPT:
    output_file_name = output_file_name + "_prompt_" + str(hash(system_prompt)% 10**6)

  if quantized_model:
    output_file = f"evaluation/evaluation-results-quantized/res_{output_file_name}.csv"
  else:
    output_file = f"evaluation/evaluation-results/res_{output_file_name}.csv"

  dataset_with_answers.add_column("system_prompt", [system_prompt for _ in range(len(dataset_with_answers))]) \
    .to_csv(output_file)

  ! git config --global user.email "marco.pennacchiotti@gmail.com"
  ! git add {output_file}
  ! git commit -m "added generated answers"
  ! git push origin main

# 4. Load base model

Load the base model to which we will attach the Lora adapters that we want to evaluate.

I load either `Llama-3.1-8B-Instruct` or `Llama-3.2-1B-Instruct`, since we have tuned adapters for both these base models

I experimented with two versions of the base models:
- **with quantization**: \\
I used the same quantization used during fine-tuning of the adapter. This saves memory and ensures that the setup is consistent with the fine-tuning setup.
- **without quantization**: \\
I experimented also with the non-quantized base model, to tackle an issue emerging when using stacked adapters. Specifically, when the first adapter of the stack is merged to the base mode using `merge_and_unload()` and the base model is quantized, we see that the results are not the same as simply applying the adapter on top of the base model. When instead we use the base model without quantization, we get the same results as expected. According to Hugging Face, this is exptected, especially when using quanization, see discussion [here](https://discuss.huggingface.co/t/model-merging-leads-to-different-output/103986/3).


In [None]:
# Load base model from Hugging Face Hub

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import AutoPeftModelForCausalLM, PeftModel

base_model_name = "meta-llama/Llama-3.1-8B-Instruct" # "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(base_model_name, padding_side="left", use_fast=True)

######
# Use this to load the model quantized in the same fashion as during fine-tuning of Lora
######
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(base_model_name, quantization_config=bnb_config).eval().to(device)

######
# Use this to use the model without quantization
######
#model = AutoModelForCausalLM.from_pretrained(base_model_name).eval().to(device)



# 5. Run base model

Use the base mode to generate answers for the questions in the evaluation dataset, and store these answers in github.


In [None]:
# generate answers for base model
dataset_with_answers = evaluate_model(model, dataset, greedy=True)
# push dataset contaning generate answer to github
push_results_to_git(dataset_with_answers, adapter=Adapter(name="Llama-3.1-8B-Instruct", subfolder="base"))

# 6. Run LoRA adapters

Use each of the fine-tuned LoRA adapters to generate answers for the questions in the evaluation dataset, and store these answers in github.


In [None]:
#evaluate first adapter

namespace="mpenna77"
adapter=Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-lora-cthulhu", subfolder="checkpoint-450")
#adapter=Adapter(name="Llama-3.2-1B-Instruct-lr0.0001-b64-r16-a32-lora-cthulhu", subfolder="best-model")

# load adapter model
peft_model = PeftModel.from_pretrained(model, f"{namespace}/{adapter.name}", subfolder=adapter.subfolder).eval().to(device)
## generate answers
dataset_with_answers = evaluate_model(peft_model, dataset, greedy=True)
# push dataset contaning generate answer to github
push_results_to_git(dataset_with_answers, adapter=adapter)

In [None]:
#evaluate all other adapters

adapters = [
    Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lora-cthulhu", subfolder="checkpoint-2127"),
    Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu", subfolder="checkpoint-2127"),
    Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu", subfolder="best-model"),
    Adapter(name="Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora-cthulhu", subfolder="checkpoint-1420"),
    Adapter(name="Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora-cthulhu", subfolder="best-model"),
    Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-no_head-lora-cthulhu", subfolder="checkpoint-455"),
    Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-combo-no_head-lora-cthulhu", subfolder="checkpoint-518"),
]
for adapter in adapters:
  print(f"Generating for adapter: {adapter.name}")
  # load adapter
  peft_model.load_adapter(f"{namespace}/{adapter.name}", subfolder=adapter.subfolder, adapter_name=adapter.shortname)
  # set adapter
  peft_model.set_adapter(adapter.shortname)
  # generate answers
  dataset_with_answers = evaluate_model(peft_model, dataset, greedy=True)
  # push dataset contaning generate answer to github
  push_results_to_git(dataset_with_answers, adapter)

'\nfor adapter in adapters:\n  print(f"Generating for adapter: {adapter.name}")\n  # load adapter\n  peft_model.load_adapter(f"{namespace}/{adapter.name}", subfolder=adapter.subfolder, adapter_name=adapter.shortname)\n  # set adapter\n  peft_model.set_adapter(adapter.shortname)\n  # generate answers\n  dataset_with_answers = evaluate_model_on_dataset(peft_model, dataset, greedy=True)\n  # push to git\n  push_results_to_git(dataset_with_answers, adapter)\n'

# 7. Run stacked LoRA adapters

Use stacked LoRA adapters to generate answers for the questions in the evaluation dataset, and store these answers in github.

I experimented two setups. These setups should lead to the same answers. However we double check both of them to double check that the HF libraries are working as expected.

1. **Base + qa + continuted** \\
First merge the QA Adapter (i.e. the adapter tuned with the QA dataset) to the base mode, and then add the Continued Adapter (i.e. the adapter trained with the Continued Pretrained dataset) on top.

2. **Base + continuted + qa** \\
First merge the Continued Adapter to the base mode, and then add the QA Adapter on top.

I also tried to use `add_weighted_adapter()` function (see [here](https://huggingface.co/docs/peft/main/en/developer_guides/lora#merge-lora-weights-into-the-base-model)) in order to stack two adapters, but this approach leads to generate gibberish text. This should be investigated further.

**Important:** Before running the code below, reload the base model from scratch.


In [None]:
#########################
# Base + qa + cotinuted #
#########################

# TODO: https://kaitchup.substack.com/p/lora-adapters-when-a-naive-merge

namespace="mpenna77"


# load QA adapter
qa_adapter = Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lora-cthulhu", subfolder="checkpoint-2127")
peft_model = PeftModel.from_pretrained(model, f"{namespace}/{qa_adapter.name}", subfolder=qa_adapter.subfolder, adapter_name="qa").eval().to(device)
# set QA adapter
peft_model.set_adapter("qa")
print(f"current active adapters: {peft_model.active_adapters}")
# merge QA adapter
peft_model.merge_and_unload()


# load cont adapter
#   we need to use a new name for the adapter, otherwise it seems like the adapter
#   has not effect (probably it gets unloaded after the merge_and_unload())
cont_adapter = Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-no_head-lora-cthulhu", subfolder="checkpoint-455")
peft_model.load_adapter(f"{namespace}/{cont_adapter.name}", subfolder=cont_adapter.subfolder, adapter_name="cont")
# set cont adapter
peft_model.set_adapter("cont")
print(f"current active adapters: {peft_model.active_adapters}")


# generate answers
dataset_with_answers = evaluate_model(peft_model, dataset, greedy=True)
# push to git
push_results_to_git(dataset_with_answers, Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-merged_qa-adapter_cont-lora-cthulhu", subfolder="checkpoint-last"))

In [None]:
#########################
# Base + cotinuted + qa #
#########################

namespace="mpenna77"

# load cont adapter
cont_adapter = Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-no_head-lora-cthulhu", subfolder="checkpoint-455")
peft_model = PeftModel.from_pretrained(model, f"{namespace}/{cont_adapter.name}", subfolder=cont_adapter.subfolder, adapter_name="cont").eval().to(device)
# set cont adapter
peft_model.set_adapter("cont")
print(f"current active adapters: {peft_model.active_adapters}")
# merge cont adapter
peft_model.merge_and_unload()

#load qa adapter
qa_adapter = Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lora-cthulhu", subfolder="checkpoint-2127")
peft_model.load_adapter(f"{namespace}/{qa_adapter.name}", subfolder=qa_adapter.subfolder, adapter_name="qa")
# set qa adapter
peft_model.set_adapter("qa")
print(f"current active adapters: {peft_model.active_adapters}")


# generate answers
dataset_with_answers = evaluate_model(peft_model, dataset, greedy=True)
# push to git
push_results_to_git(dataset_with_answers, Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-merged_cont-adapter_qa-lora-cthulhu", subfolder="checkpoint-last"))

# 8. Run RAG agent

Use a simple RAG agent to generate answers for the questions in the evaluation dataset, and store these answers in github.

I used [Langchain](https://www.langchain.com/) to implement the RAG agent. The vector database supporting the agent is created as follows:
- Split the Cthulhu Rule Book into chunks of 1000 characters, overlapping of 100 characters.
- Generate an embedding for each chunk using the [jinaai/jina-embeddings-v3](https://huggingface.co/jinaai/jina-embeddings-v3) embedding model. This model is known to perform well on RAG tasks.
- Store the embeddings in a [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss/) in-memory vector database

At inference time, the system answers each question from the evaluation dataset using the following typical RAG approach:
- the question is projected in the embedding space, and the top-5 most similar chunks from the vector database are retrieved
- The `Llama-3.1-8B-Instruct` language model is invoked to generate the answer. The model is provided a typical RAG context as input, including:
  1. a system prompt that recommends the model to base the answer on the top-5 chunks;
  2. a user prompt that contains the text of the top-5 chunks



In [25]:
## Split Rulebook into chunks

from langchain_community.document_loaders.text import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

%cd /content/cthulhu_fine_tuning


# load cthulhu document file
loader = TextLoader("data/output/cthulhu.txt", encoding="utf-8")
doc = loader.load()

# spit document into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    strip_whitespace=True,
    add_start_index=True,
    separators = ["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_documents(doc)

print(f"Number of generated chunks: {len(chunks)}")

/content/cthulhu_fine_tuning
Number of generated chunks: 2167


In [None]:
## Load embedding model

import torch
from langchain_huggingface import HuggingFaceEmbeddings

EMBEDDING_MODEL_NAME = "jinaai/jina-embeddings-v3"
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    model_kwargs={"device": device, "trust_remote_code":True},
    encode_kwargs={"normalize_embeddings": True, "convert_to_numpy":True},  # Set `True` for cosine similarity
    show_progress=True,
)

In [None]:
# Generate embeddings for chunks and populate Faiss vector database

import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

faiss_store = FAISS(
    embedding_function=embedding_model,
    index=faiss.IndexFlatL2(len(embedding_model.embed_query("hello world"))),
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

faiss_store.add_documents(chunks)

In [None]:
# evaluate RAG with base model

dataset_with_answers = evaluate_model_rag(peft_model, dataset, faiss_store, greedy=True)
push_results_to_git(dataset_with_answers, Adapter(name="Llama-3.1-8B-Instruct-RAG", subfolder="base"))

# 9. Run RAG agent with stacked LoRA adapters

Use a simple RAG agent with stacked adapters to generate answers for the questions in the evaluation dataset, and store these answers in github.

This is the same as in (8). However instead of using `Llama-3.1-8B-Instruct` as language model, we use `Llama-3.1-8B-Instruct` with stacked Lora adapters as in (7), i.e. _base + qa + continuted_.

In [None]:
# evaluate RAG with PEFT adapters

namespace="mpenna77"
qa_adapter = Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lora-cthulhu", subfolder="checkpoint-2127")
peft_model = PeftModel.from_pretrained(model, f"{namespace}/{qa_adapter.name}", subfolder=qa_adapter.subfolder, adapter_name="qa").eval().to(device)
peft_model.set_adapter("qa")
peft_model.merge_and_unload()

cont_adapter = Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-no_head-lora-cthulhu", subfolder="checkpoint-455")
peft_model.load_adapter(f"{namespace}/{cont_adapter.name}", subfolder=cont_adapter.subfolder, adapter_name="cont")
peft_model.set_adapter("cont")

dataset_with_answers = evaluate_model_rag(peft_model, dataset, faiss_store, greedy=True)

push_results_to_git(
    dataset_with_answers,
    Adapter(name="Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-merged_qa-adapter_cont-lora-cthulhu-RAG", subfolder="checkpoint-last"),
    quantized_model=True
)

In [None]:
# show results

import pandas as pd
df = pd.DataFrame(
    {
        "question": dataset_with_answers["question"],
        "answer": dataset_with_answers["answer"],
        "generated_answer": dataset_with_answers["generated_answer"],
        "rag_top_docs": dataset_with_answers["rag_top_docs"]
    }
)

df

# 10. Prepare manual annotation file

This is code that generates a CSV file, which contains the original 50 questions from the evaluation dataset, and, for each question, the answers generated by all systems experimented in this notebook.

This CSV file is used to perform a manual evaluation of the experiment systems. Each answer generated by a system, is evaluate on three metrics: **fluency**, **topicality** and **correctness**. Each answer is scored on each metric with values from 1 (worse) to 5 (best). See README.md for details on the manual annotation process.

In [None]:
import glob
import csv
from pathlib import Path

%cd /content/cthulhu_fine_tuning

# set evaluation metrics
metrics = ["Fluency", "Topicality", "Correctness"]
system_names = []
answers = {}

# load generated answer from all systems
for file_name in glob.glob("evaluation/evaluation-results*/*csv"):
  system_name = Path(file_name).stem.replace("res_","")
  if "quantized" in file_name:
    system_name = system_name + "-quantized"
  system_names.append(system_name)
  with open(file_name, "r") as f:
    reader = csv.DictReader(f)
    answers[system_name] = [row["generated_answer"] for row in reader]

# load ground truth questions and answers
questions = []
gs_answers = []
with open(glob.glob("evaluation/evaluation-results*/*csv")[0], "r") as f:
    reader = csv.DictReader(f)
    for row in reader:
      questions.append(row["question"])
      gs_answers.append(row["answer"])

# write annotation file
header = []
header.append("question")
header.append("ground_truth_answer")
for system_name in system_names:
  header.append(f"{system_name} - generated_answer")
  for metric in metrics:
    header.append(f"{system_name} - {metric}")

Path("evaluation/manual-annotation/").mkdir(exist_ok=True)

with open("evaluation/manual-annotation/manual_annotation.csv", "w") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(header)
    for i in range(len(gs_answers)):
      row = [questions[i]]
      row.append(gs_answers[i])
      for system_name in system_names:
        row.append(answers[system_name][i])
        for metric in metrics:
          row.append("")
      writer.writerow(row)

# push annotation file to git
! git config --global user.email "marco.pennacchiotti@gmail.com"
! git add evaluation/manual-annotation/manual_annotation.csv
! git commit -m "added manual annotation"
! git push origin main


# 11. Evaluate systems

Code to read the results of the manual annotation, compute the performance of each system, and create plots for these.

For each of the performance metrics (**fluency**, **topicality**, **correctness**) I compute the average across the dataset, together with the 0.05 and 0.95 percentile.

## Load data

In [2]:
# Loaded annotated file including annotation for each system

import csv
import pandas as pd

%cd /content/cthulhu_fine_tuning

with open("evaluation/manual-annotation/manual_annotation_completed.csv") as f:
  df = pd.read_csv(f)

/content/cthulhu_fine_tuning


In [4]:
# check that all manual rating are in set {1,2,3,4,5}
# by computing numbers of cell that are not in the set

print("number of cells with annotations out of range:")
print(f"Fluency: {(~df.filter(regex='Fluency')).isin([1,2,3,4,5]).sum().sum()}")
print(f"Topicality: {(~df.filter(regex='Topicality')).isin([1,2,3,4,5]).sum().sum()}")
print(f"Correctness: {(~df.filter(regex='Correctness')).isin([1,2,3,4,5]).sum().sum()}")


number of cells with annotations out of range:
Fluency: 0
Topicality: 0
Correctness: 0


## Brakedown by metrics

In [5]:
# For each system, compute the average Correctness across the dataset, together
# with 0.05 and 0.95 quantiles

import numpy as np

def quantile_05(arr):
  return arr.quantile(q=0.5)

def quantile_95(arr):
  return arr.quantile(q=0.95)

df_cor = df.filter(regex="Correctness")
(df_cor.agg(["mean", quantile_05, quantile_95], axis=0).transpose()).sort_values(by="mean", ascending=False)

Unnamed: 0,mean,quantile_05,quantile_95
Llama-3.1-8B-Instruct-RAG-base-quantized - Correctness,4.04,5.0,5.0
Llama-3.1-8B-Instruct-RAG-base - Correctness,4.0,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-lora-cthulhu-RAG-checkpoint-450 - Correctness,3.0,3.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-merged_qa-adapter_cont-lora-cthulhu-RAG-checkpoint-last-quantized - Correctness,2.6,2.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-no_head-lora-cthulhu-checkpoint-455 - Correctness,2.48,2.0,5.0
Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora-cthulhu-best-model-quantized - Correctness,2.36,2.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-best-model - Correctness,2.34,2.0,5.0
Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora-cthulhu-best-model - Correctness,2.34,2.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-checkpoint-2127 - Correctness,2.32,2.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lora-cthulhu-checkpoint-2127 - Correctness,2.32,2.0,5.0


In [8]:

# For each system, compute the average Topicality across the dataset, together
# with 0.05 and 0.95 quantiles

df_top = df.filter(regex="Topicality")
(df_top.agg(["mean", quantile_05, quantile_95], axis=0).transpose()).sort_values(by="mean", ascending=False)

Unnamed: 0,mean,quantile_05,quantile_95
Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora-cthulhu-checkpoint-1420-quantized - Topicality,4.94,5.0,5.0
Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora-cthulhu-best-model-quantized - Topicality,4.94,5.0,5.0
Llama-3.1-8B-Instruct-RAG-base - Topicality,4.92,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-checkpoint-2127 - Topicality,4.9,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lora-cthulhu-checkpoint-2127 - Topicality,4.9,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-best-model - Topicality,4.9,5.0,5.0
Llama-3.1-8B-Instruct-RAG-base-quantized - Topicality,4.9,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-best-model-quantized - Topicality,4.88,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lora-cthulhu-checkpoint-2127-quantized - Topicality,4.88,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-checkpoint-2127-quantized - Topicality,4.86,5.0,5.0


In [9]:
# For each system, compute the average Fluency across the dataset, together
# with 0.05 and 0.95 quantiles

df_flu = df.filter(regex="Fluency")
(df_flu.agg(["mean", quantile_05, quantile_95], axis=0).transpose()).sort_values(by="mean", ascending=False)

Unnamed: 0,mean,quantile_05,quantile_95
Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-checkpoint-2127 - Fluency,4.96,5.0,5.0
Llama-3.1-8B-Instruct-RAG-base - Fluency,4.96,5.0,5.0
Llama-3.1-8B-Instruct-base-quantized - Fluency,4.96,5.0,5.0
Llama-3.1-8B-Instruct-RAG-base-quantized - Fluency,4.96,5.0,5.0
Llama-3.1-8B-Instruct-base - Fluency,4.96,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-best-model-quantized - Fluency,4.86,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-checkpoint-2127-quantized - Fluency,4.84,5.0,5.0
Llama-3.2-1B-Instruct-base - Fluency,4.84,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-best-model - Fluency,4.82,5.0,5.0
Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lora-cthulhu-checkpoint-2127 - Fluency,4.82,5.0,5.0


## Aggregated results

In [4]:
# Show all metrics together

df_cor_x = df.filter(regex="Correctness") \
             .agg(["mean"], axis=0) \
             .transpose() \
             .reset_index() \
             .rename(columns={"mean":"Correctness", "index":"System"}) \
             .replace(to_replace=r' - Correctness', value='', regex=True)

df_top_x = df.filter(regex="Topicality") \
             .agg(["mean"], axis=0) \
             .transpose() \
             .reset_index() \
             .rename(columns={"mean":"Topicality", "index":"System"}) \
             .replace(to_replace=r' - Topicality', value='', regex=True)

df_flu_x = df.filter(regex="Fluency") \
             .agg(["mean"], axis=0) \
             .transpose() \
             .reset_index() \
             .rename(columns={"mean":"Fluency", "index":"System"}) \
             .replace(to_replace=r' - Fluency', value='', regex=True)

df_merged = df_cor_x.merge(df_top_x, left_on='System', right_on='System') \
                    .merge(df_flu_x, left_on='System', right_on='System') \

df_merged["Total"] = df_merged["Correctness"] * df_merged["Topicality"] * df_merged["Fluency"]

df_merged.sort_values(by="Total", ascending=False)

Unnamed: 0,System,Correctness,Topicality,Fluency,Total
22,Llama-3.1-8B-Instruct-RAG-base-quantized,4.04,4.9,4.96,98.18816
10,Llama-3.1-8B-Instruct-RAG-base,4.0,4.92,4.96,97.6128
1,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-con...,3.0,4.72,4.74,67.1184
9,Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lor...,2.32,4.9,4.96,56.38528
16,Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora...,2.36,4.94,4.8,55.96032
8,Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lor...,2.34,4.9,4.82,55.26612
0,Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lor...,2.32,4.9,4.82,54.79376
20,Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lor...,2.22,4.88,4.86,52.651296
2,Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora...,2.34,4.76,4.72,52.573248
14,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-con...,2.48,4.34,4.76,51.232832


In [5]:
# show only systems using: (1) QA lora adapter (2) non quantized
df_merged[
    df_merged["System"].str.contains("a32-lora") &
    ~df_merged["System"].str.contains("quantized") &
    df_merged["System"].str.contains("-8B-")].sort_values(by="Correctness", ascending=False)

Unnamed: 0,System,Correctness,Topicality,Fluency,Total
2,Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora...,2.34,4.76,4.72,52.573248
8,Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lor...,2.34,4.9,4.82,55.26612
0,Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lor...,2.32,4.9,4.82,54.79376
9,Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lor...,2.32,4.9,4.96,56.38528
4,Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora...,2.2,4.82,4.8,50.8992


In [6]:
# show only systems using: (1) continuous lora adapter (2) non quantized (3) 8B
df_merged[
    df_merged["System"].str.contains("a32-continuous") &
    df_merged["System"].str.contains("-8B-") &
    ~df_merged["System"].str.contains("RAG")].sort_values(by="Correctness", ascending=False)

Unnamed: 0,System,Correctness,Topicality,Fluency,Total
14,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-con...,2.48,4.34,4.76,51.232832
25,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-con...,2.3,4.48,4.62,47.60448
24,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-con...,2.28,4.36,4.54,45.131232
13,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-con...,2.22,3.82,4.72,40.027488


In [81]:
# show only systems using: (1) combo adapter (2) 8B
df_merged[
    df_merged["System"].str.contains("a32-combo") &
    df_merged["System"].str.contains("-8B-")].sort_values(by="Correctness", ascending=False)

Unnamed: 0,System,Correctness,Topicality,Fluency,Total
6,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-com...,2.2,4.84,4.72,50.25856
19,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-com...,2.14,4.76,4.8,48.89472


In [82]:
# show only systems using: (1) stacked lora adapter (2) 8B
df_merged[
    df_merged["System"].str.contains("a32-merged") &
    df_merged["System"].str.contains("-8B-") &
    ~df_merged["System"].str.contains("RAG")].sort_values(by="Correctness", ascending=False)

Unnamed: 0,System,Correctness,Topicality,Fluency,Total
7,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-mer...,2.0,4.78,4.56,43.5936
5,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-mer...,1.98,4.78,4.58,43.346952


In [83]:
# show only systems using: (1) RAG (2) 8B
df_merged[
    df_merged["System"].str.contains("-8B-") &
    df_merged["System"].str.contains("RAG")].sort_values(by="Correctness", ascending=False)

Unnamed: 0,System,Correctness,Topicality,Fluency,Total
22,Llama-3.1-8B-Instruct-RAG-base-quantized,4.04,4.9,4.96,98.18816
10,Llama-3.1-8B-Instruct-RAG-base,4.0,4.92,4.96,97.6128
1,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-con...,3.0,4.72,4.74,67.1184
18,Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-mer...,2.6,4.74,4.0,49.296


## Create main summary

In [8]:
# which systems to retain in the summary and the mapping to their short names

system_mappings = {
    # base systems
    "Llama-3.2-1B-Instruct-base": "base-1B-full",
    "Llama-3.1-8B-Instruct-base": "base-8B-full",
    # qa systems
    "Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-best-model": "lora-qa-full",
    # cont systems
    "Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-no_head-lora-cthulhu-checkpoint-455": "lora_cont-full",
    "Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-lora-cthulhu-checkpoint-450": "lora_cont_head-full",
    # combo systems
    "Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-combo-no_head-lora-cthulhu-checkpoint-518": "lora_combo-full",
    # stacked systems
    "Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-merged_qa-adapter_cont-lora-cthulhu-checkpoint-last": "lora_stack-full",
    # rag systems
    "Llama-3.1-8B-Instruct-RAG-base": "base-full-rag",
    "Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-lora-cthulhu-RAG-checkpoint-450": "lora_cont-full-rag",
    "Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-merged_qa-adapter_cont-lora-cthulhu-RAG-checkpoint-last-quantized":"lora_stack-quan-rag"
}

df_merged[df_merged["System"].isin(system_mappings.keys())] \
         .set_index("System") \
         .rename(index=system_mappings) \
         .reset_index() \
         .sort_values(by="Correctness", ascending=False)

Unnamed: 0,System,Correctness,Topicality,Fluency,Total
5,base-full-rag,4.0,4.92,4.96,97.6128
0,lora_cont-full-rag,3.0,4.72,4.74,67.1184
9,lora_stack-quan-rag,2.6,4.74,4.0,49.296
8,lora_cont-full,2.48,4.34,4.76,51.232832
4,lora-qa-full,2.34,4.9,4.82,55.26612
7,lora_cont_head-full,2.22,3.82,4.72,40.027488
2,lora_combo-full,2.2,4.84,4.72,50.25856
3,lora_stack-full,2.0,4.78,4.56,43.5936
6,base-8B-full,1.44,2.14,4.96,15.284736
1,base-1B-full,1.4,2.02,4.84,13.68752


## Plot

In [None]:
from matplotlib import pyplot as plt

fig, axs = plt.subplots(figsize=(20,8))
box = df_cor.plot.box(
    showmeans=True,
    patch_artist=True,
    boxprops=dict(facecolor='bisque'),
    medianprops=dict(linewidth=3, color='green'),
    meanprops=dict(marker='D', color='green'),
    ax=axs
)
axs.tick_params(labelrotation=90)

In [16]:
# computing statistical significance using Wilconxon test

import scipy.stats as stats

base_rag_corr = df_cor["Llama-3.1-8B-Instruct-RAG-base - Correctness"].to_list()
lora_nohead_corr = df_cor["Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-no_head-lora-cthulhu-checkpoint-455 - Correctness"]
lora_head_corr = df_cor["Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-lora-cthulhu-checkpoint-450 - Correctness"]
#lora_nohead_corr = df_cor["Llama-3.2-1B-Instruct-base - Correctness"]

stats.wilcoxon(lora_head_corr, lora_nohead_corr)

WilcoxonResult(statistic=np.float64(61.0), pvalue=np.float64(0.08858881295542467))

In [15]:

lora_nohead_topi = df_top["Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-no_head-lora-cthulhu-checkpoint-455 - Topicality"]
lora_head_topi = df_top["Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-lora-cthulhu-checkpoint-450 - Topicality"]
lora_qa_topi = df_top["Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu-best-model - Topicality"]

stats.wilcoxon(lora_head_topi, lora_nohead_topi)

WilcoxonResult(statistic=np.float64(36.0), pvalue=np.float64(0.008487503337192))

# Generate single example

In [None]:
messages = [
    #{"role": "user", "content": f"What happens to an investigator at the age of 40?"},
    #{"role": "user", "content": f"What should you do if a roll doesn't mesh with your character concept?"},
    #{"role": "user", "content": f"What is the task in the Personal Description section?"},
    #{"role": "user", "content": f"What is the name of the wife mentioned in the text?"},
    {"role": "user", "content": f"What is the chance of success if Harvey makes two successive rolls against two different skills?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to(device)

print("*"*20)
print("Tokenized input:")
print(tokenizer.decode(inputs[0]).strip())
print("*"*20)
print("\nAnswer:")

_ = peft_model.generate(
    input_ids=inputs,
    pad_token_id=tokenizer.pad_token_id,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
    max_new_tokens = 256,
    temperature = 0.1
    )



# Utils to check GPU status

In [None]:
## free the memory again
import torch
import gc

del model
del peft_model
torch.cuda.empty_cache()
gc.collect()

In [None]:
!  nvidia-smi

In [None]:
! fuser -v /dev/nvidia*

In [None]:
! kill -9 927

# Old code (deprecated)

## Run automated evaluation

   - https://huggingface.co/spaces/evaluate-metric/bleu
   - https://huggingface.co/prometheus-eval/prometheus-13b-v1.0

Relevant APIs for automated metrics:

  - https://github.com/huggingface/evaluate/tree/main/metrics/
  - https://huggingface.co/spaces/evaluate-metric/bleu

In [None]:
import evaluate
from datasets import load_dataset


%cd /content/cthulhu_fine_tuning

gt_dataset_file = "evaluation/evaluation-dataset/cthulhu_eval_dataset.csv"
predictios_dir = "/content/cthulhu_fine_tuning/evaluation/evaluation-results/*csv"
metric_names = ['meteor', 'bleu', 'rouge']

# load metrics
metrics = []
for metric_name in metric_names:
  metrics.append(evaluate.load(metric_name))

# load ground truth
gt_dataset = load_dataset("csv", data_files=dataset_file, split="train")


results = {metric_name:[] for metric_name in metric_names}
results["system"] = []

# for each system to evaluate
for pred_file in glob.glob(predictios_dir):
  pred_dataset = load_dataset("csv", data_files=pred_file, split="train")
  system_name = pred_file.split("/")[-1]
  results["system"].append(system_name)
  # for each evaluation metric
  for metric in metrics:
    # compute metric
    score = metric.compute(predictions=pred_dataset['generated_answer'], references=gt_dataset['answer'])
    results[metric.name].append(score)

print(results)


## Other code

In [None]:
# FUNCTIONS TO GENERATE ANSWERS FROM QUESTIONS IN THE DATASET, AND PUSH RESULTS TO GIT

DEFAULT_SYSTEM_PROMPT = "You are a helpful AI assistant."

from dataclasses import dataclass

@dataclass
class Adapter:
  name: str
  subfolder: str

  @property
  def shortname(self):
    return self.name.replace(".", "_")


def generate_text(target_model, example, temperature=0.1, greedy=False, system_prompt=None):
  """
  Generates the answer based on the question of a given example using a language model.

  Args:
    example: A dictionary containing the input data, expected to have a "question" key.
    temperature: The temperature for text generation (default is 0.1).
    greedy: Whether to use greedy decoding (default is False).
    system_prompt: A system prompt to use for the model (default is "You are a helpful AI assistant.").

  Returns:
    The generated answer as a string.
  """
  # see: https://huggingface.co/docs/datasets/en/process#multiprocessing

  if not system_prompt:
    system_prompt = DEFAULT_SYSTEM_PROMPT

  # format to chatml
  chat = [
      {"role": "system", "content": f"{system_prompt}"},
      {"role": "user", "content": f"{example['question']}"}
  ]
  chatml = tokenizer.apply_chat_template(
      chat,
      tokenize = False,
      add_generation_prompt = True
  )

  # tokenize
  inputs = tokenizer(chatml, return_tensors="pt").to(device)
  prompt_length = inputs['input_ids'].shape[1]

  # generate tokens
  with torch.no_grad():
    if greedy:
      # greedy generation
      output_ids = target_model.generate(**inputs, max_new_tokens = 256, do_sample = False, pad_token_id=tokenizer.pad_token_id)
    else:
      # multinomial sampling generation
      output_ids = target_model.generate(**inputs, max_new_tokens = 256, temperature = temperature, pad_token_id=tokenizer.pad_token_id)

  # decode generated tokens into text
  output_text = tokenizer.decode(output_ids[0][prompt_length:], skip_special_tokens=True) # remove input prompt from output (https://discuss.huggingface.co/t/generate-returns-full-prompt-plus-answer/70453)

  return output_text


def generate_texts(target_model, dataset, num_examples=3, temperature=0.1, greedy=False):
  results = []
  ct = 1
  for example in dataset:
    results.append(generate_text(target_model, example, temperature=temperature, greedy=greedy)) # let's generate using only a bit of creativity (temp=0.1)
    if ct == num_examples:
      break
    ct += 1
  return results


def evaluate_model_on_dataset(target_model, dataset, temperature=0.1, greedy=False, system_prompt=None):
  """
  Generates the answer for the question in each example in the dataset using the provided model.

  Args:
    dataset: A Hugging Face Dataset object, containing the questions in the field "question".
    model: A Hugging Face model for text generation.
    temperature: The temperature for text generation (default is 0.1).
    greedy: Whether to use greedy decoding (default is False).
    system_prompt: A system prompt to use for the model (default is "You are a helpful AI assistant.").

  Returns:
    A Hugging Face Dataset object with an added "generated_answer" column.
  """
  generated_texts = []

  for ct, example in enumerate(dataset):
    print(f"Generating answer {ct}/{len(dataset)}")
    generated_text = generate_text(target_model, example, temperature=temperature, greedy=greedy, system_prompt=system_prompt)
    generated_texts.append(generated_text)

  return dataset.add_column("generated_answer", generated_texts)


def push_results_to_git(dataset_with_answers, adapter: Adapter, system_prompt=None):
  """
  Pushes evaluation results to the git repository.

  Args:
    dataset_with_answers: A Hugging Face Dataset object containing the generated answers.
    model_name: The name of the model used for generation.
  """

  output_file_name = adapter.name + "-" + adapter.subfolder

  if not system_prompt:
    system_prompt = DEFAULT_SYSTEM_PROMPT

  if system_prompt != DEFAULT_SYSTEM_PROMPT:
    output_file_name = output_file_name + "_prompt_" + str(hash(system_prompt)% 10**6)


  output_file = f"evaluation/evaluation-results-quantized/res_{output_file_name}.csv"
  dataset_with_answers.add_column("system_prompt", [system_prompt for _ in range(len(dataset_with_answers))]) \
    .to_csv(output_file)

  ! git config --global user.email "marco.pennacchiotti@gmail.com"
  ! git add {output_file}
  ! git commit -m "added generated answers"
  ! git push origin main


In [None]:
# LOAD BASE MODEL FROM HF HUB
#https://discuss.huggingface.co/t/model-merging-leads-to-different-output/103986/3

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import AutoPeftModelForCausalLM, PeftModel

namespace="mpenna77"
base_model_name =  "meta-llama/Llama-3.1-8B-Instruct" # "meta-llama/Llama-3.2-1B-Instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                     # weights stored in 4 bits (saves memory)
    bnb_4bit_quant_type="nf4",             # format used for storing in 4 bits
    bnb_4bit_use_double_quant=True,        # double quantize (saves memory)
    bnb_4bit_compute_dtype=torch.bfloat16, # perform operations in 16bit instead of 32bit (speed up fine-tuning)
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name, padding_side="left", use_fast=True)
model = AutoModelForCausalLM.from_pretrained(base_model_name, quantization_config=bnb_config).eval().to(device)

In [None]:
# LOAD PEFT MODEL

#peft_model_name = "Llama-3.1-8B-Instruct-lr0.0001-b32-r64-a32-continuous-lora-cthulhu"
#peft_model_checkpoint = "checkpoint-450"

#peft_model_name = "Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu"
#peft_model_checkpoint = "checkpoint-2127"

#peft_model_name = "Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu"
#peft_model_checkpoint = "best-model"

#peft_model_name = "Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora-cthulhu"
#peft_model_checkpoint = "checkpoint-1420"

#peft_model_name = "Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora-cthulhu"
#peft_model_checkpoint = "best-model"

#peft_model_name = "Llama-3.2-1B-Instruct-lr0.0001-b64-r16-a32-lora-cthulhu"
#peft_model_checkpoint = "best-model"

peft_model_name = "Llama-3.1-8B-Instruct-lr0.0001-b16-r64-a32-lora-cthulhu"
peft_model_checkpoint = "checkpoint-2127"

peft_model = PeftModel.from_pretrained(
    model,
    f"{namespace}/{peft_model_name}",
    subfolder=peft_model_checkpoint).eval().to(device)

#peft_model = AutoPeftModelForCausalLM.from_pretrained(f"{namespace}/{model_name}").eval().to(device)

In [None]:
# GENERATE ANSWERS FROM DATASET, PUSH RESULTS TO GIT

system_prompt = None

# let's generate using only a bit of creativity (temp=0.1)
dataset_with_answers = evaluate_model_on_dataset(dataset, temperature=0.1, greedy=False, system_prompt=system_prompt)
push_results_to_git(dataset_with_answers, system_prompt=system_prompt)



In [None]:
# Load models

"""
model_info = (
        "Llama-3.2-1B-Instruct-lr0.0001-b64-r16-a32-lora-cthulhu-checkpoint-2127",
        "meta-llama/Llama-3.1-8B-Instruct",
        "/content/cthulhu_fine_tuning/fine-tuning/models/Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu/checkpoint-2127"
        )

model_info = (
        "Llama-3.2-1B-Instruct-lr0.0001-b64-r16-a32-lora-cthulhu",
        "meta-llama/Llama-3.2-1B-Instruct",
        "/content/cthulhu_fine_tuning/fine-tuning/models/Llama-3.2-1B-Instruct-lr0.0001-b64-r16-a32-lora-cthulhu/best-model"
        )

model_info = (
        "Llama-3.2-1B-Instruct-lr0.0001-b64-r16-a32-lora-cthulhu-best-model",
        "meta-llama/Llama-3.1-8B-Instruct",
        "/content/cthulhu_fine_tuning/fine-tuning/models/Llama-3.1-8B-Instruct-lr0.0001-b16-r16-a32-lora-cthulhu/best-model"
        )
"""
model_info =     (
        "Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora-cthulhu-checkpoint-1420",
        "meta-llama/Llama-3.1-8B-Instruct",
        "/content/cthulhu_fine_tuning/fine-tuning/models/Llama-3.1-8B-Instruct-lr3e-05-b32-r16-a32-lora-cthulhu/checkpoint-1420"
    )


In [None]:
# USE THIS IF MODEL WAS SAVED IN GITHUB

# Load model:
# 0. Load tokenizer
# 1. Load the base model
# 2. Load PEFT config
# 3. Load LoRA PEFT model on top of the base model
# 4. Move to GPU

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from transformers import BitsAndBytesConfig
from peft import PeftConfig, PeftModelForCausalLM, PeftModel

model_name = model_info[1]
model_dir = model_info[2]
device = "cuda" if torch.cuda.is_available() else "cpu"


# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left", use_fast=True)

# Load model
#   see:
#   https://ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora
#   https://huggingface.co/docs/peft/tutorial/peft_model_config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                     # weights stored in 4 bits (saves memory)
    bnb_4bit_quant_type="nf4",             # format used for storing in 4 bits
    bnb_4bit_use_double_quant=True,        # double quantize (saves memory)
    bnb_4bit_compute_dtype=torch.bfloat16, # perform operations in 16bit instead of 32bit (speed up fine-tuning)
)

# loads model in eval() mode. This means that stuff like batch norm and dropout are not applied (like it would be during training).
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config).eval().to(device)
peft_config = PeftConfig.from_pretrained(model_dir)
peft_model = PeftModel.from_pretrained(model, model_dir)

tokenizer.add_special_tokens({"pad_token": "<|finetune_right_pad_id|>"})
model.config.pad_token_id = tokenizer.pad_token_id

In [None]:
# this code is used to evaluate different generation temperatures for non-greedy
# generation. For each temperature, run generation 3 times
# Also greedy generation is then run

max = 8
examples = [[] for x in range(max+1)]

for temperature in [0.1, 1.0]:
  print(f"Temperature: {temperature}")
  for run in range(3):
    print(f"  Run: {run}")
    for ct, example in enumerate(dataset):
        print(f"    Example: {ct}")
        examples[ct].append((generate_text(example, temperature=temperature), temperature, ct, example["question"]))
        if ct == max:
          break
print(f"greedy")
for run in range(3):
    print(f"  Run: {run}")
    for ct, example in enumerate(dataset):
        print(f"    Example: {ct}")
        examples[ct].append((generate_text(example, greedy=True), "greedy", ct, example["question"]))
        if ct == max:
          break

for lista in examples:
  for example in lista:
        print(example)
  print("*"*20)

In [None]:
# THE FOLLOWING DOES NOT WORK, CAUSE OOM ON MAIN MEM

#dataset_with_answers = dataset.map(
#    generate_text,
#    fn_kwargs={"model": model, "tokenizer": tokenizer},
    #cache_file_name='temp', # see: https://discuss.huggingface.co/t/how-to-load-this-simple-audio-data-set-and-use-dataset-map-without-memory-issues/17722/7
#)

In [None]:
# LOAD DATASET AS CSV
import csv

%cd /content/cthulhu_fine_tuning

with open("./eval-data/cthulhu_eval_dataset.csv") as f:
  reader = csv.DictReader(f)
  eval_dataset = [row for row in reader]

print(f"Dataset size: {len(eval_dataset)}")

In [None]:

def format_to_chatml(example):
     return {
          "messages": [
            {"role": "system", "content": f"You are a helpful AI assistant."},
            {"role": "user", "content": f"{example['question']}"},
          ],
     }

def tokenize(example, tokenizer):
    input_ids = tokenizer.apply_chat_template(
        example["messages"],
        tokenize = True,
        padding = True,
        truncation = True,
        max_length = 256,
        add_generation_prompt = True,
        return_tensors = "pt"
    )
    return {"input_ids": input_ids}



tokenizer.add_special_tokens({"pad_token": "<|finetune_right_pad_id|>"})
model.config.pad_token_id = tokenizer.pad_token_id
tokenizer.padding_side = 'left'


dataset = dataset.map(
    format_to_chatml,
    remove_columns=dataset.column_names
)


dataset_tokenized = dataset.map(
    tokenize,
    #batched=True,
    remove_columns=["messages"],
    fn_kwargs={"tokenizer": tokenizer}
)


print(dataset[0])
print(dataset_tokenized[0])


In [None]:
type(dataset_tokenized[0]["input_ids"][0])

In [None]:
print(dataset[0])
tokenized_with_template = tokenize(dataset[0], tokenizer)["input_ids"].to(device)
print(tokenized_with_template)
print(type(tokenized_with_template))

tokenized = tokenizer(str(dataset[0]), return_tensors="pt").to(device)
print(tokenized)
print(type(tokenized))

In [None]:


for model, base_model, model_dir in model_infos:
  tokenizer = AutoTokenizer.from_pretrained(base_model)
  model = AutoModelForCausalLM.from_pretrained(model_dir).to(device)

  # prepare eval dataset:

  for example in eval_dataset:
    chat = [
      {"role": "system", "content": f"You are a helpful AI assistant."},
      {"role": "user", "content": f"{example['question']}"},
    ]
  inputs = tokenizer.apply_chat_template(chat, tokenize = True, add_generation_prompt = True)





In [None]:

from transformers import AutoTokenizer
from pathlib import Path

print(MODELS_DIR[2][1])
tokenizer = AutoTokenizer.from_pretrained(MODELS_DIR[2][1])

In [None]:
eval_dataset[0]

In [None]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": f"{eval_dataset[5]['question']}"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to(device)

print("*"*20)
print("Tokenized input:")
print(tokenizer.decode(inputs[0]).strip())
print("*"*20)
print("\nAnswer:")

_ = model.generate(
    input_ids=inputs,
    pad_token_id=tokenizer.pad_token_id,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
    max_new_tokens = 256,
    temperature = 0.1
    )