<a href="https://colab.research.google.com/github/dimitrod/ehu_nlp_dimathina/blob/clean_branch/evaluate_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Evaluation
In this script you can evaluate any model that has been developed during the project using the evaluation data.

By default the evaluation data has been set to be the entire validation split (7993 questions) of the triviaqa dataset.
If you want to change the evaluation data to be a different split or a different default size for your tests you can do that using the *load_evaluation_data.py* script in the utils package. A guide for this is in the section [Changing the Evaluation Data](#scrollTo=H294GP72zhM9).





---



## Setup
To import the model into the evaluation environment enter the parameters into the script below. Use the following table to find the right entries for each model:


|Model|Context|MODEL_NAME|DATABASE|PARAMS|Note|
|----------|----------|----------|----------|----------|----------|
|Tiny Llama|-|tiny_llama_no_retriever|-|-|**very slow**|
|Tiny Llama|Whole documents|tiny_llama_dense|**external**|-|**very slow**|
|Mistral Instruct|-|mistral_instruct_no_retriever|-|-|**Huggingface Token and GPU required**|
|Mistral Instruct|Whole documents|mistral_instruct_dense|**external**|-|**Huggingface Token and GPU required**|
|Mistral Instruct|Text fragments|mistral_instruct_hybrid|sparse-dense|k, c, o|**Huggingface Token and GPU required**|
|Bert Base|QA Pairs|bert_base_qa_embeddings|**directly imported**|k|-|
|Bert Base|Whole documents|bert_base_dense|**external**|-|-|
|Bert Base|Text fragments|bert_base_sparse|sparse|k|-|
|Bert Finetuned|Whole documents|bert_finetuned_dense|**external**|-|-|
|Chat GPT 4o|-|chat_gpt_no_retriever|-|t|**Not free to use**|
|Chat GPT 4o|Text fragments|chat_gpt_hybrid|sparse-dense|k, c, o, t|**Not free to use**|

The meaning of each parameter can be found in this table

|Parameter Name|Description|
|----------|----------|
|k|Number of contexts the retriever sends to the reader|
|c|chunk size of each context|
|o|overlap between the contexts|
|t|temparature of the reader model|

If the model uses an external database, a directly imported database or no database please enter an empty string ("") for the DATABASE variable in the script.

If the model doesn't have any parameter please enter "-" for the PARAMS variable. If the model requires several parameters separate each parameter by a blank space.

In [None]:
import os
import importlib
from google.colab import files

os.environ["MODEL_NAME"] = "bert_base_sparse"
os.environ["DATABASE"] = "sparse"
os.environ["PARAMS"] = "5"
os.environ["SPLIT_SIZE"] = "7993"

Execute the following script to setup the evaluation environment

In [None]:
import os
import shutil

# Set environment variables
directory = os.environ["MODEL_NAME"]
database = os.environ["DATABASE"]

# Install Git LFS
!sudo apt-get install git-lfs -y
!git lfs install

# Clone the repositories
!git clone https://github.com/mandarjoshi90/triviaqa.git
!git clone --branch clean_branch https://github.com/dimitrod/ehu_nlp_dimathina.git
%cd ehu_nlp_dimathina

# Fetch and checkout requirements
mod_path = f"models/{directory}/*"
!git lfs fetch --include="{mod_path}, evaluation"
!git lfs checkout
%cd ..

# Move the model and the evaluation package to the current directory
shutil.move(f"ehu_nlp_dimathina/models/{directory}", ".")
shutil.move("ehu_nlp_dimathina/evaluation", ".")

#Create the results directory in the evaluation package
!mkdir evaluation/results

# Handle the optional database
if database:
    %cd ehu_nlp_dimathina
    db_path = f"databases/{database}/*"
    !git lfs fetch --include="{db_path}"
    !git lfs checkout
    %cd ..
    shutil.move(f"ehu_nlp_dimathina/databases/{database}", f"{directory}/database")


# Install model-specific requirements
!pip install -r {directory}/requirements.txt
!pip install -r evaluation/requirements.txt
!pip install -r triviaqa/requirements.txt

# Cleanup
shutil.rmtree("ehu_nlp_dimathina")
shutil.rmtree("sample_data")


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Git LFS initialized.
Cloning into 'triviaqa'...
remote: Enumerating objects: 70, done.[K
remote: Total 70 (delta 0), reused 0 (delta 0), pack-reused 70 (from 1)[K
Receiving objects: 100% (70/70), 20.60 KiB | 843.00 KiB/s, done.
Resolving deltas: 100% (28/28), done.
Cloning into 'ehu_nlp_dimathina'...
remote: Enumerating objects: 1922, done.[K
remote: Counting objects: 100% (513/513), done.[K
remote: Compressing objects: 100% (416/416), done.[K
remote: Total 1922 (delta 275), reused 134 (delta 97), pack-reused 1409 (from 1)[K
Receiving objects: 100% (1922/1922), 30.10 MiB | 32.48 MiB/s, done.
Resolving deltas: 100% (1169/1169), done.
/content/ehu_nlp_dimathina
fetch: Fetching reference refs/heads/clean_branch
Skipped checkout for "databases/sparse-dense/document_lib



---



## Changing the Evaluation Data

**If you don't want to change the split of the evaluation dataset skip this step**

This step is only required if you want to evaluate with a different data set. You can change the evaluation dataset with the following command

`!python3 -m Evaluation.utils.load_evaluation_data --split {split} --split_size {split_size}`

Keep the maximum split sizes in mind

In [None]:
!python3 -m evaluation.utils.load_evaluation_data --split "train" --split_size 7900

Resolving data files: 100% 26/26 [00:00<00:00, 145.27it/s]
Loading dataset: 100% 7993/7993 [00:01<00:00, 4036.49it/s]
Replacing field names: 100% 19/19 [00:00<00:00, 83.74it/s]
Saving file...




---



## Executing the Evaluation Chain
Now you can run the Evaluation chain with the model of choice. To start the evaluation chain use the following command

`!python3 -m evaluation.evaluation_chain --model_name $MODEL_NAME --model_params $PARAMS --split_size $SPLIT_SIZE`

In [None]:
!python3 -m evaluation.evaluation_chain --model_name $MODEL_NAME --model_params $PARAMS --split_size $SPLIT_SIZE

Loading Questions:   0% 0/7993 [00:00<?, ?it/s]Loading Questions: 100% 7993/7993 [00:00<00:00, 220735.40it/s]
Loading model...
2024-12-19 20:51:50.102172: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-12-19 20:51:50.118045: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-19 20:51:50.139078: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-19 20:51:50.145385: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register facto

If you automatically want to download the results of you evaluation Execute the following script (while the data is loading)




In [None]:
files.download(f'evaluation/results/{os.environ["MODEL_NAME"]}_split_size={os.environ["SPLIT_SIZE"]}_results.txt')