<a href="https://colab.research.google.com/github/dimitrod/ehu_nlp_dimathina/blob/main/Guides/evaluate_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Evaluation
In this script you can evaluate any model that has been developed during the project using the evaluation data.

By default the evaluation data has been set to be the entire validation split (7993 questions) of the triviaqa dataset.
If you want to change the evaluation data to be a different split or a different default size for your tests you can do that using the *load_evaluation_data.py* script in the utils package. A guide for this is in the section [Changing the Evaluation Data](#scrollTo=H294GP72zhM9).





---



## Setup
To import the model into the evaluation environment enter the parameters into the script below. Use the following table to find the right entries for each model:



|Model|Context|Embedding|BRANCH|DIRECTORY|MODULE|CLASS|PARAMS|Note|
|----------|----------|----------|----------|----------|----------|----------|----------|----------|
|Dummy Model|-|-|main|Dummy_Model|dummy_model|dummy_model|2 test params|-|
|Tiny Llama|-|-|tiny_llama|TinyLlama|tiny_llama|tinyllama|-|**very slow**|
|Tiny Llama|Whole documents|Dense|tiny_llama_rag|TinyLlamaRAG|tiny_llama_rag|tiny_llama_rag|||
|Mistral Instruct|-|-|mistral-instruct-no-rag|MistralInstruct|mistral_instruct|mistral_instruct||**Huggingface Token and GPU required**|
|Mistral Instruct|Whole documents|Dense|mistral-instruct|MistralInstruct|mistral_instruct|mistral_instruct|||
|Bert Base|QA Pairs|Dense|rag_qa_embeddings|RAG_QA_Embeddings|rag_qa_embeddings|rag_qa_embeddings|k|-|
|Bert Base|Whole documents|Dense|bert_base_rag|Bert_Base|bert_base|bert_base|||
|Bert Base|Text Fragments|Sparse|bert_base_sparse_embeddings|RAG_Sparse_Embeddings|rag_sparse_embeddings|rag_sparse_embeddings|||
|Bert Finetuned|Whole documents|Dense|bert_finetuned_rag|Bert_Finetuned|bert_finetuned|bert_finetuned|||
|Mistral Instruct|Text Fragments|Sparse/Dense|rag_sparse_dense_embeddings|RAG_Sparse_Dense_Embeddings|rag_sparse_dense_embeddings|rag_sparse_dense_embeddings|k, c, o, t, mt||

The meaning of each parameter can be found in this table

|Parameter Name|Description|
|----------|----------|
|k|Number of contexts the retriever sends to the reader|
|c|chunk size of each context|
|o|overlap between the contexts|
|t|temparature of the reader model|
|mt|maximum amount of tokens generated by the reader model|

If the model requires several parameters separate each parameter by a blank space.

In [1]:
import os

os.environ["BRANCH"] = "rag_sparse_dense_embeddings"
os.environ["DIRECTORY"] = "RAG_Sparse_Dense_Embeddings"
os.environ["MODULE"] = "rag_sparse_dense_embeddings"
os.environ["CLASS"] = "	rag_sparse_dense_embeddings"
os.environ["PARAMS"] = "10 300 0 0.5 30"

Execute the following script to setup the evaluation environment

In [2]:
directory = os.environ["DIRECTORY"]

!git clone --no-checkout https://github.com/dimitrod/ehu_nlp_dimathina.git
%cd ehu_nlp_dimathina/
!git sparse-checkout init --cone
!git sparse-checkout set Evaluation/
!git checkout main
%mv Evaluation/ ..
!git sparse-checkout set $DIRECTORY
!git checkout $BRANCH
%mv $DIRECTORY ..
%cd ..
!git clone https://github.com/mandarjoshi90/triviaqa.git
%rm -r ehu_nlp_dimathina
%rm -r sample_data
%cd triviaqa/
!pip install -r requirements.txt
%cd ../Evaluation
!pip install -r requirements.txt
%cd ../{directory}
!pip install -r requirements.txt
%cd ..

Cloning into 'ehu_nlp_dimathina'...
remote: Enumerating objects: 648, done.[K
remote: Counting objects: 100% (183/183), done.[K
remote: Compressing objects: 100% (121/121), done.[K
remote: Total 648 (delta 110), reused 95 (delta 59), pack-reused 465 (from 1)[K
Receiving objects: 100% (648/648), 27.99 MiB | 9.13 MiB/s, done.
Resolving deltas: 100% (345/345), done.
/content/ehu_nlp_dimathina
Branch 'main' set up to track remote branch 'main' from 'origin'.
Switched to a new branch 'main'
Downloading RAG_Sparse_Dense_Embeddings/database/document_library.pkl (149 MB)
Error downloading object: RAG_Sparse_Dense_Embeddings/database/document_library.pkl (6b3fb88): Smudge error: Error downloading RAG_Sparse_Dense_Embeddings/database/document_library.pkl (6b3fb883ce56e22c1ec1d009a2381aea7cf95e1ebe5b42f8d45bf0652638dd26): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Errors logged to /content/e



---



## Changing the Evaluation Data

**If you don't want to change the split of the evaluation dataset skip this step**

This step is only required if you want to evaluate with a different data set. You can change the evaluation dataset with the following command

`!python3 -m Evaluation.utils.load_evaluation_data --split {split} --split_size {split_size}`

Keep the maximum split sizes in mind

In [None]:
!python3 -m Evaluation.utils.load_evaluation_data --split "train" --split_size 7900

README.md: 100% 26.7k/26.7k [00:00<00:00, 44.1MB/s]
Resolving data files: 100% 26/26 [00:04<00:00,  5.25it/s]
train-00000-of-00001.parquet: 100% 25.0M/25.0M [00:00<00:00, 68.8MB/s]
validation-00000-of-00001.parquet: 100% 3.31M/3.31M [00:00<00:00, 63.4MB/s]
test-00000-of-00001.parquet: 100% 542k/542k [00:00<00:00, 219MB/s]
Generating train split: 100% 61888/61888 [00:00<00:00, 120788.64 examples/s]
Generating validation split: 100% 7993/7993 [00:00<00:00, 121794.20 examples/s]
Generating test split: 100% 7701/7701 [00:00<00:00, 250788.73 examples/s]
Loading dataset: 100% 7900/7900 [00:01<00:00, 3985.31it/s]
Replacing field names: 100% 19/19 [00:00<00:00, 59.48it/s]
Saving file...




---



## Executing the Evaluation Chain
Now you can run the Evaluation chain with the model of choice. To start the evaluation chain use the following command

`!python3 -m Evaluation.evaluation_chain --model_directory $DIRECTORY --model_module $MODULE --model_class $CLASS --model_params $PARAMS`


*   If the model doesn't require any params please remove `--model_params $PARAMS` from the command
*   If you want to evaluate the model on a smaller split you can do that by adding --`split_size {split_size}`

In [4]:
!python3 -m Evaluation.evaluation_chain --model_directory $DIRECTORY --model_module $MODULE --model_class $CLASS --model_params $PARAMS --split_size 10

Loading Questions:   0% 0/10 [00:00<?, ?it/s]Loading Questions: 100% 10/10 [00:00<00:00, 137518.16it/s]
Loading model...
2024-12-15 10:57:51.488762: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-15 10:57:51.512229: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-15 10:57:51.519848: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-15 10:57:51.536790: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in othe

The results of the test are saved in the file *dummy_model_split_size=7900_results.txt* in the Evaluation package


