<a href="https://colab.research.google.com/github/dimitrod/ehu_nlp_dimathina/blob/main/Guides/evaluate_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Evaluation
In this script you can evaluate any model that has been developed during the project using the evaluation data.

By default the evaluation data has been set to be the first 7900 questions from the train split of the triviaqa dataset.
If you want to change the evaluation data to be a different split you can do that using the *load_evaluation_data.py* script in the utils package. A guide for this is in the section [Changing the Evaluation Data](#scrollTo=H294GP72zhM9).

The following models are available

*   (Dummy Model: Always gives the correct answer to the first 7900 questions)
*   Tiny Llama
*   Mistral (without RAG)
*   rag with qa embeddings
*   Listeneintrag
*   Listeneintrag










---



## Setup
Enter

| Model | BRANCH | DIRECTORY | MODULE | CLASS | PARAMS | Note |
|----------|----------|----------|----------|----------|----------|----------|
|Dummy Model|main|Dummy_Model|dummy_model|dummy_model|2 test params|-|
|Tiny Llama|tiny_llama|TinyLlama|tiny_llama|tinyllama|-|**very slow**|
|Mistral without RAG|mistral-instruct-no-rag|MistralInstruct|mistral_instruct|mistral_instruct|-|**Huggingface Token and GPU required**|
|RAG with QA Embeddings|rag_qa_embeddings|RAG_QA_Embeddings|rag_qa_embeddings|rag_qa_embeddings|k=contexts|-|

In [11]:
import os

os.environ["BRANCH"] = "main"
os.environ["DIRECTORY"] = "Dummy_Model"
os.environ["MODULE"] = "dummy_model"
os.environ["CLASS"] = "	dummy_model"
os.environ["PARAMS"] = "hello world"

Execute the following script to setup the evaluation environment

In [12]:
directory = os.environ["DIRECTORY"]

!git clone --no-checkout https://github.com/dimitrod/ehu_nlp_dimathina.git
%cd ehu_nlp_dimathina/
!git sparse-checkout init --cone
!git sparse-checkout set Evaluation/
!git checkout main
%mv Evaluation/ ..
!git sparse-checkout set $DIRECTORY
!git checkout $BRANCH
%mv $DIRECTORY ..
%cd ..
!git clone https://github.com/mandarjoshi90/triviaqa.git
%rm -r ehu_nlp_dimathina
%rm -r sample_data
%cd triviaqa/
!pip install -r requirements.txt
%cd ../Evaluation
!pip install -r requirements.txt
%cd ../{directory}
!pip install -r requirements.txt
%cd ..

Cloning into 'ehu_nlp_dimathina'...
remote: Enumerating objects: 351, done.[K
remote: Counting objects:   0% (1/203)[Kremote: Counting objects:   1% (3/203)[Kremote: Counting objects:   2% (5/203)[Kremote: Counting objects:   3% (7/203)[Kremote: Counting objects:   4% (9/203)[Kremote: Counting objects:   5% (11/203)[Kremote: Counting objects:   6% (13/203)[Kremote: Counting objects:   7% (15/203)[Kremote: Counting objects:   8% (17/203)[Kremote: Counting objects:   9% (19/203)[Kremote: Counting objects:  10% (21/203)[Kremote: Counting objects:  11% (23/203)[Kremote: Counting objects:  12% (25/203)[Kremote: Counting objects:  13% (27/203)[Kremote: Counting objects:  14% (29/203)[Kremote: Counting objects:  15% (31/203)[Kremote: Counting objects:  16% (33/203)[Kremote: Counting objects:  17% (35/203)[Kremote: Counting objects:  18% (37/203)[Kremote: Counting objects:  19% (39/203)[Kremote: Counting objects:  20% (41/203)[Kremote: Counting object



---



## Changing the Evaluation Data

**If you don't want to change the size evaluation dataset skip this step**

This step is only required if you want to test the model with a larger dataset than the first 7900 questions of the train split of the triviaqa wikipedia dataset. In case you want to test a rag model that also uses the web dataset as vector database you can increase the evaluation dataset to the required first 9500 questions of the train split following these steps

In [None]:
!python3 -m Evaluation.utils.load_evaluation_data --split_size 9500

README.md: 100% 26.7k/26.7k [00:00<00:00, 54.3MB/s]
Resolving data files: 100% 26/26 [00:00<00:00, 86.16it/s]
train-00000-of-00007.parquet: 100% 240M/240M [00:01<00:00, 161MB/s]
train-00001-of-00007.parquet: 100% 261M/261M [00:02<00:00, 106MB/s]
train-00002-of-00007.parquet: 100% 319M/319M [00:01<00:00, 183MB/s]
train-00003-of-00007.parquet: 100% 266M/266M [00:04<00:00, 55.5MB/s]
train-00004-of-00007.parquet: 100% 240M/240M [00:01<00:00, 152MB/s]
train-00005-of-00007.parquet: 100% 259M/259M [00:02<00:00, 88.6MB/s]
train-00006-of-00007.parquet: 100% 253M/253M [00:05<00:00, 45.7MB/s]
validation-00000-of-00001.parquet: 100% 235M/235M [00:03<00:00, 75.4MB/s]
test-00000-of-00001.parquet: 100% 221M/221M [00:02<00:00, 84.0MB/s]
Generating train split: 100% 61888/61888 [00:40<00:00, 1523.56 examples/s]
Generating validation split: 100% 7993/7993 [00:05<00:00, 1365.50 examples/s]
Generating test split: 100% 7701/7701 [00:03<00:00, 1942.78 examples/s]
Loading dataset: 100% 9500/9500 [00:09<00:00



---



## Executing the Evaluation Chain
Now you can run the Evaluation chain with the model of choice. In this case it is a dummy model to test the functionality of the evaluation chain. The dummy model always looksup the correct answer to every question.

To start the evaluation chain type the following command

`!python3 -m Evaluation.evaluation_chain --model_directory $DIRECTORY --model_module $MODULE --model_class $CLASS --model_params $PARAMS --split_size 7900`

**If your model doesn't use any parameters please delete the `--model_params $PARAMS` section from the command**

If you don't want to test the model against the entire evaluation set you can choose a smaller split size. Otherwise the maximum value of the split size is 7900

In [13]:
!python3 -m Evaluation.evaluation_chain --model_directory $DIRECTORY --model_module $MODULE --model_class $CLASS --model_params $PARAMS --split_size 7900

Loading Questions:   0% 0/7900 [00:00<?, ?it/s]Loading Questions: 100% 7900/7900 [00:00<00:00, 853752.84it/s]
Loading model...
This is param 1: hello
This is param 2: world
Model initiated
Starting QA...
Retrieving Answers:   0% 0/7900 [00:00<?, ?it/s]Retrieving Answers: 100% 7900/7900 [00:00<00:00, 620087.61it/s]
QA complete
Starting evaluation...
Evaluation complete
Saving results...
Test results saved under Evaluation/results/dummy_model_split_size=7900_results.txt


The results of the test are saved in the file *dummy_model_split_size=7900_results.txt* in the Evaluation package


