<a href="https://colab.research.google.com/github/dimitrod/ehu_nlp_dimathina/blob/main/Guides/evaluate_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Evaluation
In this script you can evaluate any model that has been developed during the project using the evaluation data.

By default the evaluation data has been set to be the first 7900 questions from the train split of the triviaqa dataset.
If you want to change the evaluation data to be a different split you can do that using the *load_evaluation_data.py* script in the utils package. A guide for this is in the section [Changing the Evaluation Data](#scrollTo=H294GP72zhM9).

The following models are available

*   (Dummy Model: Always gives the correct answer to the first 7900 questions)
*   Tiny Llama
*   Mistral (without RAG)
*   rag with qa embeddings
*   Listeneintrag
*   Listeneintrag










---



## Setup
Enter

| Model | BRANCH | DIRECTORY | MODULE | CLASS | PARAMS | Note |
|----------|----------|----------|----------|----------|----------|----------|
|Dummy Model|main|Dummy_Model|dummy_model|dummy_model|2 test params|-|
|Tiny Llama|tiny_llama|TinyLlama|tiny_llama|tinyllama|-|**very slow**|
|Mistral without RAG|mistral-instruct-no-rag|MistralInstruct|mistral_instruct|mistral_instruct|-|**Huggingface Token and GPU required**|
|RAG with QA Embeddings|rag_qa_embeddings|RAG_QA_Embeddings|rag_qa_embeddings|rag_qa_embeddings|k=contexts|-|

In [1]:
import os

os.environ["BRANCH"] = "main"
os.environ["DIRECTORY"] = "Dummy_Model"
os.environ["MODULE"] = "dummy_model"
os.environ["CLASS"] = "	dummy_model"
os.environ["PARAMS"] = "hello world"

Execute the following script to setup the evaluation environment

In [2]:
directory = os.environ["DIRECTORY"]

!git clone --no-checkout https://github.com/dimitrod/ehu_nlp_dimathina.git
%cd ehu_nlp_dimathina/
!git sparse-checkout init --cone
!git sparse-checkout set Evaluation/
!git checkout main
%mv Evaluation/ ..
!git sparse-checkout set $DIRECTORY
!git checkout $BRANCH
%mv $DIRECTORY ..
%cd ..
!git clone https://github.com/mandarjoshi90/triviaqa.git
%rm -r ehu_nlp_dimathina
%rm -r sample_data
%cd triviaqa/
!pip install -r requirements.txt
%cd ../Evaluation
!pip install -r requirements.txt
%cd ../{directory}
!pip install -r requirements.txt
%cd ..

Cloning into 'ehu_nlp_dimathina'...
remote: Enumerating objects: 441, done.[K
remote: Counting objects: 100% (293/293), done.[K
remote: Compressing objects: 100% (186/186), done.[K
remote: Total 441 (delta 154), reused 188 (delta 94), pack-reused 148 (from 1)[K
Receiving objects: 100% (441/441), 28.07 MiB | 17.52 MiB/s, done.
Resolving deltas: 100% (235/235), done.
/content/ehu_nlp_dimathina
Branch 'main' set up to track remote branch 'main' from 'origin'.
Switched to a new branch 'main'
Already on 'main'
Your branch is up to date with 'origin/main'.
/content
Cloning into 'triviaqa'...
remote: Enumerating objects: 70, done.[K
remote: Total 70 (delta 0), reused 0 (delta 0), pack-reused 70 (from 1)[K
Receiving objects: 100% (70/70), 20.60 KiB | 3.43 MiB/s, done.
Resolving deltas: 100% (28/28), done.
/content/triviaqa
/content/Evaluation
Collecting datasets (from -r requirements.txt (line 1))
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.



---



## Changing the Evaluation Data

**If you don't want to change the split of the evaluation dataset skip this step**

This step is only required if you want to evaluate with a different data set. You can change the evaluation dataset with the following command

`!python3 -m Evaluation.utils.load_evaluation_data --split {split} --split_size {split_size}`

Keep the maximum split sizes in mind

In [3]:
!python3 -m Evaluation.utils.load_evaluation_data --split "train" --split_size 7900

README.md: 100% 26.7k/26.7k [00:00<00:00, 44.1MB/s]
Resolving data files: 100% 26/26 [00:04<00:00,  5.25it/s]
train-00000-of-00001.parquet: 100% 25.0M/25.0M [00:00<00:00, 68.8MB/s]
validation-00000-of-00001.parquet: 100% 3.31M/3.31M [00:00<00:00, 63.4MB/s]
test-00000-of-00001.parquet: 100% 542k/542k [00:00<00:00, 219MB/s]
Generating train split: 100% 61888/61888 [00:00<00:00, 120788.64 examples/s]
Generating validation split: 100% 7993/7993 [00:00<00:00, 121794.20 examples/s]
Generating test split: 100% 7701/7701 [00:00<00:00, 250788.73 examples/s]
Loading dataset: 100% 7900/7900 [00:01<00:00, 3985.31it/s]
Replacing field names: 100% 19/19 [00:00<00:00, 59.48it/s]
Saving file...




---



## Executing the Evaluation Chain
Now you can run the Evaluation chain with the model of choice. In this case it is a dummy model to test the functionality of the evaluation chain. The dummy model always looksup the correct answer to every question.

To start the evaluation chain type the following command

`!python3 -m Evaluation.evaluation_chain --model_directory $DIRECTORY --model_module $MODULE --model_class $CLASS`



*   If the model requires params you can set them by adding `--model_params $PARAMS`
*   If you want to evaluate the model on a smaller split you can do that by adding --`split_size {split_size}`

In [4]:
!python3 -m Evaluation.evaluation_chain --model_directory $DIRECTORY --model_module $MODULE --model_class $CLASS --model_params $PARAMS

Loading Questions:   0% 0/7900 [00:00<?, ?it/s]Loading Questions: 100% 7900/7900 [00:00<00:00, 84829.08it/s]
Loading model...
This is param 1: hello
This is param 2: world
Model initiated
Starting QA...
Retrieving Answers:   0% 0/7900 [00:00<?, ?it/s]Retrieving Answers: 100% 7900/7900 [00:00<00:00, 611821.00it/s]
QA complete
Starting evaluation...
Evaluation complete
Saving results...
Test results saved under Evaluation/results/dummy_model_split_size=7993_results.txt


The results of the test are saved in the file *dummy_model_split_size=7900_results.txt* in the Evaluation package


