# Evaluation

After having implemented our solution, we want to know how it does compared to the baseline

In [1]:
from mediqa.config.core import config
from mediqa.rag.manager import VectorDBManager
from mediqa.rag.reader import Reader

vdb_manager = VectorDBManager(config.rag_config)
reader = Reader(config.reader_config)

  from .autonotebook import tqdm as notebook_tqdm
`low_cpu_mem_usage` was None, now default to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 8/8 [00:05<00:00,  1.56it/s]
Device set to use cuda:0


## Validation set

We will first compute scores on the validation set that was used to compare the multiple models of choice. This will allow us to see how the additions affect performance

In [2]:
import pandas as pd

val_mini_df = pd.read_csv("../data/val_mini.csv")

In [3]:
import evaluate

rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

In [7]:
from tqdm import tqdm


responses = []
for i, row in tqdm(val_mini_df.iterrows(), total=len(val_mini_df)):
    question = row["question"]
    retrieved_docs = vdb_manager.retrieve(question)
    response = reader.generate([question], [retrieved_docs])[0][0]["generated_text"]
    responses.append(response)

100%|██████████| 20/20 [03:22<00:00, 10.14s/it]


In [8]:
rouge.compute(predictions=responses, references=val_mini_df["answer"].tolist())

{'rouge1': np.float64(0.2922905644961558),
 'rouge2': np.float64(0.08503566174558237),
 'rougeL': np.float64(0.1633339509506666),
 'rougeLsum': np.float64(0.18049857927222107)}

In [9]:
bleu.compute(predictions=responses, references=val_mini_df["answer"].tolist())

{'bleu': 0.052588917299906786,
 'precisions': [0.3598300970873786,
  0.09737484737484738,
  0.04176904176904177,
  0.0203955500618047],
 'brevity_penalty': 0.711476691707965,
 'length_ratio': 0.7460389316432775,
 'translation_length': 3296,
 'reference_length': 4418}

Based on the scores above, our solution got a higher BLEU score compared to before (0.0525 > 0.0455) but a lower ROUGE score overall. This may imply that our solution is now getting more similar text blocks that appear on the reference, but that it no longer displays as much content as expected. This may be due to the fact that we're encouraging the model to only respond based on the context provided, which may not contain all the details necessary to construct a full response.

One approach to mitigate this could be to increase the number of documents returned by our RAG