# Model Selection

For modern NLP tasks, training models from scratch is often expensive, complex and not cost-effective. Thankfully, platforms like Huggingface provide free access to several models specialized in different tasks and topics. Because of this, we want to first select a few potential models that may provide good performance out-of-the-box.

In [1]:
import pandas as pd

val_df = pd.read_csv("../data/val.csv")

In [None]:
candidates = [
    "ThisIs-Developer/Llama-2-GGML-Medical-Chatbot",
    # "xdatasi/xdata-finetune-deepseek-reason-test-medical"
    "WangCa/Qwen2.5-7B-Medicine",
    "google/gemma-7b",
]

The candidates specified above are finetuned models specifically designed for question-answering tasks in the medical domain. These models have different architectures, and according to their description they were trained on a variety of medical sources, with some having used up to 80 million documents for training. Additionally, a general-use case modern llm (`gemma-7b`) has been added as a baseline, to be able to compare performance between specialized models and general models.

## Initial validation
We want to see how the models perform as is. That is, if we only used the models without any additional changes, how close would they be to responding according to our validation dataset? For this, we're going to evaluate generated responses against the provided ones using BLEU and ROUGE metrics. These metrics are widely used for QA tasks, and in general work by counting the number of matching n-grams between the reference and generated responses. While they have their limitations (like not being well suited for long answers, or in case of BLEU not considering word order), they suffice for evaluating relative improvements.

In [None]:
import evaluate
from transformers import pipeline

rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")

def evaluate_model(model_name, val_df):
    pipe = pipeline("question-answering", model=model_name)
    predictions = val_df["question"].apply(lambda x: pipe(x)["answer"])
    print(predictions)

In [4]:
import evaluate

rouge = evaluate.load("rouge")
predictions = ["hello there", "general kenobi"]
references = ["hello there", "general obi wan kenobi"]
results = rouge.compute(predictions=predictions, references=references)

In [5]:
results

{'rouge1': np.float64(0.8333333333333333),
 'rouge2': np.float64(0.5),
 'rougeL': np.float64(0.8333333333333333),
 'rougeLsum': np.float64(0.8333333333333333)}

In [7]:

bleu = evaluate.load("bleu")
predictions = ["hello there", "general kenobi"]
references = ["hello there", "general obi wan kenobi"]
results = bleu.compute(predictions=predictions, references=references)

In [9]:
results

{'bleu': 0.0,
 'precisions': [1.0, 0.5, 0.0, 0.0],
 'brevity_penalty': 0.6065306597126334,
 'length_ratio': 0.6666666666666666,
 'translation_length': 4,
 'reference_length': 6}