# Model Selection

For modern NLP tasks, training models from scratch is often expensive, complex and not cost-effective. Thankfully, platforms like Huggingface provide free access to several models specialized in different tasks and topics. Because of this, we want to first select a few potential models that may provide good performance out-of-the-box.

In [1]:
import pandas as pd
from pathlib import Path
from mediqa.config.core import DATASET_DIR

val_df = pd.read_csv(Path(DATASET_DIR) / "val.csv")

In [2]:
candidates = [
    "AdaptLLM/medicine-LLM",
    "ritvik77/Medical_Doctor_AI_LoRA-Mistral-7B-Instruct_FullModel",
    "ContactDoctor/Bio-Medical-Llama-3-8B",
    "HuggingFaceH4/zephyr-7b-beta",
]

The candidates specified above are finetuned models specifically designed for question-answering tasks in the medical domain. These models have different architectures, and according to their description they were trained on a variety of medical sources, with some having used up to 80 million documents for training. Additionally, a general-use case modern llm (`gemma-7b`) has been added as a baseline, to be able to compare performance between specialized models and general models.

## Initial validation
We want to see how the models perform as is. That is, if we only used the models without any additional changes, how close would they be to responding according to our validation dataset? For this, we're going to evaluate generated responses against the provided ones using BLEU and ROUGE metrics. These metrics are widely used for QA tasks, and in general work by counting the number of matching n-grams between the reference and generated responses. While they have their limitations (like not being well suited for long answers, or in case of BLEU not considering word order), they suffice for evaluating relative improvements.

In [3]:
from transformers import pipeline
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from datasets import Dataset


def load_pipeline(model_name):
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    pipe = pipeline(
        model=model,
        tokenizer=tokenizer,
        task="text-generation",
        do_sample=True,
        temperature=0.2,
        repetition_penalty=1.1,
        return_full_text=False,
        max_new_tokens=1000,
    )
    return pipe

def evaluate_pipe(pipe, df: pd.DataFrame, evaluators: list):
    tqdm.pandas(desc=f"Generating answers from {pipe.tokenizer.name_or_path}")
    predictions = df['question'].progress_apply(lambda x: pipe(x)[0]['generated_text'])
    references = df['answer']

    results = []

    for evaluator in evaluators:
        result = evaluator.compute(predictions=predictions.tolist(), references=references.tolist())
        results.append(result)

    return results


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
val_mini_df = val_df.sample(20)

In [5]:
import evaluate
from transformers import pipeline

rouge = evaluate.load("rouge")
bleu = evaluate.load("bleu")


In [6]:
import gc

try:
    del pipe
    gc.collect()
    torch.cuda.empty_cache()
except:
    print("No pre-existing pipeline")


No pre-existing pipeline


In [7]:
def get_benchmark_results(candidates) -> pd.DataFrame:
    benchmark = {}

    for candidate in candidates:
        pipe = load_pipeline(candidate)
        results = evaluate_pipe(pipe, val_mini_df, [rouge, bleu])
        total_results = {}
        for result_dict in results:
            total_results.update(result_dict)
        benchmark[candidate] = total_results

    benchmark_df = pd.DataFrame(benchmark).T
    return benchmark_df

benchmark_df = get_benchmark_results([candidates[0]])

`low_cpu_mem_usage` was None, now default to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 33/33 [00:08<00:00,  3.83it/s]
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False

In [8]:
benchmark_df

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum,bleu,precisions,brevity_penalty,length_ratio,translation_length,reference_length
AdaptLLM/medicine-LLM,0.086833,0.016102,0.0498,0.05099,0.003524,"[0.03234050052872753, 0.005821132474863292, 0....",1.0,5.137166,22696,4418
ritvik77/Medical_Doctor_AI_LoRA-Mistral-7B-Instruct_FullModel,0.288463,0.073674,0.159944,0.200634,0.039087,"[0.30434782608695654, 0.06748324474231569, 0.0...",0.9838,0.983929,4347,4418
ContactDoctor/Bio-Medical-Llama-3-8B,0.191224,0.055803,0.116421,0.146675,0.032816,"[0.2637979420018709, 0.07530593034201444, 0.02...",0.685497,0.725894,3207,4418
HuggingFaceH4/zephyr-7b-beta,0.294022,0.073845,0.157857,0.202263,0.045557,"[0.27755102040816326, 0.07027027027027027, 0.0...",1.0,1.386374,6125,4418


In [9]:
val_mini_df.to_csv("../data/val_mini.csv", index=None)