# Research candidate LLMs for RAG

Based on the findings in our EDA, we decided that we will finetune an LLM model for question answering, in this notebook we will explore some alternatives (restricted by the current available GPU for development RTX 3060ti). We will analyse and explore alternatives and best configurations (quantization, PEFT training)

In [1]:
from transformers import pipeline
import pandas as pd
import evaluate
from bert_score import score as bertscore
import numpy as np
from typing import Callable
import mlflow

  from .autonotebook import tqdm as notebook_tqdm


# define evaluation metrics

In [2]:
def compute_evaluations(preds, targets):
    rouge = evaluate.load("rouge")
    rouge_scores = rouge.compute(predictions=preds, references=targets, use_stemmer=True)

    bertscore = evaluate.load("bertscore")
    P, R, F1, _ = bertscore.compute(predictions=preds, references=targets, lang="en", model_type="distilbert-base-uncased").values()

    bertscore_avg = {
        "bertscore_precision": np.array(P).mean().item(),
        "bertscore_recall": np.array(R).mean().item(),
        "bertscore_f1": np.array(F1).mean().item(),
    }

    return {**rouge_scores, **bertscore_avg}

In [3]:
m = compute_evaluations(["if you have the flu you need to rest"], ["rest is the best for the flu"])

In [9]:
import torch
torch.cuda.is_available()

True

In [5]:
compute_evaluations(["if you have the flu you need to rest"], ["I like pizza"])

{'rouge1': np.float64(0.0),
 'rouge2': np.float64(0.0),
 'rougeL': np.float64(0.0),
 'rougeLsum': np.float64(0.0),
 'bertscore_precision': 0.6455556750297546,
 'bertscore_recall': 0.6872128844261169,
 'bertscore_f1': 0.665733277797699}

In [6]:

bertscore = evaluate.load("bertscore")
predictions = ["hello there", "general kenobi", "if you have the flu you need to rest"]
references = ["hello there", "general kenobi", "I like pizza but I dont like mangos"]
P, R, F1, _  = bertscore.compute(predictions=predictions, references=references, lang="en", idf=True, model_type="distilbert-base-uncased").values()
P, R, F1, _

([1.0, 1.0, 0.6719011664390564],
 [1.0, 1.0, 0.6600363254547119],
 [1.0, 1.0, 0.665915846824646],
 'distilbert-base-uncased_L5_idf_version=0.3.12(hug_trans=4.53.1)')

#### Note:
as we can see here we can not blindly trust bert-scored specially on small inputs but it should still be a good metrict to help us compare semantic similarity

# test some models

In [10]:
val_df = pd.read_parquet("../data/cleaned/validation_dataset.parquet")
val_df.head(5)

Unnamed: 0,question,answer,answer_words,valid_question,valid_answer
1735,What is (are) Hypoglycemia ?,"Hypoglycemia means low blood glucose, or blood...",180,True,True
2900,What are the symptoms of Dysautonomia like dis...,What are the signs and symptoms of Dysautonomi...,237,True,True
1900,What is (are) Tumors and Pregnancy ?,"Tumors during pregnancy are rare, but they can...",183,True,True
52,What are the treatments for Alcohol Use and Ol...,There is not one right treatment for everyone ...,133,True,True
1878,Do you have information about Genetic Testing,Summary : Genetic tests are tests on blood and...,174,True,True


In [None]:
candidate_models = [#("text2text-generation", "google/flan-t5-base"),
                    ("text-generation","unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"),
                    ("text-generation", "unsloth/Phi-3.5-mini-instruct")]

for task, model_id in candidate_models:
    qa_model = pipeline(task, model=model_id)
    if task == "text-generation":
        input = [{"role": "user", "content": "What is the flu?"}]
    else:
        input = "What is the flu?"
    response = qa_model(input)[0]["generated_text"]
    print(f"model: {model_id} -> response: {response}")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cuda:0


model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit -> response: [{'role': 'user', 'content': 'What is the flu?'}, {'role': 'assistant', 'content': "The flu, also known as influenza, is a highly contagious respiratory illness caused by the influenza virus. It affects the lungs, nose, throat, and other parts of the respiratory system.\n\nThe flu is characterized by a combination of symptoms, which can vary in severity and may include:\n\n1. **Fever**: High temperatures, usually above 102°F (39°C).\n2. **Chills**: Feeling cold, even if your body temperature is normal.\n3. **Cough**: Dry, hacking cough or a productive cough that brings up mucus.\n4. **Sore throat**: Pain or discomfort in the throat.\n5. **Runny or stuffy nose**: Nasal congestion or discharge.\n6. **Headache**: Pain or discomfort in the head.\n7. **Fatigue**: Feeling extremely tired or exhausted.\n8. **Muscle or body aches**: Pain or discomfort in the muscles, back, or other parts of the body.\n9. **Diarrhea and vomiting**:

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Fetching 2 files: 100%|██████████| 2/2 [16:09<00:00, 484.77s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  4.93it/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cuda:0


model: unsloth/Phi-3.5-mini-instruct -> response: [{'role': 'user', 'content': 'What is the flu?'}, {'role': 'assistant', 'content': ' To feuding and "explainable means is the cure for a " being, not a single word in linguism tritium-sladkasy in the ccur offlu nctua.\n\n\nRound Leaders and poise \'by\'exp among the top 30 and in fluvent withstandx, c \n taquiza un language\'sireleation\'srance\n\nly. "-aditor\n {like-for} endowa-seems-ad*\n   In ta_{to} other can, ta e}\n\n'}]


In [None]:
import os

def evaluate_qa_models(
    df: pd.DataFrame,
    qa_model: Callable[[str], str],
    experiment_name: str,
    model_name: str,
    mlflow_uri: str = None,
):
    """
    Runs QA model over df and logs EM & F1 metrics to MLflow.
    """
    if mlflow_uri:
        mlflow.set_tracking_uri(mlflow_uri)
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run(run_name=model_name):
        mlflow.log_param("model_name", model_name)
        mlflow.log_param("num_examples", len(df))

        preds = []
        truths = []
        for _, row in df.iterrows():
            print(f"Processing question: {row['question']}")
            pred = qa_model([
                            {"role": "user", "content": row['question']},
                        ])
            print(f"pred: {pred}")
            preds.append(pred[0]['generated_text'][-1]['content'])

            truth = row["answer"]
            truths.append(truth)

        metrics = compute_evaluations(preds, truths)
        mlflow.log_metrics(metrics)

        # log predictions and truth as artifact for inspection
        out_df = df.copy()
        out_df["predicted"] = preds


        os.makedirs("mlflow_artifacts", exist_ok=True)
        csv_path = f"mlflow_artifacts/{model_name}_predictions.csv"
        out_df.to_csv(csv_path, index=False)
        mlflow.log_artifact(str(csv_path), artifact_path="predictions")


In [None]:
# these may need to be executed separattly depending on the GPU Memory size
candidate_models = ["unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
                    "unsloth/Phi-3.5-mini-instruct"]

for model_id in candidate_models:
    qa_model = pipeline("text-generation", model=model_id)
    model_name = model_id.replace("/", "_").replace(".", "_")
    evaluate_qa_models(
        df = val_df.head(10),
        qa_model=qa_model,
        experiment_name="initial comparision",
        model_name=model_name
        )

Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 11.82it/s]
Device set to use cuda:0


# load and test finetuned models

(we are loading this models after running the finetune script)

### model based in Meta-Llama-3.1-8B-Instruct-bnb-4bit

In [None]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "../runs/medqa_lora",
    device_map="cuda:0",
    max_seq_length = 2048, # Choose any! We auto support RoPE Scaling internally!
    dtype = None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = True
    )

FastLanguageModel.for_inference(model)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
messages = [
    {"role": "user", "content": "What is the flu?"},
]
res=pipe(messages)
res

==((====))==  Unsloth 2025.7.1: Fast Llama patching. Transformers: 4.53.1.
   \\   /|    NVIDIA GeForce RTX 3060 Ti. Num GPUs = 1. Max memory: 8.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2025.7.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


#### run validations to compare against base models

In [26]:
qa_model = pipeline("text-generation", model=model, tokenizer=tokenizer)
model_name = "medqa_lora"
evaluate_qa_models(
    df = val_df.head(10),
    qa_model=qa_model,
    experiment_name="initial comparision",
    model_name=model_name
    )

Device set to use cuda:0


Processing question: What is (are) Hypoglycemia ?
pred: [{'generated_text': [{'role': 'user', 'content': 'What is (are) Hypoglycemia ?'}, {'role': 'assistant', 'content': "Hypoglycemia is abnormally low blood sugar. Hypoglycemia is a serious health problem that can happen to anyone. You have a higher risk of developing it if you have diabetes and take too much insulin or other diabetes medicines. Hypoglycemia can happen when your body doesn't have enough glucose (sugar) in your blood. Glucose is important for your body to work properly. It is the main source of energy for your body. If your blood glucose drops below a certain level, your body starts to break down stored fat for energy. This process is called ketosis. Ketosis can cause your body to produce ketones. Ketones are substances that are produced when your body breaks down fat for energy. You can have ketones in your urine, breath, or blood. You can have ketones in your urine, breath, or blood. If you have hypoglycemia, you may

### Model based on Phi-3.5-mini-instruct

In [None]:

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "../runs/medqa_lora_phi",
    device_map="cuda:0",
    max_seq_length = 2048, # Choose any! We auto support RoPE Scaling internally!
    dtype = None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = True
    )

FastLanguageModel.for_inference(model)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
messages = [
    {"role": "user", "content": "What is the flu?"},
]
res=pipe(messages)
res

==((====))==  Unsloth 2025.7.1: Fast Llama patching. Transformers: 4.53.1.
   \\   /|    NVIDIA GeForce RTX 3060 Ti. Num GPUs = 1. Max memory: 8.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Device set to use cuda:0


[{'generated_text': [{'role': 'user', 'content': 'What is the flu?'},
   {'role': 'assistant',
    'content': ' Influenza or "the flu" is a highly contagious respiratory illness caused by influenza viruses and is responsible for many serious illnesses and deaths among children and adults. There are 3 types of influenza viruses: A, B, and C. Influenza A and B are the two types that cause most human illnesses. Influenza viruses can be divided into 2 categories: seasonal or epidemic and pandemic. If a pandemic occurs, there may be serious illness and deaths in the United States and worldwide. Flu viruses can change quickly and often. They may change in ways that allow them to spread more easily among people. These changes can also make the viruses less sensitive to antiviral drugs. They can also make it more difficult to make vaccines that protect against the viruses. Flu viruses can change in ways that make it hard to keep up with them. For example, there are seasonal flu viruses and pan

#### run validations to compare against base models


In [None]:
qa_model = pipeline("text-generation",
                    model=model,
                    tokenizer=tokenizer,
                    max_new_tokens=256,)
model_name = "medqa_lora_phi"
evaluate_qa_models(
    df = val_df.head(10),
    qa_model=qa_model,
    experiment_name="initial comparision",
    model_name=model_name
    )

Device set to use cuda:0


Processing question: What is (are) Hypoglycemia ?
pred: [{'generated_text': [{'role': 'user', 'content': 'What is (are) Hypoglycemia ?'}, {'role': 'assistant', 'content': ' Hypoglycemia is a condition in which blood sugar levels are too low. It usually occurs when someone does not eat enough food, or takes too much insulin or another diabetes medicine. If you have diabetes, you may take insulin or other medicines to lower your blood sugar. But if you take too much insulin or another medicine, your blood sugar can get too low. Hypoglycemia is a medical emergency. Do not treat it with home remedies. If you have diabetes, call 911 if you have signs and symptoms of hyp'}]}]
Processing question: What are the symptoms of Dysautonomia like disorder ?
pred: [{'generated_text': [{'role': 'user', 'content': 'What are the symptoms of Dysautonomia like disorder ?'}, {'role': 'assistant', 'content': ' The symptoms of Dysautonomia like disorder vary widely. Signs and symptoms include blurred vision,

## Final Comparison and conclusions:

On an initial run the base model of llama outperformed phi, not only in terms of stats (we showed earlier in this notebook that we can't always trust only stats in generative tasks) but also when we inspected the generated responses

| LLama Base                        | 
|------------------------------------|
| <img src="./imgs/llama_base_preds.png" />   | 
| Phi Base                        | 
|<img src="./imgs/phi_base_preds.png" />   |


As we can see here, Phi is not even responding something that makes sense with the questions
e.g.

Q: 
> What is (are) Animal Bites ?

A: 
>1. For Irish dance,
>Against-Animal forox The are: Receplacement

>:toat_toGTA

>using the two knowledge_degreires_asami

>The same conceptus notifying...


The good news is that after finetunning the models on our `intern_screening_dataset.csv` dataset, both models where able to improve, with hororable mentions to the Phi based model, which after the finetune is now able to prodce answers that make sense for a medical related question

| LLama Finetuned                            |
|------------------------------------|
|   <img src="./imgs/llama_finetuned_preds.png" />  |
| Phi Finetuned                        | 
| <img src="./imgs/phi_finetuned_preds.png" />    |

#### Score comparison

In the following image we can see the improvement both models archieved in the validation metrics as well, and again showing the notable improvement of our phi model. (The Phi model seems to be the fastes as well, at least duting these tests, we will confirm this latter by doing a propper load test)
<img src="./imgs/comparison_stats.png" />
