# Research candidate LLMs for RAG

Based on the findings in our EDA, we decided that we will finetune an LLM model for question answering, in this notebook we will explore some alternatives (restricted by the current available GPU for development RTX 3060ti). We will analyse and explore alternatives and best configurations (quantization, PEFT training)

In [1]:
from transformers import pipeline
import pandas as pd
import evaluate
from bert_score import score as bertscore
import numpy as np
from typing import Callable
import mlflow

  from .autonotebook import tqdm as notebook_tqdm


# define evaluation metrics

In [None]:
def compute_evaluations(preds, targets):
    rouge = evaluate.load("rouge")
    rouge_scores = rouge.compute(predictions=preds, references=targets, use_stemmer=True)

    bertscore = evaluate.load("bertscore")
    P, R, F1, _ = bertscore.compute(predictions=preds, references=targets, lang="en", model_type="distilbert-base-uncased").values()

    bertscore_avg = {
        "bertscore_precision": np.array(P).mean().item(),
        "bertscore_recall": np.array(R).mean().item(),
        "bertscore_f1": np.array(F1).mean().item(),
    }

    return {**rouge_scores, **bertscore_avg}

In [3]:
m = compute_evaluations(["if you have the flu you need to rest"], ["rest is the best for the flu"])

In [4]:
import torch
torch.cuda.is_available()

True

In [5]:
compute_evaluations(["if you have the flu you need to rest"], ["I like pizza"])

{'rouge1': np.float64(0.0),
 'rouge2': np.float64(0.0),
 'rougeL': np.float64(0.0),
 'rougeLsum': np.float64(0.0),
 'bertscore_precision': 0.6455556750297546,
 'bertscore_recall': 0.6872128844261169,
 'bertscore_f1': 0.665733277797699}

In [None]:

bertscore = evaluate.load("bertscore")
predictions = ["hello there", "general kenobi", "if you have the flu you need to rest"]
references = ["hello there", "general kenobi", "I like pizza but I dont like mangos"]
P, R, F1, _  = bertscore.compute(predictions=predictions, references=references, lang="en", idf=True, model_type="distilbert-base-uncased").values()
P, R, F1, _

([1.0, 1.0, 0.6719011664390564],
 [1.0, 1.0, 0.6600363254547119],
 [1.0, 1.0, 0.665915846824646],
 'distilbert-base-uncased_L5_idf_version=0.3.12(hug_trans=4.53.1)')

#### Note:
as we can see here we can not blindly trust bert-scored specially on small inputs but it should still be a good metrict to help us compare semantic similarity

# test some models

In [6]:
val_df = pd.read_parquet("../data/cleaned/validation_dataset.parquet")
val_df.head(5)

Unnamed: 0,question,answer,answer_words,valid_question,valid_answer
719,What causes Dry Eye ?,Most people with dry eye will not have serious...,38,True,True
1574,What is (are) Canker Sores ?,"Canker sores are small, round sores in your mo...",127,True,True
1704,What is (are) Animal Bites ?,Wild animals usually avoid people. They might ...,144,True,True
1514,What is (are) Diphtheria ?,Diphtheria is a serious bacterial infection. Y...,162,True,True
2257,What is (are) Respiratory Syncytial Virus Infe...,"Respiratory syncytial virus (RSV) causes mild,...",160,True,True


In [None]:
candidate_models = [#("text2text-generation", "google/flan-t5-base"),
                    ("text-generation","unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"),
                    ("text-generation", "unsloth/Phi-3.5-mini-instruct")]

for task, model_id in candidate_models:
    qa_model = pipeline(task, model=model_id)
    if task == "text-generation":
        input = [{"role": "user", "content": "What is the flu?"}]
    else:
        input = "What is the flu?"
    response = qa_model(input)[0]["generated_text"]
    print(f"model: {model_id} -> response: {response}")

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cuda:0


model: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit -> response: [{'role': 'user', 'content': 'What is the flu?'}, {'role': 'assistant', 'content': "The flu, also known as influenza, is a highly contagious respiratory illness caused by the influenza virus. It affects the lungs, nose, throat, and other parts of the respiratory system.\n\nThe flu is characterized by a combination of symptoms, which can vary in severity and may include:\n\n1. **Fever**: High temperatures, usually above 102°F (39°C).\n2. **Chills**: Feeling cold, even if your body temperature is normal.\n3. **Cough**: Dry, hacking cough or a productive cough that brings up mucus.\n4. **Sore throat**: Pain or discomfort in the throat.\n5. **Runny or stuffy nose**: Nasal congestion or discharge.\n6. **Headache**: Pain or discomfort in the head.\n7. **Fatigue**: Feeling extremely tired or exhausted.\n8. **Muscle or body aches**: Pain or discomfort in the muscles, back, or other parts of the body.\n9. **Diarrhea and vomiting**:

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Fetching 2 files: 100%|██████████| 2/2 [16:09<00:00, 484.77s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  4.93it/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use cuda:0


model: unsloth/Phi-3.5-mini-instruct -> response: [{'role': 'user', 'content': 'What is the flu?'}, {'role': 'assistant', 'content': ' To feuding and "explainable means is the cure for a " being, not a single word in linguism tritium-sladkasy in the ccur offlu nctua.\n\n\nRound Leaders and poise \'by\'exp among the top 30 and in fluvent withstandx, c \n taquiza un language\'sireleation\'srance\n\nly. "-aditor\n {like-for} endowa-seems-ad*\n   In ta_{to} other can, ta e}\n\n'}]


In [None]:
import os

def evaluate_qa_models(
    df: pd.DataFrame,
    qa_model: Callable[[str], str],
    experiment_name: str,
    model_name: str,
    mlflow_uri: str = None,
):
    """
    Runs QA model over df and logs EM & F1 metrics to MLflow.
    """
    if mlflow_uri:
        mlflow.set_tracking_uri(mlflow_uri)
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run(run_name=model_name):
        mlflow.log_param("model_name", model_name)
        mlflow.log_param("num_examples", len(df))

        preds = []
        truths = []
        for _, row in df.iterrows():
            q = row["question"]
            input = [{"role": "user", "content": q}]
            truth = row["answer"]
            pred = qa_model(input)
            preds.append(pred[0]['generated_text'][-1]['content'])
            truths.append(truth)

        metrics = compute_evaluations(preds, truths)
        mlflow.log_metrics(metrics)

        # log predictions and truth as artifact for inspection
        out_df = df.copy()
        out_df["predicted"] = preds


        os.makedirs("mlflow_artifacts", exist_ok=True)
        csv_path = f"mlflow_artifacts/{model_name}_predictions.csv"
        out_df.to_csv(csv_path, index=False)
        mlflow.log_artifact(str(csv_path), artifact_path="predictions")


In [None]:
candidate_models = ["unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
                    "unsloth/Phi-3.5-mini-instruct"]

for model_id in candidate_models:
    qa_model = pipeline("text-generation", model=model_id)
    model_name = model_id.replace("/", "_").replace(".", "_")
    evaluate_qa_models(
        df = val_df.head(10),
        qa_model=qa_model,
        experiment_name="initial comparision",
        model_name=model_name
        )