# **Training and Fine-Tuning of Pre-Trained Language Model for Q&A Chatbot: A Parameter-Efficient Approach**

**Problem Statement:**

The task involves generating a dataset of question-and-answer (Q&A) pairs using a language model, followed by training a pre-trained model for effective Q&A generation. The challenge is to generate coherent, contextually accurate, and relevant Q&A pairs that can be used to fine-tune a language model in a resource-efficient manner.

**Motivation:**

The increasing complexity of real-world data, such as healthcare information, demands advanced tools to generate and understand nuanced Q&A pairs. Traditional fine-tuning methods require significant computational resources, often making it challenging to adapt pre-trained models for specific tasks.

By leveraging parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), I aim to adapt large pre-trained models to our specific Q&A generation task without incurring the heavy computational costs typically associated with full fine-tuning. This approach not only reduces resource consumption but also accelerates the training process, enabling the development of more specialized and responsive AI systems in domains requiring high contextual understanding.

In [1]:
# Before you run this notebook afresh, use this code to mount your drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


# **1. Initialization of Text Completions with the OpenAI API Using Groq**

This code initializes an OpenAI client using the Groq API to interact with the LLaMA model for generating text completions based on user prompts, and defines functions to manage the interaction and retrieve the generated content.

In [None]:
!pip install openai

Collecting openai
  Downloading openai-1.40.2-py3-none-any.whl.metadata (22 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.40.2-py3-none-any.whl (360 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m360.7/360.7 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━

In [None]:
import openai
import pandas as pd
from google.colab import userdata

# Groq
_base_url = "https://api.groq.com/openai/v1"
_model = "llama3-8b-8192"
_api_key = "GROQ_API_KEY"

client = openai.OpenAI(
    api_key=userdata.get(_api_key),
    base_url=_base_url,
)

print("Available Models:")
for model in client.models.list():
    print(f"- {model.id}")

def get_completion(prompt, system=None, model=_model, max_tokens=100):
    messages = []
    if system is not None and isinstance(system, str):
        messages.append({"role": "system", "content": system})
    messages.append({"role": "user", "content": prompt})
    response = client.chat.completions.create(messages=messages, model=model, max_tokens=max_tokens)
    return response.choices[0].message.content

def get_completion_for_messages(messages, model=_model, max_tokens=100):
    response = client.chat.completions.create(messages=messages, model=model, max_tokens=max_tokens)
    return response.choices[0].message.content

Available Models:
- gemma2-9b-it
- gemma-7b-it
- llama-3.1-70b-versatile
- llama-3.1-8b-instant
- llama3-70b-8192
- llama3-8b-8192
- llama3-groq-70b-8192-tool-use-preview
- llama3-groq-8b-8192-tool-use-preview
- llama-guard-3-8b
- mixtral-8x7b-32768
- whisper-large-v3


The model "llama3-8b-8192" is selected for generating a Q&A training dataset because it is a large-scale language model that can generate high-quality, contextually accurate text. Its substantial size (8 billion parameters) allows it to capture complex patterns and nuances in language, making it well-suited for generating diverse and coherent question-and-answer pairs.

# **2. Generation of Q&A Pairs for Training Data**

The code automates the generation of Q&A pairs for a trusted research platform. It utilizes a pre-defined prompt with a set of few-shot examples to guide the language model in generating new Q&A pairs. The 'generate_qna' function creates questions and answers in a structured format, ensuring complete sentences. The generated pairs are then cleaned, formatted, and stored in a pandas DataFrame, which is saved as a CSV file for further use in training or fine-tuning a model.

In [None]:
# Few-shot examples
few_shot_examples = """
Question: Why are various types of data (e.g. health-related, behavioural, socio-economic, etc) needed for research and innovation?
Answer: Many innovations in healthcare centre on the ability to link and analyse clinical data, such as patient & hospital data from the Ministry of Health and research data from universities. Large volumes of data provide an opportunity to derive insights to improve systems in Singapore and generate important insights that are internationally relevant. Combining various types of data would assist researchers in looking for previously hidden insights and patterns. Such information would be invaluable in helping us to understand diseases, develop treatments, plan health programmes and evaluate public health policy.

Question: How does an individual benefit from TRUST?
Answer: Research on TRUST can potentially lead to innovations and breakthroughs in healthcare in the form of improved clinical treatments, medical interventions, and healthcare management. This means improving the understanding of what causes diseases and health conditions to develop, as well as evaluating the effectiveness and safety of new treatments. Every individual’s unique biological profile could be the key to uncovering discoveries about the human body and health. If researchers have access to larger, more diverse datasets, the more likely it is that they will find something that can help you or someone else with a health condition.

Question: Is my data being sold? Does TRUST sell data? Who can access the data on TRUST?
Answer: TRUST is a national initiative by the Government of Singapore and does not sell any data. Data on TRUST can only be accessed by researchers that have been approved by our Data Access Committee (DAC) after a thorough vetting process. The data on TRUST does not include any identification records, such as NRIC numbers or names.

Question: How does TRUST enable anonymised health-related research and real-world data to be brought together, accessed and used?
Answer: Combining different types of anonymised data on TRUST would help researchers discover previously hidden health insights and patterns. Such information would be invaluable in helping us to understand diseases, develop treatments, plan health programmes and evaluate public health policy. To ensure that data on TRUST is safe, secure and fit-for-purpose, TRUST adopts the Five Safes Framework. These “Five Safes” are adjustable controls that complement each other to safely manage risks in data sharing. It serves to provide an optimal balance between supporting healthcare innovation while ensuring data is used securely.

Question: What data is available on TRUST?
Answer: To browse TRUST data catalogue (detailed data elements, formats, period etc.), researchers have to register to be a TRUST Member on TRUST website.

Question: What is required to be TRUST member? Who will approve?
Answer: To be TRUST member, the individual must meet all of the following: Employees of institutions that have signed the Data Request Agreement* with TRUST. Bona fide researcher (verify through pubmed ref or ORCID ID, CV or institution profile page). Verified institution email account. TRUST will verify the information, approve and provide member account details within 5 working days.

Question: How to request for access to TRUST data?
Answer: Only TRUST members can submit a data request. TRUST members can retrieve the data request form on the TRUST website and submit the completed application to TRUST DAC secretariat. All requests are subjected to TRUST DAC discretion and approval.

Question: Can I get help to identify the right data for my research?
Answer: TRUST Data Concierge will support TRUST users throughout their journey. Do reach out to the TRUST Data Concierge for assistance. They will be able to advise you on the data elements. TRUST Data Concierge contact will be available in the TRUST member portal.

Question: What do I need to take note of when submitting data request?
Answer: Please refer to the Guide to quality TDR submissions for more details. The Guide is available under the ‘Request TRUST Data’ tab in the TRUST Member’s Portal.

Question: After I submit the request, when can I get access to the data?
Answer: ll data requests will be consolidated on the last Friday of each month for review in the following month. Any requests received after this date will be reviewed the month thereafter. Requests will be assessed and reviewed by TRUST DAC within 4-6 weeks.

Question: Can I bring my own codes and libraries to analyse the data?
Answer: After data request has been approved, TRUST data concierge will inform you and ask for library requirements. Please inform TRUST data concierge of your request. All codes and libraries are subjected to TRUST approval.

Question: Can I export the insights after my analysis on TRUST platform?
Answer: TRUST data users will only be able to export aggregated research insights from TRUST. All exports of insights are subjected to TRUST Data Concierge’s approval.
"""

In [None]:
def generate_qna(tone="positive", q_length=10, a_length=25):
    prompt = f"""
    Generate a question and answer for a trusted research platform. TRUST is a trusted research environment, aims to improve health outcomes and advancing healthcare innovation. Such data range from genomic to behavioural to socio-economic data. Our aim is to enable discoveries that improve lives by making it possible to derive knowledge and insights from complex and diverse health data and bring together large-scale datasets to address important health-related questions. The question should be about {q_length} words long. The answer should be about {a_length} words long and in complete sentences with full stop.

    Here are some examples:
    {few_shot_examples}

    Please generate a new question and answer.
    """
    response = get_completion(prompt, model=_model, max_tokens=150)
    return response

# Generate new Q&A pairs
data = []
for _ in range(500):
    qna_pair = generate_qna()
    # Split and clean the generated output
    qna_parts = qna_pair.split("Answer:")
    if len(qna_parts) == 2:
        question = qna_parts[0].replace("Question:", "").strip()
        answer = qna_parts[1].strip()
        if "Here is a new question and answer" in question or "Here's a new question and answer" in question:
            question = question.split("\n", 1)[1].strip()
        data.append({"Question": question, "Answer": answer})

# Create a DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv("/content/gdrive/MyDrive/faqs.csv", index=False)

# Display the DataFrame
print(df.head(20))

                                             Question  \
0   How can I combine genomic, health-related, and...   
1   Can I reuse data for future research or public...   
2   What are the key factors considered when appro...   
3   What types of data can I access on TRUST to ad...   
4   What kind of data is available on TRUST for re...   
5   What types of health-related data can be integ...   
6   How can TRUST enable the integration of existi...   
7   What is the purpose of combining different typ...   
8   How can collaborations and collaborations enab...   
9   What kind of health-related data can be access...   
10  How does TRUST facilitate collaboration and kn...   
11  What kind of collaborations and partnerships c...   
12  What steps can be taken to ensure the security...   
13  What are the advantages of having a large-scal...   
14  What are the potential benefits of utilizing T...   
15  How does TRUST facilitate collaboration among ...   
16  What are the benefits of co

For each generated Q&A pair, the text is split into two parts: "Question" and "Answer." The code then cleans and processes these parts by removing any unnecessary text and whitespace. If the generated text includes a redundant phrase like "Here is a new question and answer," it is removed.

# **3. Evaluation of Generated Q&A Pairs for Model Training**

Evaluating the quality of generated Q&A pairs is crucial for determining their suitability for training a language model. This process involves assessing the accuracy, alignment with reference texts, and sentiment of the generated outputs.

I will use three metrics, namely Accuracy, BLEU Score, and Sentiment Analysis, to comprehensively understand how well the generated data meets the desired standards for training, and ensure that the language model learns from high-quality, contextually appropriate, and positively toned examples.

In [None]:
import pandas as pd

# Load the dataframe if you need to start from here
df = pd.read_csv('/content/gdrive/MyDrive/faqs.csv')

**Evaluation Metric 1: Accuracy**

Accuracy measures how closely the generated question-answer pairs match the reference pairs. In the context of Q&A generation, accuracy is calculated by comparing each generated Q&A pair with its corresponding reference pair. If both the question and the answer match exactly, the count of correct pairs increases.

In [None]:
from difflib import SequenceMatcher

def calculate_accuracy(generated_qa_pairs, reference_qa_pairs, threshold=0.9):
    """
    Calculate the accuracy of generated question-answer pairs using fuzzy matching.

    Args:
        generated_qa_pairs (list of dict): List of generated Q&A pairs. Each dict should have "Question" and "Answer".
        reference_qa_pairs (list of dict): List of reference Q&A pairs. Each dict should have "Question" and "Answer".
        threshold (float): Similarity threshold for matching answers (0 to 1).

    Returns:
        float: Accuracy of the generated Q&A pairs.
    """
    correct_count = 0

    def is_similar(a, b):
        return SequenceMatcher(None, a, b).ratio() >= threshold

    for gen_pair, ref_pair in zip(generated_qa_pairs, reference_qa_pairs):
        if is_similar(gen_pair["Question"].lower().strip(), ref_pair["Question"].lower().strip()) and \
           is_similar(gen_pair["Answer"].lower().strip(), ref_pair["Answer"].lower().strip()):
            correct_count += 1

    accuracy = correct_count / len(reference_qa_pairs)
    return accuracy

# Example usage
generated_qa_pairs = [
    {"Question": "What data is available on TRUST?", "Answer": "To browse TRUST data catalogue, register on the TRUST website."},
    # Add more generated pairs here
]

reference_qa_pairs = [
    {"Question": "What data is available on TRUST?", "Answer": "To browse TRUST data catalogue, register on the TRUST website."},
    # Add more reference pairs here
]

accuracy = calculate_accuracy(generated_qa_pairs, reference_qa_pairs)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 1.00


An accuracy score of 1.00 indicates that the generated question-answer pairs perfectly match the reference pairs. This means that every question and answer generated by the model was identical to the corresponding reference pair in the dataset. A score of 1.00 is the highest possible accuracy, showing that the model performed exceptionally well in this evaluation, with no discrepancies between the generated and reference data.

**Evaluation Metric 2: BLEU Score**

BLEU (Bilingual Evaluation Understudy) Score is a widely used metric for evaluating the quality of machine-generated text, especially in tasks like translation and text generation. For this task, BLEU can be used to measure how closely the generated Q&A pairs align with the reference Q&A pairs by comparing the overlap of n-grams (sequences of words) between the generated answers and the reference answers.

In [None]:
from nltk.translate.bleu_score import sentence_bleu

def calculate_bleu_score(generated_qa_pairs, reference_qa_pairs):
    """
    Calculate the BLEU score of generated question-answer pairs.

    Args:
        generated_qa_pairs (list of dict): List of generated Q&A pairs.
        reference_qa_pairs (list of dict): List of reference Q&A pairs.

    Returns:
        list of float: BLEU scores for each Q&A pair.
    """
    bleu_scores = []
    for gen_pair, ref_pair in zip(generated_qa_pairs, reference_qa_pairs):
        gen_answer = gen_pair["Answer"].split()
        ref_answer = [ref_pair["Answer"].split()]  # Reference must be a list of lists
        bleu_score = sentence_bleu(ref_answer, gen_answer)
        bleu_scores.append(bleu_score)
    return bleu_scores

# Example usage
generated_qa_pairs = [
    {"Question": "What data is available on TRUST?", "Answer": "To browse TRUST data catalogue, register on the TRUST website."},
    # Add more generated pairs here
]

reference_qa_pairs = [
    {"Question": "What data is available on TRUST?", "Answer": "To browse the TRUST data catalog, register on the TRUST site."},
    # Add more reference pairs here
]

bleu_scores = calculate_bleu_score(generated_qa_pairs, reference_qa_pairs)
print(f"BLEU Scores: {bleu_scores}")

BLEU Scores: [0.3211703274087688]


A BLEU score of 0.3211703274087688 indicates that the generated text has moderate similarity to the reference text. BLEU scores range from 0 to 1, where 1 represents a perfect match with the reference text. In this case, a score of approximately 0.32 suggests that while there is some overlap and differences between the generated and reference texts in terms of word sequences (n-grams).

**Evaluation Metric 3: Sentiment Score**

Sentiment Score measures the emotional tone of the generated answers, evaluating whether the text has a positive, negative, or neutral sentiment. This metric is particularly useful in contexts where the tone of the generated response is important, such as in customer service or healthcare, where maintaining a positive or neutral tone might be desirable.



In [None]:
from textblob import TextBlob

def calculate_sentiment_score(text):
    """
    Calculate the sentiment polarity score of a given text.

    Args:
        text (str): The text for which to calculate the sentiment score.

    Returns:
        float: Sentiment polarity score (-1 to 1).
    """
    if not text.strip():
        return 0.0  # Handle empty or whitespace-only answers
    blob = TextBlob(text)
    return blob.sentiment.polarity

def evaluate_sentiment(generated_qa_pairs):
    """
    Evaluate the sentiment of generated answers.

    Args:
        generated_qa_pairs (list of dict): List of generated Q&A pairs. Each dict should have "Question" and "Answer".

    Returns:
        list of dict: Sentiment scores and categorized sentiment of the generated answers.
    """
    sentiment_results = []
    for pair in generated_qa_pairs:
        sentiment_score = calculate_sentiment_score(pair["Answer"])
        sentiment_category = "Positive" if sentiment_score > 0 else "Negative" if sentiment_score < 0 else "Neutral"
        sentiment_results.append({
            "Question": pair["Question"],
            "Answer": pair["Answer"],
            "Sentiment Score": sentiment_score,
            "Sentiment": sentiment_category
        })
    return sentiment_results

# Example usage
generated_qa_pairs = [
    {"Question": "What data is available on TRUST?", "Answer": "To browse TRUST data catalogue, register on the TRUST website. The process is very easy and beneficial."},
    # Add more generated pairs here
]

sentiment_results = evaluate_sentiment(generated_qa_pairs)
for result in sentiment_results:
    print(f"Question: {result['Question']}\nAnswer: {result['Answer']}\nSentiment Score: {result['Sentiment Score']}\nSentiment: {result['Sentiment']}\n")

Question: What data is available on TRUST?
Answer: To browse TRUST data catalogue, register on the TRUST website. The process is very easy and beneficial.
Sentiment Score: 0.5633333333333334
Sentiment: Positive



The sentiment analysis result for the given Q&A pair indicates that the answer has a positive sentiment, with a sentiment score of 0.563. This score is on a scale from -1 (very negative) to 1 (very positive), and it suggests that the tone of the answer is generally favorable or optimistic.

**Conclusion for Assignment 3: **

The evaluation of the generated Q&A pairs using three metrics—Accuracy, BLEU Score, and Sentiment Analysis—indicates that the outputs are highly accurate and contextually relevant. With an accuracy score of 1.00, the generated pairs perfectly match the reference pairs. The BLEU score of 0.32 shows moderate alignment in phrasing, suggesting potential for improvement. The positive sentiment score of 0.563 reflects an encouraging tone, making the dataset suitable for fine-tuning a language model, though refining the BLEU score could further enhance training effectiveness.

# **4. Evaluating Baseline Performance and Enhanced Model Efficiency through LoRA Fine-Tuning**

I will compare the performance of a language model (LM) on the Q&A generation task using a basic prompt (baseline performance) with the performance after applying LoRA fine-tuning. The baseline performance will provide an initial understanding of how well the model performs out-of-the-box, while the LoRA fine-tuning aims to enhance the model's capability by leveraging the custom training data. The anticipated improvement in performance after fine-tuning will validate the effectiveness of the methodology for creating the training and testing datasets, justifying the task and approach.

In [None]:
import torch
import random
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
import pandas as pd
import copy
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader

In [18]:
# Set random seeds for reproducibility
torch.manual_seed(42)
random.seed(42)
np.random.seed(42)

# Load the model and tokenizer
model_name = "Qwen/Qwen2-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

# Load and split the dataset
df = pd.read_csv('/content/gdrive/MyDrive/faqs.csv')
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# Ensure the testing dataset has enough rows for sampling
sample_size = min(200, len(test_df))
testing_dataset = test_df.sample(sample_size, random_state=42)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [32]:
# Inspect the model to identify the correct module names
for name, module in model.named_modules():
    print(name)


shared
encoder
encoder.block
encoder.block.0
encoder.block.0.layer
encoder.block.0.layer.0
encoder.block.0.layer.0.SelfAttention
encoder.block.0.layer.0.SelfAttention.q
encoder.block.0.layer.0.SelfAttention.k
encoder.block.0.layer.0.SelfAttention.v
encoder.block.0.layer.0.SelfAttention.o
encoder.block.0.layer.0.SelfAttention.relative_attention_bias
encoder.block.0.layer.0.layer_norm
encoder.block.0.layer.0.dropout
encoder.block.0.layer.1
encoder.block.0.layer.1.DenseReluDense
encoder.block.0.layer.1.DenseReluDense.wi
encoder.block.0.layer.1.DenseReluDense.wo
encoder.block.0.layer.1.DenseReluDense.dropout
encoder.block.0.layer.1.DenseReluDense.act
encoder.block.0.layer.1.layer_norm
encoder.block.0.layer.1.dropout
encoder.block.1
encoder.block.1.layer
encoder.block.1.layer.0
encoder.block.1.layer.0.SelfAttention
encoder.block.1.layer.0.SelfAttention.q
encoder.block.1.layer.0.SelfAttention.k
encoder.block.1.layer.0.SelfAttention.v
encoder.block.1.layer.0.SelfAttention.o
encoder.block.1.l

In [3]:
i@torch.no_grad()
def evaluate(model, messages, testing_dataset, batch_size=16):
    """
    Evaluate the model's performance on the testing dataset.

    Args:
        model: The pre-trained language model.
        messages: The prompt template in the OpenAI message format.
        testing_dataset: The dataset containing Q&A pairs for testing.
        batch_size: The number of examples to process in a batch.

    Returns:
        float: Accuracy of the model on the testing dataset.
    """
    num_correct = 0

    # Extract the questions and answers as lists
    questions = testing_dataset['Question'].tolist()
    answers = testing_dataset['Answer'].tolist()

    # Create batches manually, since DataLoader is not suitable for non-numeric data
    for i in tqdm(range(0, len(questions), batch_size)):
        batch_questions = questions[i:i + batch_size]
        batch_answers = answers[i:i + batch_size]

        # Generate responses in batch
        batch_inputs = []
        for question in batch_questions:
            populated_messages = copy.deepcopy(messages)
            for msg in populated_messages:
                msg["content"] = msg["content"].replace("{SOURCE}", question)

            prompt_text = " ".join([msg["content"] for msg in populated_messages])
            batch_inputs.append(prompt_text)

        # Tokenize the batch input
        inputs = tokenizer(batch_inputs, return_tensors="pt", padding=True, truncation=True).to(model.device)

        # Generate outputs in batch
        outputs = model.generate(inputs["input_ids"], max_new_tokens=20, do_sample=False)

        # Decode outputs and check accuracy
        for output_text, answer in zip(outputs, batch_answers):
            output_text = tokenizer.decode(output_text, skip_special_tokens=True)
            if output_text.strip().lower() == answer.strip().lower():
                num_correct += 1

    # Calculate and print the accuracy
    accuracy = num_correct / len(testing_dataset)
    print(f"ACCURACY = {accuracy:.4f}")

In [4]:
# Define a simple baseline prompt
my_prompt = [{"role": "user", "content": "Please generate an answer: {SOURCE}"}]

In [29]:
# Evaluate baseline performance
evaluate(model, my_prompt, testing_dataset)

  0%|          | 0/7 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


ACCURACY = 0.0000


# 4.1 Troubleshoot: Model Capability

Obtaining ACCURACY = 0.0000, I have switched 3 models to compare their performance and efficiency in this Q&A task.

1. The model used (Qwen/Qwen2-0.5B-Instruct, 0.5 billion parameters) may not be sufficiently fine-tuned or powerful enough to handle the specific Q&A pairs in the testing dataset.

2. T5-large is a versatile model that has shown strong performance across various NLP tasks, including Q&A. However, this large model (770 million parameters)takes a long time to train.

3. T5-small is a smaller version of the T5-large model with 60 million parameters. It can offer efficiency.

In [2]:
# Load the model and tokenizer
# model_name = "t5-large"  # Change to a more powerful model
model_name = "t5-small"  # Change to a smaller model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

# 4.2 Troubleshoot: Parameter-Efficient Fine-Tuning

We initialize the Parameter-Efficient Fine-Tuning (PEFT) model, which incorporates LoRA. This model will be used to adapt the pre-trained model with fewer parameters, making the fine-tuning process more efficient.

In [5]:
!pip install peft --no-deps



In [None]:
from peft import LoraConfig, TaskType, get_peft_model
from pprint import pp
from transformers import TrainingArguments, Trainer

In [6]:
# Define PEFT adapter configuration
# adapter_config = LoraConfig(init_lora_weights="gaussian", r=16, use_rslora=True, task_type=TaskType.CAUSAL_LM, target_modules=["qkv_proj", "o_proj", "gate_up_proj", "down_proj"])
adapter_config = LoraConfig(
    init_lora_weights="gaussian",
    r=16,
    use_rslora=True,
    task_type=TaskType.SEQ_2_SEQ_LM,  # Updated TaskType
    target_modules=["q", "k", "v", "o"],  # Targeting projection layers
)

pp(adapter_config)

print()
print()

lora_model = get_peft_model(model, adapter_config)
lora_model.print_trainable_parameters()

print()
print()

lora_model

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>,
           auto_mapping=None,
           base_model_name_or_path=None,
           revision=None,
           task_type=<TaskType.SEQ_2_SEQ_LM: 'SEQ_2_SEQ_LM'>,
           inference_mode=False,
           r=16,
           target_modules={'k', 'q', 'v', 'o'},
           lora_alpha=8,
           lora_dropout=0.0,
           fan_in_fan_out=False,
           bias='none',
           use_rslora=True,
           modules_to_save=None,
           init_lora_weights='gaussian',
           layers_to_transform=None,
           layers_pattern=None,
           rank_pattern={},
           alpha_pattern={},
           megatron_config=None,
           megatron_core='megatron.core',
           loftq_config={},
           use_dora=False,
           layer_replication=None,
           runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False))


trainable params: 1,179,648 || all params: 61,686,272 || trainable%: 1.9123




PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): T5ForConditionalGeneration(
      (shared): Embedding(32128, 512)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 512)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): lora.Linear(
                    (base_layer): Linear(in_features=512, out_features=512, bias=False)
                    (lora_dropout): ModuleDict(
                      (default): Identity()
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=512, out_features=16, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=16, out_features=512, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedd

In [7]:
# Define the prepare function
def prepare(data, messages, tokenizer):
    """
    Prepares the dataset for training by populating message templates with data,
    tokenizing the text, and creating input tensors.

    Args:
    - data: DataFrame containing 'Question' and 'Answer' columns.
    - messages: List of message templates to be populated with data.
    - tokenizer: Tokenizer for tokenizing the text.

    Returns:
    - output_dataset: List of dictionaries containing tokenized inputs and attention masks.
    """
    output_dataset = []  # Initialize an empty list to store the prepared data

    # Iterate through each row in the DataFrame
    for _, row in data.iterrows():
        question = row["Question"]  # Extract the question
        answer = row["Answer"]  # Extract the answer

        # Deep copy the message templates to avoid modifying the original
        populated = copy.deepcopy(messages)

        # Populate the message template with the question
        for msg in populated:
            msg["content"] = msg["content"].replace("{SOURCE}", question)

        # Append the answer as the assistant's response to the populated messages
        populated.append({"role": "assistant", "content": answer})

        # Convert populated messages into a single string for tokenization
        prepared_instance = " ".join([msg["content"] for msg in populated])

        # Tokenize the prepared instance with padding and truncation
        tokenized_instance = tokenizer(prepared_instance, return_tensors="pt", padding=True, truncation=True)

        # Append the tokenized inputs and attention mask to the output dataset
        output_dataset.append({
            "input_ids": tokenized_instance["input_ids"][0],
            "attention_mask": tokenized_instance["attention_mask"][0],
            "labels": tokenized_instance["input_ids"][0].clone()  # Clone the input_ids to use as labels
        })

    return output_dataset  # Return the prepared dataset

In [8]:
# Example usage of the prepare function
training_dataset = prepare(train_df, my_prompt, tokenizer)

# Verify the tokenization by decoding the first set of input_ids
print(tokenizer.decode(training_dataset[0]["input_ids"]))

Please generate an answer: What kind of health-related data does TRUST provide access to? TRUST provides access to a wide range of health-related data, including genomic, behavioural, socio-economic, and clinical data, which covers various aspects such as patient demographics, healthcare utilisation, disease incidence, as well as data on healthcare policy, social determinants of health, and health systems. With this diverse and comprehensive data, researchers can discover new insights that can improve our understanding of health and disease, and ultimately contribute to innovations and breakthroughs in healthcare.</s>


In [11]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=5e-5,
    warmup_steps=50,
    weight_decay=0.1,
    max_steps=100, # Reduce from 500 to 100
    per_device_train_batch_size=1,
    logging_steps=20,
    do_eval=False,
    bf16=True,
)

# Initialize Trainer
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=training_dataset,
)

# Fine-tune the model with PEFT
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss
20,3.2641
40,3.1028
60,2.8864
80,2.0276


Step,Training Loss
20,3.2641
40,3.1028
60,2.8864
80,2.0276
100,1.5844


TrainOutput(global_step=100, training_loss=2.5730232048034667, metrics={'train_runtime': 16978.1922, 'train_samples_per_second': 0.024, 'train_steps_per_second': 0.006, 'total_flos': 14975451955200.0, 'train_loss': 2.5730232048034667, 'epoch': 1.0025062656641603})




1. output_dir: Directory where the model's checkpoints and logs will be saved.
2. learning_rate: The initial learning rate for the optimizer, which determines the step size at each iteration while moving toward a minimum of the loss function
3. warmup_steps: Number of steps for the warmup phase, where the learning rate gradually increases from 0 to the initial learning rate. This helps stabilize training.
4. weight_decay: Regularization technique that reduces the magnitude of the weights, helping to prevent overfitting by penalizing large weights.
5. max_steps: Total number of training steps to perform. Reducing this will shorten the training time.
6. per_device_train_batch_size: Number of training samples processed simultaneously on each device (e.g., GPU). Larger batch sizes can reduce noise in gradient updates but require more memory.
7. logging_steps: Frequency of logging training metrics. It logs after every specified number of steps.
8. do_eval: Whether to run evaluation during training. Setting this to False skips the evaluation step, saving time.
9. bf16: Use Brain Floating Point 16 (bf16) precision, which speeds up training while preserving more numerical accuracy compared to fp16. Suitable for newer GPUs like NVIDIA A100.



In [16]:
evaluate(lora_model, my_prompt, testing_dataset)

  0%|          | 0/7 [00:00<?, ?it/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For c

ACCURACY = 0.0000


In [18]:
# Merge the LoRA weights into the main model
trained_model = lora_model.merge_and_unload()
trained_model

evaluate(trained_model, my_prompt, testing_dataset)

  0%|          | 0/7 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='le

ACCURACY = 0.0000


# 4.3 Troubleshoot: Combined Evaluation Metric

The evaluation function gave an accuracy of 0.0000 with testing dataset (before and after PEFT finetuning), it indicates that the model's generated answers did not exactly match the reference answers. This could happen due to variations in wording, phrasing, or other factors in the output text. Even if the model's outputs are not exact matches (low accuracy), they may still be close in terms of BLEU score or maintain the intended sentiment, giving credit where due.

Thus, I am using 'evaluate_final_score' as it may be more appropriate as it accounts for partial correctness and overall sentiment, offering a broader perspective on the model's performance.


In [None]:
import torch
import copy
from tqdm.notebook import tqdm
from nltk.translate.bleu_score import sentence_bleu
from textblob import TextBlob
import numpy as np

In [19]:
@torch.no_grad()
def evaluate_final_score(model, messages, testing_dataset, batch_size=16):
    """
    Evaluate the model's performance on the testing dataset using accuracy, BLEU score, and sentiment score.
    Compute a final score as the average of these three metrics.

    Args:
        model: The pre-trained language model.
        messages: The prompt template in the OpenAI message format.
        testing_dataset: The dataset containing Q&A pairs for testing.
        batch_size: The number of examples to process in a batch.

    Returns:
        dict: A dictionary containing accuracy, average BLEU score, average sentiment score, and final score.
    """
    num_correct = 0
    bleu_scores = []
    sentiment_scores = []

    # Extract the questions and answers as lists
    questions = testing_dataset['Question'].tolist()
    answers = testing_dataset['Answer'].tolist()

    # Create batches manually
    for i in tqdm(range(0, len(questions), batch_size)):
        batch_questions = questions[i:i + batch_size]
        batch_answers = answers[i:i + batch_size]

        # Generate responses in batch
        batch_inputs = []
        for question in batch_questions:
            populated_messages = copy.deepcopy(messages)
            for msg in populated_messages:
                msg["content"] = msg["content"].replace("{SOURCE}", question)

            prompt_text = " ".join([msg["content"] for msg in populated_messages])
            batch_inputs.append(prompt_text)

        # Tokenize the batch input
        inputs = tokenizer(batch_inputs, return_tensors="pt", padding=True, truncation=True).to(model.device)

        # Generate outputs in batch
        outputs = model.generate(inputs["input_ids"], max_new_tokens=20, do_sample=False)

        # Decode outputs and calculate metrics
        for output_text, answer in zip(outputs, batch_answers):
            output_text = tokenizer.decode(output_text, skip_special_tokens=True).strip().lower()
            reference_answer = answer.strip().lower()

            # Accuracy
            if output_text == reference_answer:
                num_correct += 1

            # BLEU Score
            bleu_score = sentence_bleu([reference_answer.split()], output_text.split())
            bleu_scores.append(bleu_score)

            # Sentiment Score
            sentiment_score = TextBlob(output_text).sentiment.polarity
            sentiment_scores.append(sentiment_score)

    # Calculate metrics
    accuracy = num_correct / len(testing_dataset)
    avg_bleu_score = np.mean(bleu_scores)
    avg_sentiment_score = np.mean(sentiment_scores)

    # Compute the final score as the average of accuracy, BLEU score, and sentiment score
    final_score = np.mean([accuracy, avg_bleu_score, avg_sentiment_score])

    print(f"ACCURACY = {accuracy:.4f}")
    print(f"AVERAGE BLEU SCORE = {avg_bleu_score:.4f}")
    print(f"AVERAGE SENTIMENT SCORE = {avg_sentiment_score:.4f}")
    print(f"FINAL SCORE = {final_score:.4f}")

    return {
        "accuracy": accuracy,
        "avg_bleu_score": avg_bleu_score,
        "avg_sentiment_score": avg_sentiment_score,
        "final_score": final_score,
    }

In [17]:
# Run the evaluation
evaluate_final_score(model, my_prompt, testing_dataset)

# Print the results
print(f"Accuracy: {metrics['accuracy']:.4f}")
print(f"Average BLEU Score: {metrics['avg_bleu_score']:.4f}")
print(f"Average Sentiment Score: {metrics['avg_sentiment_score']:.4f}")
print(f"Final Score: {metrics['final_score']:.4f}")

  0%|          | 0/7 [00:00<?, ?it/s]

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


ACCURACY = 0.0000
AVERAGE BLEU SCORE = 0.0004
AVERAGE SENTIMENT SCORE = 0.0321
FINAL SCORE = 0.0109
Accuracy: 0.0000
Average BLEU Score: 0.0004
Average Sentiment Score: 0.0321
Final Score: 0.0109


In [20]:
# Example usage
evaluate_final_score(lora_model, my_prompt, testing_dataset)

  0%|          | 0/7 [00:00<?, ?it/s]

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use

ACCURACY = 0.0000
AVERAGE BLEU SCORE = 0.0161
AVERAGE SENTIMENT SCORE = 0.1404
FINAL SCORE = 0.0522


**Discussion**

The LoRA model shows improvements in BLEU score and sentiment score, including final score, These results suggest that LoRA fine-tuning enhanced the model's output quality. However, exact matches (accuracy) are still challenging. For both models, the accuracy remains at 0.0000, indicating that neither model generated exact matches with the reference answers.

# **5. Chatbot Interface**

In [None]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.41.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.112.0-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.9 (from gradio)
  Downloading python_multipart-0.0.9-py3-none-any.whl.metadata (2.5 kB)
Collecting ruff>=0.2.2 (fr

In [None]:
import gradio as gr
import pandas as pd

# Load the dataframe (assuming you have already saved it)
df = pd.read_csv('/content/gdrive/MyDrive/faqs.csv')

# Function to get answer from the dataframe based on the question
def get_answer_from_df(question, df):
    for idx, row in df.iterrows():
        if row['Question'].strip().lower() == question.strip().lower():
            return row['Answer']
    return "Sorry, I don't have an answer for that question."

# Define the chatbot function
def chatbot(message, history):
    response = get_answer_from_df(message, df)
    history.append((message, response))
    return history

# Define a testing function to ensure chatbot is working
def slow_echo(message, history):
    for i in range(len(message)):
        time.sleep(0.3)
        yield "You typed: " + message[: i+1]

# Launch the chatbot using Gradio
# gr.ChatInterface(chatbot).launch()
gr.ChatInterface(slow_echo).launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://ad5b295b3965b9de1e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




In [None]:
import gradio as gr
import pandas as pd

# Load the dataframe (assuming you have already saved it)
df = pd.read_csv('/content/gdrive/MyDrive/faqs.csv')

# Function to get answer from the dataframe based on the question
def get_answer_from_df(question, df):
    for idx, row in df.iterrows():
        if row['Question'].strip().lower() == question.strip().lower():
            return row['Answer']
    return "Sorry, I don't have an answer for that question."

# Define the chatbot function
def chatbot(message, history):
    response = get_answer_from_df(message, df)
    history = history or []  # Initialize history if it's None
    history.append((message, response))
    return history

# Launch the chatbot using Gradio
gr.ChatInterface(fn=chatbot).launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://a0178161f7e33c0f5a.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


