# **Problem Statement**

*In today’s fast-paced digital world, companies need to provide accurate and fast answers to frequently asked questions (FAQs) while keeping the computational costs low. In this lab, you will develop an efficient FAQ Question Answering (QA) system by fine-tuning a lightweight student model using knowledge distillation from a high-performing teacher model. You will then apply dynamic quantization to the student model to reduce its memory footprint and inference latency without degrading its performance.*

## Challenge:
Experiment with different training epochs until the fine-tuned student model can reliably answer the test question, "On Saturdays, when do you open your office?" while maintaining efficiency gains after quantization.

Import Libraries and Set Up Environment

In [1]:
!pip install pandas datasets torch transformers



In [2]:
# This cell imports all necessary libraries and sets up the environment.
import pandas as pd
from datasets import Dataset
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
import time
import torch.quantization

print("Libraries imported successfully.")


Libraries imported successfully.


## Prepare the Dataset
The dataset contains 51 distinct questions and answers for fine-tuning.

In [3]:
# This cell creates a dataset of 51 unique FAQ pairs.
# The dataset contains 51 distinct questions and answers for fine-tuning.
questions = [
    "What are your opening hours?",
    "How can I reset your password?",
    "Where is your store located?",
    "What is the return policy?",
    "How do I track my order?",
    "Do you offer free shipping?",
    "How do I create an account?",
    "What payment methods do you accept?",
    "Can I cancel my order?",
    "How do I exchange a product?",
    "What warranty do your products have?",
    "How do I subscribe to your newsletter?",
    "What are the shipping costs?",
    "Where can I find the latest deals?",
    "Do you have a loyalty program?",
    "How do I use a promo code?",
    "Can I place an order over the phone?",
    "What is your privacy policy?",
    "How do I update my billing information?",
    "What is your refund policy?",
    "How do I contact customer support?",
    "Do you have gift cards available?",
    "What are your international shipping options?",
    "How can I apply for a job at your company?",
    "Do you offer same-day delivery?",
    "How do I return a defective product?",
    "What are your holiday hours?",
    "Where can I view product reviews?",
    "Can I pre-order new products?",
    "How do I report a problem with my order?",
    "What measures do you take for product safety?",
    "Do you offer installation services?",
    "How do I find a store near me?",
    "Can I get a price match guarantee?",
    "What is your exchange policy?",
    "Do you have financing options available?",
    "How do I check my order status?",
    "What is the process for bulk orders?",
    "How do I reset my account password?",
    "What is your policy on damaged goods?",
    "How do I update my shipping address?",
    "Are there any membership benefits?",
    "What discounts do you offer for students?",
    "How do I become a reseller?",
    "Do you offer warranties on electronics?",
    "What is the best way to contact your sales team?",
    "Can I request a catalog?",
    "How do I provide feedback on your service?",
    "What is your procedure for handling complaints?",
    "Where can I find instructions for product setup?",
    "Do you offer virtual shopping assistance?"
]

answers = [
    "Our store is open from 9 AM to 9 PM every day.",
    "To reset your password, click 'Forgot Password' on the login page and follow the emailed instructions.",
    "Our store is located at 123 Main Street in Anytown.",
    "Products can be returned within 30 days with a valid receipt.",
    "Log into your account and click 'Order History' to track your order.",
    "Yes, free shipping is available on orders over $50.",
    "Click 'Sign Up' on our homepage and fill in the registration form.",
    "We accept major credit cards, PayPal, and Apple Pay.",
    "Orders can be cancelled within one hour of placement.",
    "To exchange a product, visit any of our store locations with your receipt.",
    "Our products come with a one-year warranty for manufacturing defects.",
    "Subscribe by entering your email address on our newsletter signup page.",
    "Shipping costs vary by location and order size; please refer to our shipping policy.",
    "Visit our 'Promotions' page to view the latest deals.",
    "Yes, our loyalty program offers exclusive discounts and rewards.",
    "Enter your promo code at checkout to receive your discount.",
    "Yes, orders can be placed over the phone via our customer service hotline.",
    "Our privacy policy is detailed on our website; we take data security seriously.",
    "Update your billing information in your account settings under 'Billing'.",
    "Refunds are processed within 7-10 business days after return.",
    "You can contact customer support via our hotline or email support@ourstore.com.",
    "Gift cards are available in multiple denominations.",
    "We ship internationally to selected countries; see our international shipping page for details.",
    "Visit our careers page for job openings and application procedures.",
    "Same-day delivery is available in select areas.",
    "If a product is defective, contact support immediately for a replacement or refund.",
    "Holiday hours are posted on our website during seasonal periods.",
    "Product reviews can be found on the product detail pages of our website.",
    "Pre-orders for upcoming products are available online.",
    "Report any order issues by contacting our customer service department.",
    "We follow strict quality control standards to ensure product safety.",
    "Yes, installation services are offered for select products.",
    "Use our store locator on the website to find the nearest store.",
    "We offer a price match guarantee if you find a lower price elsewhere.",
    "Exchanges are allowed within 30 days under our exchange policy.",
    "Financing options are available through our partnered financial services.",
    "Check your order status by logging into your account and selecting 'Order Status'.",
    "For bulk orders, please contact our sales team for a custom quote.",
    "If you've forgotten your account password, use the 'Forgot Password' link on the login page.",
    "Our damaged goods policy allows returns or exchanges within 30 days with proof of damage.",
    "Update your shipping address in your account settings or by contacting support.",
    "Membership benefits include exclusive discounts and early access to new products.",
    "We offer a 10% discount for students with a valid student ID.",
    "To become a reseller, complete the reseller application form on our website.",
    "Yes, extended warranties on electronics are available for purchase.",
    "Contact our sales team via the 'Contact Us' page for prompt assistance.",
    "You can request a catalog by filling out our online catalog request form.",
    "Provide feedback through our online feedback form or by calling customer service.",
    "We have a dedicated process to handle complaints efficiently; please contact support.",
    "Setup instructions are included with your product and available online.",
    "Virtual shopping assistance is available via our online chat service."
]

print(f"Total unique QA pairs: {len(questions)}")
data = {"question": questions, "answer": answers}
faq_df = pd.DataFrame(data)
faq_dataset = Dataset.from_pandas(faq_df)
print(f"Dataset size: {len(faq_dataset)} examples")


Total unique QA pairs: 51
Dataset size: 51 examples


## Load Teacher and Student Models
The teacher model is pre-trained on SQuAD and is expected to produce accurate QA outputs.
You will learn to mimic the teacher while being more efficient.


In [4]:
# This cell loads the teacher model and the student model.

teacher_model_name = "distilbert-base-uncased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(teacher_model_name)
teacher_model = AutoModelForQuestionAnswering.from_pretrained(teacher_model_name)

# The student model is a lightweight variant that will be fine-tuned.
student_model_name = "distilbert-base-uncased"
student_model = AutoModelForQuestionAnswering.from_pretrained(student_model_name)

print("Teacher and student models loaded successfully.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Teacher and student models loaded successfully.


## Define Helper Functions
 1. distillation_loss: Combines the teacher's soft predictions and hard labels.
 2. compute_answer_span: Extracts the true answer span from tokenized input.

In [5]:

def distillation_loss(student_start, student_end, teacher_start, teacher_end, true_start, true_end, alpha=0.5, temperature=2.0):
    # Soft target loss using KL divergence.
    soft_loss_start = F.kl_div(
        F.log_softmax(student_start / temperature, dim=-1),
        F.softmax(teacher_start / temperature, dim=-1),
        reduction="batchmean"
    ) * (temperature ** 2)
    soft_loss_end = F.kl_div(
        F.log_softmax(student_end / temperature, dim=-1),
        F.softmax(teacher_end / temperature, dim=-1),
        reduction="batchmean"
    ) * (temperature ** 2)
    # Hard target loss using cross entropy.
    hard_loss_start = F.cross_entropy(student_start, true_start)
    hard_loss_end = F.cross_entropy(student_end, true_end)
    # Combine the losses.
    return alpha * ((soft_loss_start + soft_loss_end) / 2) + (1 - alpha) * ((hard_loss_start + hard_loss_end) / 2)

def compute_answer_span(single_input, tokenizer):
    # This function assumes the SQuAD-style input: [CLS] question tokens [SEP] context tokens [SEP] ...
    input_ids = single_input['input_ids'][0].tolist()
    sep_id = tokenizer.sep_token_id
    try:
        sep_index = input_ids.index(sep_id)
        second_sep_index = input_ids.index(sep_id, sep_index + 1)
    except ValueError:
        sep_index = 0
        second_sep_index = len(input_ids) - 1
    true_start = sep_index + 1  # Start of context tokens.
    true_end = second_sep_index - 1  # End of context tokens.
    return true_start, true_end

print("Helper functions defined.")


Helper functions defined.


## Fine-Tuning via Knowledge Distillation

It processes the dataset, tokenizes the inputs, computes the true answer span,
and trains the student model to match the teacher's outputs.

In [6]:
# This cell fine-tunes the student model using knowledge distillation.

train_loader = DataLoader(faq_dataset, batch_size=2, shuffle=True)
student_model.train()
optimizer = torch.optim.Adam(student_model.parameters(), lr=3e-5)

# Experiment with different numbers of epochs.
Epochs = 20  # Try different values (e.g., 10, 30, 50) until the model answers reliably.

for epoch in range(Epochs):
    for batch in train_loader:
        # Tokenize both the question and the answer (the answer acts as context).
        batch_inputs = tokenizer(batch["question"], batch["answer"], padding=True, truncation=True, return_tensors="pt")
        true_starts, true_ends = [], []
        # Compute the correct answer span for each sample.
        for i in range(len(batch["question"])):
            single_input = {k: v[i].unsqueeze(0) for k, v in batch_inputs.items()}
            ts, te = compute_answer_span(single_input, tokenizer)
            true_starts.append(ts)
            true_ends.append(te)
        true_start = torch.tensor(true_starts)
        true_end = torch.tensor(true_ends)

        # Obtain teacher outputs without computing gradients.
        with torch.no_grad():
            teacher_outputs = teacher_model(**batch_inputs)
            teacher_start = teacher_outputs.start_logits
            teacher_end = teacher_outputs.end_logits
        # Forward pass through the student model.
        student_outputs = student_model(**batch_inputs)
        student_start = student_outputs.start_logits
        student_end = student_outputs.end_logits

        # Compute the distillation loss.
        loss = distillation_loss(student_start, student_end, teacher_start, teacher_end, true_start, true_end)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch + 1} completed. Loss: {loss.item()}")

print("Fine-tuning complete. Adjust Epochs if needed to improve performance.")


Epoch 1 completed. Loss: 3.1602189540863037
Epoch 2 completed. Loss: 1.5478594303131104
Epoch 3 completed. Loss: 1.761122226715088
Epoch 4 completed. Loss: 0.44907593727111816
Epoch 5 completed. Loss: 1.2566320896148682
Epoch 6 completed. Loss: 0.5907028913497925
Epoch 7 completed. Loss: 0.6963661909103394
Epoch 8 completed. Loss: 0.6079907417297363
Epoch 9 completed. Loss: 0.5653973817825317
Epoch 10 completed. Loss: 0.8307878971099854
Epoch 11 completed. Loss: 0.4184042811393738
Epoch 12 completed. Loss: 0.8936832547187805
Epoch 13 completed. Loss: 0.9776091575622559
Epoch 14 completed. Loss: 0.679531455039978
Epoch 15 completed. Loss: 0.8041097521781921
Epoch 16 completed. Loss: 0.3470720052719116
Epoch 17 completed. Loss: 0.7869094610214233
Epoch 18 completed. Loss: 0.8783726692199707
Epoch 19 completed. Loss: 0.7106615304946899
Epoch 20 completed. Loss: 0.8415277600288391
Fine-tuning complete. Adjust Epochs if needed to improve performance.


## *Important*:
### *Dynamic quantization in PyTorch is only supported on the CPU. This means that after applying dynamic quantization to your fine-tuned student model, the quantized model's operators (such as quantized::linear_dynamic) are implemented only for CPU. If you try to run inference with a quantized model on a GPU, you'll encounter errors*.

### *Therefore*:

### *All models must be moved to the CPU for a fair comparison*.
### *Use .cpu() to move your models to the CPU*.
### *When creating the QA pipeline, pass device=-1 to force CPU usage.*

##Apply Dynamic Quantization[TODO]

 In this cell, you will apply dynamic quantization to the  fine-tuned student model. Quantization converts parts of the
 model (such as Linear layers) to lower precision (e.g., int8),
 which reduces model size and speeds up inference.

In [11]:

import torch.quantization
# Apply dynamic quantization to the student_model for the torch.nn.Linear layers.

# TODO: Apply dynamic quantization to the student_model for the torch.nn.Linear layers.

quantized_model = torch.quantization.quantize_dynamic(
    student_model,
    {torch.nn.Linear},
    dtype=torch.qint8)

# TODO: Set teacher_model, student_model and quantized_model to evaluation mode.

teacher_model.eval()
student_model.eval()
quantized_model.eval()
print("Dynamic quantization applied to the student model.")


Dynamic quantization applied to the student model.


## Inference and Evaluation[TODO]

This cell sets up a question-answering (QA) pipeline and
evaluates the teacher model, fine-tuned student model, and
quantized student model on a given context and question.

In [12]:

# Define the inference context and question.
context = (
    "Our office is located in the heart of downtown with modern facilities and a friendly environment. "
    "We operate from 9 AM to 9 PM on weekdays (Monday through Friday) and from 10 AM to 6 PM on Saturdays. "
    "Our team is dedicated to providing exceptional customer service. For further inquiries, please contact our customer support."
)
question = "On Saturdays, when do you open your office?"

def answer_query(model, question, context, tokenizer):
  qa_pipe = pipeline("question-answering", model=model, tokenizer=tokenizer, device =-1)
    # TODO: Create a QA pipeline for the given model using the 'pipeline' function.

    # TODO: Return the answer by running the pipeline on the provided question and context.
  return qa_pipe(question=question, context=context)

# TODO: Evaluate the teacher model and store its answer.
print("Teacher Model Answer:")
teacher_answer = answer_query(teacher_model, question, context, tokenizer)
print(teacher_answer)

# TODO: Evaluate the fine-tuned student model and store its answer.
student_answer = answer_query(student_model, question, context, tokenizer)
print("\nFine-tuned Student Model Answer:")
print(student_answer)

# TODO: Evaluate the quantized student model and store its answer.
quantized_answer = answer_query(quantized_model, question, context, tokenizer)
print("\nQuantized Student Model Answer:")
print(quantized_answer)


Device set to use cpu
Device set to use cpu


Teacher Model Answer:
{'score': 0.5255181789398193, 'start': 172, 'end': 185, 'answer': '10 AM to 6 PM'}


Device set to use cpu



Fine-tuned Student Model Answer:
{'score': 0.024858662858605385, 'start': 172, 'end': 199, 'answer': '10 AM to 6 PM on Saturdays.'}

Quantized Student Model Answer:
{'score': 0.017275527119636536, 'start': 172, 'end': 199, 'answer': '10 AM to 6 PM on Saturdays.'}


## Measure Inference Time and Model Size[TODO]

In this cell, you will measure:
   - The average inference time for each model over multiple runs.
   - The approximate model size based on the sum of parameter sizes.

 These metrics help you compare the computational efficiency and
 memory footprint of the teacher, fine-tuned student, and quantized models.

In [10]:


def measure_inference_time(model, question, context, tokenizer, runs=10):
  qa_pipe = pipeline("question-answering", model=model, tokenizer=tokenizer, device =-1)

    # TODO: Create a QA pipeline for the given model.

    # Perform a few warm-up runs.
  for _ in range(3):
    _ = qa_pipe(question=question, context=context)
    # TODO: Start a timer, run the pipeline 'runs' times, and stop the timer.
  start_time = time.time()
  for _ in range(runs):
    _ = qa_pipe(question=question, context=context)
  end_time = time.time()
    # TODO: Calculate and return the average inference time.
  avg_time = (end_time - start_time) / runs
  return avg_time

def get_model_size(model):
  param_size = 0
  for param in model.parameters():
    param_size += param.nelement() * param.element_size()

    # TODO: Iterate over all parameters in the model and sum their sizes (in bytes).

    # Convert the total size from bytes to megabytes (MB) and return.
  return param_size / (1024 ** 2)

# TODO: Measure the average inference time for each model.
teacher_time = measure_inference_time(teacher_model, question, context, tokenizer)
student_time = measure_inference_time(student_model, question, context, tokenizer)
quantized_time = measure_inference_time(quantized_model, question, context, tokenizer)


# TODO: Calculate the approximate model sizes for each model.
teacher_size = get_model_size(teacher_model)
student_size = get_model_size(student_model)
quantized_size = get_model_size(quantized_model)


print("Average Inference Time (seconds):")
print(f"Teacher Model: {teacher_time:.4f}")
print(f"Fine-tuned Student Model: {student_time:.4f}")
print(f"Quantized Student Model: {quantized_time:.4f}")

print("\nApproximate Model Size (MB):")
print(f"Teacher Model: {teacher_size:.2f} MB")
print(f"Fine-tuned Student Model: {student_size:.2f} MB")
print(f"Quantized Student Model: {quantized_size:.2f} MB")


Device set to use cpu
Device set to use cpu
Device set to use cpu


Average Inference Time (seconds):
Teacher Model: 0.0963
Fine-tuned Student Model: 0.0931
Quantized Student Model: 0.0569

Approximate Model Size (MB):
Teacher Model: 253.16 MB
Fine-tuned Student Model: 253.16 MB
Quantized Student Model: 91.00 MB
