# 🤖 Question-Answering System 🤔

## 🏗️ Project Architecture and Model Selection

In this Question-Answering System project, we strategically leverage a fine-tuned **miniLM** language model. The decision to opt for miniLM is rooted in its compact size, making it an ideal choice for environments with low latency requirements and resource limitations. This emphasis on efficiency ensures that our system delivers prompt responses even in constrained computing environments.

## 📘 Training with SQuAD Dataset

The backbone of our model's intelligence is the SQuAD dataset, a gold standard in the field of question-answering. By training our language model on Squad, we equip it with the ability to comprehend and respond to a diverse array of questions with context, enhancing its effectiveness in real-world scenarios.

## 📊 Performance Metrics: miniLM vs BERT

In the pursuit of a well-balanced trade-off between model size, speed, and accuracy, we evaluated miniLM against the more heavyweight BERT model. While miniLM achieved an F1 score of 82.8%, falling slightly short of BERT's 88.6%, the true marvel lies in the efficiency miniLM brings to the table. Despite its reduced accuracy, miniLM boasts a mere 90.9 MB in size, making it a lightweight alternative. What's truly remarkable is its speed; miniLM is approximately 10 times faster than BERT, which weighs in at a hefty 436 MB.

This project, therefore, not only introduces an advanced question-answering system but also strategically chooses a model that optimizes for low latency, minimal resource requirements, and impressive efficiency. Welcome to a world where intelligence meets practicality, transforming the landscape of information retrieval. 🚀

In [1]:
%%bash

pip install -q transformers
pip install -q datasets
pip install -q evaluate

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 521.2/521.2 kB 7.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 11.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 11.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 kB 1.6 MB/s eta 0:00:00


In [2]:
from transformers import AutoModelForQuestionAnswering
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
from transformers import AutoTokenizer
import numpy as np
import torch

## Pick a Model

In [None]:
model_checkpoint = "sentence-transformers/all-MiniLM-L6-v2"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at sentence-transformers/all-MiniLM-L6-v2 and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
model_checkpoint = "bert-base-cased"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# model_name = 'nreimers/MiniLM-L3-H384-uncased'

# tokenizer = AutoTokenizer.from_pretrained(model_name)

# device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# model = AutoModelForQuestionAnswering.from_pretrained(model_name).to(device)

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("squad")

In [None]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [None]:
train_dataset = raw_datasets["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)
len(raw_datasets["train"]), len(train_dataset)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

(87599, 88524)

In [None]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [None]:
validation_dataset = raw_datasets["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)
len(raw_datasets["validation"]), len(validation_dataset)

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

(10570, 10784)

In [None]:
import collections

example_to_features = collections.defaultdict(list)

In [None]:
import numpy as np

n_best = 20
max_answer_length = 30

In [None]:
import evaluate

metric = evaluate.load("squad")

In [None]:
from tqdm.auto import tqdm


def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

# Fine-tuning the model

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "miniLM-finetuned-squad",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5, # 2e-4 miniLM
    num_train_epochs=6, # 5 miniLM
    weight_decay=0.01,
    seed=42,
    # max_steps=50000,
    fp16=True,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,4.113
1000,2.6571
1500,2.2356
2000,2.0895
2500,1.9557
3000,1.8566
3500,1.8283
4000,1.7574
4500,1.7021
5000,1.6505


TrainOutput(global_step=66396, training_loss=1.1044718530194013, metrics={'train_runtime': 5205.5837, 'train_samples_per_second': 102.034, 'train_steps_per_second': 12.755, 'total_flos': 1.3030945552207872e+16, 'train_loss': 1.1044718530194013, 'epoch': 6.0})

In [None]:
predictions, _, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions

In [None]:
compute_metrics(start_logits, end_logits, validation_dataset, raw_datasets["validation"])

  0%|          | 0/10570 [00:00<?, ?it/s]

{'exact_match': 73.70860927152317, 'f1': 82.87719443407029}

Benchmark for different models on SquAD

|Model | Exact Match | F1 |
|--- | --- | ---|
| "bert-base-cased" | **81.25** | **88.62** |
| "sentence-transformers/all-MiniLM-L6-v2" | 73.71 | 82.87 |


# Saving

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
trainer.push_to_hub(commit_message="Training MiniLM on SquAD completed")

model.safetensors:   0%|          | 0.00/90.3M [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

events.out.tfevents.1700241498.2c701727246a.40433.0:   0%|          | 0.00/25.6k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

'https://huggingface.co/josueadin/miniLM-finetuned-squad/tree/main/'

## Pipeline

In [32]:
context = """A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation.[1] LLMs are artificial neural networks (mainly transformers[2]) and are (pre-)trained using self-supervised learning and semi-supervised learning.

As autoregressive language models, they work by taking an input text and repeatedly predicting the next token or word. Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results.[4] They are thought to acquire embodied knowledge about syntax, semantics and "ontology" inherent in human language corpora, but also inaccuracies and biases present in the corpora.

Notable examples include OpenAI's GPT models (e.g., GPT-3.5 and GPT-4, used in ChatGPT), Google's PaLM (used in Bard), and Meta's LLaMa, as well as BLOOM, Ernie 3.0 Titan, and Anthropic's Claude 2."""

In [33]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "huggingface-course/bert-finetuned-squad"
question_answerer = pipeline("question-answering", model=model_checkpoint)

question = "What is a large language model?"
question_answerer(question=question, context=context)

{'score': 0.10473547875881195, 'start': 23, 'end': 27, 'answer': '(LLM'}

In [36]:
from transformers import pipeline

# Replace this with your own checkpoint
mymodel_checkpoint = "josueadin/miniLM-finetuned-squad"
question_answerer = pipeline("question-answering", model=mymodel_checkpoint)

question = "How does a large language model work?"
question_answerer(question=question, context=context)

{'score': 0.07125090062618256,
 'start': 531,
 'end': 603,
 'answer': 'by taking an input text and repeatedly predicting the next token or word'}