📌 **Project Overview**

🧠 **Project Title**

**Question Answering System using BERT (SQuAD Dataset)**

---

🎯 **Project Goal**

The goal of this project is to build a **Question Answering (QA) system** using a pre-trained BERT model.

The model learns how to:

* Read a paragraph (context)
* Understand a question
* Find the correct answer inside the paragraph

---

📂 **Dataset**

We use the **SQuAD dataset**.

Each example contains:

* `context` → A paragraph of text
* `question` → A question about the paragraph
* `answers` → The correct answer text and its position

---

⚙️ **Project Steps**

 - 1️⃣ Load Dataset

   - We load a small part of SQuAD for training.

 - 2️⃣ Tokenization

   - We use BERT tokenizer to convert text into numbers (input_ids).
   - We also handle long texts using:

       * Truncation
       * Stride
       * Offset mapping

 - 3️⃣ Preprocessing

   - We calculate:

    * `start_positions`
    * `end_positions`

      These represent where the correct answer starts and ends inside the context.

 - 4️⃣ Model

   - We use:
    `bert-base-uncased`

   - Specifically:
    `AutoModelForQuestionAnswering`

   - This model predicts:

     * Start token of answer
     * End token of answer

 - 5️⃣ Training

    We train the model using HuggingFace `Trainer`.

 - 6️⃣ Evaluation

   - We evaluate the model using the SQuAD metric:

     * Exact Match (EM)
     * F1 Score

- 7️⃣ Inference

  - We use a pipeline to test the model on new text.
   - The model returns:

     * The predicted answer
     * Confidence score

---

🏁 **Final Result**

At the end of this project:

* We have a fine-tuned BERT QA model
* The model can answer questions from text
* The model can be saved and used in a backend system

---
📌 **About the Author**

 - **Name:** **Mohamed Mamdouh**
 - Student at the Faculty of **Artificial Intelligence**

 - [**LinkedIn**](https://www.linkedin.com/in/ai-mohamed-mamdouh-74043b331/)
 - [**GitHub**](https://github.com/ai-mohamed-mamdouh)
 - [**Kaggle**](https://www.kaggle.com/mohamed00mamdouh)


In [1]:
# =========================================================
# Project: Question Answering (QA) with BERT on SQuAD
# =========================================================
# Goal: Train a BERT model to answer questions from a context text.
# Input: (context, question, answer_start, answer_text)
# Output: Model predicts start_position and end_position of the answer in the context.

In [None]:
!pip install datasets evaluate

# **Step 1: Load dataset + check transformers version**

In [3]:
from datasets import load_dataset
import transformers
print(transformers.__version__)

# Goal: Load a small part of SQuAD dataset for training.
# Why: SQuAD is a famous QA dataset (question + context + answer).
dataset = load_dataset("squad", split="train[:1000]")

# Goal: Print one example to understand dataset columns.
for col in ['id', 'title', 'context', 'question', 'answers']:
  print(dataset[0][col])
  print('--------------------------------------------------')

5.0.0


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

5733be284776f41900661182
--------------------------------------------------
University_of_Notre_Dame
--------------------------------------------------
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
--------------------------------------------------
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
-----------------------------

In [4]:
print("Dataset structure:")
print(dataset)

print("\nColumns:")
print(dataset.column_names)

print("\nFirst example:")
print(dataset[0])

Dataset structure:
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 1000
})

Columns:
['id', 'title', 'context', 'question', 'answers']

First example:
{'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did 

# **Step 2: Load tokenizer**

In [5]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

# Goal: Use BERT tokenizer to convert text -> token IDs.
# Why: Model needs numbers (input_ids), not raw text.
# Note: use_fast=True gives offset mapping (important for QA).



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
print("\nTokenizer loaded:")
print(tokenizer)

print("\nTokenizer vocab size:")
print(tokenizer.vocab_size)


Tokenizer loaded:
BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

Tokenizer vocab size:
30522


# **Step 3: Preprocess function (tokenize + create labels)**

In [7]:
def preprocess_function(examples):
    # Goal: Clean questions (remove extra spaces).
    questions = [q.strip() for q in examples["question"]]

    # Goal: Tokenize (question, context) together.
    # Why: QA model needs both question and context as input.
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,                # Goal: limit max tokens
        truncation="only_second",      # Goal: cut only the context if too long
        stride=128,                    # Goal: overlap pieces when context is long
        return_overflowing_tokens=True,# Goal: create many chunks for long context
        return_offsets_mapping=True,   # Goal: keep (start_char, end_char) for each token
        padding="max_length",          # Goal: make same length for batch
    )

    # Goal: Get offset mapping then remove it from inputs.
    # Why: We use it to find answer start/end, then we don't need it in final dataset.
    offset_mapping = inputs.pop("offset_mapping")

    # Goal: Map each chunk back to original example index.
    # Why: One example can produce many chunks because of stride.
    sample_map = inputs.pop("overflow_to_sample_mapping")

    answers = examples["answers"]
    start_positions = []
    end_positions = []

    # Goal: For each chunk, find the answer token start/end positions.
    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]        # which original example
        answer = answers[sample_idx]      # answers for that example

        # Goal: If there is no answer (rare), set 0,0
        if len(answer["answer_start"]) == 0:
            start_positions.append(0)
            end_positions.append(0)
            continue

        # Goal: Get answer start and end in character level.
        start_char = answer["answer_start"][0]
        end_char = start_char + len(answer["text"][0])

        # Goal: Know which tokens are question (0) and which are context (1)
        sequence_ids = inputs.sequence_ids(i)

        # Step A: Find where context starts in tokens
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx

        # Step B: Find where context ends in tokens
        while idx < len(sequence_ids) and sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # Goal: Check if the answer is inside this chunk.
        # Why: Some chunks do not contain the answer because of truncation.
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Goal: Find token index for answer start
            curr = context_start
            while curr <= context_end and offset[curr][0] <= start_char:
                curr += 1
            start_positions.append(curr - 1)

            # Goal: Find token index for answer end
            curr = context_end
            while curr >= context_start and offset[curr][1] >= end_char:
                curr -= 1
            end_positions.append(curr + 1)

    # Goal: Add QA labels to inputs
    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions

    print("\nInside preprocess:")
    print("Keys:", inputs.keys())
    print("input_ids shape:", len(inputs["input_ids"]))
    print("start_positions example:", start_positions[:5])
    print("end_positions example:", end_positions[:5])

    return inputs

# **Step 4: Apply preprocessing on dataset**

In [8]:
tokenized_datasets = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset.column_names
)

# Goal: Create training dataset with fields:
# input_ids, attention_mask, token_type_ids (maybe), start_positions, end_positions
# Why: Trainer needs these tensors to train QA model.

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]


Inside preprocess:
Keys: KeysView({'input_ids': [[101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 1996, 13546, 2003, 1996, 24665, 23052, 1010, 1037, 14042, 2173, 1997, 7083, 1998, 9185, 1012, 2009, 2003, 1037, 15059, 1997, 1996, 24665, 23052, 2012, 10223, 26371, 1010, 2605, 2073, 1996, 6261, 2984, 22353, 2135, 2596, 2000, 3002, 16595, 9648, 4674, 2061, 12083, 9711, 2271, 1999, 8517, 1012, 2012, 1996, 2203, 1997, 1996, 2364, 3298, 1006, 1998, 1999, 1037, 3622, 2240, 2008, 853

In [9]:
print("\nTokenized dataset:")
print(tokenized_datasets)

print("\nFirst tokenized example:")
print(tokenized_datasets[0])

print("\ninput_ids length:")
print(len(tokenized_datasets[0]["input_ids"]))


Tokenized dataset:
Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 1032
})

First tokenized example:
{'input_ids': [101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 1996, 13546, 2003, 1996, 24665, 23052, 1010, 1037, 14042, 2173, 1997, 7083, 1998, 9185, 1012, 2009, 2003, 1037, 15059, 1997, 1996, 24665, 23052, 2012, 10223, 26371, 1010, 2605, 2073, 1996, 6261, 2984, 22353, 2135, 2596, 2000, 3002, 16595,

# **Step 5: Load QA model + define training args**

In [10]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

# Goal: Load BERT model for Question Answering.
# Why: This model outputs start logits and end logits.
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

# Goal: Set training settings (light training).
training_args = TrainingArguments(
    output_dir="./bert-squad",          # where to save outputs
    eval_strategy="no",                 # no evaluation during training (simple)
    learning_rate=2e-5,                 # good learning rate for BERT
    per_device_train_batch_size=8,      # batch size
    num_train_epochs=1,                 # train 1 epoch for test
    weight_decay=0.01,                  # regularization
)

# Goal: Create Trainer to run training loop.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
    # tokenizer=tokenizer,              # optional
)

print("Start training ...")
trainer.train()

# Output: trained model in memory (and files in output_dir if saving enabled).

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

BertForQuestionAnswering LOAD REPORT from: bert-base-uncased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
bert.pooler.dense.bias                     | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
bert.pooler.dense.weight                   | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
qa_outputs.bias                            | MISSING    | 
qa_outputs.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized beca

Start training ...


Step,Training Loss


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=129, training_loss=4.425042204154554, metrics={'train_runtime': 97.4324, 'train_samples_per_second': 10.592, 'train_steps_per_second': 1.324, 'total_flos': 202243689713664.0, 'train_loss': 4.425042204154554, 'epoch': 1.0})

In [11]:
print("\nModel loaded:")
print(model)
print("\nNumber of parameters:")
print(model.num_parameters())


Model loaded:
BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), 

# **Step 6: Example metric usage (NOT real evaluation)**

In [None]:
import evaluate

# Goal: Load SQuAD metric (Exact Match and F1 for QA).
metric = evaluate.load("squad")

# WARNING:
# This part uses fake example (Paris). It is ONLY to show how metric works.
predictions = [{'prediction_text': 'Paris', 'id': '1'}]
references = [{'answers': {'answer_start': [40], 'text': ['Paris']}, 'id': '1'}]

results = metric.compute(predictions=predictions, references=references)
print(f"Results: {results}")

In [None]:
trainer.save_model("./our_model")
tokenizer.save_pretrained("./our_model")

# **Step 7: Inference with pipeline**

In [None]:
from transformers import pipeline

# IMPORTANT:
# You must save model + tokenizer, then load them.

# Goal: Build QA pipeline for easy inference.
qa_pipeline = pipeline(
    "question-answering",
    model="our model",       # TODO: replace with "./our_model" or model object
    tokenizer="our tokenizer" # TODO: replace with "./our_model" or tokenizer object
)

#__________________________________________________________________________________
# Goal: Test with a long context
long_context = """
The Great Pyramid of Giza is the largest Egyptian pyramid and served as the tomb of pharaoh Khufu.
It was built in the early 26th century BC and took around 27 years to compute.
It is the oldest of the Seven Wonders of the Ancient World.
The pyramid's height was originally 146.6 meters, making it the tallest man-made structure in the world for over 3,800 years.
"""

question = "How long did it take to build the pyramid"

#__________________________________________________________________________________

# NOTE:
# pipeline("question-answering") usually supports:
# question=..., context=...
# Some pipeline versions do NOT support max_seq_len/doc_stride/top_k here.
# If you get error, remove these params.

# Goal: Run inference and get best answers.
results = qa_pipeline(
    question=question,
    context=long_context,
    max_seq_len=512,   # may not work in some versions
    doc_stride=128,    # may not work in some versions
    top_k=3            # may not work in some versions
)

# Goal: Print answers
print(f"--- Question: {question} ---")
for i, res in enumerate(results):
    print(f"Rank {i+1}:")
    print(f"   - Answer: {res['answer']}")
    print(f"   - Score: {round(res['score'], 4)}")
    print("-" * 20)

**Extra Notes**

---


**To use in backend:**
- Save model + tokenizer in same folder:
  - `trainer.save_model("./our_model")`
  - `tokenizer.save_pretrained("./our_model")`

- Then load pipeline:
   - `qa_pipeline = pipeline("question-answering", model="./our_model", tokenizer="./our_model")`


✅ **Project Summary**

In this project, we built a **Question Answering model using BERT and SQuAD**.

The model learns to read a paragraph and answer questions by predicting the start and end of the correct answer.

We prepared the data, fine-tuned BERT, and tested the model using real examples.

The final model can be saved and used in real applications like chatbots and AI systems 🚀
