
# Question Answering with Transformers
**Goal:** Build and evaluate a Question Answering (QA) system using pretrained transformer models (like BERT, DistilBERT, RoBERTa, ALBERT).  
This notebook supports:  
- Loading SQuAD v1.1-style data or a CSV with `context`, `question`, `answers` columns.  
- Running inference with pretrained QA models to extract answer spans.  
- Evaluating using Exact Match (EM) and F1 (token-level) metrics.  
- Comparing two models side-by-side.  
- Fine-tuning a model on your dataset using the Hugging Face `Trainer`.  
- Simple interactive interfaces: CLI and Streamlit examples.

**Notes:** To run full experimentation and fine-tuning you will need a GPU and the `transformers`, `datasets`, and `evaluate` libraries installed.


## 1) Install dependencies

In [None]:

# Uncomment if you need to install packages in your environment
# %pip install transformers datasets evaluate pandas torch sentencepiece streamlit
# %pip install git+https://github.com/huggingface/transformers


## 2) Imports & Setup

In [2]:

import os, json, textwrap, warnings, math
from typing import List, Dict, Tuple, Optional
import numpy as np
import pandas as pd

import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline, TrainingArguments, Trainer, default_data_collator
from datasets import load_dataset, Dataset, DatasetDict
# Note: `load_metric` was moved to the `evaluate` package. Use `import evaluate; evaluate.load("squad")` if needed.

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", DEVICE)


  from .autonotebook import tqdm as notebook_tqdm


Device: cpu


## 3) Load SQuAD or custom CSV

In [None]:

# This cell tries to load local SQuAD-format JSON or a CSV with columns: context, question, answers
def load_local_squad(json_path="train-v1.1.json"):
    if not os.path.exists(json_path):
        return None
    # Minimal parser for SQuAD v1.1 structure
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    rows = []
    for article in data.get('data', []):
        for para in article.get('paragraphs', []):
            context = para.get('context', '')
            for qa in para.get('qas', []):
                q = qa.get('question', '')
                answers = [a['text'] for a in qa.get('answers', [])] if 'answers' in qa else []
                rows.append({'context': context, 'question': q, 'answers': answers})
    return pd.DataFrame(rows)

def load_dataset_csv(csv_path="squad_style.csv"):
    if not os.path.exists(csv_path):
        return None
    df = pd.read_csv(csv_path)
    # Expect columns: context, question, answers (answers may be JSON list or string)
    if 'answers' in df.columns:
        # ensure list type
        def parse_answers(x):
            if isinstance(x, str):
                try:
                    v = json.loads(x)
                    if isinstance(v, list): return v
                except Exception:
                    return [x]
            if isinstance(x, list):
                return x
            return [str(x)]
        df['answers'] = df['answers'].apply(parse_answers)
    else:
        df['answers'] = [[]]*len(df)
    return df[['context','question','answers']]

# Try local SQuAD, else CSV, else attempt to download small SQuAD via datasets
df = load_local_squad("./QA/train-v1.1.json") or load_dataset_csv("squad_style.csv")
if df is None:
    print("No local SQuAD/CSV found. Attempting to load 'squad' via the datasets library (internet required).")
    try:
        ds = load_dataset("squad")
        # convert to pandas for small subset; datasets structure different
        train = ds['train'].select(range(10000))  # small subset for quick runs
        df = pd.DataFrame({'context': train['context'], 'question': train['question'], 'answers': train['answers']})
        print("Loaded SQuAD subset via datasets (10000 samples).")
    except Exception as e:
        raise FileNotFoundError("No dataset found locally and failed to download SQuAD. Please provide 'train-v1.1.json' or 'squad_style.csv'.\nError: " + str(e))

print("Samples loaded:", len(df))
df.head()


No local SQuAD/CSV found. Attempting to load 'squad' via the datasets library (internet required).
Loaded SQuAD subset via datasets (1000 samples).
Samples loaded: 10000


Unnamed: 0,context,question,answers
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...


## 4) Helper: Exact Match & F1 scoring (SQuAD-style)

In [4]:

# SQuAD evaluation helpers (normalize answer & compute EM/F1)
import string, re

def normalize_answer(s):
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)
    def white_space_fix(text):
        return ' '.join(text.split())
    def remove_punc(text):
        return ''.join(ch for ch in text if ch not in set(string.punctuation))
    def lower(text):
        return text.lower()
    return white_space_fix(remove_articles(remove_punc(lower(s))))

def f1_score(pred, truth):
    pred_tokens = normalize_answer(pred).split()
    truth_tokens = normalize_answer(truth).split()
    common = Counter(pred_tokens) & Counter(truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0.0
    precision = num_same / len(pred_tokens)
    recall = num_same / len(truth_tokens)
    return 2 * precision * recall / (precision + recall)

def exact_match_score(pred, truth):
    return int(normalize_answer(pred) == normalize_answer(truth))

from collections import Counter


## 5) Load two QA models for comparison

In [11]:

# Try to load two QA models (priority list). The notebook will use the first two available.
CANDIDATE_QA_MODELS = [
    "distilbert-base-uncased-distilled-squad",
    "ktrapeznikov/albert-xlarge-v2-squad-v2",
    "bert-large-uncased-whole-word-masking-finetuned-squad",
    "deepset/roberta-base-squad2"
]

loaded_models = []
for name in CANDIDATE_QA_MODELS:
    try:
        tok = AutoTokenizer.from_pretrained(name)
        model = AutoModelForQuestionAnswering.from_pretrained(name).to(DEVICE)
        loaded_models.append((name, tok, model))
        print("Loaded:", name)
    except Exception as e:
        print("Could not load", name, ":", e)
    if len(loaded_models) >= 2:
        break

if not loaded_models:
    raise RuntimeError("No QA models could be loaded. Install at least one of the candidate models.")
loaded_models_names = [m[0] for m in loaded_models]
loaded_models_names


Loaded: distilbert-base-uncased-distilled-squad


Some weights of the model checkpoint at ktrapeznikov/albert-xlarge-v2-squad-v2 were not used when initializing AlbertForQuestionAnswering: ['albert.pooler.bias', 'albert.pooler.weight']
- This IS expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Loaded: ktrapeznikov/albert-xlarge-v2-squad-v2


['distilbert-base-uncased-distilled-squad',
 'ktrapeznikov/albert-xlarge-v2-squad-v2']

## 6) Inference: extract answer span for (context, question)

In [13]:

from transformers import pipeline

# For convenience, build HF pipelines (handles tokenization & span extraction)
qa_pipelines = {}
for name, tok, mdl in loaded_models:
    try:
        qa_pipelines[name] = pipeline("question-answering", model=mdl, tokenizer=tok, device=0 if DEVICE=='cuda' else -1)
    except Exception as e:
        # fallback to manual inference
        print("Pipeline creation failed for", name, ":", e)
        qa_pipelines[name] = None

def answer_with_pipeline(pipeline_obj, context, question):
    if pipeline_obj is None:
        return ""
    out = pipeline_obj(question=question, context=context, top_k=1)
    # pipeline may return dict or list
    if isinstance(out, list):
        out = out[0] if out else {}
    return out.get('answer', ''), out

# Quick inference on a small subset
sample = df.iloc[0]
print("Context (first 5000 chars):\n", sample['context'][:5000])
print("\nQuestion:", sample['question'])
for name, pipe in qa_pipelines.items():
    ans, raw = answer_with_pipeline(pipe, sample['context'], sample['question'])
    print(f"Model: {name} | Answer: {ans}")


Device set to use cpu
Device set to use cpu


Context (first 5000 chars):
 Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Model: distilbert-base-uncased-distilled-squad | Answer: Saint Bernadette Soubirous
Model: ktrapeznikov/albert-xlarge-v2-squad-v2 | Answer:  Saint Bernadette Soubirous


## 7) Batch inference & evaluation (EM / F1) on dataset subset

In [None]:

# Run on a subset for speed; increase for full evaluation
N = min(500, len(df))  # adjust for compute/time
subset = df.iloc[:N].reset_index(drop=True)

results = []
for name, pipe in qa_pipelines.items():
    print("Evaluating model:", name)
    em_total = 0.0
    f1_total = 0.0
    count = 0
    preds = []
    for i, row in subset.iterrows():
        context = row['context']
        question = row['question']
        ans, raw = answer_with_pipeline(pipe, context, question)
        # determine gold answer(s)
        golds = row['answers'] if isinstance(row['answers'], list) else (row['answers'] or [])
        # If no golds (e.g., CSV missing), skip metric computation
        if not golds:
            continue
        # Compare to all golds, take max
        em = max(exact_match_score(ans, g['text'] if isinstance(g, dict) else g) for g in golds)
        f1 = max(f1_score(ans, g['text'] if isinstance(g, dict) else g) for g in golds)
        em_total += em
        f1_total += f1
        count += 1
        preds.append({'idx': i, 'pred': ans, 'gold': golds})
    avg_em = em_total / max(1, count)
    avg_f1 = f1_total / max(1, count)
    results.append({'model': name, 'EM': avg_em, 'F1': avg_f1, 'count': count})
    print(f"Done {name} -> EM: {avg_em:.4f} | F1: {avg_f1:.4f} on {count} samples")

pd.DataFrame(results).sort_values('F1', ascending=False)


## 8) Inspect predictions and failure cases

In [14]:

# Show some examples where models disagree or predict poorly
def show_examples(subset, qa_pipelines, n=5):
    for i in range(min(n, len(subset))):
        row = subset.iloc[i]
        print("==== Example", i, "====")
        print("Question:", row['question'])
        print("Context (snippet):", row['context'][:1000])
        for name, pipe in qa_pipelines.items():
            ans, raw = answer_with_pipeline(pipe, row['context'], row['question'])
            print(f"{name} -> {ans}")
        print("Gold:", row['answers'])
        print()

show_examples(subset, qa_pipelines, n=5)


==== Example 0 ====
Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Context (snippet): Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
distilbert-base-uncased-distilled-squad -> Saint Bernadette Soubirous
ktrapeznikov/albert-xlarge-v2-squad-v2 ->  Saint Bernadette Soubirous
Gold: {'text': ['Saint Bernadette Soubirous

## 9) Fine-tuning template using Trainer

In [16]:

# We'll use the Hugging Face datasets/dataloader conventions to prepare features for QA

from transformers import AutoModelForQuestionAnswering, AutoTokenizer
from datasets import Dataset

def _extract_first_answer(ans_obj):
    """Return (answer_text, answer_start) from various answer formats.
    Supports:
    - dict with 'text' (str or list) and optional 'answer_start' (int or list)
    - list of dicts or strings
    - plain string
    Fallbacks to (-1) if no start provided; caller may compute via .find().
    """
    if ans_obj is None:
        return "", -1
    # SQuAD-style dict
    if isinstance(ans_obj, dict):
        texts = ans_obj.get("text", [])
        starts = ans_obj.get("answer_start", [])
        # Normalize to first element
        if isinstance(texts, list):
            text0 = texts[0] if texts else ""
        else:
            text0 = str(texts)
        if isinstance(starts, list):
            start0 = starts[0] if starts else -1
        else:
            start0 = int(starts) if isinstance(starts, int) else -1
        return text0, start0
    # List of answers (strings or dicts)
    if isinstance(ans_obj, list):
        if not ans_obj:
            return "", -1
        first = ans_obj[0]
        if isinstance(first, dict):
            text0 = first.get("text", "")
            start0 = first.get("answer_start", -1)
            if isinstance(text0, list):
                text0 = text0[0] if text0 else ""
            if isinstance(start0, list):
                start0 = start0[0] if start0 else -1
            return str(text0), int(start0) if isinstance(start0, int) else -1
        return str(first), -1
    # Plain string
    if isinstance(ans_obj, str):
        return ans_obj, -1
    return str(ans_obj), -1

def prepare_train_features(examples, tokenizer, max_length=384, doc_stride=128):
    # Tokenize our examples with truncation and maybe sliding window
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )
    # Map features to examples
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")

    start_positions = []
    end_positions = []

    for i, offsets in enumerate(offset_mapping):
        sample_index = sample_mapping[i]
        context = examples["context"][sample_index]
        answers = examples["answers"][sample_index]

        # Extract first answer text/start (robust to formats)
        answer_text, answer_start = _extract_first_answer(answers)
        if not isinstance(context, str):
            context = str(context)
        if not isinstance(answer_text, str):
            answer_text = str(answer_text)

        if not answer_text:
            # No answer provided
            start_positions.append(0)
            end_positions.append(0)
            continue

        # If start not provided, locate via substring search
        if not isinstance(answer_start, int) or answer_start < 0:
            answer_start = context.find(answer_text)
        if answer_start < 0:
            # Could not find answer in context
            start_positions.append(0)
            end_positions.append(0)
            continue

        answer_end = answer_start + len(answer_text)

        # Identify the tokens that make up the context
        sequence_ids = tokenized_examples.sequence_ids(i)
        # Find start and end of the context in tokenized input
        token_start_index = 0
        while token_start_index < len(sequence_ids) and sequence_ids[token_start_index] != 1:
            token_start_index += 1
        token_end_index = len(sequence_ids) - 1
        while token_end_index >= 0 and sequence_ids[token_end_index] != 1:
            token_end_index -= 1

        # If answer not inside the context span, label as (0, 0)
        if token_start_index >= len(offsets) or token_end_index < 0:
            start_positions.append(0)
            end_positions.append(0)
            continue

        # Move token_start_index to the right to the first token that starts after answer_start
        while token_start_index < len(offsets) and offsets[token_start_index] and offsets[token_start_index][0] <= answer_start:
            token_start_index += 1
        token_start_index -= 1

        # Move token_end_index to the left to the last token that ends before answer_end
        while token_end_index >= 0 and offsets[token_end_index] and offsets[token_end_index][1] >= answer_end:
            token_end_index -= 1
        token_end_index += 1

        # Guard against invalid indices
        if token_start_index < 0 or token_end_index < 0 or token_start_index >= len(offsets) or token_end_index >= len(offsets) or token_start_index > token_end_index:
            start_positions.append(0)
            end_positions.append(0)
        else:
            start_positions.append(token_start_index)
            end_positions.append(token_end_index)

    tokenized_examples["start_positions"] = start_positions
    tokenized_examples["end_positions"] = end_positions
    return tokenized_examples

# Example usage: convert pandas df subset to datasets.Dataset
if 'df' in globals():
    # convert subset small for demo
    demo = df.iloc[:10000].reset_index(drop=True).copy()
    # Keep 'answers' as-is; the extractor handles dict/list formats
    ds = Dataset.from_pandas(demo)
    # select tokenizer and model for fine-tuning
    model_name = loaded_models[0][0]
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForQuestionAnswering.from_pretrained(model_name).to(DEVICE)
    tokenized = ds.map(lambda ex: prepare_train_features(ex, tokenizer), batched=True, remove_columns=ds.column_names)
    tokenized.set_format(type="torch")
    print("Prepared tokenized features for training.")


Map: 100%|██████████| 10000/10000 [00:05<00:00, 1683.04 examples/s]

Prepared tokenized features for training.





## 10) Interactive interfaces: CLI & Streamlit

In [19]:

# CLI example: ask a question given a context in Python
def answer_cli(context: str, question: str, model_name=None):
    if model_name is None:
        model_name = list(qa_pipelines.keys())[0]
    pipe = qa_pipelines[model_name]
    ans, raw = answer_with_pipeline(pipe, context, question)
    print("Answer:", ans)
    return ans

# Streamlit app (save as qa_app.py and run: streamlit run qa_app.py)
streamlit_app = r"""
import streamlit as st
from transformers import pipeline
import torch

st.set_page_config(page_title="QA App")
st.title('Question Answering')
# Unique key to avoid duplicate element IDs
model_name = st.sidebar.selectbox('Model', %s, key='model_select_sidebar')

if st.button('Load model'):
    # Second selectbox must have a different key
    model_name_sel = st.sidebar.selectbox('Model', %s, key='model_select_loader')
    st.write('Model Loaded')
    device_index = 0 if torch.cuda.is_available() else -1
    pipe = pipeline('question-answering', model=model_name_sel, tokenizer=model_name_sel, device=device_index)
    st.session_state['pipe'] = pipe

context = st.text_area('Context', height=300)
question = st.text_input('Question')
if st.button('Answer') and 'pipe' in st.session_state:
    out = st.session_state['pipe'](question=question, context=context)
    st.write('Answer:', out.get('answer'))
""" % (loaded_models_names, loaded_models_names)

with open("QA_App.py","w",encoding="utf-8") as f:
    f.write(streamlit_app)

print("Wrote Streamlit example to QA_App.py")


Wrote Streamlit example to QA_App.py


## 11) Save predictions and results

In [29]:

# Save sample predictions from first model to CSV
if 'subset' in globals():
    model0 = list(qa_pipelines.keys())[0]
    preds = []
    for i, row in subset.iterrows():
        ans, raw = answer_with_pipeline(qa_pipelines[model0], row['context'], row['question'])
        preds.append({'index': i, 'question': row['question'], 'prediction': ans, 'gold': row['answers']})
    out_df = pd.DataFrame(preds)
    out_df.to_csv('qa_predictions_sample.csv', index=False)
    print('Saved qa_predictions_sample.csv')
else:
    print('No subset available to save predictions.')


Saved qa_predictions_sample.csv


## 12) Summary & Next Steps


### Summary
- This notebook demonstrates loading SQuAD-style data, running pretrained QA models, and computing EM/F1 metrics.
- It includes a minimal fine-tuning template and simple interactive CLI/Streamlit examples.
- For production: consider model quantization, server deployment (FastAPI), batching, and caching tokenizers/models for low latency.

### Next steps / improvements
- Fine-tune on the full SQuAD dataset (requires GPU and time).
- Add better handling of multiple gold answers and answer_start indices when preparing training data.
- Use `evaluate` library's squad metric for robust evaluation and comparison.
- Serve as a REST API with FastAPI + model pooling for scale.
