A short project on SQuaD v2.0 Dataset.
- ✨ Trained on SQuAD v2.0 dataset
- 🔧 Question Answering with RoBERTa / BERT / Albert-v2
- 📦 Streamlit-based interactive QA demo
- About
- Demo
- Installation
- SQuAD v2.0 Statistics
- Preprocessing & Data Handling
- Model
- Experiments
- Next Steps
- References
This project demonstrates a Question Answering (QA) system trained on the SQuAD v2.0 dataset.
It can answer questions based on user-provided text and identify when no answer exists within the paragraph.
The project includes:
- A fine-tuned QA model
- A Streamlit web interface
- Easy-to-use inference pipeline
git clone https://github.com/dionvou/squad.git cd squad
SQuAD v2.0 can be downloaded at:
🔗 https://rajpurkar.github.io/SQuAD-explorer/
Or, use the provided script included in the repository:
chmod +x download.sh
./download.sh
The SQuAD v2.0 dataset contains a mixture of answerable and unanswerable questions, which makes it more challenging than v1.1.
- Training set: 130,319 questions
- Answerable: 86,821 (~67%)
- Unanswerable: 43,498 (~33%)
- Development set: 11,873 questions
- Answerable: 5n928 (~50%)
- Unanswerable: 5,945 (~50%)
This dataset introduces unanswerable questions to train models to identify when no answer exists. To handle the varying lengths of contexts, questions, and answers during tokenization, we analyzed the distributions of token lengths using a BERT tokenizer.
The plot shows the number of BERT tokens for:
- Context: The full paragraph
- Question: Each question text
- Answer: Each answer span
- Used Hugging Face tokenizers with
return_overflowing_tokens=Trueto handle long contexts - Split long paragraphs into smaller chunks to avoid truncation
- Initially, all non-answerable chunks were labeled as
0(impossible) - This led to a high imbalance and caused the model to overfit on predicting zeros
- To fix this:
- Removed answerable question parts that did not contain answers after split
- Kept only answerable portions for training
- Result: better balance between answerable and unanswerable examples and reduced model collapse
We use base bert, roberta, albert, distill-bert and spanbert for testing.
To evaluate our models and select the best-performing architecture, we conducted a series of controlled experiments.
Due to time constraints, all evaluations were performed on a development split created from the original training set, using an 80% / 20% train–validation split.
After determining the strongest model, we retrained it on the full SQuAD training dataset.
All experiments employed early stopping with a patience of 3 epochs to prevent overfitting and reduce training time. The models were trained using a learning rate of 1e-5, a batch size of 64, a maximum sequence length of 384 tokens, and a document stride of 128.
The results follow the trends observed in previous literature:
ALBERT consistently achieves the highest validation EM and F1 scores among the tested models.
Based on this outcome, we select ALBERT as the architecture for full fine-tuning on the complete dataset.
In future work, we aim to further enhance the performance of our QA system by implementing techniques from the research paper “Retrospective Reader for Machine Reading Comprehension” (arXiv:2001.09694). This method introduces a retrospective reader architecture that revisits previously attended context during prediction, improving comprehension of long passages and producing more accurate answer spans. By incorporating this approach, we hope to achieve state-of-the-art performance on SQuAD v2.0.
Additionally, we plan to explore model ensembling to boost overall accuracy.
-
Know What You Don't Know: Unanswerable Questions for SQuAD
Rajpurkar et al., 2018 – Introduces unanswerable questions in SQuAD 2.0 to improve model robustness. -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin et al., 2019 – Introduces BERT, a deeply bidirectional transformer for language understanding tasks. -
Question Answering on SQuAD 2.0: BERT Is All You Need
Schwager et al., 2019 – Explores using BERT for SQuAD 2.0 and shows strong QA performance. -
Really Paying Attention: A BERT + BiDAF Ensemble Model for Question Answering
Yin et al., 2019 – Combines BERT with BiDAF in an ensemble to enhance QA accuracy on SQuAD. Ensemble Model for Question‑Answering`



