Skip to content

Python-based question answering system that utilizes pretrained BERTbased models from Hugging Face's transformers library.

Notifications You must be signed in to change notification settings

geehaad/Question-Answering

Repository files navigation

Question Answering system using a Pre-trained Bert based model 💬

Question-answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document.
Some question-answering models can generate answers without context! (Hugging Face)

Types of Question Answering Models

  • Extractive Question Answering
  • Open Generative Question Answering
  • Closed Generative Question Answering

Our system is the Extractive Question Answering system, which means that you have a context and a question to ask, and the model assumes that the answer is inside the context provided.

QA-system interface using Gradio 🚀:

Insert text

Upload Folder

Dataset Overview

The question-answering system in this project is evaluated using the Stanford Question Answering Dataset (SQUAD).
SQUAD is a widely used benchmark dataset for evaluating machine reading comprehension and question-answering systems.
The SQUAD dataset contains a diverse set of passages from a variety of topics and genres.

Each example in the dataset consists of the following components:

  • Context Paragraph: A corpus that contains the information from which the answer can be extracted.
  • Question: A question related to the context, formulated to prompt the model to extract the relevant answer.
  • Answer Span: The exact span of text within the context paragraph that serves as the answer to the question.
The dataset can be found here: SQuAD link

Setup Instructions

Requirements:

  • Python version => 3.6 is recommended
  • Operating System: Windows

Python Packages Required:

  • datasets==2.14.4
  • numpy==1.24.4
  • pandas==2.0.3
  • tokenizers==0.13.3
  • torch==2.0.1
  • transformers
  • import torch
  • pytest

To use the question-answering system, follow these steps:

  1. Clone the source

    git clone https://github.com/geehaad/Question-Answering.git
    

    Go to the directory you cloned the repo in, open cmd:

    cd Question-Answering
    
  2. Create a virtual environment (replace venv with your virtual environment name):

    • Using conda, in CMD write:
      conda create -p venv python==3.8
      
  3. Activate the virtual environment:

    conda activate venv\
    
  4. Run the main script:

    python src/components/main.py
    
  5. The output is a CSV file called 'output' in your directory.

  6. To run the testing file:

    pytest src/tests/test_answer_questions.py
    

Model and Question Answering System

  • Model used:
    • Model name: 'distilbert-base-cased-distilled-squad' - a variant of the DistilBERT model that has been fine-tuned specifically for the SQuAD. This model is designed to accurately extract answers from a given context.
  • How the System Works:
      Our Question Answering system takes a context paragraph and a question as inputs and aims to extract relevant answers from the context.
      Using multiple steps:
    • Tokenization: The system takes a question and a context as input, then the context paragraph and question are tokenized into subword tokens using the tokenizer provided by the Hugging Face Transformers library of AutoModelForQuestionAnswering.
    • Process input through the model: The tokenized inputs are passed through the distilbert-base-cased-distilled-squad model. This model has been fine-tuned on the squad dataset.
    • Extract the answer span: The model's output consists of logits (probabilities) for each token in the context paragraph. The tokens with the highest start and end logits are to the beginning and end of the answer span within the context.
    • Generating Answer: By decoding the answer span tokens, we generate the final answer string. This answer is then returned as the output of the system.
    • Evaluate the model: using the first 100 rows of the squad dataset to evaluate the performance of the QAS.
    • Testing: By using pytest with multiple test cases.

Project Directory Structure

The project directory is organized in a structured manner to facilitate easy navigation.
Below is an overview of the key folders and files within the project:
Question-Answering/
|-- notebooks/
|   |-- trails.ipynb
|-- src/
|   |-- __init__.py
|   |-- components/
|   |   |-- __init__.py
|   |   |-- helper.py
|   |   |-- main.py
|   |-- tests/
|   |   |-- __init__.py
|   |   |-- test_answer_questions.py
|-- requirements.txt
|-- README.md

How Files Are Used

src/: This folder contains the main source code of the project which are:

  • components/: The heart of the project, where the primary functionality resides, and contains:

    • helper.py: This file contains the core functions that enable the question-answering system, These functions are:
      • The answer_questions function takes a context and a question as input, tokenizes, and extracts answers using the chosen model.
      • The apply_answer_questions function applies the answer_questions function to a dataset, generating dictionaries containing the question, original answer, and detected answer.
    • main.py: The entry point of the project, where the main function utilizes the apply_answer_questions function on a subset of the dataset, 100 rows and saves the results in a CSV file.

  • tests/:

    • test_answer_questions.py: Contains pytest test cases that validate the accuracy of the question answering system, the function in this file uses parameterized testing to check the behavior of the answer_questions function in different situations.

notebooks/:

  • trails.ipynb: The Jupyter notebook trails.ipynb is a sandbox for experimentation. It's used to explore the dataset and try different models before integrating them into the main system.

requirements.txt: Lists the Python packages required for the project to run successfully.

README.md: The central documentation file containing essential information about the project, its usage, and directory structure.

About

Python-based question answering system that utilizes pretrained BERTbased models from Hugging Face's transformers library.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published