# Scientific Question-Answering with Citations - Davide Abbattista


A client requests an AI-powered solution that can accurately respond to user queries related to scientific topics. They have gathered a [substantial collection of documents](https://huggingface.co/datasets/loukritia/science-journal-for-kids-data) and wish for the system to leverage this data to provide accurate and reliable answers to user queries.

Here, the proposed solution is described and implemented.

## System Description

The Scientific Question-Answering System is an end-to-end solution designed to provide accurate, contextually relevant, and citation-supported answers to user queries. The system leverages state-of-the-art NLP techniques for document retrieval, answer generation, and consistency verification. It is built to process a dataset of scientific abstracts and provide users with reliable answers while ensuring transparency through citations and warnings when inconsistencies are detected.

**Overview of Query and Document Processing**

When a user submits a query, the system follows a structured process to retrieve, process, and generate a response. Abstracts from the dataset are first preprocessed and embedded into a vector space for efficient similarity matching. The system then retrieves the most relevant documents, and uses these documents as context for generating a coherent answer. The entire workflow ensures that the generated answer is accurate, consistent, and traceable to its source documents.

**Dataset Preparation and Preprocessing**

The system starts by loading and inspecting the dataset using the Dataset Handler:
-	Data Loading: The dataset, provided in CSV format, is loaded into a structured DataFrame. Key columns include the original abstracts from academic papers and their simplified “Kids Abstracts.”
-	Inspection: The system checks for missing values to ensure completeness and analyzes the token lengths of abstracts to verify compatibility with the model's token limits.
-	Summary: A high-level overview of the dataset, including data types and sample entries, is displayed.

**Embedding Creation and Similarity Search**

To efficiently retrieve relevant documents, the Embedding Indexer encodes the dataset into dense vector embeddings:
-	Embedding Model: The [multi-qa-mpnet-base-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1) model is used to generate embeddings. This model is specifically designed for semantic search. It has been trained on question-answer pairs from diverse sources and ensures high-quality semantic representation of textual data.
-	Indexing: The embeddings are stored in a [FAISS](https://faiss.ai/) index, a fast and scalable library for nearest neighbor search. The system uses cosine similarity as the metric to retrieve the most relevant documents for a given query.
-	Query Matching: When a query is submitted, it is also encoded into a dense vector. This query embedding is then matched against the FAISS index to retrieve the top-k documents that are semantically closest to the query.

The use of dense embeddings and FAISS ensures scalability, making the system capable of handling large datasets efficiently.

**Query Answering and Consistency Verification**

Once relevant documents are retrieved, the Query Answering module processes the query and documents to generate a contextually appropriate answer:

1. Answer Generation:
 -	A zero-shot prompting approach is used with the [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) Seq2Seq model to generate a detailed answer. The context for this model includes the abstracts of the retrieved documents and the query itself.
 - The model excels at generating coherent, human-readable answers, leveraging its pre-trained capabilities on a variety of NLP tasks.

2. Consistency Verification:
 -	To ensure the generated answer is aligned with the source documents, the system calculates semantic similarity between the generated answer and potential answer spans extracted from the retrieved documents.
 -	The [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) model identifies potential answer spans, and the [multi-qa-mpnet-base-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1) model computes their embeddings.
 -	The similarity between the generated answer's embedding and each span's embedding is evaluated. If at least one similarity score meets or exceeds a predefined threshold, the answer is considered consistent.
 - The retrieved documents associated to the similarity scores above the threshold are used as references in the system's answer.

3. Citations and Warnings:
 -	Citations for the documents that meet the consistency check, used in generating the answer, are appended to provide transparency and traceability.
 -	If the consistency check fails, the system flags the answer with a warning, alerting the user to potential inaccuracies.

**System Integration and Workflow**

The Scientific QA System orchestrates the entire process, combining the dataset preparation, embedding indexing, and query answering components:

1. Initialization:
 - The dataset is loaded and preprocessed.
 - 	Dense vector embeddings are created for all abstracts, and a FAISS index is built for similarity search.
 -	The query-answering model and consistency-checking mechanisms are initialized.

2.	Query Handling:
 -	The user's query is encoded into an embedding and matched against the FAISS index to retrieve the most relevant documents.
 -	The retrieved documents are used to generate an answer, which is then validated for consistency.

3. Answer Delivery:
 -	The system returns the generated answer, along with citations for the source documents that meet the consistency check.
 -	If the consistency check fails, the system includes a warning.

**Justification of Implementation Choices**
-	Dense Embeddings and FAISS: Using dense embeddings ensures robust semantic matching, while FAISS provides a scalable solution for similarity search, enabling the system to handle large datasets efficiently.
-	Zero-Shot Prompting: The [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) model is leveraged for its strong generalization capabilities, ensuring high-quality answers without task-specific fine-tuning.
-	Consistency Checks: The consistency verification mechanism enhances the reliability of the system, ensuring that the answers align with the source documents.
-	Transparency: Citations and warnings provide users with the necessary context to evaluate the trustworthiness of the generated answers.

## Setup and Installation

The required environment is set up by installing the necessary Python libraries and importing them into the notebook.

Here is what each library is used for:
- [pandas](https://pandas.pydata.org/docs/): A fast, powerful, flexible and easy to use data analysis and manipulation tool.
- [sentence-transformers](https://sbert.net/): A module for using state-of-the-art text embedding models
- [faiss-cpu](https://github.com/facebookresearch/faiss/wiki/): A library for efficient similarity search of dense vectors.
- [transformers](https://huggingface.co/docs/transformers/index): An Hugging Face's library that provides APIs and tools to easily download state-of-the-art pretrained models.
- [torch](https://pytorch.org/docs/stable/index.html): PyTorch, an optimized tensor library for deep learning using GPUs and CPUs.

In [1]:
# Installing required libraries
# !pip install pandas sentence-transformers transformers torch (requirements already satisfied)
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m62.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [3]:
# Import necessary libraries for data manipulation, embedding creation, indexing, and question-answering tasks
import pandas as pd  # For data manipulation and analysis
from sentence_transformers import SentenceTransformer  # For creating sentence embeddings
import faiss  # For fast similarity search using dense vector embeddings
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM  # For question-answering and Seq2Seq models
import torch  # For GPU/CPU computation

## System Implementation

### Dataset Handler


**Purpose**: Load and inspect the dataset.

**Functionality**
-	Reads the dataset provided in CSV format.
-	Checks for missing values and token lengths.
-	Summarizes dataset properties, such as data types and sample entries, to facilitate understanding.

In [6]:
# Class to manage dataset loading, inspection, and preprocessing
class DatasetHandler:
    def __init__(self, dataset_url):
        """
        Initialize the DatasetHandler.
        - dataset_url: URL to the dataset (CSV format).
        """
        self.dataset_url = dataset_url  # URL to the dataset
        self.data = None  # Placeholder for the dataset
        self.tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xxl")  # Tokenizer for analyzing token lengths

    def load_dataset(self):
        """
        Load the dataset from the specified URL or path.
        Prints the dataset's columns to verify successful loading.
        """
        self.data = pd.read_csv(self.dataset_url)  # Load dataset using pandas
        print("Dataset loaded successfully.")
        print("\nColumns:")
        print(*self.data.columns, sep=", ")  # Display available columns

    def check_missing_values(self):
        """
        Check for missing values in the dataset.
        Reports the count of missing values for each column.
        """
        missing_values = self.data.isnull().sum()  # Count missing values per column
        print("\nMissing Values Summary:")
        if missing_values.sum() == 0:  # If no missing values
            print("No missing values detected.")
        else:
            print(missing_values[missing_values > 0])  # Report only columns with missing values

    def check_abstract_token_lengths(self):
        """
        Analyze token lengths for the 'Abstract (Original academic paper)' and 'Kids Abstract' columns.
        Helps ensure the content is within token limits for the chosen model.
        """
        # Calculate token counts for the original abstract
        self.data['Original Abstract Length (tokens)'] = self.data['Abstract (Original academic paper)'].apply(
            lambda x: len(self.tokenizer.tokenize(str(x), truncation=True, max_length=2048)) if pd.notnull(x) else 0)

        # Calculate token counts for the Kids abstract
        self.data['Kids Abstract Length (tokens)'] = self.data['Kids Abstract'].apply(
            lambda x: len(self.tokenizer.tokenize(str(x), truncation=True, max_length=2048)) if pd.notnull(x) else 0)

        # Calculate combined token counts for both abstracts
        self.data['Combined Abstract Length (tokens)'] = (
            self.data['Original Abstract Length (tokens)'] + self.data['Kids Abstract Length (tokens)']
        )

        # Display summary statistics for token lengths
        print("\nToken Length Statistics:")
        print(f"Original Abstract - Min: {self.data['Original Abstract Length (tokens)'].min()}, "
              f"Max: {self.data['Original Abstract Length (tokens)'].max()}, "
              f"Mean: {self.data['Original Abstract Length (tokens)'].mean():.2f}")
        print(f"Kids Abstract - Min: {self.data['Kids Abstract Length (tokens)'].min()}, "
              f"Max: {self.data['Kids Abstract Length (tokens)'].max()}, "
              f"Mean: {self.data['Kids Abstract Length (tokens)'].mean():.2f}")
        print(f"Combined Abstract - Min: {self.data['Combined Abstract Length (tokens)'].min()}, "
              f"Max: {self.data['Combined Abstract Length (tokens)'].max()}, "
              f"Mean: {self.data['Combined Abstract Length (tokens)'].mean():.2f}")

    def summarize_dataset(self):
        """
        Provide a high-level summary of the dataset.
        Prints data types, memory usage, and previews the first few rows.
        """
        print("\nDataset Summary:")
        print(self.data.info())  # Summary of data types and memory usage
        print("\nFirst few rows of the dataset:")
        print(self.data.head())  # Preview dataset rows

### Embedding Indexer


**Purpose**: Create a searchable index of document embeddings for fast and efficient similarity retrieval.

**Functionality**
- Uses the [multi-qa-mpnet-base-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1) model, from [SentenceTransformer](https://huggingface.co/sentence-transformers) library, to encode textual data into dense vector embeddings.
- Builds a [FAISS](https://faiss.ai/) index to allow for efficient similarity search based on cosine similarity.
- Supports searching for the most relevant documents using query embeddings.

In [7]:
# Class for embedding generation and FAISS indexing
class EmbeddingIndexer:
    def __init__(self, semantic_search_model='multi-qa-mpnet-base-cos-v1'):
        """
        Initialize the embedding model and FAISS index.
        - semantic_search_model: Name of the sentence-transformers model used to generate embeddings.
        """
        device = "cuda" if torch.cuda.is_available() else "cpu"  # Use GPU if available
        self.model = SentenceTransformer(semantic_search_model, device=device)  # Load embedding model
        self.index = None  # Placeholder for FAISS index

    def create_embeddings(self, texts):
        """
        Generate embeddings for a list of texts.
        - texts: List of textual data to encode.
        Returns: Numpy array of embeddings.
        """
        embeddings = self.model.encode(texts, convert_to_tensor=False)  # Encode texts into embeddings
        return embeddings

    def build_index(self, embeddings):
        """
        Build a FAISS index for fast similarity searches.
        - embeddings: Array of dense embeddings to index.
        """
        dimension = embeddings.shape[1]  # Dimensionality of embeddings
        self.index = faiss.IndexFlatIP(dimension)  # Create an index for cosine similarity search
        self.index.add(embeddings)  # Add embeddings to the index
        print(f"\nFAISS index built with {self.index.ntotal} entries.")  # Confirm number of entries indexed

    def search(self, query_embedding, top_k=3):
        """
        Search the FAISS index for the top-k most relevant embeddings.
        - query_embedding: Query embedding to match against the index.
        - top_k: Number of top results to return.
        Returns: Distances and indices of top-k results.
        """
        distances, indices = self.index.search(query_embedding, top_k)  # Perform search
        return distances, indices

### Query Answering

**Purpose**: Generate accurate, context-aware answers to user queries and verify their consistency with the dataset.

**Functionality**
-	Retrieves documents using [FAISS](https://faiss.ai/).
-	Generates answers using the [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) Seq2Seq model using a zero-shot prompting and the retrieved document as context.
-	Verifies the consistency of the generated answers by calculating semantic similarity with potential answer spans extracted from the retrieved documents:
  - [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) is used to obtain potential answer spans;
  - [multi-qa-mpnet-base-cos-v1](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-cos-v1) model is used to compute the embedding of the Seq2Seq generated answer and the ones of the potential answer spans;
  - the similarity between the answer's embedding and each span's embedding is computed;
  - the generated answer is cosidered consistent if at least one similarity score meets or exceeds the threshold.
-	Appends citations for documents whose related similarity score is above the threshold and warns users if consistency is low.

In [8]:
# Class for handling query answering and consistency verification
class QueryAnswering:
    def __init__(self, nlp_model='google/flan-t5-large', qa_model='deepset/roberta-base-squad2', semantic_search_model='multi-qa-mpnet-base-cos-v1', consistency_threshold=0.5, data=None):
        """
        Initialize the query answering system.
        - nlp_model: Name of the Seq2Seq natural language model for answer generation.
        - qa_model: Name of the Extractive QA model for extracting potential answers.
        - semantic_search_model: Name of the sentence-transformers model used to generate embeddings.
        - consistency_threshold: Similarity score threshold to determine consistency (default: 0.5).
        - data: DataFrame containing the dataset.
        """
        device = 0 if torch.cuda.is_available() else -1  # Use GPU if available
        self.tokenizer = AutoTokenizer.from_pretrained(nlp_model)  # Load tokenizer for Seq2Seq model
        self.model = AutoModelForSeq2SeqLM.from_pretrained(nlp_model).to(device)  # Load Seq2Seq model
        self.data = data  # Dataset for reference
        self.qa_pipeline = pipeline("question-answering", model=qa_model, device=device)  # QA pipeline using the Extractive QA model
        self.consistency_threshold = consistency_threshold # Consistency threshold
        self.similarity_model = SentenceTransformer(semantic_search_model, device="cuda" if torch.cuda.is_available() else "cpu")  # Similarity model for consistency checks

    def verify_answer_consistency(self, query, answer, retrieved_indices):
        """
        Verify if the generated answer is consistent with the retrieved documents and identify the relevant ones.
        - query: User query string.
        - answer: Generated answer string.
        - retrieved_indices: Indices of documents retrieved by FAISS.
        Returns:
            - is_consistent (bool): True if the answer is consistent with at least one retrieved document, False otherwise.
            - relevant_indices (list): Indices of documents that are consistent with the generated answer.
        """
        potential_spans = []  # Store possible answer spans extracted from the retrieved documents

        # Loop through the indices of the retrieved documents
        for idx in retrieved_indices:
            # Construct a combined context from the original and simplified abstracts
            context = (
                f"Original Abstract: {self.data.iloc[idx]['Abstract (Original academic paper)']}\n"
                f"Kids Abstract: {self.data.iloc[idx]['Kids Abstract']}"
            )
            # Use the QA model to extract a potential answer from the context
            result = self.qa_pipeline(question=query, context=context)
            potential_spans.append(result["answer"])  # Append the extracted answer to the spans list

        # Generate embeddings for the generated answer
        answer_embedding = self.similarity_model.encode(answer, convert_to_tensor=False)
        # Generate embeddings for the extracted potential spans
        span_embeddings = self.similarity_model.encode(potential_spans, convert_to_tensor=False)

        # Compute similarity scores between the answer and each potential span
        similarities = [
            self.similarity_model.similarity(answer_embedding, span_embedding)
            for span_embedding in span_embeddings
        ]

        # Identify relevant documents where similarity meets or exceeds the threshold
        relevant_indices = [
            retrieved_indices[i] for i, score in enumerate(similarities) if score >= self.consistency_threshold
        ]

        # Determine consistency: True if there is at least one relevant document
        is_consistent = len(relevant_indices) > 0

        return is_consistent, relevant_indices

    def generate_answer(self, query, retrieved_indices):
        """
        Generate an answer to the query using the retrieved relevant documents.
        - query: User query string.
        - retrieved_indices: Indices of documents retrieved by FAISS.
        Returns: Generated answer with citations and a warning if consistency is low.
        """
        # Extract relevant documents from the dataset
        retrieved_docs = self.data.iloc[retrieved_indices]

        # Combine abstracts from retrieved documents into a single context string
        context = "\n".join(
            f"Title: {row['Title']}\n"
            f"Original Abstract: {row['Abstract (Original academic paper)']}\n"
            f"Kids Abstract: {row['Kids Abstract']}\n"
            for _, row in retrieved_docs.iterrows()
        )

        # Prepare input text for the Seq2Seq model
        input_text = (
            f"You are a scientific assistant specialized in providing accurate answers based on provided research articles.\n"
            f"Your task is to use the given context to answer the question precisely.\n"
            f"Focus on clarity and avoid adding unsupported information.\n\n"
            f"Context:\n{context}\n\n"
            f"Question: {query}\n"
            f"Answer:"
        )

        # Tokenize the input text and prepare it for the model
        inputs = self.tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048, padding='max_length').to(self.model.device)
        outputs = self.model.generate(**inputs, max_length=200)  # Generate the answer
        answer = self.tokenizer.decode(outputs[0], skip_special_tokens=True)  # Decode the output

        # Verify the consistency of the generated answer
        is_consistent, relevant_indices = self.verify_answer_consistency(query, answer, retrieved_indices)

        if is_consistent:
            relevant_docs = self.data.iloc[relevant_indices]

            # Prepare a references section listing the relevant documents
            references = "\n".join(
                f"- {row['Title']} ({row['URL (Original academic paper)']})"
                for _, row in relevant_docs.iterrows()
            )

            # Combine the answer with references
            answer_with_references = f"{answer}\n\nReferences:\n{references}"
        else:
            # Add a warning if the answer consistency is low
            answer_with_references = answer + "\n\n[WARNING: Answer may not be consistent with source documents.]"

        return answer_with_references

### Scientific QA System

**Purpose**: Orchestrates the entire pipeline, from data loading to answer generation.

**Functionality**
- Combines the Dataset Handler, Embedding Indexer, and Query Answering components into a unified system.
-	Manages preprocessing, embedding generation, and indexing.
-	Responds to user queries by leveraging the embeddings and NLP models to retrieve relevant documents and generate answers.
- Ensures end-to-end execution with minimal manual intervention.

In [9]:
# Main pipeline to orchestrate the scientific question-answering process
class ScientificQASystem:
    def __init__(self, dataset_url, nlp_model='google/flan-t5-large', qa_model='deepset/roberta-base-squad2', semantic_search_model='multi-qa-mpnet-base-cos-v1', consistency_threshold=0.5):
        """
        Initialize the entire pipeline.
        - dataset_url: Path or URL to the dataset.
        - nlp_model: Name of the Seq2Seq natural language model for answer generation.
        - qa_model: Name of the Extractive QA model for extracting potential answers.
        - semantic_search_model: Name of the sentence-transformers model used to generate embeddings.
        - consistency_threshold: Similarity score threshold to determine consistency (default: 0.5).
        """
        self.dataset_handler = DatasetHandler(dataset_url)  # Initialize the DatasetHandler
        self.embedding_indexer = EmbeddingIndexer(semantic_search_model)  # Initialize the EmbeddingIndexer
        self.query_answering = None  # Placeholder for the QueryAnswering system
        self.nlp_model = nlp_model  # Model for answer generation
        self.qa_model = qa_model  # Model for relevance filtering
        self.semantic_search_model = semantic_search_model # Model for consistency checks
        self.consistency_threshold = consistency_threshold # Consistency threshold

    def initialize_pipeline(self):
        """
        Initialize the components of the QA pipeline.
        - Loads and preprocesses the dataset.
        - Creates embeddings and builds a FAISS index.
        - Prepares the query-answering system.
        """
        # Load and preprocess the dataset
        self.dataset_handler.load_dataset()
        self.dataset_handler.check_missing_values()
        self.dataset_handler.check_abstract_token_lengths()
        self.dataset_handler.summarize_dataset()
        data = self.dataset_handler.data  # Access the preprocessed dataset

        # Generate embeddings for combined abstracts and build the FAISS index
        combined_texts = data['Abstract (Original academic paper)'] + " " + data['Kids Abstract']
        embeddings = self.embedding_indexer.create_embeddings(combined_texts.tolist())
        self.embedding_indexer.build_index(embeddings)

        # Initialize the query-answering system
        self.query_answering = QueryAnswering(nlp_model=self.nlp_model, qa_model=self.qa_model, semantic_search_model=self.semantic_search_model, consistency_threshold=self.consistency_threshold, data=data)

    def query(self, user_query, top_k=3):
        """
        Process a user query and generate an answer.
        - user_query: Query string from the user.
        - top_k: Number of top results to retrieve from the index.
        Returns: Generated answer with citations and a warning if consistency is low.
        """
        # Create embeddings for the user query
        query_embedding = self.embedding_indexer.create_embeddings([user_query])

        # Search the FAISS index for the most relevant documents
        distances, indices = self.embedding_indexer.search(query_embedding, top_k)

        # Generate and return the answer
        return self.query_answering.generate_answer(user_query, indices[0])

## System Initialization

In [11]:
# Define the dataset URL
dataset_url = "hf://datasets/loukritia/science-journal-for-kids-data/science-journal-for-kids-dataset.csv"

# Initialize the ScientificQASystem
# - This combines dataset handling, embedding, and QA tasks into a single pipeline.
qa_system = ScientificQASystem(
    dataset_url=dataset_url,  # URL of the dataset
    nlp_model="google/flan-t5-large",  # Seq2Seq natural language model
    qa_model='deepset/roberta-base-squad2',  # QA model
    semantic_search_model='multi-qa-mpnet-base-cos-v1',  # Embedding model
    consistency_threshold=0.5 # Consistency threshold
)

# Initialize the pipeline
# - This step performs all preparatory tasks, including loading the dataset, creating embeddings, building a FAISS index,
#   and preparing the QA system for answering queries.
qa_system.initialize_pipeline()

Dataset loaded successfully.

Columns:
Category, Title, Kids Abstract, Abstract (Original academic paper), URL (Original academic paper), Reading Levels

Missing Values Summary:
No missing values detected.

Token Length Statistics:
Original Abstract - Min: 35, Max: 819, Mean: 335.29
Kids Abstract - Min: 87, Max: 435, Mean: 186.05
Combined Abstract - Min: 207, Max: 998, Mean: 521.34

Dataset Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284 entries, 0 to 283
Data columns (total 9 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   Category                            284 non-null    object
 1   Title                               284 non-null    object
 2   Kids Abstract                       284 non-null    object
 3   Abstract (Original academic paper)  284 non-null    object
 4   URL (Original academic paper)       284 non-null    object
 5   Reading Levels                      284 

## System Usage and Test

In [12]:
# Test the pipeline with a sample query
user_query = "Which marine mammals are one of the most extreme examples of female-biased sexual size dimorphism?"

# Perform the query
# - This step retrieves the most relevant documents, filters them based on relevance to the query,
#   generates an answer, and verifies its consistency with the source documents.
answer = qa_system.query(user_query, top_k=3)

# Display the query and answer
print("\nUser Query:")
print(user_query)
print("\nGenerated Answer:")
print(answer)


User Query:
Which marine mammals are one of the most extreme examples of female-biased sexual size dimorphism?

Generated Answer:
leopard seals

References:
- How can leopard seals survive climate change (https://www.frontiersin.org/articles/10.3389/fmars.2022.976019/full)


The answer provided by the system and the reference paper are correct.

From the abstract: "*... As females were 50% larger than their male counterparts, leopard seals are therefore one of the most extreme examples of female-biased sexual size dimorphism in marine mammals. ...*"

In [13]:
# Test the pipeline with another sample query
user_query = "Which stimulus may represent a cue that induces responses to nearby plants?"

# Perform the query
# - This step retrieves the most relevant documents, filters them based on relevance to the query,
#   generates an answer, and verifies its consistency with the source documents.
answer = qa_system.query(user_query, top_k=3)

# Display the query and answer
print("\nUser Query:")
print(user_query)
print("\nGenerated Answer:")
print(answer)


User Query:
Which stimulus may represent a cue that induces responses to nearby plants?

Generated Answer:
light contact with neighbouring plants

References:
- How do plants keep in touch (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0165742)


The answer provided by the system and the reference paper are correct.

From the abstract: "*In natural habitats plants can be exposed to brief and light contact with neighbouring plants. This mechanical stimulus may represent a cue that induces responses to nearby plants. ...*"

In [14]:
# Test the pipeline with another sample query
user_query = "Is there a relationship between smartphone addiction and sleep quality?"

# Perform the query
# - This step retrieves the most relevant documents, filters them based on relevance to the query,
#   generates an answer, and verifies its consistency with the source documents.
answer = qa_system.query(user_query, top_k=3)

# Display the query and answer
print("\nUser Query:")
print(user_query)
print("\nGenerated Answer:")
print(answer)


User Query:
Is there a relationship between smartphone addiction and sleep quality?

Generated Answer:
Smartphone addiction was associated with poor sleep, independent of duration of usage.

References:
- How do smartphones affect our sleep (https://doi.org/10.3389/fpsyt.2021.629407)


The answer provided by the system and the reference paper are correct.

From the abstract: "*Smartphone addiction was associated with poor sleep, independent of duration of usage, indicating that length of time should not be used as a proxy for harmful usage.*"

In [16]:
# Test the pipeline with another sample query
user_query = "Why pharmaceuticals are increasingly found in wastewater?"

# Perform the query
# - This step retrieves the most relevant documents, filters them based on relevance to the query,
#   generates an answer, and verifies its consistency with the source documents.
answer = qa_system.query(user_query, top_k=3)

# Display the query and answer
print("\nUser Query:")
print(user_query)
print("\nGenerated Answer:")
print(answer)


User Query:
Why pharmaceuticals are increasingly found in wastewater?

Generated Answer:
Due to incomplete metabolism in humans and subsequent excretion in human waste.

References:
- Medicine in our waters so what (https://doi.org/10.1371/journal.pone.0197259)


The answer provided by the system and the reference paper are correct.

From the abstract: "*... Pharmaceuticals are increasingly found in wastewater and surface waters around the world, often due to incomplete metabolism in humans and subsequent excretion in human waste. ...*"

## Conclutions

The implemented solution addresses the core requirements of a scientific question-answering system capable of leveraging a document collection to provide accurate and citation-backed responses. The approach employs natural language processing (NLP) methods for retrieval, and generation. Preliminary implementation and test on simple queries demonstrates that the system can parse queries, retrieve relevant documents, and generate answers while linking them to their source. This aligns with the goal of ensuring transparency and reliability in the answers provided.

Despite these achievements, the current solution has several limitations that need to be addressed for it to become production-ready.

**Limitations**
1. Lack of Experimental Evaluation:
  -	The current implementation lacks a robust experimental evaluation to assess the system's performance rigorously.
  - An error analysis mechanism to identify the root causes of system failures is absent. This would help pinpoint weaknesses in document retrieval, relevance scoring, or answer generation.
2. Inconsistency Handling:
 - While a mechanism to detect inconsistencies was implemented, its accuracy and reliability need further validation. Incorrectly flagging valid responses or missing genuine inconsistencies could undermine trust in the system.
 - The consistency threshold should be fine-tuned.

**Future Directions**

1. Rigorous Evaluation and Benchmarking:
	-	Establish a comprehensive experimental setup with clear performance metrics to evaluate each stage of the pipeline. Conduct comparative experiments using domain-specific datasets.
2. Error Analysis:
	-	Introduce systematic error categorization to identify whether errors stem from retrieval, or generation. This could guide iterative improvements in specific components of the system.
3. System Variations:
 -	Experiment with different model architectures for the answer generation phase or fine-tune pre-trained models for improved contextual understanding and answering.
 - Desing more advanced prompt engineering strategies, such as Few-Shot or Chain-of-Thought (CoT) prompting, to enhance the system's performance.

By addressing these directions, the system can evolve from a proof-of-concept to a robust and accurate scientific question-answering tool, meeting real-world needs effectively.