# Document Augmentation Through Question Generation for Enhanced Retrieval
## Overview
This implementation demonstrates a text augmentation technique that leverages additional question generation to improve document retrieval within a vectore database. By generating and incorporating various questions related to each text fragment, the system enhances the standard retrieval process, thus increasing the likelihood of finding relevant documents that can be utilized as context for generative question answering.
## Movtivation
By enriching text fragments with related questions, we aim to significantly enhance the accuracy of identifying the most relevant sections of a document that contain answers to user queries.
## Key Components
1. PDF Processing and Text Chunking: Handling PDF documents and dividing them into manageable text fragments
2. Question Augmentation: Generating relevan questions at both the document and fragment levels using LLM
3. Vector Store Ceation: Calculating embeddings for documents using LLM and creating a FAISS vector store
4. REtrieval and Answer Generation: Finding the most relevant document using FAISS and generating answers based ont he context provided
## Mothod Detail
### Document Preprocessing
- Convert PDF to a string using PyPDFLoader from LangChain
- Split the text into overlapping text documents (text_document) for building context purpose and then each document to overlapping text fragments (text_fragment) for retrieval and semantic search purpose.
### Document Augmentation
- Generate questions at the document or text fragment level using LLM
- Configure the number of questions to generate using the QUESTIONS_PER_DOCUMENT constant
### Vector Store Creation
- Use the Embddings to compute document vector
- Create a FAISS vectore store from these embeddings
### Retrieval and Generation
- Retrieve the most relevant document from the FAISS store based on the given query
- Use the retrieved document as context for generating answers with LLM
## Benefits of this Approach
- Enhanced Retrieval Process: Increases the probability of finding the most relevant FAISS document for a given query
- Felxible Context Adjustment: Allows for easy adjustment of the context window size for both text documents and fragments
- High-Quality Language Understanding: Leverages LLM for question generation and answer production
## Conclusion
This technique provides a method to improve the quality of information retrieval in vector-based document search systems. By generating additional questions similar to user queries and utilizing LLM, it potentially leads to better comprehension and more accurate responses in subsequent tasks, such as question answering.

In [1]:
import os
from dotenv import load_dotenv

from langchain_openai.chat_models.azure import AzureChatOpenAI
load_dotenv()
openai_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
openai_api_key = os.environ.get("AZURE_OPENAI_API_KEY")
openai_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT_ID")
openai_api_version = os.getenv("AZURE_API_VERSION")

llm = AzureChatOpenAI(
    azure_deployment=openai_deployment,
    api_version="2024-10-01-preview",
    azure_endpoint=f"{openai_endpoint}openai/deployments/{openai_deployment}/chat/completions?api-version=2024-10-01-preview",
    temperature=0,
    logprobs=True,
)

In [2]:
from enum import Enum
class QuestionsGeneration(Enum):
    """
    Enum class to specify the level of question generation for document processing.

    Attributes:
        DOCUMENT_LEVEL (int): Represents question generation at the entire document level.
        FRAGMENT_LEVEL (int): Represents question generation at the individual text fragment level.
    """
    
    DOCUMENT_LEVEL = 1
    FRAGMENT_LEVEL = 2

DOCUMENT_MAX_TOKENS = 2048
DOCUMENT_OVERLAP_TOKENS = 128

FRAGMENT_MAX_TOKENS = 128
FRAGMENT_OVERLAP_TOKENS = 16

QUESTION_GENERATION = QuestionsGeneration.DOCUMENT_LEVEL
QUESTIONS_PER_DOCUMENT = 5

In [3]:
from pydantic import BaseModel, Field
from typing import List
class QuestionList(BaseModel):
    question_list: List[str] = Field(..., title="List of questions generated for the document or fragment")

openai_embedding = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_ID")

from langchain_openai.embeddings.azure import AzureOpenAIEmbeddings
embeddings = AzureOpenAIEmbeddings(
    deployment=openai_embedding,
    model="text-embedding-ada-002",
    chunk_size=16
)

In [4]:
import re
def clean_and_filter_questions(questions: List[str]) -> List[str]:
    """
    Cleans and filters a list of questions.

    Args:
        questions (List[str]): A list of questions to be cleaned and filtered.

    Returns:
        List[str]: A list of cleaned and filtered questions that end with a question mark.
    """

    cleaned_questions = []
    for question in questions:
        cleaned_question = re.sub(r"^\d+\.\s", "", question.strip())
        if cleaned_question.endswith("?"):
            cleaned_questions.append(cleaned_question)

    return cleaned_questions

from langchain_core.prompts import PromptTemplate

def generate_questions(text: str) -> List[str]:
    """
    Generates a list of questions based on the provided text using OpenAI.

    Args:
        text (str): The context data from which questions are generated.

    Returns:
        List[str]: A list of unique, filtered questions.
    """
    prompt = PromptTemplate(
        input_variables=["context", "num_questions"],
        template= """
        Using the context data: {context}\n\nGenerate a list of at least {num_questions} 
        possible questions that can be asked about this context. Ensure the questions are 
        directly answerable within the context and do not include any answers or headers. 
        Separate the questions with a new line character.
        """
    )


    chain = prompt | llm.with_structured_output(QuestionList)
    input_data = {"context": text, "num_questions": QUESTIONS_PER_DOCUMENT}
    result = chain.invoke(input_data)

    questions = result.question_list
    filtered_questions = clean_and_filter_questions(questions)

    return list(filtered_questions)

def generate_answer(content: str, question:str) -> str:
    """
    Generates an answer to a given question based on the provided context using OpenAI.

    Args:
        content (str): The context data used to generate the answer.
        question (str): The question for which the answer is generated.

    Returns:
        str: The precise answer to the question based on the provided context.
    """

    prompt = PromptTemplate(
        input_variables=["context", "question"],
        template= """
        Using the context data: {context}\n\nProvide a brief and precise answer to the question: {question}
        """
    )

    chain = prompt | llm
    input_data = {"context": content, "question": question}

    return chain.invoke(input_data)

def split_document(document: str, chunk_size: int, chunk_overlap: int) -> List[str]:
    """
    Splits a document into smaller chunks of text.

    Args:
        document (str): The text of the document to be split.
        chunk_size (int): The size of each chunk in terms of the number of tokens.
        chunk_overlap (int): The number of overlapping tokens between consecutive chunks.

    Returns:
        List[str]: A list of text chunks, where each chunk is a string of the document content.
    """
    tokens = re.findall(r"\b\w+\b", document)
    chunks = []
    for i in range(0, len(tokens), chunk_size - chunk_overlap):
        chunk_tokens = tokens[i:i + chunk_size]
        chunks.append(chunk_tokens)
        if i + chunk_size >= len(tokens):
            break
    return [" ".join(chunk) for chunk in chunks]

from typing import Any
def print_document(comment: str, document: Any) -> None:
    """
    Prints a comment followed by the content of a document.

    Args:
        comment (str): The comment or description to print before the document details.
        document (Any): The document whose content is to be printed.

    Returns:
        None
    """
    print(f"{comment} (type: {document.metadata["type"]}, index: {document.metadata["index"]}): {document.page_content}")

In [5]:
example_text = "This is an example document. It contains information about various topics."
questions = generate_questions(example_text)
print("Generated Questions:")
for q in questions:
    print(f"- {q}")

Generated Questions:
- What is this document?
- What does the document contain?
- How many topics are covered in the document?
- Is the document an example?
- What type of information is included in the document?


In [6]:
sample_question = questions[0] if questions else "What is this document about?"
answer = generate_answer(example_text, sample_question)
print(f"\nQuestion: {sample_question}")
print(f"Answer: {answer}")


Question: What is this document?
Answer: content='This document is an example that contains information about various topics.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 12, 'prompt_tokens': 43, 'total_tokens': 55, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_67802d9a6d', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': False, 'detected': False}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}], 'finish_reason': 'stop', 'logprobs': {'content': [{'token': 'This', 'bytes': [84, 104, 105, 115], 'logprob': -0.025102183, 'top_logprobs': []}, {'token': ' document', 'bytes': [32, 100, 111, 99, 117, 109, 101, 110, 116], 'logprob': -0.069937214, 'top_logprob

In [7]:
chunks = split_document(example_text, chunk_size=10, chunk_overlap=2)
print("\nDocument Chunks:")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}: {chunk}")


Document Chunks:
Chunk 1: This is an example document It contains information about various
Chunk 2: about various topics


In [8]:
doc_embedding = embeddings.embed_documents([example_text])
query_embedding = embeddings.embed_query("What is the main topic?")
print("\nDocument Embedding (first 5 elements):", doc_embedding[0][:5])
print("Query Embedding (first 5 elements):", query_embedding[:5])


Document Embedding (first 5 elements): [-0.009030998684465885, 0.023224463686347008, 0.006018453743308783, -0.01844685897231102, -0.009163710288703442]
Query Embedding (first 5 elements): [0.00890314020216465, 0.0026888789143413305, 0.0010232089553028345, -0.0037895417772233486, -0.01861506514251232]


In [12]:
from langchain.docstore.document import Document
from helper_functions import FAISS
def process_documents(content: str, embedding: AzureOpenAIEmbeddings):
    """
    Process the document content, split it into fragments, generate questions,
    create a FAISS vector store, and return a retriever.

    Args:
        content (str): The content of the document to process.
        embedding_model (OpenAIEmbeddings): The embedding model to use for vectorization.

    Returns:
        VectorStoreRetriever: A retriever for the most relevant FAISS document.
    """
    text_documents = split_document(content, DOCUMENT_MAX_TOKENS, DOCUMENT_OVERLAP_TOKENS)
    print(f"Text content split into:  {len(text_documents)} documents")

    documents = []
    counter = 0

    for i, text_document in enumerate(text_documents):
        text_fragments = split_document(text_document, FRAGMENT_MAX_TOKENS, FRAGMENT_OVERLAP_TOKENS)
        print(f"Text document {i} split into: {len(text_fragments)} fragments")

        for j, text_fragment in enumerate(text_fragments):
            documents.append(Document(
                page_content=text_fragment,
                metadata={"type": "ORIGINAL", "index": counter, "text": text_document}
            ))
            counter += 1

            if QUESTION_GENERATION == QuestionsGeneration.FRAGMENT_LEVEL:
                fragment_questions = generate_questions(text_fragment)
                documents.extend([
                    Document(
                        page_content=question,
                        metadata={"type": "AUGMENTED", "index": counter + idx, "text": text_document}
                    ) for idx, question in enumerate(fragment_questions)
                ])

                counter += len(fragment_questions)
                print(f"Text document {i} Text fragment {j} - generated: {len(fragment_questions)} questions")

        if QUESTION_GENERATION == QuestionsGeneration.DOCUMENT_LEVEL:
            document_questions = generate_questions(text_document)
            documents.extend([
                Document(
                    page_content=question,
                    metadata={"type": "AUGMENTED", "index": counter + idx, "text": text_document}
                ) for idx, question in enumerate(document_questions)
            ])

            counter += len(document_questions)
            print(f"Text document {i} - generated: {len(document_questions)} questions")

    for document in documents:
        print_document("Dataset", document)

    print(f"Creating store, calculating embeddings for {len(documents)} documents")
    vectorstore = FAISS.from_documents(documents, embedding)

    print(f"Creating retriever returning the most relevant FAISS document")
    return vectorstore.as_retriever(search_kwargs={"k": 1})

In [10]:
from helper_functions import read_pdf_to_string
path = "./data/Understanding_Climate_Change.pdf"
content = read_pdf_to_string(path)

In [13]:
document_query_retriever = process_documents(content, embeddings)

Text content split into:  5 documents
Text document 0 split into: 19 fragments
Text document 0 - generated: 5 questions
Text document 1 split into: 19 fragments
Text document 1 - generated: 5 questions
Text document 2 split into: 19 fragments
Text document 2 - generated: 10 questions
Text document 3 split into: 19 fragments
Text document 3 - generated: 10 questions
Text document 4 split into: 16 fragments
Text document 4 - generated: 5 questions
Dataset (type: ORIGINAL, index: 0): Understanding Climate Change Chapter 1 Introduction to Climate Change Climate change refers to significant long term changes in the global climate The term global climate encompasses the planet s overall weather patterns including temperature precipitation and wind patterns over an extended period Over the past century human activities particularly the burning of fossil fuels and deforestation have significantly contributed to climate change Historical Context The Earth s climate has changed throughout histor

In [15]:
query = "What is climate change?"
retrieved_docs = document_query_retriever.invoke(query)
print(f"\nQuery: {query}")
print(f"Retrieved document: {retrieved_docs[0].page_content}")


Query: What is climate change?
Retrieved document: What are the effects of climate change on global temperatures?


In [16]:
query = "How do freshwater ecosystems change due to alterations in climatic factors?"
print (f'Question:{os.linesep}{query}{os.linesep}')
retrieved_documents = document_query_retriever.invoke(query)

for doc in retrieved_documents:
    print_document("Relevant fragment retrieved", doc)

Question:
How do freshwater ecosystems change due to alterations in climatic factors?

Relevant fragment retrieved (type: AUGMENTED, index: 69): How does climate change affect freshwater ecosystems?


In [17]:
context = doc.metadata['text']
print (f'{os.linesep}Context:{os.linesep}{context}')
answer = generate_answer(context, query)
print(f'{os.linesep}Answer:{os.linesep}{answer}')


Context:
creativity and collective effort from all sectors of society Chapter 9 Climate Change and Biodiversity Impact on Ecosystems Terrestrial Ecosystems Climate change is altering terrestrial ecosystems by shifting habitat ranges changing species distributions and impacting ecosystem functions Forests grasslands and deserts are experiencing shifts in plant and animal species composition These changes can lead to a loss of biodiversity and disrupt ecological balance Marine Ecosystems Marine ecosystems are highly vulnerable to climate change Rising sea temperatures ocean acidification and changing currents affect marine biodiversity from coral reefs to deep sea habitats Species migration and changes in reproductive cycles can disrupt marine food webs and fisheries Freshwater Ecosystems Freshwater ecosystems including rivers lakes and wetlands are affected by changes in precipitation patterns temperature and water flow These changes can lead to altered water quality habitat loss and r