# Learning Objectives

- Illustrate how to improve the retrieval process of RAG using:
    - Query Expansion
    - Hypothetical Questions


# Setup

In [1]:
!pip install -q openai==1.66.3 \
                tiktoken==0.9.0 \
                langchain==0.3.20 \
                langchain-chroma==0.2.2 \
                langchain-openai==0.3.9 \
                chromadb==0.6.3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m567.4/567.4 kB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.9/60.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m89.0 MB/s[0m eta [36m0

In [2]:
import os
import chromadb

from langchain_chroma import Chroma
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI

from langchain_core.documents import Document

from google.colab import userdata

In [4]:
azure_api_key = userdata.get('azure_api_key')
# Modify the Azure Endpoint and the API Versions as needed
azure_base_url = "https://oait3st.cognitiveservices.azure.com"
azure_api_version = "2024-12-01-preview"

In [5]:
llm = AzureChatOpenAI(
    azure_endpoint=azure_base_url,
    api_key=azure_api_key,
    api_version=azure_api_version,
    model='gpt-4o-mini',
    temperature=0.4
)

embedding_model = AzureOpenAIEmbeddings(
    api_key=azure_api_key,
    azure_endpoint=azure_base_url,
    api_version=azure_api_version,
    azure_deployment="text-embedding-3-small"
)

To illustrate the techniques of improving retrieval, let us set up an ephemeral Chroma database with a few documents.

In [6]:
chromadb_client = chromadb.EphemeralClient()

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


In [7]:
documents = [
    Document(
        id=1,
        page_content="We design, develop, manufacture, sell and lease high-performance fully electric vehicles and energy generation and storage systems, and offer services related to our products. We generally sell our products directly to customers, and continue to grow our customer-facing infrastructure through a global network of vehicle showrooms and service centers, Mobile Service, body shops, Supercharger stations and Destination Chargers to accelerate the widespread adoption of our products. We emphasize performance, attractive styling and the safety of our users and workforce in the design and manufacture of our products and are continuing to develop full self-driving technology for improved safety. We also strive to lower the cost of ownership for our customers through continuous efforts to reduce manufacturing costs and by offering financial and other services tailored to our products.",
        metadata={"year": 2023, "section": "business"}
    ),
    Document(
        id=2,
        page_content="We have previously experienced and may in the future experience launch and production ramp delays for new products and features. For example, we encountered unanticipated supplier issues that led to delays during the initial ramp of our first Model X and experienced challenges with a supplier and with ramping full automation for certain of our initial Model 3 manufacturing processes. In addition, we may introduce in the future new or unique manufacturing processes and design features for our products. As we expand our vehicle offerings and global footprint, there is no guarantee that we will be able to successfully and timely introduce and scale such processes or features.",
        metadata={"year": 2023, "section": "risk_factors"}
    ),
    Document(
        id=3,
        page_content="We recognize the importance of assessing, identifying, and managing material risks associated with cybersecurity threats, as such term is defined in Item 106(a) of Regulation S-K. These risks include, among other things: operational risks, intellectual property theft, fraud, extortion, harm to employees or customers and violation of data privacy or security laws. Identifying and assessing cybersecurity risk is integrated into our overall risk management systems and processes. Cybersecurity risks related to our business, technical operations, privacy and compliance issues are identified and addressed through a multi-faceted approach including third party assessments, internal IT Audit, IT security, governance, risk and compliance reviews. To defend, detect and respond to cybersecurity incidents, we, among other things: conduct proactive privacy and cybersecurity reviews of systems and applications, audit applicable data policies, perform penetration testing using external third-party tools and techniques to test security controls, operate a bug bounty program to encourage proactive vulnerability reporting, conduct employee training, monitor emerging laws and regulations related to data protection and information security (including our consumer products) and implement appropriate changes.",
        metadata={"year": 2023, "section": "cyber_security"}
    ),
    Document(
        id=4,
        page_content="The automotive segment includes the design, development, manufacturing, sales and leasing of high-performance fully electric vehicles as well as sales of automotive regulatory credits. Additionally, the automotive segment also includes services and other, which includes non-warranty after- sales vehicle services and parts, sales of used vehicles, retail merchandise, paid Supercharging and vehicle insurance revenue. The energy generation and storage segment includes the design, manufacture, installation, sales and leasing of solar energy generation and energy storage products and related services and sales of solar energy systems incentives.",
        metadata={"year": 2022, "section": "business"}
    ),
    Document(
        id=5,
        page_content="Since the first quarter of 2020, there has been a worldwide impact from the COVID-19 pandemic. Government regulations and shifting social behaviors have, at times, limited or closed non-essential transportation, government functions, business activities and person-to-person interactions. Global trade conditions and consumer trends that originated during the pandemic continue to persist and may also have long-lasting adverse impact on us and our industries independently of the progress of the pandemic.",
        metadata={"year": 2022, "section": "risk_factors"}
    ),
    Document(
        id=6,
        page_content="The German Umweltbundesamt issued our subsidiary in Germany a notice and fine in the amount of 12 million euro alleging its non-compliance under applicable laws relating to market participation notifications and take-back obligations with respect to end-of-life battery products required thereunder. In response to Tesla’s objection, the German Umweltbundesamt issued Tesla a revised fine notice dated April 29, 2021 in which it reduced the original fine amount to 1.45 million euro. This is primarily relating to administrative requirements, but Tesla has continued to take back battery packs, and filed a new objection in June 2021. A hearing took place on November 24, 2022, and the parties reached a settlement which resulted in a further reduction of the fine to 600,000 euro. Both parties have waived their right to appeal.",
        metadata={"year": 2022, "section": "legal_proceedings"}
    )
]

We can now point our vector store to the Chroma client and add these documents to the vector store as in previous modules.

In [8]:
vectorstore = Chroma(
    collection_name="full_document_chunks",
    collection_metadata={"hnsw:space": "cosine"},
    embedding_function=embedding_model,
    client=chromadb_client
)

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [9]:
chromadb_client.list_collections()

['full_document_chunks']

In [10]:
vectorstore.add_documents(
    documents=documents
)

['1', '2', '3', '4', '5', '6']

In [11]:
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        'k': 3
    }
)

# Query Expansion

In query expansion, we ask the LLM to generate variations of the original user query. Then, we use each of these variations to retrieve the relevant context. The final context used is the unique set of documents that are retreived across all the query expansions.

In [12]:
query_expansion_system_message = """
You are an financial domain expert assisting in answering questions related to 10-k reports.
Perform query expansion on the question below. If there are multiple common ways of phrasing a user question \
or common synonyms for key words in the question, make sure to return multiple versions \
of the query with the different phrasings.

If there are acronyms or words you are not familiar with, do not try to rephrase them.

Return at least 3 versions of the question as a list.
Generate only a list of questions, each question in a new line.
Do not number the list of questions or use bullet points.
Do not mention anything before or after the list.
"""

user_message_template="""
<Question>
{question}
</Question>
"""

In [13]:
user_input = "Any specific fines levied on the company in 2022?"

In [14]:
query_expansions = llm.invoke(
    [
        ('system', query_expansion_system_message),
        ('user', user_message_template.format(question=user_input))
    ]
)

In [15]:
query_expansions_list = query_expansions.content.strip().split("\n")

In [16]:
query_expansions_list

['Any particular penalties imposed on the company in 2022?  ',
 'Were there any specific fines assessed against the company in 2022?  ',
 'Did the company face any fines or penalties in 2022?']

In [17]:
expanded_context_list = []

In [18]:
for query in query_expansions_list:
    expanded_context_list.extend([d.page_content for d in retriever.invoke(query)])

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


In [19]:
final_context_documents = set(expanded_context_list)

In [20]:
final_context_documents

{'Since the first quarter of 2020, there has been a worldwide impact from the COVID-19 pandemic. Government regulations and shifting social behaviors have, at times, limited or closed non-essential transportation, government functions, business activities and person-to-person interactions. Global trade conditions and consumer trends that originated during the pandemic continue to persist and may also have long-lasting adverse impact on us and our industries independently of the progress of the pandemic.',
 'The German Umweltbundesamt issued our subsidiary in Germany a notice and fine in the amount of 12 million euro alleging its non-compliance under applicable laws relating to market participation notifications and take-back obligations with respect to end-of-life battery products required thereunder. In response to Tesla’s objection, the German Umweltbundesamt issued Tesla a revised fine notice dated April 29, 2021 in which it reduced the original fine amount to 1.45 million euro. Thi

# Hypothetical Questions

In this approach, we generate 3 hypothetical questions that can be answered with each document chunk. These hypothetical questions are then seperately indexed into the vector database (along with the parent document chunk ids as metadata). For each query, we then retrieve relevant hypothetical questions first and the then retrieve the associated chunks as the second step. Note that the retrieval is focused on the comparison between the user query and hypothetical questions.

In [21]:
hypothetical_questions_system_message = """
Generate a list of exactly 3 hypothetical questions that the document presented in the input could be used to answer.
Generate only a list of questions, each question in a new line.
Do not number the questions or use bullet points.
Do not mention anything before or after the list.
"""

user_message_template = """
<Document>
{document}
</Document>
"""

In [22]:
hypothetical_questions = []

In [23]:
for document in documents:

    try:
        response = llm.invoke(
            [
                ('system', hypothetical_questions_system_message),
                ('user', user_message_template.format(document=document.page_content))
            ]
        )

        questions = response.content.strip()
    except Exception as e:
        questions = ""

    questions_list = questions.split("\n")

    for question in questions_list:

        questions_metadata = {
            'parent_chunk_id': document.id,
            'parent_collection': 'full_document_chunks'
        }

        hypothetical_questions.append(
            Document(
                page_content=question,
                metadata=questions_metadata
            )
        )

Let us look at the first set of hypothetical questions.

In [24]:
hypothetical_questions[0], hypothetical_questions[1], hypothetical_questions[2]

(Document(metadata={'parent_chunk_id': '1', 'parent_collection': 'full_document_chunks'}, page_content='What types of products does the company design and manufacture?  '),
 Document(metadata={'parent_chunk_id': '1', 'parent_collection': 'full_document_chunks'}, page_content='How does the company aim to enhance the safety of its users and workforce?  '),
 Document(metadata={'parent_chunk_id': '1', 'parent_collection': 'full_document_chunks'}, page_content='What strategies does the company employ to lower the cost of ownership for its customers?'))

We can now index these hypothetical questions into a new collection.

In [25]:
hypothetical_questions_vectorstore = Chroma(
    collection_name="hypothetical_questions",
    collection_metadata={"hnsw:space": "cosine"},
    embedding_function=embedding_model,
    client=chromadb_client
)

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [26]:
chromadb_client.list_collections()

['full_document_chunks', 'hypothetical_questions']

In [27]:
hypothetical_questions_vectorstore.add_documents(
    documents=hypothetical_questions
)

['c127e61b-a734-437c-be8f-1efe979009b6',
 'a2a5c0a1-5980-478e-b279-bfbe035cea3a',
 '5e2a90dc-cc4e-4c29-bc71-e48f9646fdcb',
 '0f603a9e-e814-44a2-9f06-ed71e92fe606',
 'f6d80dfe-34df-4f26-ab0f-f55c4aa6fa53',
 '0518a2dc-292c-40d0-bdf1-45cb67bf4445',
 '0ccf05be-d942-420d-adc8-cecca863323e',
 '7ef59240-5cbe-4090-b162-cf28353f7ce4',
 'c85bcbe9-c95d-48e3-9118-9d563e4943dc',
 'a09cf7de-db08-495b-8508-bb6f2b3e1a4c',
 '975592ab-4c4b-4027-b89e-ff3ddcf49d44',
 '2605f6ab-6e1d-467e-9344-30521daf134a',
 '903da273-5aec-4a3a-9917-0c03a581e17e',
 'ad380d19-2f05-4cc2-bb03-7c7980de9684',
 '637c0d57-466c-4d81-8084-10174b6a08ec',
 '50e70e18-c1d9-4f77-a702-bf24f9ec890c',
 '8925b81e-f911-49f2-9ace-d86b1112e124',
 'ea6146ae-8ae8-4d13-9ff7-e1759135a491']

In [28]:
retriever = hypothetical_questions_vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 5}
)

In [29]:
user_query = "Any specific fines levied on the company in 2022?"

In [30]:
hypothetical_questions_retrieved = retriever.invoke(user_query)

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


In [31]:
hypothetical_questions_retrieved

[Document(id='ea6146ae-8ae8-4d13-9ff7-e1759135a491', metadata={'parent_chunk_id': '6', 'parent_collection': 'full_document_chunks'}, page_content="What was the outcome of the hearing that took place on November 24, 2022, regarding Tesla's compliance issues?"),
 Document(id='50e70e18-c1d9-4f77-a702-bf24f9ec890c', metadata={'parent_chunk_id': '6', 'parent_collection': 'full_document_chunks'}, page_content='What were the initial and final fine amounts imposed on Tesla by the German Umweltbundesamt?  '),
 Document(id='8925b81e-f911-49f2-9ace-d86b1112e124', metadata={'parent_chunk_id': '6', 'parent_collection': 'full_document_chunks'}, page_content='What actions did Tesla take in response to the fine issued by the German Umweltbundesamt?  '),
 Document(id='a2a5c0a1-5980-478e-b279-bfbe035cea3a', metadata={'parent_chunk_id': '1', 'parent_collection': 'full_document_chunks'}, page_content='How does the company aim to enhance the safety of its users and workforce?  '),
 Document(id='7ef59240-5c

In [32]:
vectorstore.get(
    ids=list(set([d.metadata['parent_chunk_id'] for d in hypothetical_questions_retrieved]))
)

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionGetEvent: capture() takes 1 positional argument but 3 were given


{'ids': ['1', '3', '6'],
 'embeddings': None,
 'documents': ['We design, develop, manufacture, sell and lease high-performance fully electric vehicles and energy generation and storage systems, and offer services related to our products. We generally sell our products directly to customers, and continue to grow our customer-facing infrastructure through a global network of vehicle showrooms and service centers, Mobile Service, body shops, Supercharger stations and Destination Chargers to accelerate the widespread adoption of our products. We emphasize performance, attractive styling and the safety of our users and workforce in the design and manufacture of our products and are continuing to develop full self-driving technology for improved safety. We also strive to lower the cost of ownership for our customers through continuous efforts to reduce manufacturing costs and by offering financial and other services tailored to our products.',
  'We recognize the importance of assessing, ide