
# Parent Child Retriever

This notebook demonstrates the process of setting up and using Azure OpenAI services for document retrieval and question answering. Below is a step-by-step summary of the workflow:

1. **Installation and Setup**:
    - Install necessary packages: `langchain`, `openai`, `tiktoken`, and `chromadb`.
    - Import required modules and load environment variables from a `.env` file.
    - Set up Azure OpenAI API credentials and initialize the `AzureChatOpenAI` and `AzureOpenAIEmbeddings` models.

2. **Document Loading and Preparation**:
    - Load a PDF document using `PyPDFLoader` and split it into smaller chunks using `RecursiveCharacterTextSplitter`.
    - Store the documents in an in-memory store and index them using `Chroma` vector store.

3. **Document Retrieval**:
    - Implement two modes of document retrieval:
      1. **Full Document Retrieval**: Retrieve entire documents based on query similarity.
      2. **Larger Chunk Retrieval**: Split documents into larger chunks and retrieve these chunks based on query similarity.

4. **Question Answering**:
    - Set up a `RetrievalQA` chain using the `AzureChatOpenAI` model and the document retriever.
    - Execute a sample query to retrieve relevant documents and generate an answer.

5. **Evaluation**:
    - Define evaluation metrics such as `faithfulness`, `answer_relevancy`, `answer_correctness`, `context_recall`, and `context_precision`.
    - Load a test dataset and populate it with responses and contexts using the QA chain.
    - Evaluate the performance of the QA system using the defined metrics and convert the results to a pandas DataFrame for analysis.

This notebook provides a comprehensive guide to leveraging Azure OpenAI services for advanced document retrieval and question answering tasks, including data preparation, model setup, retrieval strategies, and performance evaluation.

In [3]:
!pip install langchain langchain-chroma openai tiktoken chromadb --quiet

In [2]:
import asyncio
import getpass
import os
import sys
from dotenv import load_dotenv
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings


sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks


# Load environment variables from a .env file
load_dotenv('../.env')

# Set the OpenAI API key environment variable
api_endpoint = os.getenv('AZURE_OPENAI_ENDPOINT') 
api_key=os.getenv('AZURE_OPENAI_API_KEY')
llm_deployment_name = os.getenv('AZURE_OPENAI_MODEL_NAME')
embedding_deployment_name = os.getenv('AZURE_OPENAI_EMBEDDING_MODEL')
api_version = '2024-02-15-preview' # this might change in the future


if "AZURE_OPENAI_API_KEY" not in os.environ:
    os.environ["AZURE_OPENAI_API_KEY"] = getpass.getpass(
        "Enter your AzureOpenAI API key: "
    )
os.environ["AZURE_OPENAI_ENDPOINT"] = api_endpoint


 # Create document-level summaries
llm = AzureChatOpenAI(
    model=llm_deployment_name,
    azure_deployment=llm_deployment_name,
    api_version=api_version,
    
)

embeddings = AzureOpenAIEmbeddings()

## Parent Document Retriever

2 ways to use it:

1. Return full docs from smaller chunks look up
2. Return bigger chunks for smaller chunks look up

In [5]:
from langchain.schema import Document
from langchain_chroma import Chroma
from langchain.retrievers import ParentDocumentRetriever

## Text Splitting & Docloader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain_community.document_loaders import PyPDFLoader


## Data prep


In [6]:
loaders = [
    PyPDFLoader('/Users/ali/Dev/content/advanced-rag/data/Understanding_Climate_Change.pdf'),
]
docs = []
for l in loaders:
    docs.extend(l.load())

In [7]:
len(docs)

33

In [8]:
docs[0]

Document(metadata={'source': '/Users/ali/Dev/content/advanced-rag/data/Understanding_Climate_Change.pdf', 'page': 0}, page_content='Understanding Climate Change  \nChapter 1: Introduction to Climate Change  \nClimate change refers to significant, long -term changes in the global climate. The term \n"global climate" encompasses the planet\'s overall weather patterns, including temperature, \nprecipitation, and wind patterns, over an extended period. Over the past cent ury, human \nactivities, particularly the burning of fossil fuels and deforestation, have significantly \ncontributed to climate change.  \nHistorical Context  \nThe Earth\'s climate has changed throughout history. Over the past 650,000 years, there have \nbeen seven cycles of glacial advance and retreat, with the abrupt end of the last ice age about \n11,700 years ago marking the beginning of the modern climate era and  human civilization. \nMost of these climate changes are attributed to very small variations in Earth\'s

## 1. Retrieving full documents rather than chunks
In this mode, we want to retrieve the full documents.

This is good to use if you initial full docs aren't too big themselves and you aren't going to return many of them

In [9]:
# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)


# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=embeddings
)

# The storage layer for the parent documents
store = InMemoryStore()

full_doc_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

In [10]:
full_doc_retriever.add_documents(docs, ids=None)

In [11]:
# our
list(store.yield_keys())

['d13bd803-37f9-43ee-b8ef-5d471ed098c8',
 '04b259f9-10a7-414d-b2ad-b3c269c10f3d',
 '73e5955f-120f-45c5-9eeb-e530cb891e05',
 'ba109eeb-16d6-412b-a747-1c22ec972310',
 '0bdf8200-eac7-42d9-9862-094c68a0b60c',
 'b8dbbdc6-d3c6-4a2a-a1ca-4fd876540cc3',
 '22ee57cf-b18e-497e-94dd-dfc8ad56129f',
 '0ba41a34-80f1-4b8e-9537-53e851208e28',
 '262c14a8-3409-4551-8646-7eb4dd737702',
 'aef6ae3a-88bd-4ecb-9854-e47e976a5760',
 'bd28cd9e-186a-4031-a067-681678f23a8d',
 '8a408149-a228-445f-a9f1-327a2a541a98',
 '0de2ed5e-1045-4fc2-b272-d94b6427f78d',
 '6049f385-9c2e-4c55-9801-1205a240a5ff',
 '8ef8e84c-952f-41ef-aaad-8e1f55a0c0e7',
 '7b5bf995-185b-4a23-9610-82b9d30cbfb6',
 '18d0f329-78cc-4a8c-9c51-64151a3efa32',
 '629fb37d-f9d4-4fdf-82ce-b227253c39b5',
 '31a79e2c-ce1a-4a14-aea5-806cad7afb0c',
 '9169b09f-da9d-477c-9594-701f1aae4e99',
 '85620a4b-6e4a-489c-b4c7-e8c35ac4331d',
 '62f4fb14-9e41-45dd-bac1-08c71b1e3571',
 'b1a681ea-c294-495f-9ffc-69b8a7392cfa',
 'f0efca27-4ef7-4f44-85c4-fbc22ce20761',
 'df966b9e-6a5d-

In [12]:
# vectorstore

In [13]:
sub_docs = vectorstore.similarity_search("Floating Solar Farms", k=2)

In [14]:
len(sub_docs)

2

In [15]:
print(sub_docs[0].page_content)

adoption of solar energy globally, making it a more viable option for a broad er range of 
applications, including residential, commercial, and industrial uses.  
Floating Solar Farms  
Floating solar farms, installed on water bodies, offer a way to generate solar power without 
using valuable land space. These systems can also reduce evaporation from water bodies and


In [16]:
retrieved_docs = full_doc_retriever.invoke("What are floating solar farms?")

In [17]:
len(retrieved_docs[0].page_content)

2250

In [18]:
retrieved_docs[0].page_content

'New advancements in solar technology, such as perovskite solar cells and solar paint, \npromise higher efficiency and lower costs. These innovations could significantly enhance the \nadoption of solar energy globally, making it a more viable option for a broad er range of \napplications, including residential, commercial, and industrial uses.  \nFloating Solar Farms  \nFloating solar farms, installed on water bodies, offer a way to generate solar power without \nusing valuable land space. These systems can also reduce evaporation from water bodies and \nimprove solar panel efficiency due to the cooling effect of water.  \nOffshore Wind Farms  \nOffshore wind farms have the potential to generate vast amounts of electricity. They are \ntypically located far from shore, where winds are stronger and more consistent. Innovations \nin turbine design and installation methods are making offshore wind an incr easingly cost -\ncompetitive option.  \nEnergy Storage and Grid Management  \nBattery

## Retrieving larger chunks
Sometimes, the full documents can be too big to want to retrieve them as is. In that case, what we really want to do is to first split the raw documents into larger chunks, and then split it into smaller chunks. We then index the smaller chunks, but on retrieval we retrieve the larger chunks (but still not the full documents).

In [19]:
# This text splitter is used to create the parent documents - The big chunks
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# This text splitter is used to create the child documents - The small chunks
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings)

# The storage layer for the parent documents
store = InMemoryStore()

In [20]:
big_chunks_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
    
)

In [21]:
big_chunks_retriever.add_documents(docs)

In [22]:
len(list(store.yield_keys()))

65

In [23]:
sub_docs = vectorstore.similarity_search("What are floating solar farms?")

In [24]:
len(sub_docs)

4

In [25]:
print(sub_docs[0].page_content)

adoption of solar energy globally, making it a more viable option for a broad er range of 
applications, including residential, commercial, and industrial uses.  
Floating Solar Farms  
Floating solar farms, installed on water bodies, offer a way to generate solar power without 
using valuable land space. These systems can also reduce evaporation from water bodies and


In [26]:
retrieved_docs = big_chunks_retriever.invoke("What are floating solar farms?")

In [27]:
len(retrieved_docs)

1

In [28]:
len(retrieved_docs[0].page_content)

1939

In [29]:
print(retrieved_docs[0].page_content)

New advancements in solar technology, such as perovskite solar cells and solar paint, 
promise higher efficiency and lower costs. These innovations could significantly enhance the 
adoption of solar energy globally, making it a more viable option for a broad er range of 
applications, including residential, commercial, and industrial uses.  
Floating Solar Farms  
Floating solar farms, installed on water bodies, offer a way to generate solar power without 
using valuable land space. These systems can also reduce evaporation from water bodies and 
improve solar panel efficiency due to the cooling effect of water.  
Offshore Wind Farms  
Offshore wind farms have the potential to generate vast amounts of electricity. They are 
typically located far from shore, where winds are stronger and more consistent. Innovations 
in turbine design and installation methods are making offshore wind an incr easingly cost -
competitive option.  
Energy Storage and Grid Management  
Battery Storage  
Adva

In [30]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=big_chunks_retriever,
    return_source_documents=True
)

In [31]:
query = "What are floating solar farms?"

result = qa.invoke(query)

In [32]:
result

{'query': 'What are floating solar farms?',
 'result': 'Floating solar farms are solar power systems installed on water bodies, allowing for energy generation without using valuable land space. They help reduce evaporation from water bodies and improve the efficiency of solar panels due to the cooling effect of water.',
 'source_documents': [Document(metadata={'source': '/Users/ali/Dev/content/advanced-rag/data/Understanding_Climate_Change.pdf', 'page': 7}, page_content='New advancements in solar technology, such as perovskite solar cells and solar paint, \npromise higher efficiency and lower costs. These innovations could significantly enhance the \nadoption of solar energy globally, making it a more viable option for a broad er range of \napplications, including residential, commercial, and industrial uses.  \nFloating Solar Farms  \nFloating solar farms, installed on water bodies, offer a way to generate solar power without \nusing valuable land space. These systems can also reduce 

In [33]:
from ragas.metrics import (
    context_precision,
    answer_relevancy,
    faithfulness,
    context_recall,
    answer_correctness,
)

# list of metrics we're going to use
metrics = [
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
]

In [34]:
from datasets import load_from_disk


# Load the test set from the specified path
testset_path = "../data/testset"
testset = load_from_disk(testset_path)
testset

Dataset({
    features: ['question', 'contexts', 'ground_truth', 'evolution_type', 'metadata', 'episode_done'],
    num_rows: 31
})

In [33]:
from tqdm import tqdm

def populate_responses(dataset):
    """
    Populates the 'response' and 'contexts' columns in the dataset using the qa.invoke function.

    Args:
        dataset: The dataset containing the questions.

    Returns:
        The dataset with the 'response' and 'contexts' columns populated.
    """
    responses = []
    retrieved_contexts = []
    
    for question in tqdm(dataset['question'], desc="Processing questions"):
        result = qa.invoke(question)
        responses.append(result['result'])

        source_documents = result['source_documents']
        relevant_context = [chunk.page_content for chunk in source_documents]
        retrieved_contexts.append(relevant_context)
    
    dataset = dataset.rename_column('contexts', 'groundtruth_contexts')
    dataset = dataset.add_column('response', responses)
    dataset = dataset.add_column('contexts', retrieved_contexts)
    
    return dataset

# Populate the 'response' and 'contexts' columns in the testset
testset = populate_responses(testset)

Processing questions: 100%|██████████| 31/31 [02:51<00:00,  5.54s/it]


In [34]:
testset[0]['response']

"The economic impacts of climate change include:\n\n1. **Damage to Infrastructure**: Extreme weather events like hurricanes and floods can cause significant damage to buildings, roads, and other infrastructure, leading to costly repairs and disruptions.\n\n2. **Reduced Agricultural Productivity**: Changes in climate patterns, such as droughts and floods, can negatively affect crop yields, impacting food production and supply chains.\n\n3. **Healthcare Costs**: Climate change can lead to increased health issues, such as heat-related illnesses and diseases spread by insects, resulting in higher healthcare expenses.\n\n4. **Lost Labor Productivity**: Higher temperatures and extreme weather conditions can reduce workers' efficiency and increase health-related absences, affecting overall productivity."

In [35]:
len(testset[0]['contexts'])

2

In [36]:
from ragas import evaluate

result = evaluate(testset, metrics=metrics, llm=llm, embeddings=embeddings)

Evaluating:   0%|          | 0/155 [00:00<?, ?it/s]

In [37]:
result

{'faithfulness': 0.7044, 'answer_relevancy': 0.9790, 'answer_correctness': 0.5808, 'context_recall': 0.9382, 'context_precision': 0.7823}

In [38]:
result.to_pandas()

Unnamed: 0,question,contexts,answer,ground_truth,faithfulness,answer_relevancy,answer_correctness,context_recall,context_precision
0,What are the economic impacts of climate chang...,[Local communities are often on the front line...,The economic impacts of climate change include...,The economic costs of climate change include d...,1.0,0.946644,0.986215,1.0,1.0
1,How does global collaboration contribute to ad...,[Global cooperation is crucial for addressing ...,Global collaboration contributes to addressing...,Global collaboration is essential for addressi...,1.0,0.964458,0.872333,1.0,1.0
2,How does global collaboration contribute to ad...,[Global cooperation is crucial for addressing ...,Global collaboration is essential for addressi...,Global collaboration is essential for addressi...,0.6875,0.941245,0.553818,1.0,0.916667
3,What are the potential applications and benefi...,[resources. Smart grids enhance grid reliabili...,Carbon capture and utilization (CCU) technolog...,Utilizing captured CO2 to produce valuable pro...,0.8,0.997274,0.650796,1.0,1.0
4,"How do coastal ecosystems like mangroves, salt...","[sustainable forestry, support ecosystem healt...","Coastal ecosystems such as mangroves, salt mar...","Coastal ecosystems like mangroves, salt marshe...",0.636364,0.958687,0.713256,1.0,1.0
5,What are the primary human activities contribu...,[Understanding Climate Change \nChapter 1: In...,The primary human activities contributing to c...,The primary human activities contributing to c...,1.0,0.999999,0.613435,1.0,1.0
6,How do public awareness campaigns inform and e...,[Cultural movements play a crucial role in mob...,Public awareness campaigns inform and educate ...,Public awareness campaigns aim to inform and e...,0.833333,0.990894,0.888638,1.0,1.0
7,"How does climate change impact terrestrial, ma...",[large -scale climate solutions. PPPs are part...,Climate change impacts ecosystems in the follo...,Climate change impacts terrestrial ecosystems ...,0.923077,0.944372,0.827032,1.0,0.0
8,How do synthetic fertilizers contribute to gre...,"[Ruminant animals, such as cows and sheep, pro...",Synthetic fertilizers contribute to greenhouse...,The use of synthetic fertilizers in agricultur...,0.833333,0.989851,0.569397,1.0,1.0
9,How do digital technologies contribute to ener...,[Healthy ecosystems provide services such as w...,"Digital technologies, such as artificial intel...",Digital technologies contribute to energy effi...,0.166667,0.961692,0.389547,0.5,1.0
