In [1]:
%pip install --upgrade pip

# Uninstall conflicting packages
%pip uninstall -y langchain-core langchain-openai langchain-experimental langchain-community langchain chromadb beautifulsoup4 python-dotenv PyPDF2 rank_bm25

# Install compatible versions of langchain libraries
%pip install langchain-core==0.3.6
%pip install langchain-openai==0.2.1
%pip install langchain-experimental==0.3.2
%pip install langchain-community==0.3.1
%pip install langchain==0.3.1

# Install remaining packages
%pip install chromadb==0.5.11
%pip install beautifulsoup4==4.12.3
%pip install python-dotenv==1.0.1
%pip install PyPDF2==3.0.1 -q --user
%pip install rank_bm25==0.2.2

Note: you may need to restart the kernel to use updated packages.
Found existing installation: langchain-core 0.3.28
Uninstalling langchain-core-0.3.28:
  Successfully uninstalled langchain-core-0.3.28
Found existing installation: langchain-openai 0.2.1
Uninstalling langchain-openai-0.2.1:
  Successfully uninstalled langchain-openai-0.2.1
Found existing installation: langchain-experimental 0.3.2
Uninstalling langchain-experimental-0.3.2:
  Successfully uninstalled langchain-experimental-0.3.2
Found existing installation: langchain-community 0.3.1
Uninstalling langchain-community-0.3.1:
  Successfully uninstalled langchain-community-0.3.1
Found existing installation: langchain 0.3.1
Uninstalling langchain-0.3.1:
  Successfully uninstalled langchain-0.3.1
Found existing installation: chromadb 0.5.11
Uninstalling chromadb-0.5.11:
  Successfully uninstalled chromadb-0.5.11
Found existing installation: beautifulsoup4 4.12.3
Uninstalling beautifulsoup4-4.12.3:
  Successfully uninstalled beau

In [1]:
import os
os.environ['USER_AGENT'] = 'RAGUserAgent'
import openai
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import chromadb
from langchain_community.vectorstores import Chroma
from langchain_core.runnables import RunnableParallel
from dotenv import load_dotenv, find_dotenv
from langchain_core.prompts import PromptTemplate
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents.base import Document
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# new
from langchain.load import dumps, loads


In [2]:
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')
openai.api_key = os.environ['OPENAI_API_KEY']
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
embedding_function = OpenAIEmbeddings()
pdf_path = "google-2023-environmental-report.pdf"
collection_name = "google_environmental_report"
str_output_parser = StrOutputParser()
user_query = "What are Google's environmental initiatives?"

In [3]:
docs = []
with open(pdf_path, "rb") as pdf_file:
    pdf_reader = PdfReader(pdf_file)
    pdf_text = "".join(page.extract_text() for page in pdf_reader.pages)
    docs = [Document(page_content=page) for page in pdf_text.split("\n\n")]

In [4]:
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=200
)

splits = recursive_splitter.split_documents(docs)

In [5]:
dense_documents = [Document(page_content=doc.page_content, metadata={"id": str(i), "search_source": "dense"}) for i, doc in enumerate(splits)]
sparse_documents = [Document(page_content=doc.page_content, metadata={"id": str(i), "search_source": "sparse"}) for i, doc in enumerate(splits)]

In [6]:
chroma_client = chromadb.Client()
vectorstore = Chroma.from_documents(
    documents=dense_documents,
    embedding=embedding_function,
    collection_name=collection_name,
    client=chroma_client
)

In [7]:
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
sparse_retriever = BM25Retriever.from_documents(sparse_documents, k=10)
ensemble_retriever = EnsembleRetriever(retrievers=[dense_retriever, sparse_retriever], weights=[0.5, 0.5], c=0)

In [8]:
prompt_decompose = PromptTemplate.from_template(
    """You are an AI language model assistant.
    Your task is to generate five different versions of the given 
    user query to retrieve relevant documents from a vector search. 
    By generating multiple perspectives on the user question, 
    your goal is to help the user overcome some of the limitations 
    of the distance-based similarity search. 
    Provide these alternative questions separated by newlines. 
    Original question: {question}"""
)

decompose_queries_chain = (
    prompt_decompose
    | llm
    | str_output_parser
    | (lambda x: x.split("\n"))
)

# Invoke decompose_queries_chain and print the five different versions
decomposed_queries = decompose_queries_chain.invoke({"question": user_query})
print("Five different versions of the user query:")
print(f"Original: {user_query}")
for i, question in enumerate(decomposed_queries, start=1):
    print(f"{question.strip()}")

Five different versions of the user query:
Original: What are Google's environmental initiatives?
1. What initiatives has Google implemented to promote environmental sustainability?
2. Can you provide information on Google's efforts towards environmental conservation?
3. What programs or projects does Google have in place to address environmental issues?
4. How is Google contributing to environmental protection and sustainability?
5. What are the key environmental strategies and policies adopted by Google?


In [9]:
def format_retrieved_docs(documents: list[list]):
    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
    print(f"FLATTENED DOCS: {len(flattened_docs)}")
    deduped_docs = list(set(flattened_docs))
    print(f"DEDUPED DOCS: {len(deduped_docs)}")
    return [loads(doc) for doc in deduped_docs]

retrieval_chain = (
    decompose_queries_chain 
    | ensemble_retriever.map() 
    | format_retrieved_docs
)

# We retrieve a significant number of documents compared to previous methods
docs = retrieval_chain.invoke({"question":user_query})

FLATTENED DOCS: 97
DEDUPED DOCS: 63


  return [loads(doc) for doc in deduped_docs]


In [10]:
prompt_primary = PromptTemplate.from_template(
    """
    You are an environment expert assisting others in 
    understanding what large companies are doing to 
    improve the environment. Use the following pieces 
    of retrieved context with information about what 
    a particular company is doing to improve the 
    environment to answer the question. 
    
    If you don't know the answer, just say that you don't know.
    
    Question: {question} 
    Context: {context} 
    
    Answer:
    """
)

# Relevance check prompt
relevance_prompt_template = PromptTemplate.from_template(
    """
    Given the following question and retrieved context, determine if the context is relevant to the question.
    Provide a score from 1 to 5, where 1 is not at all relevant and 5 is highly relevant.
    Return ONLY the numeric score, without any additional text or explanation.

    Question: {question}
    Retrieved Context: {retrieved_context}

    Relevance Score:"""
)

In [11]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
    
def extract_score(llm_output):
    try:
        score = float(llm_output.strip())
        return score
    except ValueError:
        return 0

def conditional_answer(x):
    relevance_score = extract_score(x['relevance_score'])
    if relevance_score < 4:
        return "I don't know."
    else:
        return x['answer']

In [12]:
rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
    | RunnableParallel(
        {
            "relevance_score": (
                RunnablePassthrough()
                | (lambda x: relevance_prompt_template.format(question=x['question'], retrieved_context=x['context']))
                | llm
                | str_output_parser
            ), 
             "answer": (
                RunnablePassthrough()
                | prompt_primary
                | llm
                | str_output_parser
            )
        }
    )
    | RunnablePassthrough().assign(final_answer=conditional_answer)
)

In [13]:
rag_chain_with_source = RunnableParallel(
    {"context": retrieval_chain, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

In [14]:
result = rag_chain_with_source.invoke(user_query)
retrieved_docs = result['context']
print(f"Original Question: {user_query}\n")
print(f"Relevance Score: {result['answer']['relevance_score']}\n")
print(f"Final Answer:\n{result['answer']['final_answer']}\n\n")
print("Retrieved Documents:")
for i, doc in enumerate(retrieved_docs, start=1):
    print(f"Document {i}: Document ID: {doc.metadata['id']} source: {doc.metadata['search_source']}")
    print(f"Content:\n{doc.page_content}\n")

FLATTENED DOCS: 97
DEDUPED DOCS: 63
Original Question: What are Google's environmental initiatives?

Relevance Score: 5

Final Answer:
Google has implemented a variety of environmental initiatives aimed at improving sustainability and addressing climate change. Here are some key aspects of their efforts:

1. **Net-Zero Carbon Goals**: Google has committed to achieving net-zero carbon emissions across its operations and value chain. This includes sourcing 24/7 carbon-free energy and investing in renewable energy projects like wind and solar farms.

2. **Water Stewardship**: The company aims to replenish more water than it consumes and improve water quality in the communities where it operates. Google has set a goal to replenish 120% of the freshwater it uses.

3. **Circular Economy**: Google is focused on creating a circular economy by sourcing sustainable materials, promoting recycling, and reducing waste. They have initiatives to recycle e-waste and use recycled materials in their pro

In [15]:
from IPython.display import Markdown, display
markdown_text = result['answer']['final_answer']
display(Markdown(markdown_text))

Google has implemented a variety of environmental initiatives aimed at improving sustainability and addressing climate change. Here are some key aspects of their efforts:

1. **Net-Zero Carbon Goals**: Google has committed to achieving net-zero carbon emissions across its operations and value chain. This includes sourcing 24/7 carbon-free energy and investing in renewable energy projects like wind and solar farms.

2. **Water Stewardship**: The company aims to replenish more water than it consumes and improve water quality in the communities where it operates. Google has set a goal to replenish 120% of the freshwater it uses.

3. **Circular Economy**: Google is focused on creating a circular economy by sourcing sustainable materials, promoting recycling, and reducing waste. They have initiatives to recycle e-waste and use recycled materials in their products.

4. **Biodiversity and Habitat Restoration**: Google is working to restore habitats and enhance biodiversity, particularly in areas where they have a significant presence. They have committed to creating and restoring habitats for pollinators like monarch butterflies.

5. **Digital Technologies for Sustainability**: Google leverages its technology, such as AI and cloud computing, to support sustainability efforts. This includes tools for monitoring environmental changes, optimizing resource use, and providing data for better decision-making.

6. **Partnerships and Collaborations**: Google collaborates with various organizations, including NGOs and government agencies, to advance sustainability goals. They are involved in initiatives like the iMasons Climate Accord and the 24/7 Carbon-Free Energy Compact.

7. **Employee Engagement**: Google promotes sustainability within its corporate culture, providing employees with opportunities to engage in environmental initiatives and learn about sustainable practices.

8. **Public Policy Advocacy**: Google actively engages in public policy discussions to support strong sustainability outcomes and has provided comments on climate-related disclosures to the SEC.

9. **Sustainable Product Features**: Google has integrated sustainability features into its products, such as eco-friendly routing in Google Maps, which has helped reduce carbon emissions significantly.

Through these initiatives, Google aims to not only reduce its own environmental impact but also empower individuals and organizations to make more sustainable choices.