#Advanced Retrieval With LangChain

- Multi Query - Given a single user query, use an LLM to synthetically generate multiple other queries. Use each one of the new queries to retrieve documents, take the union of those documents for the final context of your prompt

- Contextual Compression - Fluff remover. Normal retrieval but with an extra step of pulling out relevant information from each returned document. This makes each relevant document smaller for your final prompt (which increases information density)

- Parent Document Retriever - Split and embed small chunks (for maximum information density), then return the parent documents (or larger chunks) those small chunks come from

- Ensemble Retriever - Combine multiple retrievers together

- Self-Query - When the retriever infers filters from a users query and applies those filters to the underlying data

In [None]:
# you need to restart session
!pip install langchain_openai
!pip install langchain
!pip install langchain-community
!pip install unstructured
!pip install chromadb
!pip install rank_bm25
!pip install lark



In [None]:
# from dotenv import load_dotenv
# import os

# load_dotenv()
OpenAI_key="your api key"

In [None]:
from langchain.document_loaders import DirectoryLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings,ChatOpenAI

In [None]:
file_dir = "/content/drive/MyDrive/llm/RAG/playground/data/PaulGrahamEssaysLarge/"

# loader = DirectoryLoader('../data/PaulGrahamEssaysLarge/', glob="**/*.txt", show_progress=True)
loader = DirectoryLoader(file_dir, glob="**/*.txt", show_progress=True)
docs = loader.load()

100%|██████████| 49/49 [00:26<00:00,  1.83it/s]


In [None]:
print (f"You have {len(docs)} essays loaded")

You have 49 essays loaded


In [None]:
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
splits = text_splitter.split_documents(docs)

print (f"Your {len(docs)} documents have been split into {len(splits)} chunks")

Your 49 documents have been split into 468 chunks


In [None]:
if 'vectordb' in globals(): # If you've already made your vectordb this will delete it so you start fresh
    vectordb.delete_collection()

embedding = OpenAIEmbeddings(openai_api_key=OpenAI_key)
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

### MultiQuery
This retrieval method will generated 3 additional questions to get a total of 4 queries (with the users included) that will be used to go retrieve documents. This is helpful when you want to retrieve documents which are similar in meaning to your question.

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.prompts import PromptTemplate
# Set logging for the queries
import logging

In [None]:
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [None]:
question = "What is the authors view on the early stages of a startup?"
llm = ChatOpenAI(openai_api_key=OpenAI_key, temperature=0)

retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

  warn_deprecated(


In [None]:
unique_docs = retriever_from_llm.get_relevant_documents(query=question)

  warn_deprecated(
INFO:langchain.retrievers.multi_query:Generated queries: ['1. How does the author perceive the initial phases of a startup?', "2. What are the author's thoughts on the beginning stages of a startup?", "3. What is the author's perspective on the early development of a startup?"]


In [None]:
len(unique_docs)

7

In [None]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [None]:
llm.predict(text=PROMPT.format_prompt(
    context=unique_docs,
    question=question
).text)

  warn_deprecated(


"The author's view on the early stages of a startup is that it is important to release a minimal version 1 quickly and then improve it based on users' reactions. The author emphasizes the importance of releasing early and iterating based on feedback."

### Contextual Compression
Then we'll move onto contextual compression. This will take the chunk that you've made (above) and compress it's information down to the parts relevant to your query.

Say that you have a chunk that has 3 topics within it, you only really care about one of them though, this compressor will look at your query, see that you only need one of the 3 topics, then extract & return that one topic.

This one is a bit more expensive because each doc returned will get processed an additional time (to pull out the relevant data)

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [None]:
llm = ChatOpenAI(temperature=0, model='gpt-4-turbo-2024-04-09', openai_api_key=OpenAI_key)

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,
                                                       base_retriever=vectordb.as_retriever())

In [None]:
splits[0].page_content

'Want to start a startup? Get funded by\n\nY Combinator.\n\nJuly 2004(This essay is derived from a talk at Oscon 2004.)\n\nA few months ago I finished a new\n\nbook,\n\nand in reviews I keep\n\nnoticing words like "provocative\'\' and "controversial.\'\' To say\n\nnothing of "idiotic. \'\'I didn\'t mean to make the book controversial. I was trying to make\n\nit efficient. I didn\'t want to waste people\'s time telling them\n\nthings they already knew. It\'s more efficient just to give them\n\nthe diffs. But I suppose that\'s bound to yield an alarming book.EdisonsThere\'s no controversy about which idea is most controversial:\n\nthe suggestion that variation in wealth might not be as big a\n\nproblem as we think.I didn\'t say in the book that variation in wealth was in itself a\n\ngood thing. I said in some situations it might be a sign of good\n\nthings. A throbbing headache is not a good thing, but it can be\n\na sign of a good thing-- for example, that you\'re recovering\n\nconsciou

In [None]:
compressor.compress_documents(documents=[splits[0]], query="test for what you like to do")
# this is weird, needs more test for different queries
# result should long document to shrter document with more dense information

[]

In [None]:
question = "What is the authors view on the early stages of a startup?"
compressed_docs = compression_retriever.get_relevant_documents(question)

In [None]:
print (len(compressed_docs))
compressed_docs
# anyways it looks like working, but needs to check the difference between compressed docs and normal

4


[Document(page_content="Startups can die from releasing something full of bugs, and not fixing them fast enough, but I don't know of any that died from releasing something stable but minimal very early, then promptly improving it. [2]", metadata={'source': '/content/drive/MyDrive/llm/RAG/playground/data/PaulGrahamEssaysLarge/startuplessons.txt'}),
 Document(page_content='Release Early.The thing I probably repeat most is this recipe for a startup: get\n\na version 1 out fast, then improve it based on users\' reactions.By "release early" I don\'t mean you should release something full\n\nof bugs, but that you should release something minimal. Users hate\n\nbugs, but they don\'t seem to mind a minimal version 1, if there\'s\n\nmore coming soon.There are several reasons it pays to get version 1 done fast. One\n\nis that this is simply the right way to write software, whether for\n\na startup or not. I\'ve been repeating that since 1993, and I haven\'t seen much since to\n\ncontradict it. I

In [None]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [None]:
llm.predict(text=PROMPT.format_prompt(
    context=compressed_docs,
    question=question
).text)

"The author's view on the early stages of a startup emphasizes the importance of releasing a product early, even if it is minimal, rather than waiting to release a more complete version that might be full of bugs. The author advocates for getting a version 1 out quickly and then improving it based on user feedback. This approach is seen as beneficial because it allows startups to adapt and evolve based on actual user needs and reactions, rather than assumptions. The author believes that releasing something stable but minimal early on and then promptly improving it is less likely to lead to the failure of a startup compared to releasing something full of bugs without quick fixes. This strategy is described as the right way to write software, not just for startups but in general, and is reiterated as a recipe for startup success."

### Parent Document Retriever
[LangChain](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/parent_document_retriever/) documentation does a great job describing this - my minor edits below:

When you split your docs, you generally may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.

But at the same time you may want to have information around those small chunks to keep context of the longer document.

The ParentDocumentRetriever strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

Note that "parent document" refers to the document that a small chunk originated from. This can either be the whole raw document OR a larger chunk.

In [None]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

In [None]:
# This text splitter is used to create the child documents. They should be small chunk size.
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

In [None]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="return_full_documents",
    embedding_function=OpenAIEmbeddings(openai_api_key=OpenAI_key)
)

In [None]:
# The storage layer for the parent documents
store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

Now we will add the whole essays that we split above. We haven't chunked these essays yet, but the .add_documents will do the small chunking for us with the child_splitter above

In [None]:
retriever.add_documents(docs, ids=None)

In [None]:
sub_docs = vectorstore.similarity_search("what is some investing advice?")
sub_docs

[Document(page_content="people there are rich, or expect to be when their options vest.\n\nOrdinary employees find it very hard to recommend an acquisition;\n\nit's just too annoying to see a bunch of twenty year olds get rich\n\nwhen you're still working for salary. Even if it's the right thing\n\nfor your company to do.The Solution(s)Bad as things look now, there is a way for VCs to save themselves.", metadata={'doc_id': '66742520-3cb8-4846-aec8-3f07a4888411', 'source': '/content/drive/MyDrive/llm/RAG/playground/data/PaulGrahamEssaysLarge/vcsqueeze.txt'}),
 Document(page_content="the product is expensive to develop or sell, or simply because\n\nthey're wasteful.If you're paying attention, you'll be asking at this point not just\n\nhow to avoid the fatal pinch, but how to avoid being default dead.\n\nThat one is easy: don't hire too fast. Hiring too fast is by far\n\nthe biggest killer of startups that raise money.", metadata={'doc_id': '1d7f8f52-b3b5-4a01-81a0-8b538b134205', 'source'

In [None]:
retrieved_docs = retriever.get_relevant_documents("what is some investing advice?")

I'm going to only do the first doc to save space, but there are more waiting for you. Keep in mind that LangChain will do the union of docs, so if you have two child docs from the same parent doc, you'll only return the parent doc once, not twice.

In [None]:
retrieved_docs[0].page_content[:1000]

"November 2005In the next few years, venture capital funds will find themselves\n\nsqueezed from four directions. They're already stuck with a seller's\n\nmarket, because of the huge amounts they raised at the end of the\n\nBubble and still haven't invested. This by itself is not the end\n\nof the world. In fact, it's just a more extreme version of the\n\nnorm\n\nin the VC business: too much money chasing too few deals.Unfortunately, those few deals now want less and less money, because\n\nit's getting so cheap to start a startup. The four causes: open\n\nsource, which makes software free; Moore's law, which makes hardware\n\ngeometrically closer to free; the Web, which makes promotion free\n\nif you're good; and better languages, which make development a lot\n\ncheaper.When we started our startup in 1995, the first three were our biggest\n\nexpenses. We had to pay $5000 for the Netscape Commerce Server,\n\nthe only software that then supported secure http connections. We\n\npaid $3000

In [None]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="return_split_parent_documents", embedding_function=OpenAIEmbeddings(openai_api_key=OpenAI_key))

# The storage layer for the parent documents
store = InMemoryStore()

In [None]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

Now this time when we add documents two things will happen

1. Larger chunks - We'll split our docs into large chunks
2. Smaller chunks - We'll split our docs into smaller chunks

Both of them will be combined.

In [None]:
retriever.add_documents(docs)

In [None]:
len(list(store.yield_keys()))

385


Then let's go get our small chunks to make sure it's working and see how long they are

In [None]:
sub_docs = vectorstore.similarity_search("what is some investing advice?")
sub_docs

[Document(page_content="people there are rich, or expect to be when their options vest.\n\nOrdinary employees find it very hard to recommend an acquisition;\n\nit's just too annoying to see a bunch of twenty year olds get rich\n\nwhen you're still working for salary. Even if it's the right thing\n\nfor your company to do.The Solution(s)Bad as things look now, there is a way for VCs to save themselves.", metadata={'doc_id': '6e124873-b58e-45e4-85ae-7926d7d5456b', 'source': '/content/drive/MyDrive/llm/RAG/playground/data/PaulGrahamEssaysLarge/vcsqueeze.txt'}),
 Document(page_content="the product is expensive to develop or sell, or simply because\n\nthey're wasteful.If you're paying attention, you'll be asking at this point not just\n\nhow to avoid the fatal pinch, but how to avoid being default dead.\n\nThat one is easy: don't hire too fast. Hiring too fast is by far\n\nthe biggest killer of startups that raise money.", metadata={'doc_id': 'f13982e6-27ce-400f-8f2d-b82983065bbf', 'source'

Now, let's do the full process, we'll see what small chunks are generated, but then return the larger chunks as our relevant documents

In [None]:
larger_chunk_relevant_docs = retriever.get_relevant_documents("what is some investing advice?")
larger_chunk_relevant_docs[0]
# Think how diffrence retrievals

Document(page_content='all practical purposes, succeeding now equals getting bought. Which\n\nmeans VCs are now in the business of finding promising little 2-3\n\nman startups and pumping them up into companies that cost $100\n\nmillion to acquire. They didn\'t mean to be in this business; it\'s\n\njust what their business has evolved into.Hence the fourth problem: the acquirers have begun to realize they\n\ncan buy wholesale. Why should they wait for VCs to make the startups\n\nthey want more expensive? Most of what the VCs add, acquirers don\'t\n\nwant anyway. The acquirers already have brand recognition and HR\n\ndepartments. What they really want is the software and the developers,\n\nand that\'s what the startup is in the early phase: concentrated\n\nsoftware and developers.Google, typically, seems to have been the first to figure this out.\n\n"Bring us your startups early," said Google\'s speaker at the Startup School. They\'re quite\n\nexplicit about it: they like to acquire sta

In [None]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

question = "what is some investing advice?"

llm.predict(text=PROMPT.format_prompt(
    context=larger_chunk_relevant_docs,
    question=question
).text)

"The investing advice extracted from the provided documents includes:\n\n1. **Avoid Overhiring**: Startups should be cautious about hiring too quickly. Rapid hiring can be a significant factor leading to the failure of startups that have raised money. Founders often overestimate the need to hire to foster growth, which can lead to excessive spending and operational inefficiencies.\n\n2. **Focus on Product Appeal**: Ensure that the product is highly appealing to drive growth. Startups often fail because their product is only moderately appealing, leading to mediocre growth. Founders should focus on enhancing the product's appeal rather than assuming that hiring more employees will automatically boost growth.\n\n3. **Be Wary of Dependence on Further Funding**: Founders should not assume it will be easy to raise more money. It's crucial to distinguish between current facts and hopeful future outcomes. Startups should have a clear plan B for survival if additional funding cannot be secured

### Ensemble Retriever
The next one on our list combines multiple retrievers together. The goal here is to see what multiple methods return, then pull them together for (hopefully) better results.

You may need to install bm25 with !pip install rank_bm25

In [None]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

We'll use a [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) retriever for this one which is really good at keyword matching (vs semantic). When you combine this method with regular semantic search it's known as hybrid search.

In [None]:
# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_documents(splits)
bm25_retriever.k = 2

In [None]:
embedding = OpenAIEmbeddings(openai_api_key=OpenAI_key)
vectordb = Chroma.from_documents(splits, embedding)
vectordb = vectordb.as_retriever(search_kwargs={"k": 2})

In [None]:
# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, vectordb], weights=[0.5, 0.5])

In [None]:
# Notice: your language is not english like Korean, Basic BM25Retriever performance is not good, you need to use a morpheme analyzer
ensemble_docs = ensemble_retriever.get_relevant_documents("what is some investing advice?")
len(ensemble_docs)

3

In [None]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

question = "what is some investing advice?"

llm.predict(text=PROMPT.format_prompt(
    context=ensemble_docs,
    question=question
).text)

'Investing advice mentioned in the context includes making a larger number of smaller investments instead of a handful of giant ones, funding younger, more technical founders instead of MBAs, and seeking seed funding from successful startup founders who can also provide advice.'

### Self Querying
The last one we'll look at today is self querying. This is when the retriever has the ability to query itself. It does this so it can use filters when doing it's final query.

This means it'll use the users query for semantic search, but also its own query for filtering (so the user doesn't have to give a structured filter).

In [None]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

embeddings = OpenAIEmbeddings(openai_api_key=OpenAI_key)
llm = ChatOpenAI(temperature=0, model='gpt-4', openai_api_key=OpenAI_key)

In [None]:
if 'vectorstore' in globals(): # If you've already made your vectordb this will delete it so you start fresh
    vectorstore.delete_collection()

vectorstore = Chroma.from_documents(
    splits, embeddings
)

Below is the information on the fitlers available. This will help the model know which filters to semantically search for

In [None]:
metadata_field_info=[
    AttributeInfo(
        name="source",
        description="The filename of the essay",
        type="string or list[string]",
    ),
]

In [None]:
document_content_description = "Essays from Paul Graham"
retriever = SelfQueryRetriever.from_llm(llm,
                                        vectorstore,
                                        document_content_description,
                                        metadata_field_info,
                                        verbose=True,
                                        enable_limit=True)

In [None]:
retriever.get_relevant_documents("Return only 1 essay. What is one thing you can do to figure out what you like to do from source '/content/drive/MyDrive/llm/RAG/playground/data/PaulGrahamEssaysLarge/island.txt'")

[Document(page_content="July 2006I've discovered a handy test for figuring out what you're addicted\n\nto. Imagine you were going to spend the weekend at a friend's house\n\non a little island off the coast of Maine. There are no shops on\n\nthe island and you won't be able to leave while you're there. Also,\n\nyou've never been to this house before, so you can't assume it will\n\nhave more than any house might.What, besides clothes and toiletries, do you make a point of packing?\n\nThat's what you're addicted to. For example, if you find yourself\n\npacking a bottle of vodka (just in case), you may want to stop and\n\nthink about that.For me the list is four things: books, earplugs, a notebook, and a\n\npen.There are other things I might bring if I thought of it, like music,\n\nor tea, but I can live without them. I'm not so addicted to caffeine\n\nthat I wouldn't risk the house not having any tea, just for a\n\nweekend.Quiet is another matter. I realize it seems a bit eccentric to\n\

In [None]:
import re

for split in splits:
    split.metadata['essay'] = re.search(r'[^/]+(?=\.\w+$)', split.metadata['source']).group()

In [None]:
metadata_field_info=[
    AttributeInfo(
        name="essay",
        description="The name of the essay",
        type="string or list[string]",
    ),
]

In [None]:
if 'vectorstore' in globals(): # If you've already made your vectordb this will delete it so you start fresh
    vectorstore.delete_collection()

vectorstore = Chroma.from_documents(
    splits, embeddings
)

In [None]:
document_content_description = "Essays from Paul Graham"
retriever = SelfQueryRetriever.from_llm(llm,
                                        vectorstore,
                                        document_content_description,
                                        metadata_field_info,
                                        verbose=True,
                                        enable_limit=True)

In [None]:
retriever.get_relevant_documents("Tell me about investment advice the 'worked' essay? return only 1")

[Document(page_content='should make a larger number of smaller investments instead of a\n\nhandful of giant ones, they should be funding younger, more technical\n\nfounders instead of MBAs, they should let the founders remain as\n\nCEO, and so on.One of my tricks for writing essays had always been to give talks.\n\nThe prospect of having to stand up in front of a group of people\n\nand tell them something that won\'t waste their time is a great\n\nspur to the imagination. When the Harvard Computer Society, the\n\nundergrad computer club, asked me to give a talk, I decided I would\n\ntell them how to start a startup. Maybe they\'d be able to avoid the\n\nworst of the mistakes we\'d made.So I gave this talk, in the course of which I told them that the\n\nbest sources of seed funding were successful startup founders,\n\nbecause then they\'d be sources of advice too. Whereupon it seemed\n\nthey were all looking expectantly at me. Horrified at the prospect\n\nof having my inbox flooded by b

Awesome! It returned it back for us. It's a bit rigid because you need to put in the exact name of the file/essay you want to get. You could make a pre-step and infer the correct essay from the users choice but this is out of scope for now and application specific.