In [1]:
import dotenv

dotenv.load_dotenv()

True

In this document I will explore difference kinds of Vector Retriever in Langchain. I will try the vanilla version `Vectorstore`, then `MultiVectorRetriever` using smaller chunks embedding which is similar to `ParentDocumentRetriever`, as well as summary embedding. I will use the some query to measure the performance

In [2]:
import bs4
from langchain import hub
from langchain_core.documents import Document
from langchain_community.document_loaders import WebBaseLoader
from langchain_chroma import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import PromptTemplate
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
from langchain_core.prompts import ChatPromptTemplate

In [3]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

## Load pages and split into chunks

In [4]:
pages = ["https://zylo.com/blog/guide-saas-renewal/", "https://zylo.com/blog/saas-management/"]

In [5]:
loader = WebBaseLoader(
    web_paths=pages,
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("hero__text-section-container col col-12 col-lg-6", "site-main")
        )
    ),
)
docs = loader.load()

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(docs)

In [7]:
len(docs)

76

# Vanilla retriever: vectorstore

In [8]:
vectorstore = Chroma.from_documents(documents=docs, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [9]:
retrieved_docs = retriever.invoke("saas renewal strategy")

In [10]:
retrieved_docs

[Document(page_content='However, this can be challenging due to the sheer number of known and unknown applications operating across an organization. That’s why your organization needs a strong SaaS renewal strategy.\xa0\nGoals of a SaaS Renewal Strategy\nProactively managing your renewals is vital for holistic SaaS management. Rather than dealing with renewals only when they appear, a strong SaaS renewal strategy empowers organizations to cut SaaS costs and risks. This is essential for optimizing your SaaS investments.\xa0\nAn effective SaaS renewal strategy will:', metadata={'source': 'https://zylo.com/blog/guide-saas-renewal/'}),
 Document(page_content='SaaS Renewal Management\nIn most companies, SaaS renewals occur without any sort of structure. Auto-renewals, click-through terms, and lack of benchmark data leave you ill-prepared and overpaying for software. The average organization has more than 200 renewals a year – that’s about one per business day!\xa0\nWith that many renewals, 

## MultiVectorRetriever with smaller chunks embedding

In [11]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
    search_kwargs = {"k":10}
)
import uuid

doc_ids = [str(uuid.uuid4()) for _ in docs]

In [12]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

In [13]:
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

In [14]:
len(sub_docs)

224

In [15]:
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [16]:
retriever.vectorstore.similarity_search("SaaS Renewal Strategy")[0]

Document(page_content='Goals of a SaaS Renewal Strategy\nProactively managing your renewals is vital for holistic SaaS management. Rather than dealing with renewals only when they appear, a strong SaaS renewal strategy empowers organizations to cut SaaS costs and risks. This is essential for optimizing your SaaS investments.\xa0\nAn effective SaaS renewal strategy will:', metadata={'doc_id': '29ce4bbe-6e74-4186-b412-ccfac5d89fb3', 'source': 'https://zylo.com/blog/guide-saas-renewal/'})

In [17]:
retrieved_docs = retriever.invoke("saas renewal strategy")

In [18]:
retrieved_docs

[Document(page_content='However, this can be challenging due to the sheer number of known and unknown applications operating across an organization. That’s why your organization needs a strong SaaS renewal strategy.\xa0\nGoals of a SaaS Renewal Strategy\nProactively managing your renewals is vital for holistic SaaS management. Rather than dealing with renewals only when they appear, a strong SaaS renewal strategy empowers organizations to cut SaaS costs and risks. This is essential for optimizing your SaaS investments.\xa0\nAn effective SaaS renewal strategy will:', metadata={'source': 'https://zylo.com/blog/guide-saas-renewal/'}),
 Document(page_content='SaaS Renewal Management\nIn most companies, SaaS renewals occur without any sort of structure. Auto-renewals, click-through terms, and lack of benchmark data leave you ill-prepared and overpaying for software. The average organization has more than 200 renewals a year – that’s about one per business day!\xa0\nWith that many renewals, 

# MultiVectorRetriever with summary embeddings

In [19]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

In [20]:
summaries = chain.batch(docs, {"max_concurrency": 5})

In [21]:
len(summaries)

76

In [22]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
    search_kwargs = {"k":10}
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [23]:
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

In [24]:
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [25]:
vectorstore.similarity_search("saas renewal strategy")

[Document(page_content='The document emphasizes the importance of having a strong SaaS renewal strategy in place in order to proactively manage renewals, cut costs, and minimize risks associated with the numerous applications operating within an organization. The goals of such a strategy include optimizing SaaS investments and taking a holistic approach to SaaS management.', metadata={'doc_id': 'fe63348e-38c2-41f1-a234-4bd97b65ba42'}),
 Document(page_content='SaaS renewals are often overlooked and can result in overpaying for software due to lack of structure and benchmark data. With an average of 200 renewals per year, renewal management is crucial for negotiating the best price and making intelligent purchasing decisions. This process involves strategy, preparation, negotiation, and acceptance of software contracts, and is typically managed by Procurement, IT, or Software Asset Management departments.', metadata={'doc_id': 'e6d1a195-53fd-4aba-9e2b-4530ed35acbc'}),
 Document(page_cont

In [26]:
retriever.invoke("saas renewal strategy")

[Document(page_content='However, this can be challenging due to the sheer number of known and unknown applications operating across an organization. That’s why your organization needs a strong SaaS renewal strategy.\xa0\nGoals of a SaaS Renewal Strategy\nProactively managing your renewals is vital for holistic SaaS management. Rather than dealing with renewals only when they appear, a strong SaaS renewal strategy empowers organizations to cut SaaS costs and risks. This is essential for optimizing your SaaS investments.\xa0\nAn effective SaaS renewal strategy will:', metadata={'source': 'https://zylo.com/blog/guide-saas-renewal/'}),
 Document(page_content='SaaS Renewal Management\nIn most companies, SaaS renewals occur without any sort of structure. Auto-renewals, click-through terms, and lack of benchmark data leave you ill-prepared and overpaying for software. The average organization has more than 200 renewals a year – that’s about one per business day!\xa0\nWith that many renewals, 