### MultiIndexRetriever - Retrieve Full Documents using Documents Summary

In [1]:
from langchain_community.document_loaders import TextLoader , DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = DirectoryLoader('./source', glob="./*.txt", loader_cls=TextLoader)
docs = loader.load()

In [3]:
docs

[Document(page_content='McIlroy aiming for Madrid title\n\nNorthern Ireland man James McIlroy is confident he can win his first major title at this weekend\'s Spar European Indoor Championships in Madrid.\n\nThe 28-year-old has been in great form in recent weeks and will go in as one of the 800 metres favourites. "I believe after my wins abroad and in our trial race in Sheffield, I can run my race from the front, back or middle," said McIlroy. New coach Tony Lester has helped get McIlroy\'s career back on track. The 28-year-old 800 metres runner has not always matched his promise with performances but believes his decision to change coaches and move base will bring the rewards. McIlroy now lives in Windsor and feels his career has been transformed by the no-nonsense leadership style of former Army sergeant Lester. Lester is better known for his work with 400m runners Roger Black and Mark Richardson in the past but under his guidance McIlroy has secured five wins this indoor season.\n\n

In [16]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

summ_prompt = ChatPromptTemplate.from_template("Summarize the following document in concise and meaningful manner:\n {doc}")
from utils.llm import LLM
llm = LLM().get_llama_together()

In [17]:
chain = (
    {"doc": lambda x: x.page_content}
    | summ_prompt
    | llm
    | StrOutputParser()
)

summaries = chain.batch(docs)

In [18]:
summaries[0]

'\n\nIn summary, James McIlroy, a 28-year-old Northern Irish runner, is aiming to win his first major title at the Spar European Indoor Championships in Madrid. He has been in good form recently and has a new coach, Tony Lester, who has helped him transform his career. McIlroy is confident he can run his race from the front, back or middle and is focused on his ambition of competing at the world championships this summer.'

In [19]:
summaries[1]

"\n\nThe news of the improved industrial output and retail sales was welcomed by investors, who had been worried about the economy's ability to shake off the effects of the global slowdown. The Nikkei 225 index had fallen 10% in the past three months, as concerns about the economy's health had grown. The improved data will help to ease those fears, and could also help to boost the economy in the coming months.\n\nIn conclusion, Japan's industrial output and retail sales have shown signs of improvement, boosting hopes for the country's economic revival. The news has been welcomed by investors, who had been worried about the economy's ability to shake off the effects of the global slowdown. The improved data could help to ease those fears and boost the economy in the coming months."

In [20]:
from langchain.embeddings import HuggingFaceBgeEmbeddings


model_name = "BAAI/bge-base-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

  torch.utils._pytree._register_pytree_node(


In [21]:
from langchain.storage import InMemoryByteStore
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=embeddings)

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

In [22]:
# creating doc ids which is to be stored as metadata in vectore store along with summaries
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [24]:
query = "Tell me about Japan's Industrial Growth"
sub_docs = vectorstore.similarity_search(query,k=1)
sub_docs

[Document(page_content="\n\nThe news of the improved industrial output and retail sales was welcomed by investors, who had been worried about the economy's ability to shake off the effects of the global slowdown. The Nikkei 225 index had fallen 10% in the past three months, as concerns about the economy's health had grown. The improved data will help to ease those fears, and could also help to boost the economy in the coming months.\n\nIn conclusion, Japan's industrial output and retail sales have shown signs of improvement, boosting hopes for the country's economic revival. The news has been welcomed by investors, who had been worried about the economy's ability to shake off the effects of the global slowdown. The improved data could help to ease those fears and boost the economy in the coming months.", metadata={'doc_id': '4d063b36-15d2-44f9-8537-c13a70544f82'})]

In [33]:
sub_docs[0].metadata['doc_id']

'4d063b36-15d2-44f9-8537-c13a70544f82'

In [26]:
retrieved_docs = retriever.get_relevant_documents(query,n_results=1)

print(retrieved_docs[0].page_content)

Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


Industrial revival hope for Japan

Japanese industry is growing faster than expected, boosting hopes that the country's retreat back into recession is over.


Within the overall industrial output figure, there were signs of a pullback from the export slowdown. Among the best-performing sectors were key overseas sales areas such as cars, chemicals and electronic goods. With US growth doing better than expected the picture for exports in early 2005 could also be one of sustained demand. Electronics were also one of the keys to the improved domestic market, with products such as flat-screen TVs in high demand during January.
