# Multi-Representation Indexing

What is multi-representation indexing?
Multi-representation indexing involves creating and storing multiple representations of each document within the retrieval system. These representations can be derived from different techniques, such as:

Textual analysis: Extracting keywords, named entities, or using topic modeling algorithms.

Semantic embeddings: Utilizing pre-trained LLMs to capture the semantic meaning of the text.

Visual features: Processing images or diagrams associated with the document.

![image.png](attachment:a6547477-6a65-4e0e-8445-f524f7bacf95.png)

# Import Necessary Libraries



In [2]:
! pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.5-py3-none-any.whl.metadata (2.9 kB)
Collecting SQLAlchemy<2.0.36,>=1.4 (from langchain_community)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.6 (from langchain_community)
  Downloading langchain-0.3.7-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.15 (from langchain_community)
  Downloading langchain_core-0.3.15-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.6.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from datac

In [4]:
! pip install langchain_openai

Collecting langchain_openai
  Downloading langchain_openai-0.2.5-py3-none-any.whl.metadata (2.6 kB)
Collecting tiktoken<1,>=0.7 (from langchain_openai)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading langchain_openai-0.2.5-py3-none-any.whl (50 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken, langchain_openai
Successfully installed langchain_openai-0.2.5 tiktoken-0.8.0


In [8]:
! pip install pypdf

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.1.0


In [12]:
! pip install chromadb

Collecting chromadb
  Downloading chromadb-0.5.17-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.4-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.32.0-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.27.0-py3

In [45]:
! pip install langchain_chroma

Collecting langchain_chroma
  Downloading langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Downloading langchain_chroma-0.1.4-py3-none-any.whl (10 kB)
Installing collected packages: langchain_chroma
Successfully installed langchain_chroma-0.1.4


# Import necessary libraries

In [50]:
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
import uuid
from langchain_core.documents import Document

# Set up the OpenAI API key

In [76]:
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"] = ""  # Add your OpenAI API key
if OPENAI_API_KEY == "":
    raise ValueError("Please set the OPENAI_API_KEY environment variable")

# Load documents and split text

In [55]:
loaders = [
    PyPDFLoader("Attention_is_all_you_need.pdf"),
    PyPDFLoader("Comparativestudyofwordembeddingalgorithm.pdf")
]

docs = []
for loader in loaders:
    docs.extend(loader.load())

text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

In [57]:
len(docs)

27

In [63]:
docs[0]

Document(metadata={'source': 'Attention_is_all_you_need.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network archite

In [64]:
docs[26]

Document(metadata={'source': 'Comparativestudyofwordembeddingalgorithm.pdf', 'page': 9}, page_content='Marwa Naili et al. / Procedia Computer Science 112 (2017) 340–349 349\nIf we compare our results with related works, some statements can be doubted. For example, Pennington et\nal. 2 argue that GloVe performs better than Word2Vec in every task (word analogies, word similarity and named\nentity recognition). Even Mitra 4 proved the same results. Yet, we share the same statement of Mikolov et al. 1 that\nWord2Vec outperforms other models. Yet, the performance of GloVe is comparable to Word2Vec and they can be\nconsidered as two powerful methods to learn word vector representations in the domain of topic segmentation. On the\nother hand, if we compare modern prediction-based embeddings (Word2Vec and GloVe) with traditional count-based\nmethods (LSA), we share the same claim as Baroni et al. 6 that LSA is less eﬃcient than the other models for synonym\ndetection, concept categorization, s

# Generate document summaries with LLM

In [78]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(model="gpt-4o-mini", max_retries=0)
    | StrOutputParser()
)


In [74]:
chain

{
  doc: RunnableLambda(lambda x: x.page_content)
}
| ChatPromptTemplate(input_variables=['doc'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['doc'], input_types={}, partial_variables={}, template='Summarize the following document:\n\n{doc}'), additional_kwargs={})])
| ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x78b58c1288b0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x78b5929462f0>, root_client=<openai.OpenAI object at 0x78b58c12a3e0>, root_async_client=<openai.AsyncOpenAI object at 0x78b58c128820>, model_name='gpt-4o-mini', model_kwargs={}, openai_api_key=SecretStr(''), max_retries=0)
| StrOutputParser()

In [79]:
summaries = chain.batch(docs, {"max_concurrency": 3})

In [80]:
len(summaries)

27

In [82]:
summaries[0]

'The document outlines a research paper titled "Attention Is All You Need," authored by a team from Google Brain and Google Research, which introduces a novel neural network architecture called the Transformer. This architecture solely relies on attention mechanisms, eliminating the need for complex recurrent or convolutional networks typical in previous sequence transduction models. The authors demonstrate that the Transformer model outperforms existing models in machine translation tasks, achieving state-of-the-art BLEU scores of 28.4 for English-to-German and 41.8 for English-to-French translations, with significantly reduced training times and costs. Additionally, the Transformer shows strong generalization capabilities, performing well in English constituency parsing tasks. The paper emphasizes the collaborative contributions of the authors, highlighting their roles in developing and evaluating the model. The work was presented at the 31st Conference on Neural Information Processi

In [84]:
docs[0].page_content

'Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.edu\nŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurren

In [85]:
summaries[26]

"The document by Marwa Naili et al. compares various word embedding methods (LSA, Word2Vec, and GloVe) in the context of topic segmentation for both English and Arabic languages. The authors note that while some studies claim GloVe consistently outperforms Word2Vec across tasks, they align with Mikolov et al. in asserting that Word2Vec is superior overall, although GloVe's performance is comparable. Traditional count-based methods like LSA are deemed less effective than modern predictive methods for tasks such as synonym detection and analogy, yet some studies suggest LSA performs better in specific scenarios.\n\nIn their evaluation of topic segmentation performance, the authors find that exogenous segmenters (those utilizing external knowledge) significantly outperform endogenous ones in both languages. Specifically, ToSe-Word2Vec and ToSe-GloVe emerged as the most effective segmenters, particularly highlighting Word2Vec's efficiency due to its ability to capture semantic meanings thr

In [86]:
docs[26].page_content

'Marwa Naili et al. / Procedia Computer Science 112 (2017) 340–349 349\nIf we compare our results with related works, some statements can be doubted. For example, Pennington et\nal. 2 argue that GloVe performs better than Word2Vec in every task (word analogies, word similarity and named\nentity recognition). Even Mitra 4 proved the same results. Yet, we share the same statement of Mikolov et al. 1 that\nWord2Vec outperforms other models. Yet, the performance of GloVe is comparable to Word2Vec and they can be\nconsidered as two powerful methods to learn word vector representations in the domain of topic segmentation. On the\nother hand, if we compare modern prediction-based embeddings (Word2Vec and GloVe) with traditional count-based\nmethods (LSA), we share the same claim as Baroni et al. 6 that LSA is less eﬃcient than the other models for synonym\ndetection, concept categorization, selection preferences and analogy. Yet, Levy et al. 3 argue that LSA outperforms\nWord2Vec and GloVe fo

# Index with multiple representations

In [87]:
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())

store = InMemoryByteStore()
id_key = "doc_id"

retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

doc_ids = [str(uuid.uuid4()) for _ in docs]

summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [100]:
query = "What is topic segmentation?"

In [101]:
sub_docs = vectorstore.similarity_search(query)

In [102]:
sub_docs[0]

Document(metadata={'doc_id': '15680797-40ba-4a4e-85fb-ae86150a9e24'}, page_content='The document discusses various methods for topic segmentation in natural language processing, focusing on three main techniques: Latent Semantic Analysis (LSA), Word2Vec, and GloVe. \n\n1. **Latent Semantic Analysis (LSA)**: The process involves constructing a term-document matrix, followed by a singular value decomposition to reduce the dimensionality of the semantic space. The authors reference an empirical study by Naili et al. that identified optimal parameters for LSA, concluding that local and global frequency settings, weight functions, and the dimensionality of the reduced space significantly affect topic segmentation quality.\n\n2. **Word2Vec**: This is a word embedding model that utilizes two architectures—Continuous Bag of Words (CBOW) and Skip-Gram. CBOW predicts a target word based on its context, while Skip-Gram predicts context words from a given target word. The document highlights the c

In [103]:
retrieved_docs = retriever.invoke(query)

In [105]:
retrieved_docs[0].page_content[0:500]

'342 Marwa Naili et al. / Procedia Computer Science 112 (2017) 340–349\nterm i in the document j. The second step is the singular value decomposition where M will be decomposed,\naccording to the equation 1, into three matrices: U , V T which are two orthogonal matrices and S which is\na diagonal matrix. Finally, based on equation 2, only the k largest singular values and their corresponding\nsingular vectors from U and V T will be used in order to reduce the semantic space which corresponds to M k.\n'

In [106]:
len(retrieved_docs[0].page_content)

8190

In [108]:
retrieved_docs[0].page_content

'342 Marwa Naili et al. / Procedia Computer Science 112 (2017) 340–349\nterm i in the document j. The second step is the singular value decomposition where M will be decomposed,\naccording to the equation 1, into three matrices: U , V T which are two orthogonal matrices and S which is\na diagonal matrix. Finally, based on equation 2, only the k largest singular values and their corresponding\nsingular vectors from U and V T will be used in order to reduce the semantic space which corresponds to M k.\nM = U ∗ S ∗ V T (1)\nM k = U k ∗ Sk ∗ V t\nk (2)\nThe LSA method is based on several parameters which are: local and global frequencies setting, local and\nglobal weighting functions and the dimension of the semantic space. Naili et al. 12 have conducted an empirical\nstudy of LSA parameters in the ﬁeld of topic segmentation. As result, they proved that these parameters have an\nimportant impact on the quality of topic segmentation. Furthermore, the best choices are: local frequency =3,\ng

In [110]:
docs[17].page_content

'Marwa Naili et al. / Procedia Computer Science 112 (2017) 340–349 341Available online at www.sciencedirect.com\nProcedia Computer Science 00 (2017) 000–000\nwww.elsevier.com/ locate /procedia\nInternational Conference on Knowledge Based and Intelligent Information and Engineering\nSystems, KES2017, 6-8 September 2017, Marseille, France\nComparative study of word embedding methods in topic\nsegmentation\nMarwa Naili ∗, Anja Habacha Chaibi, Henda Hajjami Ben Ghezala\nRIADI laboratory, National School of computer Science (ENSI),\nUniversity of Mannouba 2010, Tunisia\nAbstract\nThe vector representations of words are very useful in di ﬀerent natural language processing tasks in order to capture the semantic\nmeaning of words. In this context, the three known methods are: LSA, Word2Vec and GloVe. In this paper, these methods will\nbe investigated in the ﬁeld of topic segmentation for both languages Arabic and English. Moreover, Word2Vec is studied in depth\nby using diﬀerent models and app

In [129]:
query = "What is a Encoder ?"

In [130]:
sub_docs = vectorstore.similarity_search(query , 3)

In [131]:
sub_docs[0]

Document(metadata={'doc_id': '1c1917eb-2814-4204-ad86-821f125c1eed'}, page_content='The document introduces the Transformer model, a novel architecture in the field of sequence modeling and transduction tasks, particularly in language processing. It highlights the limitations of traditional recurrent neural networks (RNNs), including long short-term memory (LSTM) and gated recurrent networks, which rely on sequential computation and hinder parallelization, especially with longer sequences.\n\nThe Transformer model addresses these constraints by completely eliminating recurrence and utilizing an attention mechanism to capture global dependencies between input and output sequences. This design allows for significant parallelization and achieves state-of-the-art performance in translation tasks with efficient training times.\n\nThe document also discusses the background of other models like the Extended Neural GPU, ByteNet, and ConvS2S, which use convolutional neural networks to reduce se

In [132]:
retrieved_docs = retriever.invoke(query)

In [133]:
len(retrieved_docs[0].page_content)

4257

In [134]:
len(retrieved_docs[1].page_content)

818

In [135]:
len(retrieved_docs[2].page_content)

1826

In [136]:
retrieved_docs[0].page_content

'1 Introduction\nRecurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks\nin particular, have been firmly established as state of the art approaches in sequence modeling and\ntransduction problems such as language modeling and machine translation [ 35, 2, 5]. Numerous\nefforts have since continued to push the boundaries of recurrent language models and encoder-decoder\narchitectures [38, 24, 15].\nRecurrent models typically factor computation along the symbol positions of the input and output\nsequences. Aligning the positions to steps in computation time, they generate a sequence of hidden\nstates ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently\nsequential nature precludes parallelization within training examples, which becomes critical at longer\nsequence lengths, as memory constraints limit batching across examples. Recent work has achieved\nsignificant improvements in computational efficiency t

In [137]:
retrieved_docs[1].page_content

'Input-Input Layer5\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nInput-Input Layer5\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nThe\nLaw\nwill\nnever\nbe\nperfect\n,\nbut\nits\napplication\nshould\nbe\njust\n-\nthis\nis\nwhat\nwe\nare\nmissing\n,\nin\nmy\nopinion\n.\n<EOS>\n<pad>\nFigure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the\nsentence. We give two such examples above, from two different heads from the encoder self-attention\nat layer 5 of 6. The heads clearly learned to perform different tasks.\n15'