#  🔍 Improve retrieval by embedding meaningful metadata 🏷️

<a target="_blank" href="https://colab.research.google.com/github/anakin87/notebooks/blob/main/improve_retrieval_by_embedding_metadata.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" width="200"/>
</a>

In this notebook, I do some experiments on embedding meaningful metadata to improve Document retrieval.

I use the [Haystack LLM orchestration framework](https://github.com/deepset-ai/haystack), which provides several out-of-the box features for embedding creation and retrieval. In particular, I use the 2.0 preview version.

In [1]:
%%capture
! pip install wikipedia "haystack-ai==0.117.0" sentence_transformers rich

In [2]:
import rich

## Load data from Wikipedia

We are going to download the Wikipedia pages related to some bands, using the python library `wikipedia`.

These pages are converted into Haystack Documents.

In [3]:
some_bands="""The Beatles
Rolling stones
Dire Straits
The Cure
The Smiths""".split("\n")

In [4]:
import wikipedia
from haystack.preview.dataclasses import Document

raw_docs=[]

for title in some_bands:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

## 🔧 Setup the experiment

### Utility functions to create Pipelines

The **indexing Pipeline** transforms the Documents and stores them (with vectors) in a Document Store. The **retrieval Pipeline** takes a query as input and perform the vector search.


I build some utility functions to create different indexing and retrieval Pipelines.

In fact, I am interested in comparing the standard approach (where we only embed text) with the embedding metadata strategy (we embed text + meaningful metadata).



In [5]:
from haystack.preview import Pipeline
from haystack.preview.document_stores import InMemoryDocumentStore
from haystack.preview.components.preprocessors import DocumentCleaner, TextDocumentSplitter
from haystack.preview.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.preview.components.writers import DocumentWriter
from haystack.preview.components.writers.document_writer import DuplicatePolicy
from haystack.preview.components.retrievers import InMemoryEmbeddingRetriever

In [6]:
def create_indexing_pipeline(document_store, metadata_fields_to_embed):

  indexing = Pipeline()
  indexing.add_component("cleaner", DocumentCleaner())
  indexing.add_component("splitter", TextDocumentSplitter(split_by='sentence', split_length=2))

  # in the following componente, we can specify the parameter `metadata_fields_to_embed`, with the metadata to embed
  indexing.add_component("doc_embedder", SentenceTransformersDocumentEmbedder(model_name_or_path="thenlper/gte-large", device="cuda:0", metadata_fields_to_embed=metadata_fields_to_embed)
  )
  indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

  indexing.connect("cleaner", "splitter")
  indexing.connect("splitter", "doc_embedder")
  indexing.connect("doc_embedder", "writer")

  return indexing


In [7]:
def create_retrieval_pipeline(document_store):

  retrieval = Pipeline()
  retrieval.add_component("text_embedder", SentenceTransformersTextEmbedder(model_name_or_path="thenlper/gte-large", device="cuda:0"))
  retrieval.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, scale_score=False, top_k=3))

  retrieval.connect("text_embedder", "retriever")

  return retrieval

###  Create the Pipelines

Let's define 2 Document Stores, to compare the different approaches.

In [8]:
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
document_store_w_embedded_metadata = InMemoryDocumentStore(embedding_similarity_function="cosine")

Now, I create the 2 indexing pipelines and run them.

In [9]:
indexing_pipe_std = create_indexing_pipeline(document_store=document_store, metadata_fields_to_embed=[])

# here we specify the fields to embed
# we select the field `title`, containing the name of the band
indexing_pipe_w_embedded_metadata = create_indexing_pipeline(document_store=document_store_w_embedded_metadata, metadata_fields_to_embed=["title"])

In [10]:
indexing_pipe_std.run({"cleaner":{"documents":raw_docs}})
indexing_pipe_w_embedded_metadata.run({"cleaner":{"documents":raw_docs}})

Downloading (…)b04c2/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Downloading (…)28b43b04c2/README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

Downloading (…)b43b04c2/config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

Downloading (…)4c2/onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

Downloading model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)/onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

Downloading (…)b04c2/onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)b04c2/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

Downloading (…)28b43b04c2/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)43b04c2/modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

Batches:   0%|          | 0/39 [00:00<?, ?it/s]

Batches:   0%|          | 0/39 [00:00<?, ?it/s]

{'writer': {'documents_written': 1233}}

In [11]:
print(len(document_store.filter_documents()))
print(len(document_store_w_embedded_metadata.filter_documents()))

1203
1203


Create the 2 retrieval pipelines.

In [12]:
retrieval_pipe_std = create_retrieval_pipeline(document_store=document_store)

retrieval_pipe_w_embedded_metadata = create_retrieval_pipeline(document_store=document_store_w_embedded_metadata)

## 🧪 Run the experiment!

In [13]:
# standard approach (no metadata embedding)

res=retrieval_pipe_std.run({"text_embedder":{"text":"have the beatles ever been to bangor?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

❌ the retrieved Documents seem irrelevant

In [14]:
# embedding meaningful metadata

res=retrieval_pipe_w_embedded_metadata.run({"text_embedder":{"text":"have the beatles ever been to bangor?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

✅ the first Document is relevant

In [15]:
# standard approach (no metadata embedding)

res=retrieval_pipe_std.run({"text_embedder":{"text":"What announcements did the band The Cure make in 2022?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

❌ the retrieved Documents seem irrelevant

In [16]:
# embedding meaningful metadata

res=retrieval_pipe_w_embedded_metadata.run({"text_embedder":{"text":"What announcements did the band The Cure make in 2022?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

✅ the first 2 Documents are relevant

## ⚠️ Notes of caution

- This technique is not a silver bullet
- It works well when the embedded metadata are meaningful and distinctive
- I would say that the embedded metadata should be meaningful from the perspective of the embedding model. For example, I don't expect embedding numbers to work well.