[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aiembassy/workshop-rag-haystack/blob/master/notebooks/02-vector-search.ipynb)

In [None]:
!pip install "haystack-ai" \
    "qdrant-haystack" \
    "qdrant-client" \
    "sentence-transformers"

# Vector search

Although vector search is not the only method to find relevant documents given a query, LLM applications usually offer a more conversational-like interface than search engines. Users can express themselves using natural language, not long-crafted queries, with a selection of keywords that can possibly match the words used in the database.

In [None]:
from haystack.components.embedders import SentenceTransformersTextEmbedder

text_embedder = SentenceTransformersTextEmbedder(model="all-MiniLM-L6-v2")
text_embedder.warm_up()  # That downloads the model from the hub

In [None]:
texts = [
    "Which continent has the most countries?",
    "What is the longest river in the world?",
    "Which country has the most islands in the world?",
]

In [None]:
embeddings = []
for text in texts:
    vector = text_embedder.run(text=text)["embedding"]
    embeddings.append(vector)

len(embeddings[0])

In [None]:
import numpy as np


def cosine_distance(v: list[float], w: list[float]) -> float:
    v, w = np.array(v), np.array(w)
    v /= np.linalg.norm(x=v, keepdims=True)
    w /= np.linalg.norm(x=w, keepdims=True)
    return np.dot(a=v, b=w.T).tolist()

In [None]:
cosine_distance(embeddings[0], embeddings[1])

In [None]:
cosine_distance(embeddings[0], embeddings[2])

### Text vs document embeddings

When we build any kind of text search mechanism, we have documents and queries. Documents contain not only some textual data, but usually also some kind of unique identifier and metadata. This is also reflected in Haystack, as there is a concept of a `Document`.

In [None]:
from haystack import Document

document = Document(
    id="my-unique-document-id",  # Typically more some kind of UUID
    content="Africa has the most countries with 54 internationally recognized sovereign states.",
    meta={
        "source": "Wikipedia",
        "author": "John Doe",
        "date": "2025-01-05",
    },
)
document

Vectors are usually created out of the text data, but there is nothing wrong with including some metadata in the vectorization process. Haystack has two different kinds of embedders: one for documents, and another one for queries. There is usually a counterpart for each of them.

In [None]:
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

document_embedder = SentenceTransformersDocumentEmbedder(model="all-MiniLM-L6-v2")
document_embedder.warm_up()

In [None]:
document_embedder.run(documents=[document])
document

In [None]:
cosine_distance(document.embedding, embeddings[0])

In [None]:
cosine_distance(document.embedding, embeddings[1])

In [None]:
cosine_distance(document.embedding, embeddings[2])

Embedding models are trained in a way that they produce high similarity scores between the query and the document if they are semantically similar. This is why we can use usually use cosine similarity to compare the vectors. The closer the score is to 1, the more similar the vectors are.

## Vector databases

Real-world applications operate on thousands or even millions of documents. We could possibly store their embeddings in an existing database powering our system, but search operations would need to download all the embeddings and calculate the similarity scores between the query and each document. This is not efficient, especially when we have to do it in real-time. Vector databases are designed to store and search for vectors efficiently, but they approximate the search results. They are not as precise as the brute-force search, but they are much faster.

In [None]:
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

document_store = QdrantDocumentStore(
    ":memory:",  # Never use in production systems! It's a mode only for testing purposes.
    embedding_dim=384,  # The size of the embeddings produced by the model
    index="facts",  # We can have multiple indexes in the same database
)

In [None]:
facts = [
    "Africa has the most countries with 54 internationally recognized sovereign states.",
    "The Nile River is the longest river in the world, stretching approximately 6,650 kilometers (4,132 miles) through eleven countries in northeastern Africa.",
    "Sweden has the most islands in the world, with over 267,570 islands, though only about 984 of them are inhabited.",
    "Vatican City, with an area of just 0.44 square kilometers (0.17 square miles), is the world's smallest independent state.",
    "The Andes Mountains are the longest continental mountain range in the world, stretching about 7,000 kilometers (4,300 miles) along South America's western coast.",
    "The Challenger Deep in the Mariana Trench is the deepest known point on Earth, reaching a depth of approximately 11,034 meters (36,201 feet) below sea level.",
    "The Antarctic Desert is technically the world's largest desert. However, if considering only hot deserts, the Sahara Desert is the largest, covering about 9.2 million square kilometers (3.6 million square miles).",
    "Indonesia has the most volcanoes of any country, with 147 volcanoes, of which around 129 are considered active.",
    "The Dead Sea, located between Israel and Jordan, has the highest salt concentration of any body of water on Earth, with about 34% salinity.",
    "La Paz, Bolivia, is the world's highest administrative capital at approximately 3,650 meters (11,975 feet) above sea level, although Sucre is the constitutional capital.",
]

In [None]:
import uuid

documents = [
    Document(
        id=uuid.uuid4().hex,
        content=fact,
        meta={
            "source": "Wikipedia",
            "author": "John Doe",
            "date": "2025-01-05",
            "category": "Geography",  # That's a new attribute!
        },
    )
    for fact in facts
]
document_embedder.run(documents=documents)

# Save the documents to the database and display the current number of documents.
# Since we are using an in-memory database, the documents will be lost after the
# kernel restart.
document_store.write_documents(documents)
document_store.count_documents()

### Search with vector databases

We can now search for the most similar documents to the query. The search is done in the vector database, and the results are returned in a sorted order, with the most similar documents first.

In [None]:
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever

retriever = QdrantEmbeddingRetriever(document_store=document_store)

In [None]:
questions = [
    "Which continent has the most countries?",
    "What is the longest river in the world?",
    "Which country has the most islands in the world?",
    "What is the smallest country in the world by area?",
    "Which mountain range is the longest in the world?",
    "What is the deepest point in the ocean?",
    "Which desert is the largest in the world?",
    "What country has the most volcanoes?",
    "Which sea has the highest salt content?",
    "What is the highest capital city in the world?",
]

In [None]:
for question in questions:
    query_vector = text_embedder.run(text=question)["embedding"]
    results = retriever.run(query_embedding=query_vector, top_k=3)
    print("Question", question)
    print("Top answers:")
    for result in results["documents"]:
        print("\t", result.content, f"score = {result.score}")

#### Semantic search with filtering

Vector embeddings capture the meaning of the text quite well. However, our documents may contain some other useful information we also need to consider when searching. For example, semantic search won't capture the price of a product, as it may also vary over time. We can use filters to narrow down the search results.

In [None]:
new_facts = [
    "The cell is the smallest unit of life. It is the basic structural, functional, and biological unit of all known living organisms. Cells can exist as single-celled organisms or as part of a multicellular organism.",
    "Mitochondria are often called the powerhouses of the cell because they generate most of the cell's supply of adenosine triphosphate (ATP), the energy currency of cells. They do this through cellular respiration, breaking down glucose to produce energy.",
    "DNA (deoxyribonucleic acid) is a double-stranded molecule that stores genetic information, while RNA (ribonucleic acid) is typically single-stranded and helps in expressing genes. DNA uses thymine while RNA uses uracil, and their sugars differ (deoxyribose vs. ribose).",
    "Photosynthesis is the process by which plants and other organisms convert light energy into chemical energy. Plants use sunlight, water, and carbon dioxide to produce glucose and oxygen. This process is crucial as it provides food for plants and produces oxygen that most living things need to survive.",
    "The four main blood types in humans are A, B, AB, and O. These types are determined by the presence or absence of specific antigens on the surface of red blood cells. Additionally, each type can be either Rh-positive or Rh-negative, creating eight possible blood types.",
]

In [None]:
new_documents = [
    Document(
        id=uuid.uuid4().hex,
        content=fact,
        meta={
            "source": "Wikipedia",
            "author": "John Doe",
            "date": "2025-01-05",
            "category": "Biology",
        },
    )
    for fact in new_facts
]
document_embedder.run(documents=new_documents)
document_store.write_documents(new_documents)
document_store.count_documents()

In [None]:
question = "What is the role of mitochondria in a cell?"
query_vector = text_embedder.run(text=question)["embedding"]
results = retriever.run(
    query_embedding=query_vector,
    top_k=3,
    filters={
        "field": "meta.category",
        "operator": "==",
        "value": "Geography",  # Oops! We should be looking for biology facts!
    },
)

print("Question", question)
print("Top answers:")
for result in results["documents"]:
    print("\t", result.content, f"score = {result.score}")

In [None]:
results = retriever.run(
    query_embedding=query_vector,
    top_k=3,
    filters={
        "field": "meta.category",
        "operator": "==",
        "value": "Biology",
    },
)

print("Question", question)
print("Top answers:")
for result in results["documents"]:
    print("\t", result.content, f"score = {result.score}")