# 02 - Building the vector database

Notebook steps:

- Load TSV that contains the chunks to index
- Generate vector indexing configurations from `config.yml`
- Index the chunks in the vector database using the `VectorDB` abstraction


When the chunks are encoded by the embedding model, they are stored in a vector database. When a user enters a query, it is also encoded by the same model, and then compared to the vectors in the database to identify the most similar documents.

The main technical challenge is as follows:
> Given a query vector, quickly find its **k nearest neighbors** in the vector database, i.e., the k most relevant documents.


Let's start by create the vector database and save it to disk.



In [1]:
import os
import pandas as pd
import warnings


from lib.io_utils import read_yaml, get_absolute_path
from lib.vector_store import VectorDB
warnings.filterwarnings("ignore")

In [2]:
## configuration loading
config = read_yaml("../config.yml")
data_path = get_absolute_path("data/raw/encpos_chunked_tok_512_51.csv")
defaults = config.get("defaults", {})

In [3]:
## data loading
if not os.path.exists(data_path):
    raise FileNotFoundError(f"File Not Found: {data_path}. Run first the notebook: 01-prepare_chunk_corpus.ipynb.")
df = pd.read_csv(data_path, sep="\t")
print(f"Total chunks to index: {len(df)}")

Total chunks to index: 39377


We prepare the configurations to create our persistent vector databases.

In [4]:
vector_indexing = []
for entry in config.get("vector_indexing", []):
    model_id = entry["model_id"]
    model = next((m for m in config["embedding_models"] if m["id"] == model_id), None)
    if not model:
        raise ValueError(f"Modèle non trouvé : {model_id}")

    for backend in entry["backends"]:
        suffix = f"{model_id}_{backend}"
        name = f"{model['name']} - {backend.upper()}"
        collection_name = f"{defaults.get('collection_prefix', 'encpos')}_{model_id}"
        path = os.path.join(defaults.get("base_path", "data/vectordb"), suffix)

        vector_indexing.append({
            "name": name,
            "backend": backend,
            "embedding_model": model["model_path"],
            "metric": defaults.get("metric", "cosine"),
            "text_column": defaults.get("text_column", "full_chunk"),
            "metadata_columns": defaults.get("metadata_columns", []),
            "path": path,
            "qdrant_collection_name": collection_name,
            "k": defaults.get("k", 10),
            "force_rebuild": defaults.get("force_rebuild", False)
        })

To efficiently index our data in a vector database that could have thousands of documents, we need to pick two key things:

- A **distance metric** to compare vectors (like cosine similarity, Euclidean distance, etc.);

- A **nearest neighbors search algorithm** to quickly find relevant documents.

In the following cell, we have set up a main loop that builds a vector database for each configuration defined in the `config.yml` file.
The goal is to test different combinations of embedding models and storage backends.

The two backends currently supported are:

- Faiss: very fast for indexing and searching, but metadata filters are limited;

- LanceDB: slower for indexing, but allows complex queries on metadata, for example in SQL.

We chose to use cosine similarity as the metric for comparing vectors. It measures the angle between two vectors, which allows their direction to be compared independently of their norm. This requires normalizing all vectors (i.e., giving them a unit norm) before indexing or searching.

To facilitate indexing, we have developed a Python abstraction called `VectorDB` that supports:

- Vector normalization

- Creation of vector databases and their persistence

- Indexing of embeddings

- Searching

This abstraction allows us to compare different models and backends in a uniform manner and evaluate them under fair conditions.



In [6]:
%%time
#df = df.sample(n=50)
for conf in vector_indexing:
    print("\n--- Indexation in progress ---")
    print("Nom:", conf["name"])

    db = VectorDB(
        backend=conf["backend"],
        embedding_model=conf["embedding_model"],
        metric=conf["metric"],
        path=get_absolute_path(conf["path"]),
        k=conf["k"],
        force_rebuild=bool(conf["force_rebuild"])
    )


    db.add_from_dataframe(
        df=df,
        text_column=conf["text_column"],
        metadata_columns=conf["metadata_columns"]
    )

    db.save()
    print("📦 Index is created:", conf["name"])


--- Indexation in progress ---
Nom: CamemBERT Large - FAISS
📂 Loading existing FAISS index from /Users/lucaterre/Documents/pro/Travail_courant/DEV/AI-ENC-Projects/on-github/encpos-qa-rag/data/retrievers/vectordb/camembert-large_faiss
ℹ️ FAISS index already loaded — skipping creation.
📦 Index is created: CamemBERT Large - FAISS

--- Indexation in progress ---
Nom: CamemBERT Large - LANCEDB
📦 LanceDB intialization on: /Users/lucaterre/Documents/pro/Travail_courant/DEV/AI-ENC-Projects/on-github/encpos-qa-rag/data/retrievers/vectordb/camembert-large_lancedb
ℹ️ LanceDB table founded.
📦 Index is created: CamemBERT Large - LANCEDB

--- Indexation in progress ---
Nom: CamemBERT Base - FAISS
📂 Loading existing FAISS index from /Users/lucaterre/Documents/pro/Travail_courant/DEV/AI-ENC-Projects/on-github/encpos-qa-rag/data/retrievers/vectordb/camembert-base_faiss
ℹ️ FAISS index already loaded — skipping creation.
📦 Index is created: CamemBERT Base - FAISS

--- Indexation in progress ---
Nom: Cam

➡️ Notebook suivant : [03-assemble_rag.ipynb](./03-assemble_rag.ipynb)