This is an interactive example that will walk you through the initialization of a RAG and the basic embedding of a few documents.

In [1]:
from pathlib import Path
import os

from backend.config import RagConfig
from backend.rag_components.rag import RAG

repo_root = Path(os.getcwd())

rag_config = RagConfig.from_yaml(repo_root / "backend" / "config.yaml")
rag = RAG(config=rag_config)

print("LLM:", rag.llm.__class__.__name__)
print("Embedding model:", rag.embeddings.__class__.__name__)
print("Vector store:", rag.vector_store.__class__.__name__)
print("Retriever:", rag.retriever.__class__.__name__)

Successfuly ran script at /Users/alexis.vialaret/vscode_projects/skaff-rag-accelerator/backend/rag_components/rag_tables.sql for sqlite
  warn_deprecated(
  from .autonotebook import tqdm as notebook_tqdm


LLM: AzureChatOpenAI
Embedding model: HuggingFaceEmbeddings
Vector store: Chroma
Retriever: VectorStoreRetriever


Here we transform our CSV into standalone embeddable documents that we will be able to feed the vector store.

We generate one document for each line, and each document will contain header:value pairs for all the columns.

This is a very simplistic example, but vector store data models can get more advanced to support more [powerful retreival methods.](https://python.langchain.com/docs/modules/data_connection/retrievers/)

In [2]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores.utils import filter_complex_metadata


data_sample_path = repo_root / "examples" / "billionaires.csv"

loader = CSVLoader(
    file_path=str(data_sample_path),
    csv_args={"delimiter": ",", "quotechar": '"', "escapechar": "\\"},
    encoding="utf-8-sig",
)

raw_documents = loader.load()
documents = filter_complex_metadata(raw_documents)
documents[:5]

[Document(page_content='rank: 1\nfinalWorth: 211000\ncategory: Fashion & Retail\npersonName: Bernard Arnault & family\nage: 74\ncountry: France\ncity: Paris\nsource: LVMH\nindustries: Fashion & Retail\ncountryOfCitizenship: France\norganization: LVMH Moët Hennessy Louis Vuitton\nselfMade: FALSE\nstatus: U\ngender: M\nbirthDate: 3/5/1949 0:00\nlastName: Arnault\nfirstName: Bernard\ntitle: Chairman and CEO\ndate: 4/4/2023 5:01\nstate: \nresidenceStateRegion: \nbirthYear: 1949\nbirthMonth: 3\nbirthDay: 5\ncpi_country: 110.05\ncpi_change_country: 1.1\ngdp_country: $2,715,518,274,227\ngross_tertiary_education_enrollment: 65.6\ngross_primary_education_enrollment_country: 102.5\nlife_expectancy_country: 82.5\ntax_revenue_country_country: 24.2\ntotal_tax_rate_country: 60.7\npopulation_country: 67059887\nlatitude_country: 46.227638\nlongitude_country: 2.213749', metadata={'source': '/Users/alexis.vialaret/vscode_projects/skaff-rag-accelerator/examples/billionaires.csv', 'row': 0}),
 Document(pa

To load the docs in the vector store, we recommend using the `load_document` as it [indexes previously embedded docs](https://python.langchain.com/docs/modules/data_connection/indexing), making the process idempotent.

In [3]:
rag.load_documents(documents)

Indexing 2640 documents.
Indexing batch 0 to 100.
{'event': 'load_documents', 'num_added': 100, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
Indexing batch 100 to 200.
{'event': 'load_documents', 'num_added': 100, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
Indexing batch 200 to 300.
{'event': 'load_documents', 'num_added': 100, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
Indexing batch 300 to 400.
{'event': 'load_documents', 'num_added': 100, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
Indexing batch 400 to 500.
{'event': 'load_documents', 'num_added': 100, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
Indexing batch 500 to 600.
{'event': 'load_documents', 'num_added': 100, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
Indexing batch 600 to 700.
{'event': 'load_documents', 'num_added': 100, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
Indexing batch 700 to 800.
{'event': 'load_documents', 'num_added': 100, 'num_updated': 