## MongoDB Atlas Quickstart

[MongoDB Atlas Vector Search](https://www.mongodb.com/products/platform/atlas-vector-search) is part of the MongoDB platform that enables MongoDB customers to build intelligent applications powered by semantic search over any type of data. Atlas Vector Search allows you to integrate your operational database and vector search in a single, unified, fully managed platform with full vector database capabilities.

You can integrate TruLens with your application built on Atlas Vector Search to leverage observability and measure improvements in your application's search capabilities.

This tutorial will walk you through the process of setting up TruLens with MongoDB Atlas Vector Search and Llama-Index as the orchestrator.

Even better, you'll learn how to use metadata filters to create specialized query engines and leverage a router to choose the most appropriate query engine based on the query.

See [MongoDB Atlas/LlamaIndex Quickstart](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/llamaindex/) for more details.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/main/trulens_eval/examples/expositional/vector-dbs/mongodb_atlas/atlas_quickstart.ipynb)



In [6]:
!pip install trulens-eval llama-index llama-index-vector-stores-mongodb llama-index-embeddings-openai pymongo "ipython>=8.12.0" "ipywidgets>=8.0.6"

Defaulting to user installation because normal site-packages is not writeable
Collecting ipywidgets>=8.0.6
  Downloading ipywidgets-8.1.2-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.10 (from ipywidgets>=8.0.6)
  Downloading widgetsnbextension-4.0.10-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.10 (from ipywidgets>=8.0.6)
  Downloading jupyterlab_widgets-3.0.10-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.2-py3-none-any.whl (139 kB)
   ---------------------------------------- 0.0/139.4 kB ? eta -:--:--
   -- ------------------------------------- 10.2/139.4 kB ? eta -:--:--
   -------- ------------------------------ 30.7/139.4 kB 660.6 kB/s eta 0:00:01
   -------------------------------- ------- 112.6/139.4 kB 1.1 MB/s eta 0:00:01
   ---------------------------------------- 139.4/139.4 kB 1.2 MB/s eta 0:00:00
Downloading jupyterlab_widgets-3.0.10-py3-none-any.whl (215 kB)
   ---------------------------------------- 0.0/2

## Import TruLens and start the dashboard

In [1]:
from trulens_eval import Tru

tru = Tru()

tru.reset_database()

tru.run_dashboard()

Using legacy llama_index version 0.9.34. Consider upgrading to 0.10.0 or later.


🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of Tru` to prevent this.
Starting dashboard ...


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://192.168.1.11:8501 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

## Set imports, keys and llama-index settings

In [2]:
import getpass, os, pymongo, pprint
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.core.settings import Settings
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, ExactMatchFilter, FilterOperator
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch

In [5]:
import os
# importing necessary functions from dotenv library
from dotenv import load_dotenv, dotenv_values 
# loading variables from .env file
load_dotenv() 
OPENAI_API_KEY=os.getenv("OPENAI_API_KEY")
mongo_key=os.getenv("mongo_key")
mongo_pass=os.getenv("mongo_pass")
from urllib.parse import quote_plus

username = quote_plus(mongo_key)
password = quote_plus(mongo_pass)
ATLAS_CONNECTION_STRING = 'mongodb+srv://' + username + ':' + password + '@datacluster.hjhs3xb.mongodb.net/'

In [6]:
Settings.llm = OpenAI()
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
Settings.chunk_size = 100
Settings.chunk_overlap = 10

## sample data

Here we'll load two PDFs: one for Atlas best practices and one textbook on database essentials.

In [None]:
from pathlib import Path
from llama_index import download_loader
PDFReader = download_loader("PDFReader")
loader = PDFReader()

# Load HNSW PDF from LFS
documents = loader.load_data(file=Path("docs/Seeratul Mustafa Abridged.pdf"))

documents1 = loader.load_data(file=Path("docs/Ar-Raheeq_Al-Makhtum_(The_Sealed_Nectar)_English_(www.TheChoice.one)_text.pdf"))
documents2 = loader.load_data(file=Path("docs/Seerat Ibn e Hisham - English Translation (1st Edition)_text.pdf"))
documents3 = loader.load_data(file=Path("docs/Seerat Ibn e Hisham - English Translation (2nd Edition)_text.pdf"))



In [45]:
print(documents[0])

Doc ID: 2b62cee6-ebd8-4cfc-a2ab-6428008683f0
Text: اَ    it    04    04   Better than you no eye has ever seen
More beautiful than you no woman has given birth to   You have been
created free from any defect  As if you were created like how you
desired
Seeratul Mustafa    


In [46]:

all_documents = documents + documents1 + documents2 + documents3


## Create a vector store

Next you need to create an Atlas Vector Search Index.

When you do so, use the following in the json editor:

```
{
  "fields": [
    {
      "numDimensions": 1536,
      "path": "embedding",
      "similarity": "cosine",
      "type": "vector"
    },
    {
      "path": "metadata.file_name",
      "type": "filter"
    }
  ]
}
```

In [19]:
# Connect to your Atlas cluster
mongodb_client = pymongo.MongoClient(ATLAS_CONNECTION_STRING)

# Instantiate the vector store
atlas_vector_search = MongoDBAtlasVectorSearch(
    mongodb_client,
    db_name = "books",
    collection_name = "seerahbooks",
    index_name = "vector_index"
)
vector_store_context = StorageContext.from_defaults(vector_store=atlas_vector_search)





In [47]:
import re

def clean_up_text(content: str) -> str:
    """
    Remove unwanted characters and patterns in text input.

    :param content: Text input.
    
    :return: Cleaned version of original text input.
    """

    # Fix hyphenated words broken by newline
    content = re.sub(r'(\w+)-\n(\w+)', r'\1\2', content)

    # Remove specific unwanted patterns and characters
    unwanted_patterns = [
        "\\n", "  —", "——————————", "—————————", "—————",
        r'\\u[\dA-Fa-f]{4}', r'\uf075', r'\uf0b7'
    ]
    for pattern in unwanted_patterns:
        content = re.sub(pattern, "", content)

    # Fix improperly spaced hyphenated words and normalize whitespace
    content = re.sub(r'(\w)\s*-\s*(\w)', r'\1-\2', content)
    content = re.sub(r'\s+', ' ', content)

    return content

In [53]:
# Call function
cleaned_docs = []
for d in documents: 
    cleaned_text = clean_up_text(d.text)
    d.text = cleaned_text
    # add meta data according to the book
    metadata_additions = {"title": "Seeratul Mustafa Abridged"}
    d.metadata.update(metadata_additions) 
    cleaned_docs.append(d)

# Inspect output
cleaned_docs[0].get_content()

' اَ it 04 04 Better than you no eye has ever seen More beautiful than you no woman has given birth to You have been created free from any defect As if you were created like how you desired Seeratul Mustafa \uf048\uf020'

In [54]:
for d in documents1: 
    cleaned_text = clean_up_text(d.text)
    d.text = cleaned_text
    # add meta data according to the book
    metadata_additions = {"title": "AR-RAHEEQ al-makhtum"}
    d.metadata.update(metadata_additions) 
    cleaned_docs.append(d)
for d in documents2: 
    cleaned_text = clean_up_text(d.text)
    d.text = cleaned_text
    # add meta data according to the book
    metadata_additions = {"title": "Seerat Ibn e Hisham"}
    d.metadata.update(metadata_additions) 
    cleaned_docs.append(d)
for d in documents3: 
    cleaned_text = clean_up_text(d.text)
    d.text = cleaned_text
    # add meta data according to the book
    metadata_additions = {"title": "Seerat Ibn e Hisham"}
    d.metadata.update(metadata_additions) 
    cleaned_docs.append(d)

In [None]:
from llama_index.core import Settings

Settings.chunk_size = 1024

# Local settings
from llama_index.core.node_parser import SentenceSplitter
# load both documents into the vector store
vector_store_index = VectorStoreIndex.from_documents(
   all_documents, storage_context=vector_store_context, show_progress=True, transformations=[SentenceSplitter(chunk_size=1024)]
)

## Setup basic RAG

In [None]:
query_engine = vector_store_index.as_query_engine()

## Add feedback functions

In [None]:
from trulens_eval.feedback.provider import OpenAI
from trulens_eval import Feedback
import numpy as np

# Initialize provider class
provider = OpenAI()

# select context to be used in feedback. the location of context is app specific.
from trulens_eval.app import App
context = App.select_context(query_engine)

from trulens_eval.feedback import Groundedness
grounded = Groundedness(groundedness_provider=OpenAI())
# Define a groundedness feedback function
f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons, name = "Groundedness")
    .on(context.collect()) # collect context chunks into a list
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

# Question/answer relevance between overall question and answer.
f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name = "Answer Relevance")
    .on_input_output()
)
# Context relevance between question and each context chunk.
f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name = "Context Relevance")
    .on_input()
    .on(context)
    .aggregate(np.mean)
)

In [None]:
from trulens_eval import TruLlama
tru_query_engine_recorder = TruLlama(query_engine,
    app_id='Basic RAG',
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance])

## Write test cases

Let's write a few test queries to test the ability of our RAG to answer questions on both documents in the vector store.

In [None]:
from trulens_eval.generate_test_set import GenerateTestSet

test_set = {'MongoDB Atlas': [
                "How do you secure MongoDB Atlas?",
                "How can Time to Live (TTL) be used to expire data in MongoDB Atlas?",
                "What is vector search index in Mongo Atlas?",
                "How does MongoDB Atlas different from relational DB in terms of data modeling"],
            'Database Essentials': [
                "What is the impact of interleaving transactions in database operations?",
                "What is vector search index? how is it related to semantic search?"
                ]
}

## Alternatively, we can generate test set automatically


In [None]:
# test = GenerateTestSet(app_callable = query_engine.query)
# Generate the test set of a specified breadth and depth without examples automatically
# test_set = test.generate_test_set(test_breadth = 3, test_depth = 2)

## Get testing!

Our test set is made up of 2 topics (test breadth), each with 2-3 questions (test depth).

We can store the topic as record level metadata and then test queries from each topic, using `tru_query_engine_recorder` as a context manager.

In [None]:
with tru_query_engine_recorder as recording:
    for category in test_set:
        recording.record_metadata=dict(prompt_category=category)
        test_prompts = test_set[category]
        for test_prompt in test_prompts:
            response = query_engine.query(test_prompt)

## Check evaluation results

Evaluation results can be viewed in the TruLens dashboard (started at the top of the notebook) or directly in the notebook.

In [None]:
tru.get_leaderboard()

Perhaps if we use metadata filters to create specialized query engines, we can improve the search results and thus, the overall evaluation results.

But it may be clunky to have two separate query engines - then we have to decide which one to use!

Instead, let's use a router query engine to choose the query engine based on the query.

## Router Query Engine + Metadata Filters

In [None]:
# Specify metadata filters
metadata_filters_db_essentials = MetadataFilters(
   filters=[ExactMatchFilter(key="metadata.file_name", value="DBEssential-2021.pdf")]
)
metadata_filters_atlas = MetadataFilters(
   filters=[ExactMatchFilter(key="metadata.file_name", value="atlas_best_practices.pdf")]
)

metadata_filters_databrick = MetadataFilters(
   filters=[ExactMatchFilter(key="metadata.file_name", value="DataBrick_vector_search.pdf")]
)
# Instantiate Atlas Vector Search as a retriever for each set of filters
vector_store_retriever_db_essentials = VectorIndexRetriever(index=vector_store_index, filters=metadata_filters_db_essentials, similarity_top_k=5)
vector_store_retriever_atlas = VectorIndexRetriever(index=vector_store_index, filters=metadata_filters_atlas, similarity_top_k=5)
vector_store_retriever_databrick = VectorIndexRetriever(index=vector_store_index, filters=metadata_filters_databrick, similarity_top_k=5)
# Pass the retrievers into the query engines
query_engine_with_filters_db_essentials = RetrieverQueryEngine(retriever=vector_store_retriever_db_essentials)
query_engine_with_filters_atlas = RetrieverQueryEngine(retriever=vector_store_retriever_atlas)
query_engine_with_filters_databrick = RetrieverQueryEngine(retriever=vector_store_retriever_databrick)

from llama_index.core.tools import QueryEngineTool

# Set up the two distinct tools (query engines)

essentials_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine_with_filters_db_essentials,
    description=(
        "Useful for retrieving context about database essentials"
    ),
)

atlas_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine_with_filters_atlas,
    description=(
        "Useful for retrieving context about MongoDB Atlas"
    ),
)

databrick_tool = QueryEngineTool.from_defaults(
    query_engine=query_engine_with_filters_databrick,
    description = (
        "Useful for retrieving context about Databrick's course on Vector Databases and Search"
    )
)

# Create the router query engine

from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector, LLMMultiSelector
from llama_index.core.selectors import (
    PydanticMultiSelector,
    PydanticSingleSelector,
)


router_query_engine = RouterQueryEngine(
    selector=PydanticSingleSelector.from_defaults(),
    query_engine_tools=[
        essentials_tool,
        atlas_tool,
        databrick_tool
    ],
)

from trulens_eval import TruLlama
tru_query_engine_recorder_with_router = TruLlama(router_query_engine,
    app_id='Router Query Engine + Filters v2',
    feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance])

In [None]:
with tru_query_engine_recorder_with_router as recording:
    for category in test_set:
        recording.record_metadata=dict(prompt_category=category)
        test_prompts = test_set[category]
        for test_prompt in test_prompts:
            response = router_query_engine.query(test_prompt)

## Check results!

In [None]:
tru.get_leaderboard()