## Chroma Vector DB Usage
#### Pre-requisites: User should run "Populate Chroma Vector DB with document embeddings" job so that Chroma has relevant embeddings before using this Notebook

#### 5.5 Initialize persistent Chroma Vector DB connection

In [1]:
## Initialize a connection to the running Chroma DB server
import chromadb
import os

## Use the following line to connect from within CML
chroma_client = chromadb.PersistentClient(path="/home/cdsw/chroma-data")

#### 5.6 Get Chroma Vector DB Collection and number of collection objects
This code initializes a connection to Chroma DB, a database for managing and querying embeddings. It defines the embedding model to be used, specifies the name of the collection as 'cml-default', and attempts to get or create that collection with the specified embedding function. Finally, it retrieves and prints the total number of embeddings in the Chroma DB index, providing statistics on the collection.

In [3]:
from chromadb.utils import embedding_functions

EMBEDDING_MODEL_REPO = "sentence-transformers/all-mpnet-base-v2"
EMBEDDING_MODEL_NAME = "all-mpnet-base-v2"
EMBEDDING_FUNCTION = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBEDDING_MODEL_NAME)

COLLECTION_NAME = 'cml-default'

print("initialising Chroma DB connection...")

print(f"Getting '{COLLECTION_NAME}' as object...")
try:
    chroma_client.get_collection(name=COLLECTION_NAME, embedding_function=EMBEDDING_FUNCTION)
    print("Success")
    collection = chroma_client.get_collection(name=COLLECTION_NAME, embedding_function=EMBEDDING_FUNCTION)
except:
    print("Creating new collection...")
    collection = chroma_client.create_collection(name=COLLECTION_NAME, embedding_function=EMBEDDING_FUNCTION)
    print("Success")

# Get latest statistics from index
current_collection_stats = collection.count()
print('Total number of embeddings in Chroma DB index is ' + str(current_collection_stats))


initialising Chroma DB connection...
Getting 'cml-default' as object...
Success
Total number of embeddings in Chroma DB index is 5


#### 5.7 Sample demonstration of populating a vector into Chroma given several attributes

Here we add a sample document with associated metadata and a unique ID to a Chroma vector database collection for semantic search, using the specified text content, classification, and file path.

In [36]:
## Sample add to Chroma vector DB
file_path = '/example/of/file/path/to/doc.txt'
classification = "public"
text = "This is a sample document which would represent content for a semantic search."

collection.add(
    documents=[text],
    metadatas=[{"classification": classification}],
    ids=[file_path]
)

#### 5.8 Sample demonstration of querying a vector in Chroma along with using metadata to reduce noise

This code performs a semantic search in a Chroma vector database using sample query text and retrieves the two most similar results; metadata can be utilized to further refine search results by specifying filters based on metadata fields, allowing for more precise and context-aware queries.

In [4]:
## Query Chroma vector DB 
## This query returns the two most similar results from a semantic search
results = collection.query(
    query_texts=["What is Apache Iceberg?"],
    n_results=2
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
)
print(results)

{'ids': [['/home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/architecture-overview/topics/ml-architecture-overview-cml.txt', '/home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/product/topics/ml-product-overview.txt']], 'distances': [[1.4979395183284212, 1.5874341272395678]], 'metadatas': [[{'classification': 'public'}, {'classification': 'public'}]], 'embeddings': None, 'documents': [["CML ArchitectureCloudera Docs\nCML Architecture\nOnce a CML workspace is provisioned, you can start using Cloudera Machine Learning\n      (CML) for your end-to-end Machine Learning workflow. \nCML is a three-tier application that consists of a presentation tier, an application tier\n         and a data tier. \nWeb tier\nCML is a web application that provides a UI that simplifies the action of managing\n            workloads and resources for data scientists. It offers users a convenient way to deploy\n            and scale their analytical pipeline and collaborate with their 

#### 5.9 Outcomes of using Chroma to map to the original file in the local file system (the complete file)

This code defines a helper function load_context_chunk_from_data to retrieve the content of a knowledge base document based on its file path (ID), and then it iterates through the search results to print information about each result, including file path, classification, the snippet of the document, and the full document content loaded from the file, providing a detailed display of the search results.

In [5]:
# Helper function to return the Knowledge Base doc based on Knowledge Base ID (relative file path)
def load_context_chunk_from_data(id_path):
    with open(id_path, "r") as f: # Open file in read mode
        return f.read()
    
## Clean up output and display full file
for i in range(len(results['ids'][0])):
    file_path = results['ids'][0][i]
    classification = results['metadatas'][0][i]['classification']
    document = results['documents'][0][i]
    
    print("------------- RESULT " + str(i+1) + " ----------------\n")
    print(f"FILE PATH: {file_path}")
    print(f"CLASSIFICATION: {classification}")
    print(f"DOCUMENT: {document}\n")
    print(f"FULL DOCUMENT (FROM FILE): {load_context_chunk_from_data(file_path)}\n")


------------- RESULT 1 ----------------

FILE PATH: /home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/architecture-overview/topics/ml-architecture-overview-cml.txt
CLASSIFICATION: public
DOCUMENT: CML ArchitectureCloudera Docs
CML Architecture
Once a CML workspace is provisioned, you can start using Cloudera Machine Learning
      (CML) for your end-to-end Machine Learning workflow. 
CML is a three-tier application that consists of a presentation tier, an application tier
         and a data tier. 
Web tier
CML is a web application that provides a UI that simplifies the action of managing
            workloads and resources for data scientists. It offers users a convenient way to deploy
            and scale their analytical pipeline and collaborate with their colleagues in a secure
            compartmentalized environment. 
CML communicates using HTTPS, Websocket, and gRPC. External communication is limited to
            HTTP and Websocket for the web UI and APIs. In-c