## Chroma 벡터 DB 사용
#### 사전 조건: 사용자는 이 노트북을 사용하기 전에 Chroma에 관련 임베딩이 있도록 "문서 임베딩으로 Chroma 벡터 DB 채우기" 작업을 실행해야 합니다.

#### 2.5 Chroma Vector DB 새로 연결하기

In [2]:
## Initialize a connection to the running Chroma DB server
import chromadb
import os

## Use the following line to connect from within CML
chroma_client = chromadb.PersistentClient(path="/home/cdsw/chroma-data")

#### 2.6 Chroma Vector DB 컬렉션 및 컬렉션 객체 수 가져오기
이 코드는 임베딩을 관리하고 쿼리하기 위한 데이터베이스인 Chroma DB에 대한 연결을 초기화합니다. 사용할 임베딩 모델을 정의하고, 컬렉션 이름을 'cml-default'로 지정하고, 지정된 임베딩 함수로 해당 컬렉션을 가져오거나 생성하려고 시도합니다. 마지막으로 Chroma DB 인덱스의 총 임베딩 수를 검색하여 인쇄하고, 컬렉션에 대한 통계를 제공합니다.

※ GPU를 사용하는 실습이 아니므로 GPU에 관한 경고는 무시하셔도 좋습니다

In [3]:
from chromadb.utils import embedding_functions


EMBEDDING_MODEL_REPO = os.environ.get("HF_EMBEDDING_MODEL_REPO")
EMBEDDING_MODEL_NAME = EMBEDDING_MODEL_REPO.split('/')[-1]
EMBEDDING_FUNCTION = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBEDDING_MODEL_NAME)

COLLECTION_NAME = 'cml-default'

print("initialising Chroma DB connection...")

print(f"Getting '{COLLECTION_NAME}' as object...")
try:
    chroma_client.get_collection(name=COLLECTION_NAME, embedding_function=EMBEDDING_FUNCTION)
    print("Success")
    collection = chroma_client.get_collection(name=COLLECTION_NAME, embedding_function=EMBEDDING_FUNCTION)
except:
    print("Creating new collection...")
    collection = chroma_client.create_collection(name=COLLECTION_NAME, embedding_function=EMBEDDING_FUNCTION)
    print("Success")

# Get latest statistics from index
current_collection_stats = collection.count()
print('Total number of embeddings in Chroma DB index is ' + str(current_collection_stats))


2024-12-17 18:01:13.029971: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-17 18:01:13.030011: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-17 18:01:13.036584: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-17 18:01:13.082096: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


initialising Chroma DB connection...
Getting 'cml-default' as object...
Success
Total number of embeddings in Chroma DB index is 27


#### 2.7 여러 속성을 사용하여 Chroma에 벡터를 채우는 샘플 데모

여기서 우리는 지정된 텍스트 콘텐츠, 분류 및 파일 경로를 사용하여 의미 검색을 위해 연관된 메타데이터와 고유 ID가 있는 샘플 문서를 Chroma 벡터 데이터베이스 컬렉션에 추가합니다.

In [4]:
## Sample add to Chroma vector DB
file_path = '/example/of/file/path/to/doc.txt'
classification = "public"
text = "This is a sample document which would represent content for a semantic search."

collection.add(
    documents=[text],
    metadatas=[{"classification": classification}],
    ids=[file_path]
)

Add of existing embedding ID: /home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/requirements-azure/topics/ml-requirements-azure.txt
Add of existing embedding ID: /home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/requirements-azure/topics/ml-limitations-azure.txt
Add of existing embedding ID: /home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/requirements-azure/topics/ml-azure-planning.txt
Add of existing embedding ID: /home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/models/topics/ml-challenges-in-prod.txt
Add of existing embedding ID: /home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/user-accounts/topics/ml-user-roles.txt
Add of existing embedding ID: /home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/requirements-aws/topics/ml-requirements-aws.txt
Add of existing embedding ID: /home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/requirements-aws/topics/ml-aws-network-planning.txt
Add of exist

#### 2.8 Chroma에서 벡터 쿼리를 수행하고 메타데이터를 사용하여 노이즈를 줄이는 샘플 데모

이 코드는 샘플 쿼리 텍스트를 사용하여 Chroma 벡터 데이터베이스에서 의미 검색을 수행하고 가장 유사한 두 결과를 검색합니다. 메타데이터는 메타데이터 필드를 기반으로 필터를 지정하여 검색 결과를 더욱 세분화하는 데 사용할 수 있으므로 보다 정확하고 컨텍스트를 인식하는 쿼리가 가능합니다.

In [5]:
## Query Chroma vector DB 
## This query returns the two most similar results from a semantic search
results = collection.query(
    query_texts=["What is Apache Iceberg?"],
    n_results=2
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
)
print(results)

{'ids': [['/home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/architecture-overview/topics/ml-architecture-overview-cml.txt', '/home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/architecture-overview/topics/ml-architecture-overview-provisioning.txt']], 'distances': [[1.501272571600087, 1.5616063231252637]], 'metadatas': [[{'classification': 'public'}, {'classification': 'public'}]], 'embeddings': None, 'documents': [["Cloudera AI ArchitectureCloudera Docs\nCloudera AI Architecture\nOnce a Cloudera AI Workbench is provisioned, you can start using Cloudera AI for your end-to-end Machine Learning workflow. \nCloudera AI is a three-tier application that consists of a presentation tier, an application tier\nand a data tier. \nWeb tier\nCloudera AI is a web application that provides a UI that simplifies the action of managing\nworkloads and resources for data scientists. It offers users a convenient way to deploy\nand scale their analytical pipeline and collaborate 

#### 2.9 로컬 파일 시스템의 원본 파일(전체 파일)에 매핑하기 위해 Chroma를 사용한 결과

이 코드는 파일 경로(ID)를 기반으로 지식 기반 문서의 콘텐츠를 검색하는 도우미 함수 load_context_chunk_from_data를 정의한 다음 검색 결과를 반복하여 파일 경로, 분류, 문서의 스니펫, 파일에서 로드된 전체 문서 콘텐츠를 포함하여 각 결과에 대한 정보를 인쇄하여 검색 결과에 대한 자세한 표시를 제공합니다.

In [6]:
# Helper function to return the Knowledge Base doc based on Knowledge Base ID (relative file path)
def load_context_chunk_from_data(id_path):
    with open(id_path, "r") as f: # Open file in read mode
        return f.read()
    
## Clean up output and display full file
for i in range(len(results['ids'][0])):
    file_path = results['ids'][0][i]
    classification = results['metadatas'][0][i]['classification']
    document = results['documents'][0][i]
    
    print("------------- RESULT " + str(i+1) + " ----------------\n")
    print(f"FILE PATH: {file_path}")
    print(f"CLASSIFICATION: {classification}")
    print(f"DOCUMENT: {document}\n")
    print(f"FULL DOCUMENT (FROM FILE): {load_context_chunk_from_data(file_path)}\n")


------------- RESULT 1 ----------------

FILE PATH: /home/cdsw/data/https:/docs.cloudera.com/machine-learning/cloud/architecture-overview/topics/ml-architecture-overview-cml.txt
CLASSIFICATION: public
DOCUMENT: Cloudera AI ArchitectureCloudera Docs
Cloudera AI Architecture
Once a Cloudera AI Workbench is provisioned, you can start using Cloudera AI for your end-to-end Machine Learning workflow. 
Cloudera AI is a three-tier application that consists of a presentation tier, an application tier
and a data tier. 
Web tier
Cloudera AI is a web application that provides a UI that simplifies the action of managing
workloads and resources for data scientists. It offers users a convenient way to deploy
and scale their analytical pipeline and collaborate with their colleagues in a secure
compartmentalized environment. 
Cloudera AI communicates using HTTPS, Websocket, and gRPC. External communication is limited to
HTTP and Websocket for the web UI and APIs. In-cluster service-to-service communica