# Exploitation Zone Text

This notebook builds an audio embedding pipeline: it loads S3/MinIO credentials, connects to MinIO and a ChromaDB HTTP server, it instantiates a sentence-transfomers model (all-MiniLM-L6-v2) for text embeddings. It creates a chroma collection, it iterates over the text MinIO objects and extracts the embeddings for each object. This embeddings will be used in the last part of the project.

In [2]:
import boto3
import os
from dotenv import load_dotenv

load_dotenv()
access_key_id = os.getenv("ACCESS_KEY_ID")
secret_access_key = os.getenv("SECRET_ACCESS_KEY")
minio_url = "http://" + os.getenv("S3_API_ENDPOINT")


minio_client = boto3.client(
    "s3",
    aws_access_key_id=access_key_id,
    aws_secret_access_key=secret_access_key,
    endpoint_url=minio_url
)

new_bucket = "exploitation-zone"
try:
    minio_client.create_bucket(Bucket=new_bucket)
except (minio_client.exceptions.BucketAlreadyExists, minio_client.exceptions.BucketAlreadyOwnedByYou):
    print(f"Bucket '{new_bucket}' already exists")

Bucket 'exploitation-zone' already exists


In [3]:
import chromadb
from chromadb.utils.embedding_functions import DefaultEmbeddingFunction

client = chromadb.HttpClient(host="localhost", port=8000)
default_ef = DefaultEmbeddingFunction()
paginator = minio_client.get_paginator("list_objects_v2")
exploitation_zone = "exploitation-zone"
trusted_zone = "trusted-zone"

collection_name = "exploitation-zone_text"
try:
    collection = client.get_or_create_collection(name=collection_name)
except Exception as e:
    print(f"Error accessing or creating collection: {e}")
    exit(1)

for page in paginator.paginate(Bucket=trusted_zone, Prefix="text/"):
    for obj in page.get("Contents", []):
        key = obj.get("Key", "")
        response = minio_client.get_object(Bucket=trusted_zone, Key=key)
        document_content = response['Body'].read().decode('utf-8')

        embedding = default_ef([document_content])[0]

        collection.add(
            documents=[document_content],
            embeddings=[embedding],
            ids=[key]
        )
        minio_client.copy_object(
            Bucket=exploitation_zone,
            CopySource={'Bucket': trusted_zone, 'Key': key},
            Key=key
        )
result = collection.get()
print("returned keys:", list(result.keys()))

returned keys: ['ids', 'embeddings', 'metadatas', 'documents', 'data', 'uris', 'included']


# Nearest Neighbour search on text

This cell does the same modality search on text. We used a query string and we ask "What are common symptoms of skin cancer?", then we encode that text into an embedding with the SentenceTransformer used for indexing. Then we set to top_k=1 and we query the chroma collection of the k-nearest neighbours. In our case, since k=1, we only get the closest neighbour. Finally, it prints the data which is the closest to the one provided.

In [1]:
query_text = "What are common symptoms of skin cancer?"
query_emb = default_ef([query_text])[0]

top_k = 1
results = collection.query(
    query_embeddings=[query_emb],
    n_results=top_k,
    include=["documents", "distances"]
)

print("\n--- Query Results ---")
print(f"Query: '{query_text}'")
print(f"Most similar document: {results['documents'][0][0]}")
print(f"Distance: {results['distances'][0][0]}")

NameError: name 'default_ef' is not defined