# Query OLS Embeddings with ChromaDB

This notebook demonstrates fast similarity search using ChromaDB.

**Prerequisites:** Uses ../data/chromadb/chroma_ols20_nonols4/ (see CHROMADB_SETUP.md) to create the ChromaDB collection.

In [1]:
import openai
import chromadb
import os
from dotenv import load_dotenv
from chromadb.config import Settings

In [2]:
# Load environment variables
load_dotenv(dotenv_path='../.env')
openai.api_key = os.getenv("OPENAI_API_KEY")

if not openai.api_key:
    raise ValueError("OPENAI_API_KEY not found in .env file or environment variables.")

In [3]:
chromadb_path = "../data/chromadb/chroma_ols20_nonols4"
chromadb_collection_name = "combined_embeddings"

In [4]:
# Connect to ChromaDB
client = chromadb.PersistentClient(
    path=chromadb_path,
    settings=Settings(anonymized_telemetry=False)
)

collection = client.get_collection(name=chromadb_collection_name)
print(f"Collection loaded: {collection.count():,} embeddings")

Collection loaded: 452,942 embeddings


In [5]:
# Your search query
query = "agricultural soil"

In [6]:

# Generate embedding for the query
response = openai.embeddings.create(
    model="text-embedding-3-small",
    input=query
)

query_embedding = response.data[0].embedding
print(f"Generated embedding for: '{query}'")

Generated embedding for: 'agricultural soil'


In [7]:
query_embedding

[-0.008643225766718388,
 0.002361387712880969,
 0.022854099050164223,
 0.014709287323057652,
 -0.002233745064586401,
 -0.02543126419186592,
 0.03428114950656891,
 -0.014526940882205963,
 -0.024665407836437225,
 0.0020073314663022757,
 0.0048230658285319805,
 -0.004142305348068476,
 0.011396658606827259,
 -0.026573969051241875,
 -0.0049051218666136265,
 -0.03566698357462883,
 -0.025236761197447777,
 -0.047507353127002716,
 -0.029540138319134712,
 -0.014672817662358284,
 0.05193229392170906,
 -0.02034987322986126,
 -0.0038870202843099833,
 -0.015949243679642677,
 0.010551786050200462,
 -0.07950308918952942,
 0.005841167643666267,
 -0.027789613232016563,
 0.00043687192373909056,
 -0.015122606419026852,
 -0.04704540595412254,
 -0.04167226329445839,
 -0.00894105900079012,
 -0.012113888747990131,
 0.04086993634700775,
 -0.040116239339113235,
 0.04493018612265587,
 -0.03980017080903053,
 -0.0024890301283448935,
 0.005029725376516581,
 -0.03457290306687355,
 0.012703475542366505,
 -0.022343529

In [8]:
# Query ChromaDB for similar embeddings
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=10,  # Top 10 results
    include=["documents", "metadatas", "distances"]
)

In [9]:


print(f"\nTop 10 most similar terms (across ALL ontologies):\n")
for i in range(len(results['ids'][0])):
    iri = results['ids'][0][i]
    distance = results['distances'][0][i]
    ontology = results['metadatas'][0][i]['ontologyId']
    document = results['documents'][0][i]
    
    # Convert distance to similarity (ChromaDB uses L2 distance by default)
    # For cosine similarity, use collection with metric="cosine"
    print(f"{distance:.4f} | {ontology} | {iri}")
    print(f"  {document[:150]}...\n")


Top 10 most similar terms (across ALL ontologies):

0.0000 | meo | meo:class:https://mdatahub.org/data/meo/MEO_0000125
  agricultural soil...

0.0000 | micro | micro:class:http://purl.obolibrary.org/obo/ENVO_00002259
  agricultural soil...

0.5117 | envo | envo:class:http://purl.obolibrary.org/obo/ENVO_00002259
  agricultural soil; Soil which is part of an ecosystem used for agricultural activities....

0.5369 | meo | meo:class:https://mdatahub.org/data/meo/MEO_0000829
  agricultural field...

0.6157 | micro | micro:class:http://purl.obolibrary.org/obo/ENVO_00005742
  arable soil...

0.6196 | micro | micro:class:http://purl.obolibrary.org/obo/ENVO_00005749
  farm soil...

0.6197 | meo | meo:class:https://mdatahub.org/data/meo/MEO_0000197
  farm soil...

0.6339 | meo | meo:class:https://mdatahub.org/data/meo/MEO_0000198
  field soil...

0.6340 | micro | micro:class:http://purl.obolibrary.org/obo/ENVO_00005755
  field soil...

0.6936 | meo | meo:class:https://mdatahub.org/data/meo/MEO_0

In [10]:
# Filter by specific ontology (e.g., only ENVO terms)
results_envo = collection.query(
    query_embeddings=[query_embedding],
    n_results=10,
    where={"ontologyId": "envo"},  # Filter condition
    include=["documents", "metadatas", "distances"]
)

print(f"\nTop 10 most similar ENVO terms:\n")
for i in range(len(results_envo['ids'][0])):
    iri = results_envo['ids'][0][i]
    distance = results_envo['distances'][0][i]
    document = results_envo['documents'][0][i]
    
    print(f"{distance:.4f} | {iri}")
    print(f"  {document[:150]}...\n")


Top 10 most similar ENVO terms:

0.5117 | envo:class:http://purl.obolibrary.org/obo/ENVO_00002259
  agricultural soil; Soil which is part of an ecosystem used for agricultural activities....

0.7365 | envo:class:http://purl.obolibrary.org/obo/ENVO_00005755
  field soil; Soil which is part of an agricultural field....

0.7492 | envo:class:http://purl.obolibrary.org/obo/ENVO_00005742
  arable soil; Soil which is capable of supporting the growth of crops....

0.8351 | envo:class:http://purl.obolibrary.org/obo/CHEBI_33286
  agrochemical...

0.8486 | envo:class:http://purl.obolibrary.org/obo/ENVO_00005749
  farm soil; A portion of soil which is part of a cropland or a rangeland biome....

0.8768 | envo:class:http://purl.obolibrary.org/obo/ENVO_00000519
  agricultural terrace; A terrace which is used for agricultural activities....

0.8848 | envo:class:http://purl.obolibrary.org/obo/ENVO_03000133
  agricultural potting mixture; potting soil; An agricultural environmental material which 1) i

In [11]:
# Filter by multiple ontologies
results_multi = collection.query(
    query_embeddings=[query_embedding],
    n_results=20,
    where={"ontologyId": {"$in": ["envo", "go", "mondo"]}},  # Multiple ontologies
    include=["documents", "metadatas", "distances"]
)

print(f"\nTop 20 results from ENVO, GO, and MONDO:\n")
for i in range(len(results_multi['ids'][0])):
    iri = results_multi['ids'][0][i]
    distance = results_multi['distances'][0][i]
    ontology = results_multi['metadatas'][0][i]['ontologyId']
    document = results_multi['documents'][0][i]
    
    print(f"{distance:.4f} | {ontology} | {iri}")
    print(f"  {document[:100]}...\n")


Top 20 results from ENVO, GO, and MONDO:

0.5117 | envo | envo:class:http://purl.obolibrary.org/obo/ENVO_00002259
  agricultural soil; Soil which is part of an ecosystem used for agricultural activities....

0.7365 | envo | envo:class:http://purl.obolibrary.org/obo/ENVO_00005755
  field soil; Soil which is part of an agricultural field....

0.7492 | envo | envo:class:http://purl.obolibrary.org/obo/ENVO_00005742
  arable soil; Soil which is capable of supporting the growth of crops....

0.8351 | envo | envo:class:http://purl.obolibrary.org/obo/CHEBI_33286
  agrochemical...

0.8351 | go | go:class:http://purl.obolibrary.org/obo/CHEBI_33286
  agrochemical...

0.8486 | envo | envo:class:http://purl.obolibrary.org/obo/ENVO_00005749
  farm soil; A portion of soil which is part of a cropland or a rangeland biome....

0.8768 | envo | envo:class:http://purl.obolibrary.org/obo/ENVO_00000519
  agricultural terrace; A terrace which is used for agricultural activities....

0.8848 | envo | envo:cla