# Query relevance using Chroma density

Density is a new experimental feature in Chroma. The aim is to give LM-in-the-loop application developers an easy way 
to evaluate whether a given input is supported by the data in the embedding store. 

For an in-depth writeup and more example use-cases, see [this post]().

**We value feedback on this experimental feature! [Join us in Discord](https://discord.gg/9WZAkTEEwC)** 

## Relevant queries

Simply retrieving a number of 'most relevant' documents is often insufficient, because while they might be the _most_ 
relevant, it's not necessarily true that they're _actually_ relevant to the query. 

The simplest way to deal with this is to restrict results by certain distance threshold, for example a maximum cosine
distance between the query and the retrieved results. 

However, what this threshold should be will vary by dataset and embedding model. Finding the right threshold might 
require significant tuning and experimentation for each new dataset, and the right threshold is likely to vary 
with the data. 

## Just ask the model ?

One approach to this problem is to just ask the model whether information in its context is relevant to the query.
This might work well in the general case, but comes with some drawbacks. 

LM calls are relatively expensive compared to querying Chroma. Additionally, in many use-cases, the model may not be 
able to assess the relevancy of a document to a given query, especially if there are few or no examples of these types 
of documents in the training corpus. 

## Algorithmic relevance 

Density is an algorithmic approach to the problem of query relevance, which automatically adapts to the dataset and
embedding model. In brief, Chroma computes a distance percentile for each query result and returns it as an auxilliary 
output along with the query result itself. Here's how it works: 

- For each embedding in the dataset, we compute the distances to N nearest neighbors. 
- We construct a cumulative density over the returned distances, and store it as a lookup table. 

When a query arrives:

- We get the distance to the query for each result.
- We use the cumulative density lookup table to determine the percentile each distance lies in relative to the dataset 
as a whole.
- We return this percentile $p$ as an auxilliary output. 

The percentile $p$ can be thought of as a proxy for the probability that a given neighbor is relevant to the query. We
can then use this proxy value to decide whether a result is relevant before giving it to the LM.

See [this post]() for a more in-depth write up. On to the examples! 

In [1]:
import os
import urllib.request
import zipfile

import chromadb.experimental
from chromadb.config import Settings

# Dataset

As an example dataset we use an embedding of the bible. 

We embed the bible is by verse, using the `multi-qa-mpnet-base-dot-v1` embedding model 
from [Sentence Transformers](https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1). This model is 
trained such that short questions and their accompanying answers lie close together in emebedding space, 
under the dot product. 

For convenience, we supply the dataset as a pre-made Chroma index, and load it using a persisted client. 
The metadata contains information about the chapter, verse, and book the verse is from. 

In [2]:
# Check if the chroma-bible persistent directory exists
persist_directory = "chroma-bible"

if not os.path.exists(persist_directory):

    # Download the chroma-bible index from https://chroma-datastore.sfo3.digitaloceanspaces.com/chroma-bible.zip
    urllib.request.urlretrieve("https://chroma-datastore.sfo3.digitaloceanspaces.com/chroma-bible.zip", "chroma-bible.zip")

    # Unzip the chroma-bible persistant directory
    with zipfile.ZipFile("chroma-bible.zip", "r") as zip_ref:
        zip_ref.extractall()
        

# Setup
We load the data using the Chroma experimental client, which gives us access to the density feature. 

In [3]:
client = chromadb.experimental.ExperimentalClient(
    Settings(persist_directory=persist_directory, chroma_db_impl="duckdb+parquet")
)
collection = client.get_collection("bible")


     ###### [38;5;220m Welcome to the Chroma experimental client! [0m #####
     
                [38;5;069m(((((((((    [38;5;203m((((([38;5;220m####         
             [38;5;069m((((((((((((([38;5;203m((((((((([38;5;220m#########    
           [38;5;069m((((((((((((([38;5;203m((((((((((([38;5;220m###########  
         [38;5;069m(((((((((((((([38;5;203m(((((((((((([38;5;220m############ 
        [38;5;069m((((((((((((([38;5;203m(((((((((((((([38;5;220m#############
        [38;5;069m((((((((((((([38;5;203m(((((((((((((([38;5;220m#############
         [38;5;069m(((((((((((([38;5;203m((((((((((((([38;5;220m##############
         [38;5;069m(((((((((((([38;5;203m(((((((((((([38;5;220m############## 
           [38;5;069m(((((((((([38;5;203m((((((((((([38;5;220m#############   
             [38;5;069m(((((((([38;5;203m(((((((([38;5;220m##############     
                [38;5;069m((((([38;5;203m((((    [38;5;220m#########[0m            

    

W

  from .autonotebook import tqdm as notebook_tqdm


We initialize the same embedding function we used to originally create the collection.

In [28]:
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

embedding_function = SentenceTransformerEmbeddingFunction(model_name="multi-qa-mpnet-base-dot-v1")

Density is computed when the first query is run, and remains precomputed until the data is updated. We run a test
query to initialize.

In [50]:
query_emb = embedding_function(["Just a test"])
collection.query(query_embeddings=query_emb, n_results=1)

({'ids': [['13721']],
  'embeddings': [[[-0.22741438448429108,
     -0.5974783897399902,
     -0.42101210355758667,
     -0.01273857057094574,
     -0.16018493473529816,
     -0.4925821125507355,
     0.11176109313964844,
     0.26458078622817993,
     0.038228511810302734,
     0.20868925750255585,
     -0.09203582257032394,
     0.2778061330318451,
     0.0772935226559639,
     -0.06866028904914856,
     0.0555313378572464,
     0.4308217465877533,
     0.07707927376031876,
     0.24073900282382965,
     -0.06981579959392548,
     -0.2653667628765106,
     -0.13021938502788544,
     0.08506990224123001,
     -0.11561842262744904,
     -0.01671893149614334,
     -0.005568559281527996,
     0.204319566488266,
     -0.08888082206249237,
     0.17850187420845032,
     -0.03431474417448044,
     -0.17904958128929138,
     0.23629184067249298,
     0.06904833018779755,
     -0.19172152876853943,
     -0.17581963539123535,
     -0.00010020360059570521,
     -0.1313767284154892,
     0.30331

Putting it all together we define a function for querying the Bible collection, and printing out the 
results along with their distance percentiles.

In [43]:
def query_bible(query: str, n_results: int = 3):
    query_emb = embedding_function([query])
    result, distance_percentiles = collection.query(query_embeddings=query_emb, n_results=n_results)

    print(f"Query: {query}")
    print(f"Results:")
    for documents, distances in zip(result['documents'], distance_percentiles):
        for document, distance in zip(documents, distances):
            print(f"\t{document}")
            print(f"\tPercentile: {distance}")


# Queries

Let's run some queries. First, let's start with an irrelevant question, which we can use as a baseline.

**Note:** Density is commputed on the first query

In [45]:
query_bible("Who was the 35th president of the United States?")

Query: Who was the 35th president of the United States?
Results:
	I said in my haste, All men are liars.
	Percentile: 0.07713452280821043
	And again he denied with an oath, I know not the man.
	Percentile: 0.02268830866694105
	Behold, I have told you beforehand.
	Percentile: 0.01549777197158142


As expected, each of these results has a relatively low distance percentile - they aren't likely relevant to the query.

Next, let's try something something that might be covered in the bible. 

In [46]:
query_bible("What is the meaning of life?")

Query: What is the meaning of life?
Results:
	The Spirit of God hath made me, And the breath of the Almighty giveth me life.
	Percentile: 0.5794090803342937
	He that loveth his life loseth it; and he that hateth his life in this world shall keep it unto life eternal.
	Percentile: 0.5794090803342937
	And this is life eternal, that they should know thee the only true God, and him whom thou didst send, [even] Jesus Christ.
	Percentile: 0.539043289965819


The results are somewhat more relevant. Though their content doesn't immediately answer the query, we might supply 
these results to an LM to synthesisze an answer for us. 

Next lets ask more direct questions. 

In [48]:
query_bible("How old was Noah when he built the ark?")
query_bible("What is the number of wives of Solomon?")

Query: How old was Noah when he built the ark?
Results:
	And all the days of Noah were nine hundred and fifty years: And he died.
	Percentile: 0.841675763703717
	And Noah was six hundred years old when the flood of waters was upon the earth.
	Percentile: 0.841675763703717
	And Noah lived after the flood three hundred and fifty years.
	Percentile: 0.7610658744787846
Query: What is the number of wives of Solomon?
Results:
	And he had seven hundred wives, princesses, and three hundred concubines; and his wives turned away his heart.
	Percentile: 0.6943646092449758
	And there were born unto him seven sons and three daughters.
	Percentile: 0.6190339841443118
	The proverbs of Solomon the son of David, king of Israel:
	Percentile: 0.6190339841443118


We get pretty relevant results! 
Noah's age is given at various points, though only the second result actually answers the query. We also correctly get 
the number of Solomon's wives (seems he was prett prolific, at least according to scripture).

Interestingly, despite both being direct questions about numbers, the percentiles for the results about Solomon
are lower, and indeed, only one result is actually relevant. 

Next, lets ask some questions which aren't just about a single quantity.

In [49]:
query_bible("Which Apostle took the Gospel to the city of Samaria?")
query_bible("What did the apostle Simon do for a living?")

Query: Which Apostle took the Gospel to the city of Samaria?
Results:
	Now when the apostles that were at Jerusalem heard that Samaria had received the word of God, they sent unto them Peter and John:
	Percentile: 0.9238139551530986
	And when they had preached the gospel to that city, and had made many disciples, they returned to Lystra, and to Iconium, and to Antioch,
	Percentile: 0.8628536659567996
	And Philip went down to the city of Samaria, and proclaimed unto them the Christ.
	Percentile: 0.8628536659567996
Query: What did the apostle Simon do for a living?
Results:
	and Simon he surnamed Peter;
	Percentile: 0.841675763703717
	Paul, an apostle of Christ Jesus through the will of God, to the saints that are at Ephesus, and the faithful in Christ Jesus:
	Percentile: 0.4990139408364501
	Paul, an apostle (not from men, neither through man, but through Jesus Christ, and God the Father, who raised him from the dead),
	Percentile: 0.4990139408364501


The results we get are mixed. For the query regarding Samaria, we get pretty relevant results, though one of them seems
to be relevant only because it mentions several cities.

For the query regarding Simon, we don't get any results relevant to the actual query, and a high percentile for the 
first result - it does mention Simon, but not his profession. 

The reason for this might be two-fold;

- Simon's profession might never be in the same verse as his name. Without this context, it's not possible to just 
answer this query. 

- The embedding model we used might weight matching entities (e.g. names) more strongly than matching other features,
leading to things being similar because they're both cities, instead of their role in the actual question.

There is still a lot to explore here - results like these suggest that density can be used not just to estimate 
query relevance, but also to explore the structure of the dataset for a given embedding model. 

# Conclusion

Density provides a way to estimate the relevancy of query results returned by Chroma which automatically adapts to the 
dataset and embedding model. We hope it provides you with another tool to build robust AI applications.

There's a lot more work to do to answer questions around how text should be chunked, tokenized, embeeded and retrieved.
If you're interested in working with us on these and other problems, [we are hiring!](https://www.notion.so/trychroma/careers-chroma-9d017c3007c7478ebd85bad854101497).

