<a href="https://colab.research.google.com/github/alex-smith-uwec/NLP_Spring2025/blob/main/ChromaStart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic embedding retrieval with Chroma

This notebook demonstrates the most basic use of Chroma to store and retrieve information using embeddings. This core building block is at the heart of many powerful AI applications.

## What are embeddings?

Embeddings are the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video.

To create an embedding, data is fed into an embedding model, which outputs vectors of numbers. The model is trained in such a way that 'similar' data, e.g. text with similar meanings, or images with similar content, will produce vectors which are nearer to one another, than those which are dissimilar.

## Embeddings and retrieval

We can use the similarity property of embeddings to search for and retrieve information. For example, we can find documents relevant to a particular topic, or images similar to a given image. Rather than searching for keywords or tags, we can search by finding data with similar semantic meaning.


In [None]:
%pip install -Uq chromadb numpy datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.9/91.9 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import textwrap
import random

## Example Dataset

As a demonstration we use the [SciQ dataset](https://arxiv.org/abs/1707.06209), available from [HuggingFace](https://huggingface.co/datasets/sciq).

Dataset description, from HuggingFace:

> The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

In this notebook, we will demonstrate how to retrieve supporting evidence for a given question.


In [None]:
# Get the SciQ dataset from HuggingFace
from datasets import load_dataset

dataset = load_dataset("sciq", split="train")

# Filter the dataset to only include questions with a support
dataset = dataset.filter(lambda x: x["support"] != "")

print("Number of questions with support: ", len(dataset))

In [None]:
dataset

Dataset({
    features: ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support'],
    num_rows: 10481
})

In [None]:
dataset[10]

In [None]:
dataset['support'][10]

'One way to keep iron from corroding is to keep it painted. The layer of paint prevents the water and oxygen necessary for rust formation from coming into contact with the iron. As long as the paint remains intact, the iron is protected from corrosion. Other strategies include alloying the iron with other metals. For example, stainless steel is mostly iron with a bit of chromium. The chromium tends to collect near the surface, where it forms an oxide layer that protects the iron. Zinc-plated or galvanized iron uses a different strategy. Zinc is more easily oxidized than iron because zinc has a lower reduction potential. Since zinc has a lower reduction potential, it is a more active metal. Thus, even if the zinc coating is scratched, the zinc will still oxidize before the iron. This suggests that this approach should work with other active metals. Another important way to protect metal is to make it the cathode in a galvanic cell. This is cathodic protection and can be used for metals 

## Loading the data into Chroma

Chroma comes with a built-in embedding model, which makes it simple to load text.
We can load the SciQ dataset into Chroma with just a few lines of code.


In [None]:
# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
import chromadb

client = chromadb.Client()

In [None]:

# Create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding fuction, and the default will be used.
collection = client.create_collection("sciq_supports")

In [None]:
k=2500
# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
import chromadb

client = chromadb.Client()

# Delete the existing "sciq_supports" collection, if it exists
try:
    client.delete_collection("sciq_supports")
except chromadb.errors.NotFound:
    pass  # Ignore if the collection doesn't exist

# Create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding function, and the default will be used.
collection = client.create_collection("sciq_supports")

# Embed and store the first k supports for this demo
collection.add(
    ids=[str(i) for i in range(0, k)],  # IDs are just strings
    documents=dataset["support"][:k],
    metadatas=[{"type": "support"} for _ in range(0, k)]
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 72.3MiB/s]


In [None]:
dataset[1500]

{'question': 'The appendicular skeleton is made up of all bones of the upper and lower what?',
 'distractor3': 'digestive tract',
 'distractor1': 'organs',
 'distractor2': 'hemispheres',
 'correct_answer': 'limbs',
 'support': 'Figure 7.2 Axial and Appendicular Skeleton The axial skeleton supports the head, neck, back, and chest and thus forms the vertical axis of the body. It consists of the skull, vertebral column (including the sacrum and coccyx), and the thoracic cage, formed by the ribs and sternum. The appendicular skeleton is made up of all bones of the upper and lower limbs.'}

In [None]:
collection.get("1500")

{'ids': ['1500'],
 'embeddings': None,
 'metadatas': [{'type': 'support'}],
 'documents': ['Figure 7.2 Axial and Appendicular Skeleton The axial skeleton supports the head, neck, back, and chest and thus forms the vertical axis of the body. It consists of the skull, vertebral column (including the sacrum and coccyx), and the thoracic cage, formed by the ribs and sternum. The appendicular skeleton is made up of all bones of the upper and lower limbs.'],
 'uris': None,
 'data': None}

## Querying the data

Once the data is loaded, we can use Chroma to find supporting evidence for the questions in the dataset.
In this example, we retrieve the most relevant result according to the embedding similarity score.

Chroma handles computing similarity and finding the most relevant results for you, so you can focus on building your application.


In [None]:

random_index = random.choice(range(len(dataset["question"])))
# random_question = dataset["question"][random_index]
random_question="Why do we have teeth?"
# Query the collection with the random question
results = collection.query(
    query_texts=[random_question],
    n_results=3
)
print(f"Random question index: {random_index}")
print(f"Random question: {random_question}")


Random question index: 1183
Random question: Why do we have teeth?


In [None]:
print(random_question)


Why do we have teeth?


In [None]:
# dataset[53]

{'question': 'The angle at which light bends when it enters a different medium is known as what?',
 'distractor3': 'resonance',
 'distractor1': 'bounce',
 'distractor2': 'frequency',
 'correct_answer': 'refraction',
 'support': 'The angle at which light bends when it enters a different medium depends on its change in speed. The greater the change in speed, the greater the angle of refraction is. For example, light refracts more when it passes from air to diamond than it does when it passes from air to water. That’s because the speed of light is slower in diamond than it is in water.'}

In [None]:
print({results['ids'][0][1]})
print({results['documents'][0][1]})

{'846'}
{'All crocodilians have, like humans, teeth set in bony sockets. But unlike mammals, they replace their teeth throughout life. Crocodiles and gharials (large crocodilians with longer jaws) have salivary glands on their tongue, which are used to remove salt from their bodies. This helps with life in a saltwater environment. Crocodilians are often seen lying with their mouths open, a behavior called gaping . One of its functions is probably to cool them down.'}


In [None]:

wrapped_question = textwrap.fill(random_question, width=80)
wrapped_support = textwrap.fill(f"Index: {results['ids'][0][0]}\n\n{results['documents'][0][0]}", width=80)

print(f"Question: {wrapped_question}\n\n")
print(f"Retrieved support: {wrapped_support}")
print()


Question: Why do we have teeth?


Retrieved support: Index: 529  Mammalian teeth are also important for digestion. The four types of
teeth are specialized for different feeding functions, as shown in Figure below
. Together, the four types of teeth can cut, tear, and grind food. This makes
food easier and quicker to digest.



we display the query questions along with their retrieved supports

## What's next?

Check out the Chroma documentation to [get started](https://docs.trychroma.com/getting-started) with building your own applications.

The core embeddings based retrieval functionality demonstrated here is at the heart of many powerful AI applications, like using large language models with Chroma to [chat with your documents](https://github.com/chroma-core/chroma/tree/main/examples/chat_with_your_documents), as well as memory for agents like [BabyAgi](https://github.com/yoheinakajima/babyagi) and [Voyager](https://github.com/MineDojo/Voyager).

Chroma is already integrated with many popular AI applications frameworks, including [LangChain](https://python.langchain.com/docs/integrations/vectorstores/chroma) and [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/ChromaIndexDemo.html).

Join our community to learn more and get help with your projects: [Discord](https://discord.gg/MMeYNTmh3x) | [Twitter](https://twitter.com/trychroma)

We are [hiring](https://trychroma.notion.site/careers-chroma-9d017c3007c7478ebd85bad854101497?pvs=4)!