### Working with Chroma DB

* Working with LLMs requires a different approach than traditional machine learning models.

* One of the core enabling technologies for LLMs is vector embeddings. While computers cannot directly understand text, embeddings represent text numerically. All user-provided text is converted to embeddings, which are used to generate responses.

* Converting text into embedding is a time-consuming process. To avoid that, we have vector databases explicitly designed for efficient storage and retrieval of vector embeddings.

### Vector Stores

* Chroma DB is an open-source vector store used for storing and retrieving vector embeddings. Its main use is to save embeddings along with metadata to be used later by large language models. Additionally, it can also be used for semantic search engines over text data.

* First, you have to create a collection similar to the tables in the relations database. By default, Chroma converts the text into the embeddings using all-MiniLM-L6-v2, but you can modify the collection to use another embedding model.
* Add text documents to the newly created collection with metadata and a unique ID. When your collection receives the text, it automatically converts it into embedding.
* Query the collection by text or embedding to receive similar documents. You can also filter out results based on metadata.

* Create an in-memory DB for testing by creating a Chroma DB client without settings.
* Create a persistent database that will be stored in the 'path' directory 

In [6]:
#!pip install chromadb openai

In [7]:
# Old Chroma client code 
# Follow: # https://docs.trychroma.com/migration
#import chromadb
#from chromadb.config import Settings
#client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet", persist_directory="db/"))

In [1]:
import chromadb
client = chromadb.EphemeralClient()

In [2]:
client = chromadb.PersistentClient(path="C:/Users/Dipjyoti/chromadb")

In [3]:
# Create a collection object using the client. It is similar to creating a table in a traditional database.
collection = client.create_collection(name="Students")

* To add text to our collection, we need to generate random text about a student, club, and university. You can generate random text using ChatGPT.

In [10]:
student_info = """
Maya Simmons, a 19-year-old computer science sophomore with a 3.7 GPA,
is a member of the programming and chess clubs who enjoys pizza, swimming, and hiking
in her free time in hopes of working at a tech company after graduating from the University of Washington.
"""

club_info = """
The university chess club provides an outlet for students to come together and enjoy playing
the classic strategy game of chess. Members of all skill levels are welcome, from beginners learning
the rules to experienced tournament players. The club typically meets a few times per week to play casual games,
participate in tournaments, analyze famous chess matches, and improve members' skills.
"""

university_info = """
The University of Washington, founded in 1861 in Seattle, is a public research university
with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.
As the flagship institution of the six public universities in Washington state,
UW encompasses over 500 buildings and 20 million square feet of space,
including one of the largest library systems in the world.
"""

* Use the add function to add text data with metadata and unique IDs. After that, Chroma will automatically download the all-MiniLM-L6-v2 model to convert the text into embeddings and store it in the “Students” collection.

In [11]:
collection.add(
    documents = [student_info, club_info, university_info],
    metadatas = [{"source": "student info"},{"source": "club info"},{'source':'university info'}],
    ids = ["id1", "id2", "id3"]
)

C:\Users\Dipjyoti\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|███████████████████████████| 79.3M/79.3M [00:14<00:00, 5.79MiB/s]


* To run a similarity search, you can use the query function and ask questions in natural language. It will convert the query into embedding and use similarity algorithms to come up with similar results. 

In [15]:
results = collection.query(
    query_texts=["What is the student name?"],
    n_results=2
)

print(results)

{'ids': [['id1', 'id2']], 'distances': [[1.34322580609167, 1.3954030668049473]], 'metadatas': [[{'source': 'student info'}, {'source': 'club info'}]], 'embeddings': None, 'documents': [['\nMaya Simmons, a 19-year-old computer science sophomore with a 3.7 GPA,\nis a member of the programming and chess clubs who enjoys pizza, swimming, and hiking\nin her free time in hopes of working at a tech company after graduating from the University of Washington.\n', "\nThe university chess club provides an outlet for students to come together and enjoy playing\nthe classic strategy game of chess. Members of all skill levels are welcome, from beginners learning\nthe rules to experienced tournament players. The club typically meets a few times per week to play casual games,\nparticipate in tournaments, analyze famous chess matches, and improve members' skills.\n"]]}


### Embeddings

https://docs.trychroma.com/embeddings

* Use the line OpenAI embedding model called “text-embedding-ada-002” to convert text into embedding.
* add the list of text documents to generate embeddings.


In [16]:
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                model_name="text-embedding-ada-002"
            )
students_embeddings = openai_ef([student_info, club_info, university_info])
print(students_embeddings)

[[-0.004352970980107784, -0.005270606372505426, -0.007235074881464243, -0.05022148787975311, -0.022447282448410988, 0.029046306386590004, -0.027164658531546593, 0.00532361026853323, -0.02040662057697773, -0.038242537528276443, 0.04285390302538872, -0.0013706557219848037, 0.012522244825959206, -0.01799492910504341, 0.013608830980956554, -0.02467346005141735, 0.01214458979666233, -0.020101847127079964, 0.02870178036391735, -0.02026085928082466, 0.011939198710024357, -0.0102761909365654, 0.01939954049885273, -0.01775640994310379, -0.011197139509022236, -0.005939784459769726, 0.01778291165828705, -0.009474502876400948, 0.018206946551799774, -0.010455080308020115, 0.032703597098588943, -0.02000908926129341, -0.018432214856147766, -0.020870408043265343, -0.018432214856147766, 0.009249234572052956, -0.0036539777647703886, 0.004651119466871023, 0.004412600304931402, -0.0007047903491184115, 0.013940107077360153, -0.009838906116783619, -0.014827927574515343, 0.010945369489490986, -0.012767389416

In [None]:
# • The first line creates a collection named "Students2" in the client database, or gets it if it already exists.
# • The add method is then called on collection2 to add data to this collection.
# • The embeddings parameter is set to students_embeddings, which is likely a list of embeddings for each student.
# • The documents parameter is set to a list containing student_info, club_info, and university_info.
# • The metadatas parameter is set to a list of dictionaries, each containing a single key-value pair indicating the source of the data.
# • The ids parameter is set to a list of strings, which are likely unique identifiers for each document in the documents list.

Instead of using the default embedding model, load already created embedding directly to the collections.

* Use the get_or_create_collection function to create a new collection called "Students2". This function is different from create_collection. It will get a collection or create if it doesn't exist already.
* Add embedding, text documents, metadata, and IDs to our newly created collection.

In [17]:
collection2 = client.get_or_create_collection(name="Students2")

collection2.add(
    embeddings = students_embeddings,
    documents = [student_info, club_info, university_info],
    metadatas = [{"source": "student info"},{"source": "club info"},{'source':'university info'}],
    ids = ["id1", "id2", "id3"]
)

* We ccan add an OpenAI embedding function while creating or accessing the collection. Apart from OpenAI, you can use Cohere, Google PaLM, HuggingFace, and Instructor models.
* In our case, adding new text documents will run an OpenAI embedding function instead of the default model to convert text into embeddings.

In [18]:
collection2 = client.get_or_create_collection(name="Students2",embedding_function=openai_ef)

collection2.add(
    documents = [student_info, club_info, university_info],
    metadatas = [{"source": "student info"},{"source": "club info"},{'source':'university info'}],
    ids = ["id1", "id2", "id3"]
)

Add of existing embedding ID: id1
Add of existing embedding ID: id2
Add of existing embedding ID: id3
Insert of existing embedding ID: id1
Insert of existing embedding ID: id2
Insert of existing embedding ID: id3


In [19]:
results = collection2.query(
    query_texts=["What is the student name?"],
    n_results=2
)

print(results)

{'ids': [['id1', 'id3']], 'distances': [[0.45334473141675297, 0.489124293351394]], 'metadatas': [[{'source': 'student info'}, {'source': 'university info'}]], 'embeddings': None, 'documents': [['\nMaya Simmons, a 19-year-old computer science sophomore with a 3.7 GPA,\nis a member of the programming and chess clubs who enjoys pizza, swimming, and hiking\nin her free time in hopes of working at a tech company after graduating from the University of Washington.\n', '\nThe University of Washington, founded in 1861 in Seattle, is a public research university\nwith over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.\nAs the flagship institution of the six public universities in Washington state,\nUW encompasses over 500 buildings and 20 million square feet of space,\nincluding one of the largest library systems in the world.\n']]}


The results have improved. The similarity search now returns information about the university instead of a club. Additionally, the distance between the vectors is lower than the default embedding model, which is a good thing.

### Updating and Removing Data

In [None]:
#  update or remove the values from the collections. To update the text and metadata, we will provide the specific ID for the record and new text.

In [22]:
collection2.update(
    ids=["id1"],
    documents=["Anne Lauren, a 19-year-old computer science sophomore with a 3.7 GPA"],
    metadatas=[{"source": "student info"}],
)

In [23]:
results = collection2.query(
    query_texts=["What is the student name?"],
    n_results=2
)

print(results)

{'ids': [['id1', 'id3']], 'distances': [[0.38385508350143666, 0.48909799098730483]], 'metadatas': [[{'source': 'student info'}, {'source': 'university info'}]], 'embeddings': None, 'documents': [['Anne Lauren, a 19-year-old computer science sophomore with a 3.7 GPA', '\nThe University of Washington, founded in 1861 in Seattle, is a public research university\nwith over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.\nAs the flagship institution of the six public universities in Washington state,\nUW encompasses over 500 buildings and 20 million square feet of space,\nincluding one of the largest library systems in the world.\n']]}


In [None]:
# name of the student: Anne Lauren got updated now

In [24]:
# To remove a record from the collection:
collection2.delete(ids = ['id1'])


results = collection2.query(
    query_texts=["What is the student name?"],
    n_results=2
)

print(results)

{'ids': [['id3', 'id2']], 'distances': [[0.4892809532928519, 0.5302249700530861]], 'metadatas': [[{'source': 'university info'}, {'source': 'club info'}]], 'embeddings': None, 'documents': [['\nThe University of Washington, founded in 1861 in Seattle, is a public research university\nwith over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.\nAs the flagship institution of the six public universities in Washington state,\nUW encompasses over 500 buildings and 20 million square feet of space,\nincluding one of the largest library systems in the world.\n', "\nThe university chess club provides an outlet for students to come together and enjoy playing\nthe classic strategy game of chess. Members of all skill levels are welcome, from beginners learning\nthe rules to experienced tournament players. The club typically meets a few times per week to play casual games,\nparticipate in tournaments, analyze famous chess matches, and improve members' skills.\n"]]}


In [None]:
# The student information text has been removed; instead of that, we get the next best results.

### Collection Management

* We learn about the collection utility function that will make our lives much easier.

* We will create a new collection called "vectordb" and add the information about the Chroma DB cheat sheet, documentation, and JS API with metadata

In [25]:
vector_collections = client.create_collection("vectordb")

vector_collections.add(
    documents=["This is Chroma DB CheatSheet",
               "This is Chroma DB Documentation",
               "This document Chroma JS API Docs"],
    metadatas=[{"source": "Chroma Cheatsheet"},
    {"source": "Chroma Doc"},
    {'source':'JS API Doc'}],
    ids=["id1", "id2", "id3"]
)

In [27]:
# Check how many records the collection has
print(vector_collections.count())

3


In [28]:
# To view all the records from the collection
vector_collections.get()

{'ids': ['id1', 'id2', 'id3'],
 'embeddings': None,
 'metadatas': [{'source': 'Chroma Cheatsheet'},
  {'source': 'Chroma Doc'},
  {'source': 'JS API Doc'}],
 'documents': ['This is Chroma DB CheatSheet',
  'This is Chroma DB Documentation',
  'This document Chroma JS API Docs']}

In [29]:
# To change the collection name, use the modify() function. To view all collection names, use list_collections().
vector_collections.modify(name="chroma_info")
# list all collections
client.list_collections()

[Collection(name=chroma_info),
 Collection(name=Students2),
 Collection(name=Students)]

In [None]:
#  we have effectively renamed "vectordb" as "chroma_info".

In [30]:
#To access any new collection, you can use get_collection with the collection's name.
vector_collections_new = client.get_collection(name="chroma_info")

In [31]:
print(vector_collections_new)

name='chroma_info' id=UUID('8bc8d8c4-6ae3-4a10-89e0-b54a0ed02e3d') metadata=None


In [32]:
# delete a collection using the client function delete_collection and specify the collection name.
client.delete_collection(name="chroma_info")
client.list_collections()

[Collection(name=Students2), Collection(name=Students)]

In [None]:
# We can delete the entire database collection by using client.reset()
# not recommended as there is no way to restore the data after deletion.
# client.reset()
# client.list_collections()

* Vector stores like Chroma DB are becoming essential components of large language model systems. By providing specialized storage and efficient retrieval of vector embeddings, they enable fast access to relevant semantic information to power LLMs.

* Basics of creating a collection, adding documents, converting text to embeddings, querying for semantic similarity, and managing the collections.

### End