<a href="https://colab.research.google.com/github/anshupandey/Generative-AI-for-Professionals/blob/main/VectorDB_Getting_started_with_ChromaDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What is ChromaDB?
- ChromaDB is an open-source embedding database optimized for developer productivity and simplicity in building applications with Large Language Models (LLMs). It facilitates the storage, search, and management of embeddings and their metadata, enabling efficient development of AI-driven applications. With features like document and query embedding, and support for multiple programming languages through client SDKs, ChromaDB aims to streamline the integration of advanced language model functionalities into various software projects.

## Architecture of ChromaDB Integration with Large Language Models
<img src="https://docs.trychroma.com/img/hrm4.svg">

The image illustrates how the application interacts with the ChromaDB service. The flow is as follows:

1. **Queries**: The app sends queries which are processed to generate embeddings.
2. **Gen Embedding**: This refers to the process where queries are transformed into embeddings, which are numerical representations understandable by machine learning models.
3. **Chroma**: The ChromaDB service, where:
    - **Documents**: Text data to be analyzed is stored.
    - **Embeddings**: Each document or query is converted into an array of numerical values (embeddings) for efficient processing.
4. **LLM Context Window**: The embeddings are used within the context window of a Large Language Model (LLM) to interpret the query.
5. **Answer**: The LLM, using the context provided by Chroma, produces an answer to the initial query.

## Tools provided by ChromaDB:

- Store embeddings and their metadata for easy retrieval and management.
- Embed documents and queries to transform text into a format suitable for machine learning models.
- Search through embeddings to find the most relevant information or insights based on query inputs.

## Advantages of ChromaDB

- Focusing on simplicity and developer productivity.
- Enabling analysis capabilities on top of search functionalities.
- Offering fast performance for embedding storage and retrieval.

For a detailed overview, please refer to the Chroma documentation at https://docs.trychroma.com/.

## Step 1: Initializing Embedding Models

## Installing necessary libraries

In [None]:
# install chromadb python API
!pip install chromadb --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m49.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.4/58.4 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
#install other necessary libraries
!pip install langchain pymupdf openai --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.4/262.4 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m80.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m269.1/269.1 kB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.6/71.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.8/30.8 MB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

## Importing necessary libraries

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
import fitz  # Import PyMuPDF
import chromadb

In [None]:
OPENAI_API_KEY = "sk-xxxxxx"

## Extracting from pdf

In [None]:
# Downloading PDF File
!wget -q https://anshupandey.blob.core.windows.net/generativeaidocs/VectorEmbeddings.pdf

In [None]:
def extract_text_from_pdf(pdf_path):
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

pdf_path = "VectorEmbeddings.pdf"
text = extract_text_from_pdf(pdf_path)

## Dividing text into chunks

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=600,
    chunk_overlap=20,
)

In [None]:
chunks = text_splitter.split_text(text)

In [None]:
chunks

['You might not know it yet, but vector embeddings are everywhere. They are the \nbuilding blocks of many machine learning and deep learning algorithms used by \napplications ranging from search to AI assistants. If you’re considering building your \nown application in this space, you will likely run into vector embeddings at some point. \nIn this post, we’ll try to get a basic intuition for what vector embeddings are and how \nthey can be used. \nWhat problem are we trying to solve? \nWhen you build a traditional application, your data structures are represented as',
 'objects that probably come from a database. These objects have properties (or columns \nin a database) that are relevant to the application you’re building. \nOver time, the number of properties of these objects grows — to the point where you \nmay need to be more intentional about which properties you need to complete a given \ntask. You may even end up creating specialized representations of these objects to solve \np

In [None]:
len(chunks)

14

## Creating class for custom embeddings (we are using openai embeddings) for chromaDB

In [None]:
from chromadb import Documents, EmbeddingFunction, Embeddings
import chromadb.utils.embedding_functions as embedding_functions

In [None]:
class MyOpenAIEmbeddingFunction(EmbeddingFunction):
    def __init__(self, api_key: str, model_name: str = "text-embedding-ada-002"):
        self.openai_ef = embedding_functions.OpenAIEmbeddingFunction(
            api_key=api_key,
            model_name=model_name
        )
        print("MyOpenAIEmbeddingFunction initialized with model:", model_name)

    def __call__(self, input: Documents) -> Embeddings:
        print("Embedding documents with OpenAI...")
        embeddings = self.openai_ef(input)
        print("Embedding complete.")
        return embeddings

In [None]:
my_openai_embedding_function = MyOpenAIEmbeddingFunction(api_key=OPENAI_API_KEY, model_name="text-embedding-3-large")

MyOpenAIEmbeddingFunction initialized with model: text-embedding-3-large


## Initialize chromaDB and setting path for where data will save

In [None]:
chroma_client = chromadb.PersistentClient(path="myVectorDB/chroma")

## Creating collection with custom embeddings

In [None]:
collection = chroma_client.create_collection(name="my_collection", embedding_function=my_openai_embedding_function)

## Step 2: Database Operations for ChromaDB

### Creating ids according to data

In [None]:
id = []
for i in range(len(chunks)):
    id.append(str(i+1))

id

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14']

### Adding data in collection using custom embedding

In [None]:
collection.add(
    documents=chunks,
    ids=id
)# this will create embedding of documnets we are passing using custom embedding

Embedding documents with OpenAI...
Embedding complete.


#### If you already created embeddings using some other way you can add embeddings and it's documents
##### ex:   
        `collection.add(
            embeddings =[[.....],[....],....]
            documents= data,
            metadata = [{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, ...],# can also add metadata if you want to.
            ids=id
            )`

### Get a collection

In [None]:
coll1 = chroma_client.get_collection(name="my_collection") # Get a collection object from an existing collection, by name. Will raise an exception if it's not found.
coll1

Collection(name=my_collection)

In [None]:
coll2 = chroma_client.get_or_create_collection(name="test") # Get a collection object from an existing collection, by name. If it doesn't exist, create it.
coll2

Collection(name=test)

### Peak in collection

In [None]:
coll1.peek() # returns a list of the first 10 items in the collection

{'ids': [],
 'embeddings': [],
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

In [None]:
coll2.peek() # returns a list of the first 10 items in the collection

{'ids': [],
 'embeddings': [],
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

### Delete a collection

In [None]:
chroma_client.delete_collection(name="test") # Delete a collection and all associated embeddings, documents, and metadata.

In [None]:
try:
    coll = chroma_client.get_collection(name="test")
except:
    print("collection name does not exist")

collection name does not exist


In [None]:
coll1.peek()

{'ids': [],
 'embeddings': [],
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

### See number of items in collection

In [None]:
collection.count() # returns the number of items in the collection

14

### Rename the collection

In [None]:
collection.modify(name="embeddings1") # Rename the collection

In [None]:
try:
    chroma_client.get_collection(name="my_collection") # Get a collection object from an existing collection, by name. Will raise an exception if it's not found.
except:
    print("collection does not exist")

In [None]:
chroma_client.get_collection(name="embeddings1") # Get a collection object from an existing collection, by name. Will raise an exception if it's not found.


Collection(name=embeddings1)

### Get method to get data present in collection

In [None]:
get = collection.get(
    ids=["1","2","3","4","5"],
)
get

{'ids': ['1', '2', '3', '4', '5'],
 'embeddings': None,
 'metadatas': [None, None, None, None, None],
 'documents': ['You might not know it yet, but vector embeddings are everywhere. They are the \nbuilding blocks of many machine learning and deep learning algorithms used by \napplications ranging from search to AI assistants. If you’re considering building your \nown application in this space, you will likely run into vector embeddings at some point. \nIn this post, we’ll try to get a basic intuition for what vector embeddings are and how \nthey can be used. \nWhat problem are we trying to solve? \nWhen you build a traditional application, your data structures are represented as',
  'objects that probably come from a database. These objects have properties (or columns \nin a database) that are relevant to the application you’re building. \nOver time, the number of properties of these objects grows — to the point where you \nmay need to be more intentional about which properties you ne

#### Use 'where_document' to filter dictionary to filter by contents of the document.

In [None]:
get = collection.get(
    ids=["1","2","3","4","5"],
    where_document={"$contains":"vector embeddings"} # $contains means it will give document with the string vector embeddings.
)
get

{'ids': ['1', '3', '4'],
 'embeddings': None,
 'metadatas': [None, None, None],
 'documents': ['You might not know it yet, but vector embeddings are everywhere. They are the \nbuilding blocks of many machine learning and deep learning algorithms used by \napplications ranging from search to AI assistants. If you’re considering building your \nown application in this space, you will likely run into vector embeddings at some point. \nIn this post, we’ll try to get a basic intuition for what vector embeddings are and how \nthey can be used. \nWhat problem are we trying to solve? \nWhen you build a traditional application, your data structures are represented as',
  'picking only the essential features relevant to the task at hand. \nWhen you deal with unstructured data, you will have to go through this same feature \nengineering process. However, unstructured data is likely to have many more pertinent \nfeatures, and performing manual feature engineering is bound to be untenable. \nIn tho

In [None]:
get = collection.get(
    ids=["1","2","3","4","5"],
    where_document={"$not_contains":"vector embeddings"} # $not_contains means it will give documents which do not have vector embeddings string.
)
get

{'ids': ['2', '5'],
 'embeddings': None,
 'metadatas': [None, None],
 'documents': ['objects that probably come from a database. These objects have properties (or columns \nin a database) that are relevant to the application you’re building. \nOver time, the number of properties of these objects grows — to the point where you \nmay need to be more intentional about which properties you need to complete a given \ntask. You may even end up creating specialized representations of these objects to solve \nparticular tasks without paying the overhead of having to process very “fat” objects. \nThis process is known as feature engineering — you optimize your application by',
  'When we look at a bunch of vectors in one space, we can say that some are closer to one \nanother, while others are far apart. Some vectors can seem to cluster together, while \nothers could be sparsely distributed in the space. \n \nWe’ll soon explore how these relationships between vectors can be useful. \nVectors ar

##### You can use 'where' filter dictionary to filter by the metadata associated with each document.
###### ex:
    `get = collection.get(
    ids=["1","2","3","4","5"],
    where={"style": "style1"}
    )
    get`

##### You can use it with 'where_documents' method also.
###### ex:
    `get = collection.get(
        ids=["1","2","3","4","5"],
        where={"style": "style1"}
        where_document={"$contains":"vector embeddings"}
    )
    get`

### Query method

In [None]:
results = collection.query(
    query_texts=["what is vector embeddings"],
    n_results=5,# n_results means how many similar results to print according to score
)
results

Embedding documents with OpenAI...
Embedding complete.


{'ids': [['4', '1', '9', '5', '11']],
 'distances': [[0.5083547252793864,
   0.6269708208667786,
   0.6720473213787388,
   0.6935371904709291,
   0.7173474459290626]],
 'metadatas': [[None, None, None, None, None]],
 'embeddings': None,
 'documents': [['that is more compact while preserving what’s meaningful about the data. \nWhat are vector embeddings? \nBefore we delve into what vector embeddings are, let’s talk about vectors. A vector is a \nmathematical structure with a size and a direction. For example, we can think of the \nvector as a point in space, with the “direction” being an arrow from (0,0,0) to that point \nin the vector space. \n \nAs developers, it might be easier to think of a vector as an array containing numerical \nvalues. For example: \nvector = [0,-2,...4]',
   'You might not know it yet, but vector embeddings are everywhere. They are the \nbuilding blocks of many machine learning and deep learning algorithms used by \napplications ranging from search to AI assist

#### Use can use include to choose which data to return

In [None]:
results = collection.query(
    query_texts=["what is vector embeddings"],
    n_results=2,
    include=["documents","embeddings","distances"]
)
results

Embedding documents with OpenAI...
Embedding complete.


{'ids': [['4', '1']],
 'distances': [[0.5083547252793864, 0.6269708208667786]],
 'metadatas': None,
 'embeddings': [[[-0.013817156665027142,
    0.017874164506793022,
    -0.017913049086928368,
    -0.029941493645310402,
    0.03263752534985542,
    -0.015878064557909966,
    0.019235141575336456,
    0.007226139772683382,
    -0.04394011199474335,
    -0.045754749327898026,
    -0.00989624671638012,
    0.009624050930142403,
    0.0026992710772901773,
    0.017096463590860367,
    -0.014996670186519623,
    0.03012295626103878,
    0.011691439896821976,
    0.02939710207283497,
    -0.006354466080665588,
    0.0055378801189363,
    0.021231241524219513,
    -0.017277926206588745,
    -0.020816465839743614,
    -0.01420600712299347,
    0.0344780832529068,
    0.008800984360277653,
    -0.02758246660232544,
    0.005194395314902067,
    -0.06356410682201385,
    0.05184674263000488,
    0.01929994858801365,
    -0.017601968720555305,
    0.01774454675614834,
    -0.015981758013367653,


### Updating data in collection

#### Printing data before updating

In [None]:
get = collection.get(
    ids=["1","2"],
)
get

{'ids': ['1', '2'],
 'embeddings': None,
 'metadatas': [None, None],
 'documents': ['You might not know it yet, but vector embeddings are everywhere. They are the \nbuilding blocks of many machine learning and deep learning algorithms used by \napplications ranging from search to AI assistants. If you’re considering building your \nown application in this space, you will likely run into vector embeddings at some point. \nIn this post, we’ll try to get a basic intuition for what vector embeddings are and how \nthey can be used. \nWhat problem are we trying to solve? \nWhen you build a traditional application, your data structures are represented as',
  'objects that probably come from a database. These objects have properties (or columns \nin a database) that are relevant to the application you’re building. \nOver time, the number of properties of these objects grows — to the point where you \nmay need to be more intentional about which properties you need to complete a given \ntask. Yo

#### Updating data

In [None]:
collection.update(
    ids=["1", "2"],
    metadatas=[{"page": "3", "line": "16"}, {"page": "3", "line": "5"}], # example metadata
    documents=["doc1", "doc2"] # example document
)
# If an id is not present in the collection, an error will be logged and the update will be ignored.

Embedding documents with OpenAI...
Embedding complete.


#### Printing data after updating

In [None]:
get = collection.get(
    ids=["1","2"],
    include=['embeddings','metadatas','documents']
)
get

{'ids': ['1', '2'],
 'embeddings': [[-0.00269283982925117,
   -0.007498405873775482,
   -0.016891678795218468,
   0.024319155141711235,
   0.0007422408671118319,
   0.020569952204823494,
   -0.017479391768574715,
   0.11113853752613068,
   0.0011906252475455403,
   0.033479370176792145,
   0.02221149392426014,
   -0.014743487350642681,
   -0.02802782505750656,
   -0.053866926580667496,
   -0.026852399110794067,
   -0.011551598086953163,
   -0.01110574696213007,
   0.014196306467056274,
   0.02557564526796341,
   -0.011784656904637814,
   0.02259654738008976,
   0.009661797434091568,
   -0.014865083619952202,
   0.05779852345585823,
   0.024927133694291115,
   0.01966811716556549,
   0.017013275995850563,
   0.02275867573916912,
   -0.006870161276310682,
   -0.03293218836188316,
   -0.002670040586963296,
   0.05143501237034798,
   -0.010016451589763165,
   -0.03532357141375542,
   -0.021562984213232994,
   0.018715616315603256,
   0.03832293301820755,
   0.01909053698182106,
   0.004400

### Update data, or add them if they don't exist.

#### Printing data before updating

In [None]:
get = collection.get(
    ids=["6","16"],
)
get

{'ids': ['6'],
 'embeddings': None,
 'metadatas': [None],
 'documents': ['comes into play. It’s a technique that allows us to take virtually any data type and \nrepresent it as vectors. \nBut it isn’t as simple as just turning data into vectors. We want to ensure that we can \nperform tasks on this transformed data without losing the data’s original meaning. For \nexample, if we want to compare two sentences — we don’t want just to compare the \nwords they contain but rather whether or not they mean the same thing. To preserve \nthe data’s meaning, we need to understand how to produce vectors \nwhere relationships between the vectors make sense.'],
 'uris': None,
 'data': None}

#### Upsert data

In [None]:
collection.upsert(
    ids=["6", "16"],
    metadatas=[{"page": "4", "line": "6"}, {"page": "7", "line": "16"}], # example metadata
    documents=["doc3", "doc4"] # example document
)

Embedding documents with OpenAI...
Embedding complete.


#### Printing data after Upserting

In [None]:
get = collection.get(
    ids=["6","16"],
    include=['embeddings','metadatas','documents']
)
get

{'ids': ['16', '6'],
 'embeddings': [[0.002441534772515297,
   0.015049630776047707,
   -0.0182426068931818,
   0.057327963411808014,
   0.009095301851630211,
   0.013707956299185753,
   -0.027270305901765823,
   0.108914814889431,
   -0.016412105411291122,
   0.05965769290924072,
   0.03116011992096901,
   -0.010858199559152126,
   -0.013759959489107132,
   -0.041581496596336365,
   -0.047842640429735184,
   -0.008549272082746029,
   -0.04597053676843643,
   0.026313452050089836,
   0.009022497572004795,
   -0.017389759421348572,
   0.0006227343692444265,
   0.013967971317470074,
   -0.013135924935340881,
   0.06452516466379166,
   0.010920602828264236,
   0.02390052005648613,
   0.028975998982787132,
   0.008637676946818829,
   -0.005845122504979372,
   -0.0038066101260483265,
   0.0046386560425162315,
   0.029350420460104942,
   0.009745338000357151,
   -0.031347330659627914,
   -0.0015015829121693969,
   0.014945625327527523,
   0.02835196442902088,
   0.033448245376348495,
   -0.0

## Thank You