# Intro to Gen AI Orchestration - Vector Databases

<a target="_blank" href="https://colab.research.google.com/github/adeshmukh/gaiip/blob/main/notebooks/Intro_to_Vector_Databases.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Install dependencies


In [2]:
%%capture

# Takes about 1 minute to install all dependencies.

!pip install chromadb         # ChromaDB vector database
!pip install faiss-cpu        # FAISS vector database
!pip install pinecone-client  # Pinecone vector database client

!pip install openai[embeddings]     # OpenAI embeddings (requires API key)
!pip install sentence-transformers  # For generating embeddings using sentence transformers models

!pip install protoc-gen-openapiv2 # Protobuf support (pinecone dependency)

!pip install mmh3  # Murmur3 hash algorithm

## Test data

Here we're just setting up some data that we will insert into the vector databases.


In [15]:
math_statements = [
    "The area of a rectangle with dimensions L & W is LH.",
    "The area of a circle with radius R is `πR²`.",
    "The area of a triangle with base B and height H is `½ BH`.",
    "The area of a trapezoid with parallel sides A, B and height H is `H(A+B)/2`.",
    "The area of a parallelogram with base B and height H is BH.",
    "The surface area of a retangular prism with dimensions H, L, B is `2(LB + BH + HL)`.",
    "The surface area of a right circular cone with radius of base R and height H is `2πRH`.",
]

pets_statements = [
    """
    As an integrated part of modern life, animals play the role of domestic companions, give
physical and emotional support to humans, and provide value to many types of organizations
(e.g., search and rescue dogs, zoo animals). Animals are also becoming more present in
organizations due to employees and customers who bring their pets into the workplace. In
addition, the integration of pets into individuals’ family lives also plays an important role in
employees’ work-family dynamics.
    """,
    """
     Even though animals are becoming more present in
organizational life and play an influential role in employees’ lives, management research has
lagged behind these social trends. Therefore, we seek to define the ways in which animals
intersect with organizations, highlight opportunities for theory development, and illustrate
important areas for future research.
    """,
    """
    To provide more precision about how animals relate to organizations, we posit four types of
animals that intersect with organizations: 1) animals who work alongside humans, 2) animals as
the focus of organizations or employees, 3) companion animals brought into the workplace by
employees or customers, and 4) employees’ companion animals that stay at home.
    """,
    """
    The offices of dentists and doctors often include fish tanks in their patient waiting
areas, and previous research has shown that having a fish tank might decrease stress for patients.
    """,
    """
    Employees might feel left out if they are required to avoid certain areas of
the office due to an allergy or a fear of dogs or cats
    """,
]

all_statements = math_statements + pets_statements

## Using Vector Databases


### ChromaDB

(Ref: [https://www.trychroma.com/](https://www.trychroma.com/))

ChromaDB is an open-source vector database system. It can be used for storing & searching unstructured data like documents & images.


First, we will create an empty ChromaDB collection.


In [4]:
import chromadb

chroma_client = chromadb.Client()

facts_collection = chroma_client.get_or_create_collection(name="interesting_facts")

Now we upsert the data into the collection.

_(You may also use `add` instead of `upsert` in the example below. The semantics of `add` vs `upsert` are what you'd expect intuitively)_


In [5]:
import mmh3

# Upsert the math facts to the collection
facts_collection.upsert(
    ids=[hex(mmh3.hash(statement)) for statement in math_statements],
    documents=math_statements,
    metadatas=[{"source": "math"} for _ in math_statements],
)

# Upsert the pet statements to the collection
facts_collection.upsert(
    ids=[hex(mmh3.hash(statement)) for statement in pets_statements],
    documents=pets_statements,
    metadatas=[{"source": "pets"} for _ in pets_statements],
)

facts_collection.count()

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:02<00:00, 28.8MiB/s]


12

Notes:

1. We are providing 3 pieces of information for each item that we insert:  
   a. `id` which is calculated as the murmur3 hash of the content.
   b. text value as a `document`.  
   c. `metadata` that contains the key-value pair `source` and either math or pets as the value.
2. The ChromaDB library call to `upsert` (and `add`) is columnar in nature.

<hr/>


🔎 Now let's query the collection for a fact that we know exists.


In [25]:
results = facts_collection.query(
    query_texts=["What is the area of a trapezoid?"], n_results=3
)
results

{'ids': [['0x568ae5b4', '-0x46830a05', '-0x52bebbd3']],
 'distances': [[0.49865448474884033, 0.8906270265579224, 1.0057439804077148]],
 'metadatas': [[{'source': 'math'}, {'source': 'math'}, {'source': 'math'}]],
 'embeddings': None,
 'documents': [['The area of a trapezoid with parallel sides A, B and height H is `H(A+B)/2`.',
   'The area of a parallelogram with base B and height H is BH.',
   'The area of a triangle with base B and height H is `½ BH`.']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

Let's just look at the top result of the one query.


In [26]:
results["documents"][0][0]

'The area of a trapezoid with parallel sides A, B and height H is `H(A+B)/2`.'

🔎 Let's search with a synonymous term that does not exist as a literal string in the facts collection.


In [70]:
query_texts = [
    "What is the area of a trapezium?",
    "What is the area of a rectangular shape?",
    "What is the surface area of a rectangular prism?",
    "What is the area of a circular shape?",
    "What is the surface area of a circular cone?",
    "What is the area of a cone?",
]

results = facts_collection.query(query_texts=query_texts, n_results=1)

for result in zip(query_texts, results["metadatas"], results["documents"]):
    query_text = result[0]
    metadata = result[1]
    document = result[2]
    print(query_text, f"[category: {metadata[0]['source']}]", document)

What is the area of a trapezium? [category: math] ['The area of a trapezoid with parallel sides A, B and height H is `H(A+B)/2`.']
What is the area of a rectangular shape? [category: math] ['The area of a rectangle with dimensions L & W is LH.']
What is the surface area of a rectangular prism? [category: math] ['The surface area of a retangular prism with dimensions H, L, B is `2(LB + BH + HL)`.']
What is the area of a circular shape? [category: math] ['The area of a circle with radius R is `πR²`.']
What is the surface area of a circular cone? [category: math] ['The surface area of a right circular cone with radius of base R and height H is `2πRH`.']
What is the area of a cone? [category: math] ['The surface area of a right circular cone with radius of base R and height H is `2πRH`.']


In [8]:
# Uncomment the below line and run this cell if you want to clear the collection and start over.

# chroma_client.delete_collection("interesting_facts")

✨ Semantic search


In [9]:
results = facts_collection.query(
    query_texts=["What are the downsides of having pets at work?"], n_results=2
)

results

{'ids': [['-0x1eac65dc', '0x6d8504ee']],
 'distances': [[0.7810479402542114, 0.8007870316505432]],
 'metadatas': [[{'source': 'pets'}, {'source': 'pets'}]],
 'embeddings': None,
 'documents': [['\n    Employees might feel left out if they are required to avoid certain areas of\nthe office due to an allergy or a fear of dogs or cats\n    ',
   '\n    As an integrated part of modern life, animals play the role of domestic companions, give\nphysical and emotional support to humans, and provide value to many types of organizations\n(e.g., search and rescue dogs, zoo animals). Animals are also becoming more present in\norganizations due to employees and customers who bring their pets into the workplace. In\naddition, the integration of pets into individuals’ family lives also plays an important role in\nemployees’ work-family dynamics.\n    ']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

### FAISS

(Ref: [https://ai.meta.com/tools/faiss/](https://ai.meta.com/tools/faiss/))

FAISS (Facebook AI Similarity Search) is a library that allows developers to quickly search ... multimedia documents that are similar to each other. It solves limitations of traditional query search engines that are optimized for hash-based searches, and provides more scalable similarity search functions.

Compared to ChromaDB, FAISS has a lower level abstraction. It is designed to be flexible and can be used as an engine for building a custom vector database.


First, we will create an embedding model object.


In [40]:
from sentence_transformers import SentenceTransformer

# initialize sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# You can try other SentenceTransformer models, e.g.:
# model = SentenceTransformer('bert-base-nli-mean-tokens')

# ... or from other providers like OpenAI
# !pip install openai[embeddings]
# model = OpenAI(model='text-embedding-3-small')
# model = OpenAI(model='text-embedding-3-large')
# model = OpenAI(model='text-embedding-ada-002')



In [11]:
# create sentence embeddings
math_embeddings = model.encode(math_statements)
pets_embeddings = model.encode(pets_statements)

print("Math embeddings shape (items, dimensions): ", math_embeddings.shape)
print("Pets embeddings shape (items, dimensions): ", pets_embeddings.shape)

Math embeddings shape (items, dimensions):  (7, 384)
Pets embeddings shape (items, dimensions):  (5, 384)


In [12]:
import faiss


facts_index = faiss.IndexFlatL2(model.get_sentence_embedding_dimension())

facts_index.add(math_embeddings)
facts_index.add(pets_embeddings)

facts_index.ntotal

12

In [16]:
from typing import List
from faiss import Index


def search_faiss_index(query_text: str, n_results=1) -> List[str]:
    """
    Convenience function to make it easier to query the index
    with different text queries.
    """

    global all_statements
    global model
    global facts_index

    # Encode the query into embeddings
    query = model.encode([query_text])

    # Find the closest embeddings
    distances, indices = facts_index.search(query, n_results)

    # Lookup statements corresponding to the embedding position
    matching_statements = [all_statements[i] for i in indices[0]]

    return matching_statements

In [17]:
results = search_faiss_index(query_text="What is the area of a trapezoid?")
results

['The area of a trapezoid with parallel sides A, B and height H is `H(A+B)/2`.']

In [18]:
query_texts = [
    "What is the area of a trapezium?",
    "What is the area of a rectangular shape?",
    "What is the surface area of a rectangular prism?",
    "What is the area of a circular shape?",
    "What is the surface area of a circular cone?",
    "What is the area of a cone?",
    "What are the downsides of having pets at work?",
]

for query_text in query_texts:
    results = search_faiss_index(query_text=query_text, n_results=1)
    print(query_text, results)

What is the area of a trapezium? ['The area of a trapezoid with parallel sides A, B and height H is `H(A+B)/2`.']
What is the area of a rectangular shape? ['The area of a rectangle with dimensions L & W is LH.']
What is the surface area of a rectangular prism? ['The surface area of a retangular prism with dimensions H, L, B is `2(LB + BH + HL)`.']
What is the area of a circular shape? ['The area of a circle with radius R is `πR²`.']
What is the surface area of a circular cone? ['The surface area of a right circular cone with radius of base R and height H is `2πRH`.']
What is the area of a cone? ['The surface area of a right circular cone with radius of base R and height H is `2πRH`.']
What are the downsides of having pets at work? ['\n    Employees might feel left out if they are required to avoid certain areas of\nthe office due to an allergy or a fear of dogs or cats\n    ']


### Pinecone


#### Setting up Pinecone

1. Create an account at https://app.pinecone.io/ (You may use Google/Github/Microsoft auth).
2. Login and get or create an API Key (Manage > API Keys > Copy key value).
3. Create a Colab secret named `PINECONE_API_KEY` and paste the copied value.
4. Enable the secret for this notebook (sliding toggle next to secret name).


#### Using Pinecone


For this session, we will be creating a `Serverless` index.

Pinecone also supports what they call a `pod-based` index. The difference between the 2 is roughly analogous to using a serverless AWS service vs a managed AWS service. i.e. with `pod-based` index, you need to pick and configure a few more parameters.

Generally, it seems like `serverless` is what Pinecone is promoting as the lower cost and hassle-free option. So it's a good option to start with.


In [34]:
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec

from google.colab import userdata

pinecone_api_key = userdata.get("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)

In [41]:
import mmh3

In [42]:
from sentence_transformers import SentenceTransformer

# initialize sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# You can try other SentenceTransformer models, e.g.:
# model = SentenceTransformer('bert-base-nli-mean-tokens')

# ... or from other providers like OpenAI
# !pip install openai[embeddings]
# model = OpenAI(model='text-embedding-3-small')
# model = OpenAI(model='text-embedding-3-large')
# model = OpenAI(model='text-embedding-ada-002')

Now we will create the pinecone index. The index will be created remotely in the Pinecone service.

Note:

1. `create_index` does not return the index object.
2. Pinecone index names cannot container `_` character.
3. `create_index` is not rerunnable. It fails if the index is already created.


In [56]:
pc.create_index(
    name="interesting-facts",
    dimension=model.get_sentence_embedding_dimension(),
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

In [57]:
pc_index = pc.Index("interesting-facts")

for statement in math_statements:
    pc_index.upsert(
        vectors=[
            {
                "id": hex(mmh3.hash(statement)),
                "values": model.encode(statement),
                "metadata": {"source": "math", "text": statement},
            }
        ]
    )

for statement in pets_statements:
    pc_index.upsert(
        vectors=[
            {
                "id": hex(mmh3.hash(statement)),
                "values": model.encode(statement),
                "metadata": {"source": "pets", "text": statement},
            }
        ]
    )

In [54]:
# Uncomment and run to delete index

# pc.delete_index("interesting-facts")

In [69]:
query_texts = [
    "What is the area of a trapezium?",
    "What is the area of a rectangular shape?",
    "What is the surface area of a rectangular prism?",
    "What is the area of a circular shape?",
    "What is the surface area of a circular cone?",
    "What is the area of a cone?",
    "What are the downsides of having pets at work?",
]

sample_results = None
for query_text in query_texts:
    results = pc_index.query(
        vector=model.encode(query_text),
        top_k=1,
        include_values=False,
        include_metadata=True,
    )
    metadata = results["matches"][0]["metadata"]  # Pick [0] since top_k = 1

    sample_results = results
    print(query_text, f"[category: {metadata['source']}]", metadata["text"])

What is the area of a trapezium? [category: math] The area of a trapezoid with parallel sides A, B and height H is `H(A+B)/2`.
What is the area of a rectangular shape? [category: math] The area of a rectangle with dimensions L & W is LH.
What is the surface area of a rectangular prism? [category: math] The surface area of a retangular prism with dimensions H, L, B is `2(LB + BH + HL)`.
What is the area of a circular shape? [category: math] The area of a circle with radius R is `πR²`.
What is the surface area of a circular cone? [category: math] The surface area of a right circular cone with radius of base R and height H is `2πRH`.
What is the area of a cone? [category: math] The surface area of a right circular cone with radius of base R and height H is `2πRH`.
What are the downsides of having pets at work? [category: pets] 
    Employees might feel left out if they are required to avoid certain areas of
the office due to an allergy or a fear of dogs or cats
    
