Vector Search using vCore-based Azure Cosmos DB for MongoDB

This notebook demonstrates using an Azure OpenAI embedding model to vectorize documents already stored in Azure Cosmos DB API for MongoDB, storing the embedding vectors and the creation of a vector index. Lastly, the notebook will demonstrate how to query the vector index to find similar documents.

Make sure that the lab in load_data was ran

In [20]:
! pip install pymongo
! pip install openai
! pip install python-dotenv
! pip install tenacity




[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os
import pymongo
import time
import json
from openai import AzureOpenAI
from dotenv import load_dotenv
from tenacity import retry, wait_random_exponential, stop_after_attempt

Load settings and establish connectivity to database and openAI

In [30]:
load_dotenv()
CONNECTION_STRING = os.environ.get("DB_CONNECTION_STRING")
client = pymongo.MongoClient(CONNECTION_STRING)
# Create database to hold cosmic works data
# MongoDB will create the database if it does not exist
db = client.cosmic_works

EMBEDDINGS_DEPLOYMENT_NAME = "embeddings"
COMPLETIONS_DEPLOYMENT_NAME = "completions"
AOAI_ENDPOINT = os.environ.get("AOAI_ENDPOINT")
AOAI_KEY = os.environ.get("AOAI_KEY")
AOAI_API_VERSION = "2023-05-15"

ai_client = AzureOpenAI(
    azure_endpoint = AOAI_ENDPOINT,
    api_version = AOAI_API_VERSION,
    api_key = AOAI_KEY
    )

  client = pymongo.MongoClient(CONNECTION_STRING)


Vectorize and store the embeddings in each document

The process of creating a vector embedding field on each document only needs to be done once. However, if a document changes, the vector embedding field will need to be updated with an updated vector.

In [22]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def generate_embeddings(text: str):
    '''
    Generate embeddings from string of text using the deployed Azure OpenAI API embeddings model.
    This will be used to vectorize document data and incoming user messages for a similarity search with
    the vector index.
    '''
    response = ai_client.embeddings.create(input=text, model=EMBEDDINGS_DEPLOYMENT_NAME)
    embeddings = response.data[0].embedding
    time.sleep(0.5) # rest period to avoid rate limiting on AOAI
    return embeddings

In [23]:
# demonstrate embeddings generation using a test string
test = "hello, world"
print(generate_embeddings(test))

[-0.016903093, -0.0068888566, -0.027763786, -0.046490017, -0.011016962, 0.01019655, -0.014064207, -0.004762948, -0.018856455, -0.028388862, 0.029092072, 0.01996336, -0.021812543, -0.006263781, 0.009525896, 0.006566552, 0.017410967, -0.0143507, 0.011863419, 0.018804366, -0.0124950055, -1.7002898e-05, 0.009206846, -0.010281196, -0.009695187, -0.016551487, 0.006986525, -0.016759846, 0.024560273, -0.03815567, 0.00072762737, 0.0034574508, -0.016043615, -0.006322382, 0.01116672, -0.011935042, 0.0009498223, -0.027789831, 0.029534834, -0.011290433, 0.0023961242, -0.007116749, 0.0041606613, -0.013725624, -0.03263417, 0.0127554545, 0.008718506, -0.015079955, 0.0042420514, 0.02260691, 0.021773476, 0.001420257, -0.024443071, -0.0018866222, -0.013289373, 0.008751062, -0.035473056, 0.014780439, 0.020145673, -0.020666571, 0.01601757, 0.0037309215, -0.025471842, 0.012091311, -0.009988192, 0.010235617, 0.016199883, 0.008588281, -0.019351307, 0.014585103, 0.021005154, 0.019051792, -0.005482436, -0.00776

In [31]:
print(db.list_collection_names())


['products', 'customers', 'sales']


Vectorize and update all documents in the Cosmic Works database

In [36]:
def add_collection_content_vector_field(collection_name: str):
    '''
    Add a new field to the collection to hold the vectorized content of each document.
    '''
    collection = db[collection_name]
    bulk_operations = []

    # Check if collection is empty
    if collection.count_documents({}) == 0:
        print(f"The collection '{collection_name}' is empty.")
        return
    

    for doc in collection.find():

        # Remove any previous contentVector embeddings
        if "contentVector" in doc:
            del doc["contentVector"]

        # Generate embeddings for the document string representation
        content = json.dumps(doc, default=str)
        content_vector = generate_embeddings(content)
        
        bulk_operations.append(pymongo.UpdateOne(
            {"_id": doc["_id"]},
            {"$set": {"contentVector": content_vector}},
            upsert=True
        ))

    # Check if bulk_operations is empty
    if not bulk_operations:
        print(f"No operations to execute for collection '{collection_name}'.")
        return

    # Execute bulk operations
    collection.bulk_write(bulk_operations)



In [34]:
# Add vector field to products documents - this will take approximately 3-5 minutes due to rate limiting
add_collection_content_vector_field("products")

Processing document ID: 027D0B9A-F9D9-4C96-8213-C8546C4AAE71
Processing document ID: 08225A9E-F2B3-4FA3-AB08-8C70ADD6C3C2
Processing document ID: 0A7E57DA-C73F-467F-954F-17B7AFD6227E
Processing document ID: 14174164-F6C0-47FC-83FB-604C6A63408D
Processing document ID: 1A176FDB-D9A8-4888-BDD9-CE4F12E97AAE
Processing document ID: 201D0D79-81AD-43D2-AD6E-F09EEE6AC2D7
Processing document ID: 24BE4267-85D8-4C1A-B184-C08709495752
Processing document ID: 290B4594-95BE-47C5-863A-4EFAAFC0AED7
Processing document ID: 29663491-D2E9-47B4-83AE-D9459B6B5B67
Processing document ID: 2C981511-AC73-4A65-9DA3-A0577E386394
Processing document ID: 3F105575-8677-42F9-8E1F-76E4B450F136
Processing document ID: 3FE1A99E-DE14-4D11-B635-F5D39258A0B9
Processing document ID: 44873725-7B3B-4B28-804D-963D2D62E761
Processing document ID: 47C70E1E-E500-41B3-8615-DCCB963D9E35
Processing document ID: 4B0848F8-7BF5-4DB9-84A7-C4D69F2E3E8E
Processing document ID: 4E4B38CB-0D82-43E5-89AF-20270CD28A04
Processing document ID: 

In [33]:
# Add vector field to customers documents - this will take approximately 1-2 minutes due to rate limiting
add_collection_content_vector_field("customers")

Processing document ID: 022BB1FA-35E6-4CC5-9079-8EA61FE7FAAE
Processing document ID: 0E57A241-1B95-43A2-BCFB-637608B0AD1A
Processing document ID: 23A65A9A-479C-44D2-9F6A-E6CDA8B0BE08
Processing document ID: 29C95F8A-9C52-48DB-A1C4-8A14C430FF06
Processing document ID: 34E7A125-0F66-4673-A80B-20B4C46EAD3A
Processing document ID: 35D52474-3D1A-433C-A310-10FA7DF8950B
Processing document ID: 3945DB3E-2632-466C-BCBE-0C252729C937
Processing document ID: 44A6D5F6-AF44-4B34-8AB5-21C5DC50926E
Processing document ID: 45E422FD-0AE2-4C73-8883-61B1C3BB4431
Processing document ID: 4FEAA310-61D4-4A89-8E78-3CA6B34F7934
Processing document ID: 537E369C-C65B-4F23-B7C0-D07DFFFAC08B
Processing document ID: 5EE9C404-EBE5-45F6-8063-E537AA5E750C
Processing document ID: 6325847B-D85C-4F23-9C47-7346082D38A1
Processing document ID: 670C1D45-DCAF-4B08-8358-F1DFBAE8E8C8
Processing document ID: 6A0B9894-8EDB-4568-A4AC-03241F72F62C
Processing document ID: 6DF10977-EA10-4607-AF0F-74EDABA54943
Processing document ID: 

In [37]:
# Add vector field to sales documents - this will take approximately 15-20 minutes due to rate limiting
add_collection_content_vector_field("sales")

In [38]:
# Create the products vector index
db.command({
  'createIndexes': 'products',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

# Create the customers vector index
db.command({
  'createIndexes': 'customers',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

# Create the sales vector index
db.command({
  'createIndexes': 'sales',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

{'raw': {'defaultShard': {'numIndexesBefore': 1,
   'numIndexesAfter': 2,
   'createdCollectionAutomatically': False,
   'ok': 1}},
 'ok': 1}

Use vector search in vCore-based Azure Cosmos DB for MongoDB

Now that each document has its associated vector embedding and the vector indexes have been created on each collection, we can now use the vector search capabilities of vCore-based Azure Cosmos DB for MongoDB.

In [39]:
def vector_search(collection_name, query, num_results=3):
    """
    Perform a vector search on the specified collection by vectorizing
    the query and searching the vector index for the most similar documents.

    returns a list of the top num_results most similar documents
    """
    collection = db[collection_name]
    query_embedding = generate_embeddings(query)    
    pipeline = [
        {
            '$search': {
                "cosmosSearch": {
                    "vector": query_embedding,
                    "path": "contentVector",
                    "k": num_results
                },
                "returnStoredSource": True }},
        {'$project': { 'similarityScore': { '$meta': 'searchScore' }, 'document' : '$$ROOT' } }
    ]
    results = collection.aggregate(pipeline)
    return results

def print_product_search_result(result):
    '''
    Print the search result document in a readable format
    '''
    print(f"Similarity Score: {result['similarityScore']}")  
    print(f"Name: {result['document']['name']}")   
    print(f"Category: {result['document']['categoryName']}")
    print(f"SKU: {result['document']['categoryName']}")
    print(f"_id: {result['document']['_id']}\n")

In [40]:
query = "What bikes do you have?"
results = vector_search("products", query, num_results=4)
for result in results:
    print_product_search_result(result) 

Similarity Score: 0.7668056488037109
Name: Road-750 Black, 48
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 2595584F-EA4E-4D45-948E-99A17AF8C519

Similarity Score: 0.7646317937750707
Name: Road-550-W Yellow, 40
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 3A70EDD4-6C8C-44AA-A13D-49D0F6058699

Similarity Score: 0.764350845554422
Name: Mountain-300 Black, 48
Category: Bikes, Mountain Bikes
SKU: Bikes, Mountain Bikes
_id: E8767BC9-D6BA-47FC-9842-3511468869B6

Similarity Score: 0.7631214229771299
Name: Road-550-W Yellow, 48
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 26E8185C-782A-4B48-87FA-1E715E3825FB



In [41]:
query = "What do you have that is yellow?"
results = vector_search("products", query, num_results=4)
for result in results:
    print_product_search_result(result)  

Similarity Score: 0.7417972924536421
Name: Road-550-W Yellow, 48
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 26E8185C-782A-4B48-87FA-1E715E3825FB

Similarity Score: 0.7403228217074814
Name: Road-350-W Yellow, 40
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 9E5C74FD-F685-45AE-A799-D67EFB5C28A1

Similarity Score: 0.7373172703967041
Name: Road-550-W Yellow, 40
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: 3A70EDD4-6C8C-44AA-A13D-49D0F6058699

Similarity Score: 0.7355404996187269
Name: LL Touring Frame - Yellow, 62
Category: Components, Touring Frames
SKU: Components, Touring Frames
_id: 91AA100C-D092-4190-92A7-7C02410F04EA



In [43]:
query = "What type of products you have?"
results = vector_search("products", query, num_results=4)
for result in results:
    print_product_search_result(result)  

Similarity Score: 0.7479221743229322
Name: HL Fork
Category: Components, Forks
SKU: Components, Forks
_id: 751115E7-BD5E-45C7-932B-E9DDE9D62579

Similarity Score: 0.7470571398735258
Name: LL Headset
Category: Components, Headsets
SKU: Components, Headsets
_id: FC0B659C-C1EF-41F3-AFE2-F87C7F43AD48

Similarity Score: 0.7468865362934313
Name: Mountain Pump
Category: Accessories, Pumps
SKU: Accessories, Pumps
_id: FE292D83-1F34-4845-A467-7C62AD3C6CBE

Similarity Score: 0.7463898952472138
Name: HL Mountain Frame - Black, 38
Category: Components, Mountain Frames
SKU: Components, Mountain Frames
_id: 0990C3D9-4EC2-4272-ADB6-9481CA12F5F6



In [None]:
query = "What do you have that is yellow?"
results = vector_search("products", query, num_results=4)
for result in results:
    print_product_search_result(result)  

Use vector search results in a RAG pattern with Chat GPT-3.5


In [44]:
# A system prompt describes the responsibilities, instructions, and persona of the AI.
system_prompt = """
You are a helpful, fun and friendly sales assistant for Cosmic Works, a bicycle and bicycle accessories store. 
Your name is Cosmo.
You are designed to answer questions about the products that Cosmic Works sells.

Only answer questions related to the information provided in the list of products below that are represented
in JSON format.

If you are asked a question that is not in the list, respond with "I don't know."

List of products:
"""

In [45]:
def rag_with_vector_search(question: str, num_results: int = 3):
    """
    Use the RAG model to generate a prompt using vector search results based on the
    incoming question.  
    """
    # perform the vector search and build product list
    results = vector_search("products", question, num_results=num_results)
    product_list = ""
    for result in results:
        if "contentVector" in result["document"]:
            del result["document"]["contentVector"]
        product_list += json.dumps(result["document"], indent=4, default=str) + "\n\n"

    # generate prompt for the LLM with vector results
    formatted_prompt = system_prompt + product_list

    # prepare the LLM request
    messages = [
        {"role": "system", "content": formatted_prompt},
        {"role": "user", "content": question}
    ]

    completion = ai_client.chat.completions.create(messages=messages, model=COMPLETIONS_DEPLOYMENT_NAME)
    return completion.choices[0].message.content

In [46]:
print(rag_with_vector_search("What bikes do you have?", 5))

We have the following bikes available:
1. Road-750 Black, 48
2. Road-550-W Yellow, 40
3. Mountain-300 Black, 48
4. Road-550-W Yellow, 48
5. Touring-1000 Yellow, 60


In [47]:
print(rag_with_vector_search("What are the names and skus of yellow products?", 5))

The names and skus of yellow products are as follows:

1. Road-550-W Yellow, 48 (SKU: BK-R64Y-48)
2. ML Road Frame-W - Yellow, 48 (SKU: FR-R72Y-48)
3. Road-550-W Yellow, 40 (SKU: BK-R64Y-40)
4. Road-350-W Yellow, 48 (SKU: BK-R79Y-48)
5. Touring-1000 Yellow, 60 (SKU: BK-T79Y-60)


In [49]:
print(rag_with_vector_search("What are the diferent types of products you have?", 10))

The different types of products we have are Components, Forks (HL Fork, LL Fork), Components, Headsets (LL Headset), Accessories, Cleaners (Bike Wash - Dissolver), Accessories, Pumps (Mountain Pump), Accessories, Lights (Headlights - Dual-Beam), Components, Derailleurs (Front Derailleur), Components, Mountain Frames (HL Mountain Frame - Black, 38), Components, Chains (Chain), and Components, Road Frames (ML Road Frame-W - Yellow, 48).
