# Ingest data to a Vector DB (Amazon DocumentDB (with MongoDB Compability))
**_Use of Amazon DocumentDB (with MongoDB Compability) as a vector database to store embeddings_**

This notebook works well on `ml.t3.medium` instance with `Python3` kernel from **JupyterLab** or `Data Science 2.0` kernel from **SageMaker Studio Classic**.

Here is a list of packages that are used in this notebook.
```
!pip freeze | grep -E -w "langchain|pymongo|pypdf"
---------------------------------------------------
langchain==0.2.6
langchain-aws==0.1.9
langchain-community==0.2.6
langchain-core==0.2.11
langchain-text-splitters==0.2.2
pymongo==4.6.3
pypdf==4.2.0
```

## Step 1: Set up
Install the required packages

In [None]:
%%capture --no-stderr

!pip install -U langchain==0.2.6
!pip install -U langchain-community==0.2.6
!pip install -U langchain-aws==0.1.9
!pip install -U pypdf==4.2.0
!pip install -U pymongo==4.6.3

In [None]:
!pip list | grep -E -w "langchain|pymongo|pypdf"

## Step 2: Download the data from the web

In this step we use `wget` to download the pdf version of Amazon DocumentDB (MongoDB Compatability) developer guide.

**This data download would take a few minutes**.

In [None]:
%%sh
mkdir -p data
wget --no-check-certificate -O data/documentdb-guide.pdf https://docs.aws.amazon.com/documentdb/latest/developerguide/developerguide.pdf

## Step 3: Load data into Amazon DocumentDB (with MongoDB Compatability)

In [None]:
import boto3

aws_region = boto3.Session().region_name

In [None]:
import json
from typing import List

def get_cfn_outputs(stackname: str, region_name: str) -> List:
    cfn = boto3.client('cloudformation', region_name=region_name)
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs


def get_credentials(secret_id: str, region_name: str) -> str:
    client = boto3.client('secretsmanager', region_name=region_name)
    response = client.get_secret_value(SecretId=secret_id)
    secrets_value = json.loads(response['SecretString'])
    return secrets_value

In [None]:
CFN_STACK_NAME = 'RAGDocDBStack'

cfn_stack_outputs = get_cfn_outputs(CFN_STACK_NAME, aws_region)
docdb_secret_name = cfn_stack_outputs['DocDBSecret']
docdb_host = cfn_stack_outputs['DocumentDBCluster']

In [None]:
creds = get_credentials(docdb_secret_name, aws_region)
USER, PASSWORD = creds['username'], creds['password']


##### Get Amazon DocumentDB Certificate Authority (CA) certificate
 Download the Amazon DocumentDB Certificate Authority (CA) certificate required to authenticate to your instance

In [None]:
!wget https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem

In [None]:
import pymongo

client = pymongo.MongoClient(
    host=docdb_host,
    port=27017,
    username=USER,
    password=PASSWORD,
    retryWrites=False,
    tls='true',
    tlsCAFile="global-bundle.pem"
)

In [None]:
from pymongo.errors import ConnectionFailure

try:
    client.admin.command('ping')
    print('Server available')
except ConnectionFailure as ex:
    import tracebak
    print('Server not available')
    traceback.print_exc()

In [None]:
db_name = "ragdemo" # name the database
collection_name = "rag" # name the collection

db = client[db_name] # create a database object
collection = db[collection_name] # create a collection object

In [None]:
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain_text_splitters.character import RecursiveCharacterTextSplitter

In [None]:
pdf_path = './data/documentdb-guide.pdf'

loader = PyPDFLoader(file_path=pdf_path)

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ".", " "],
    chunk_size=1000,
    chunk_overlap=100
)

In [None]:
%%time
chunks = loader.load_and_split(text_splitter)

CPU times: user 1min 15s, sys: 125 ms, total: 1min 15s
Wall time: 1min 30s


In [None]:
from langchain_community.embeddings import BedrockEmbeddings

embeddings = BedrockEmbeddings(
    region_name=aws_region
)

In [None]:
%%time
from langchain_community.vectorstores import DocumentDBVectorSearch

# Using MongoDB Langchain integration as DocumentDB is compatible with MongoDB insert API
vectorstore = DocumentDBVectorSearch.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection=collection
)

CPU times: user 7.33 s, sys: 426 ms, total: 7.75 s
Wall time: 4min 24s


In [None]:
# Count of events
collection.count_documents({})

In [None]:
%%time

# Below is creating a vector index named on field "vectorContent". By default, Langchain will insert chunks with the following fields: vectorContent, source, page, textContent
# see the following for vector options in creating an index https://docs.aws.amazon.com/documentdb/latest/developerguide/vector-search.html#w5aac21c11c11
collection.create_index(
    [("vectorContent", "vector")],
    vectorOptions= {
        "type": "hnsw",
        "similarity": "cosine",
        "dimensions": 1536,
        "m": 16,
        "efConstruction": 64
    },
    name=vectorstore.get_index_name()
)

CPU times: user 3.48 ms, sys: 59 µs, total: 3.54 ms
Wall time: 1.39 s


'vectorSearchIndex'

In [None]:
for index in collection.list_indexes():
    print(index)

SON([('v', 4), ('key', SON([('_id', 1)])), ('name', '_id_'), ('ns', 'ragdemo.rag')])
SON([('v', 4), ('key', SON([('vectorContent', 'vector')])), ('name', 'vectorSearchIndex'), ('vectorOptions', SON([('type', 'hnsw'), ('dimensions', 1536), ('similarity', 'cosine'), ('m', 16), ('efConstruction', 64)])), ('ns', 'ragdemo.rag')])


## Step 4: Do a similarity search for user input to documents (embeddings) in Amazon DocumentDB (with MongoDB Compatibility)

In [None]:
query = "What is the company's strategy for generative AI?"

embedded_query = embeddings.embed_query(query)

pipeline = [
    {"$match": {}},
    {
        "$search": {
            "vectorSearch" : {
                "vector" : embedded_query,
                "path": "vectorContent",
                "similarity": "cosine",
                "k": 2,
                "efSearch": 40
            }
        }
    }
]

docs = collection.aggregate(pipeline)

results = [doc['textContent'] for doc in docs]
for i, e in enumerate(results):
    print(f"[doc-{i}]\n", e)

## Clean up

To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation template used to create the IAM role and SageMaker notebook.

## Conclusion

In this notebook we were able to see how to use Amazon Bedrock to generate embeddings and then ingest those embeddings into Amazon DocumentDB (with MongoDB Compatibility) and finally do a similarity search for user input to the documents (embeddings) stored in Amazon DocumentDB (with MongoDB Compatibility). We used langchain as an abstraction layer to talk to both Amazon Bedrock as well as Amazon DocumentDB (with MongoDB Compatibility).

## References

- [Vector search for Amazon DocumentDB](https://docs.aws.amazon.com/documentdb/latest/developerguide/vector-search.html)
- [Amazon DocumentDB (with MongoDB compatibility) samples](https://github.com/aws-samples/amazon-documentdb-samples/)
- [LangChain Providers - AWS](https://python.langchain.com/docs/integrations/platforms/aws/) - The `LangChain` integrations related to `Amazon AWS` platform.