# Ingest data to a Vector DB (Amazon MemoryDB for Redis)
**_Use of Amazon MemoryDB for Redis as a vector database to store embeddings_**

This notebook works well with the `Data Science 2.0` kernel on a SageMaker Studio `ml.t3.medium` instance.

Here is a list of packages that are used in this notebook.
```
!pip freeze | grep -E -w "langchain|pypdf|redis"
----------------------------------------------------------------------------------------
langchain==0.1.16
langchain-aws==0.1.0
langchain-community==0.0.34
langchain-core==0.1.45
langchain-text-splitters==0.0.1
pypdf==4.2.0
redis==5.0.1
```

## Step 1: Set up
Install the required packages

### Install LangChain MemoryDB

In [None]:
%%sh

git clone --depth=1 https://github.com/aws-samples/amazon-memorydb-for-redis-samples.git
cd amazon-memorydb-for-redis-samples/tutorials/langchain-memorydb
pip install .

In [None]:
!pip install -U langchain==0.1.16 langchain-aws==0.1.0 langchain-community==0.0.34
!pip install -U pypdf==4.2.0 redis==5.0.1

In [None]:
!pip list | grep -E -w "langchain|pypdf|redis"

langchain                            0.1.16
langchain-aws                        0.1.0
langchain-community                  0.0.34
langchain-core                       0.1.45
langchain-text-splitters             0.0.1
pypdf                                4.0.1
redis                                5.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 2: Download the data from the web

In this step we use `wget` to download the pdf version of Amazon MemoryDB for Redis developer guide.

**This data download would take a few minutes**.

In [None]:
%%sh
mkdir -p data
wget -O data/memorydb-guide.pdf https://docs.aws.amazon.com/memorydb/latest/devguide/memorydb-guide.pdf.pdf

## Step 3: Load data into Amazon MemoryDB for Redis

In [None]:
import boto3

aws_region = boto3.Session().region_name

In [None]:
import json
from typing import List

def get_cfn_outputs(stackname: str, region_name: str) -> List:
    cfn = boto3.client('cloudformation', region_name=region_name)
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs


def get_credentials(secret_id: str, region_name: str) -> str:
    client = boto3.client('secretsmanager', region_name=region_name)
    response = client.get_secret_value(SecretId=secret_id)
    secrets_value = json.loads(response['SecretString'])
    return secrets_value

In [None]:
CFN_STACK_NAME = 'RAGMemoryDBAclStack'

cfn_stack_outputs = get_cfn_outputs(CFN_STACK_NAME, aws_region)
memorydb_secret_name = cfn_stack_outputs['MemoryDBSecretName']

In [None]:
CFN_STACK_NAME = 'RAGMemoryDBStack'

cfn_stack_outputs = get_cfn_outputs(CFN_STACK_NAME, aws_region)
memorydb_host = cfn_stack_outputs['MemoryDBClusterEndpoint']

In [None]:
creds = get_credentials(memorydb_secret_name, aws_region)
USER, PASSWORD = creds['username'], creds['password']

In [None]:
REDIS_URL = f"rediss://{USER}:{PASSWORD}@{memorydb_host}:6379/ssl=True&ssl_cert_reqs=none"
INDEX_NAME = 'idx:vss-mm'

REDIS_URL, INDEX_NAME

In [None]:
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain_text_splitters.character import RecursiveCharacterTextSplitter

In [None]:
pdf_path = './data/memorydb-guide.pdf'

loader = PyPDFLoader(file_path=pdf_path)

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ".", " "],
    chunk_size=1000,
    chunk_overlap=100
)

In [None]:
%%time
chunks = loader.load_and_split(text_splitter)

In [None]:
from langchain_community.embeddings import BedrockEmbeddings

embeddings = BedrockEmbeddings(
    region_name=aws_region
)

In [None]:
%%time
from langchain_memorydb import MemoryDB

vectorstore = MemoryDB.from_documents(
    chunks,
    embedding=embeddings,
    redis_url=REDIS_URL,
    index_name=INDEX_NAME
)

#### Check Index

Now we will have a look at the index of the documents using Redis command.

In [None]:
import redis

redis_client = redis.Redis(host=memorydb_host, port=6379,
                           username=USER, password=PASSWORD,
                           decode_responses=True, ssl=True, ssl_cert_reqs="none")

redis_client.execute_command('ft._list')

['idx:vss-mm']

In [None]:
redis_client.ft(INDEX_NAME).info()

## Step 4: Do a similarity search for for user input to documents (embeddings) in Amazon MemoryDB for Redis

In [None]:
query = "What is the company's strategy for generative AI?"

%time
results = vectorstore.similarity_search(query)
results

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 9.06 µs


[Document(page_content='recomputing them, FM buﬀer memory reduces the amount of computation required during \ninference through the FMs. This FM buﬀer memory allows large language models to respond faster \nwith lower costs due to service charges from the FM.\nRetrieval Augmented Generation (RAG) 295', metadata={'id': 'doc:idx:vss-mm:4b468d1b5d4f4c3592301c2082555bf3', 'source': './data/memorydb-guide.pdf', 'page': '302'}),
 Document(page_content='managed experience enabling developers to execute LUA scripts with application logic stored on \nthe MemoryDB cluster, without requiring clients to re-send the scripts to the server with every \nconnection.\nEngine versions 92', metadata={'id': 'doc:idx:vss-mm:835df288eae94702b19de8db615dc2f9', 'source': './data/memorydb-guide.pdf', 'page': '99'}),
 Document(page_content='Amazon MemoryDB for Redis Developer Guide\nFollowing are use cases of vector search.\nRetrieval Augmented Generation (RAG)\nRetrieval Augmented Generation (RAG) leverages vec

## Clean up

To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation template used to create the IAM role and SageMaker notebook.

## Conclusion

In this notebook we were able to see how to use Amazon Bedrock to generate embeddings and then ingest those embeddings into Amazon MemoryDB for Redis and finally do a similarity search for user input to the documents (embeddings) stored in Amazon MemoryDB for Redis. We used langchain as an abstraction layer to talk to both Amazon Bedrock as well as Amazon MemoryDB for Redis.