From abaf41e9a22ba2a998f59feb51f32c2daca64b13 Mon Sep 17 00:00:00 2001 From: Paul Cornell Date: Fri, 18 Jul 2025 10:04:57 -0700 Subject: [PATCH] Use Amazon S3 Vectors with Unstructured --- docs.json | 1 + examplecode/tools/s3-vectors.mdx | 397 +++++++++++++++++++++++++++++++ 2 files changed, 398 insertions(+) create mode 100644 examplecode/tools/s3-vectors.mdx diff --git a/docs.json b/docs.json index 93ca66c0..68afa20d 100644 --- a/docs.json +++ b/docs.json @@ -279,6 +279,7 @@ "examplecode/tools/google-drive-events", "examplecode/tools/onedrive-events", "examplecode/tools/sharepoint-events", + "examplecode/tools/s3-vectors", "examplecode/tools/jq", "examplecode/tools/firecrawl", "examplecode/tools/langflow", diff --git a/examplecode/tools/s3-vectors.mdx b/examplecode/tools/s3-vectors.mdx new file mode 100644 index 00000000..f47ec170 --- /dev/null +++ b/examplecode/tools/s3-vectors.mdx @@ -0,0 +1,397 @@ +--- +title: Amazon S3 Vectors +--- + +[Amazon S3 Vectors](https://aws.amazon.com/s3/features/vectors/) is a durable vector storage solution that can greatly reduce the total cost of uploading, storing, +and querying vectors. S3 Vectors is a cloud object store with native support to store large vector datasets and +provide subsecond query performance. This makes it more affordable for businesses to store AI-ready data at massive scale. + +This hands-on example walkthrough demonstrates how to use Amazon S3 Vectors with Unstructured. In this walkthrough, you will: + +1. Create an S3 vector bucket. +2. Create a vector index in the bucket. +3. Add the contents of one or more source JSON output files that have been generated by Unstructured to the vector index. +4. Query the vector index against the contents of the source JSON output files that were added. + +## Requirements + +import GetStartedSimpleUIOnly from '/snippets/general-shared-text/get-started-simple-ui-only.mdx' + +To use this example, you will need: + +- A set of one or more JSON output files that have been generated by Unstructured and stored somewhere on your local development machine. For maximum compatibility with this example, these files must contain vector embeddings that were + generated by Amazon Bedrock, by using the **Titan Text Embeddings V2** (`amazon.titan-embed-text-v2:0`) embedding model, with 1024 dimensions. To get these files, you will need: + + - An Unstructured account, as follows: + + + + - A workflow that generates vector embeddings and adds them to the JSON output files. Learn how to [create a custom workflow](/ui/workflows#create-a-custom-workflow) and [add an Embedder node](/ui/embedding#generate-embeddings) to that workflow. + + + The [destination connector](/ui/destinations/overview) for your worklow must generate JSON output files. These include destination connectors for file storage connectors such as + Databricks Volumes, Google Cloud Storage, OneDrive, and S3. Destination connectors for databases such as Elasticsearch, Kafka, and MongoDB, and vector stores such as Astra DB, Pinecone, and Weaviate, do not generate JSON output files. + + + - After your workflow generates the JSON output files, you must copy them from your workflow's destination location over to some location on your local development machine for access. + +- Python installed on your local development machine. +- An AWS account. [Create an AWS account](https://aws.amazon.com/free). + + + +## Step 1: Create the S3 vector bucket + +1. Sign in to the [AWS Management Console](https://console.aws.amazon.com/). +2. Open the [Amazon S3 Console](https://console.aws.amazon.com/s3/home). +3. On the sidebar, click **Vector bucket**. +4. Click **Create vector bucket**. +5. For **Vector bucket name**, enter some name for your bucket. +6. For **Encryption**, select an encryption method, or leave the default. +7. Click **Create vector bucket**. + +## Step 2: Add the vector index to the bucket + +1. With the list of vector buckets showing from the previous step, click the name of the bucket that you just created. +2. Click **Create vector index**. +3. For **Vector index name**, enter some name for your index. +4. For **Dimension** enter the number of dimensions that Unstructured generated for your vector embeddings. For example, + for the **Titan Text Embeddings V2** (`amazon.titan-embed-text-v2:0`) embedding model, enter `1024`. If you are not sure how many dimensions to enter, + see your workflow's **Embedder** node settings. +5. Select the appropriate **Distance metric** for your embedding model. For example, for the + **Titan Text Embeddings V2** (`amazon.titan-embed-text-v2:0`) embedding model, select **Cosine**. If you are not sure which distance metric to use, + see your embedding model's documentation. +8. Click **Create vector index**. +9. After the vector index is created, copy the value of the index's **Amazon Resource Name (ARN)**, as you will need it in + later steps. This ARN takes the format `arn:aws:s3vectors:::bucket//index/`. + +## Step 3: Add the source JSON output files' contents to the vector index + +1. In your local Python virtual environment, install the `boto3` and `uuid` libraries. +2. [Set up Boto3 credentials](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for your AWS account. + The following steps assume you have set up your Boto3 credentials from outside of the following code, such as setting + [environment variables](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#environment-variables) or + configuring a [shared credentials file](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#shared-credentials-file), + + One approach to getting and setting up Boto3 credentials is to [create an AWS access key and secret access key](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey) + and then use the [AWS Command Line Interface (AWS CLI)](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) + to [set up your credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-methods) on your local development machine. + + + +3. Add the following code to a Python script file in your virtual environment, replacing the following placeholders: + + - Replace `` with the path to the directory that contains your JSON output files. + - Replace `` with the ARN of the vector index that you created previously in Step 2. + - Replace `` with the short ID of the region where your vector index is located, for example `us-east-1`. + + ```python + import boto3 + import os + import json + import uuid + + source_json_file_path = '' + index_arn = '' + index_region_short_id = '' + s3vectors = boto3.client('s3vectors', region_name=index_region_short_id) + num_vectors = 0 + verbose = True # Set to False to only print final results. + + # For each JSON file in the source directory... + for filename in os.listdir(source_json_file_path): + if filename.endswith('.json'): + # ...read the JSON file, and then... + with open(source_json_file_path + filename, 'r') as f: + file = json.load(f) + # ...for each object in the Unstructured-formatted JSON array... + for object in file: + # ...add the object's text and vectors to the S3 vector index. + # Use the following format: + + # { + # "key": "", + # "data": { + # "float32": + # }, + # "metadata": { + # "text": "" + # } + # } + json_object = {} + json_object['key'] = str(uuid.uuid4()) + json_object['data'] = {} + json_object['metadata'] = {} + + # If the object has no text, do not add it to the + # vector index, and move on to the next object. + if 'text' in object: + json_object['metadata']['text'] = object['text'] + else: + if verbose: + print(f"Skipping object with source element ID {object['id']} as it has no text") + continue + + # If the object has no embeddings, do not add it to the + # vector index either, and move on to the next object. + if 'embeddings' in object: + json_object['data']['float32'] = object['embeddings'] + else: + if verbose: + print(f"Skipping object with source element ID {object['element_id']} as it has no embeddings") + continue + + # Add the object to the vector index. + s3vectors.put_vectors( + indexArn=index_arn, + vectors=[json_object] + ) + + if verbose: + print(f"Added a vector entry and assigned it the internal ID {json_object['key']}. " + + f"First 20 characters of the text: {object['text'][:20]}") + + num_vectors += 1 + + print(f"Added {num_vectors} vector entries to the vector index.") + ``` + +4. Run the script to add the JSON output files' contents to the vector index. Each object in each JSON output file + is added as a vector entry in the vector index. + +## Step 4: Query the vector index + +1. In your local Python virtual environment, install the `numpy` library. +2. Add the following code to another Python script file in your virtual environment, replacing the following placeholders: + + - Replace `` with the ARN of the vector index that you created previously in Step 2. + - Replace `` with the short ID of the region where your vector index is located, for example `us-east-1`. + - Replace `` with the search text that you want to embed for the query. + + ```python + import boto3 + import json + import numpy as np + + index_arn = '' + index_region_short_id = '' + client = boto3.client('s3vectors', region_name=index_region_short_id) + + # The sentence to embed. + sentence = '' + + # Generate embeddings for the sentence to embed. + model_id = 'amazon.titan-embed-text-v2:0' + bedrock = boto3.client('bedrock-runtime', region_name=index_region_short_id) + body = {'inputText': f"{sentence}"} + json_string = json.dumps(body) + json_bytes = json_string.encode('utf-8') + + response = bedrock.invoke_model( + modelId=model_id, + body=json_bytes, + contentType='application/json', + accept='application/json' + ) + + # Get the embeddings for the sentence and prepare them for the query. + response_body = json.loads(response['body'].read().decode()) + embedding = response_body['embedding'] + embedding = np.array(embedding, dtype=np.float32).tolist() + + # Run the query. + response = client.query_vectors( + indexArn=index_arn, + topK=5, + queryVector={'float32': embedding} + ) + + print(f"Original search query: {sentence}") + print(f"\nTop 5 results by similarity search...") + print('-----') + + for vector in response['vectors']: + response = client.get_vectors( + indexArn=index_arn, + keys=[vector['key']], + returnData=True, + returnMetadata=True + ) + + for vector in response['vectors']: + print(vector['metadata']['text']) + print('-----') + ``` + +3. Run the script to query the vector index and see the query results. + +## Appendix: Additional operations + +Use the following code examples to perform additional vector index and vector bucket operations. + +### List all entries in a vector index + +Replace the following placeholders: + +- Replace `` with the ARN of the vector index that you created earlier in Step 2. +- Replace `` with the short ID of the region where your vector index is located, for example `us-east-1`. + +```python +import boto3 + +index_arn = '' +index_region_short_id = '' +client = boto3.client('s3vectors', region_name=index_region_short_id) +num_vectors = 0 +next_token = None +verbose = True # Set to False to only print final results. + +# List all of the vectors in the S3 vector index. +# Vectors are fetched by "page", so loop through the index in pages. +while True: + kwargs = { + 'indexArn': index_arn, + 'returnData': True, + 'returnMetadata': True + } + + if next_token: + kwargs['nextToken'] = next_token + + response = client.list_vectors(**kwargs) + + for vector in response['vectors']: + if verbose: + print(f"Found vector entry with internal ID {vector['key']}. " + + f"First 20 characters of the text: {vector['metadata']['text'][:20]}") + num_vectors += 1 + + if 'nextToken' in response: + next_token = response['nextToken'] + else: + break + +print(f"Total number of vector entries found: {num_vectors}") +``` + +### Delete all entries from a vector index + + + This operation will permanently delete all vector entries in the vector index. This operation cannot be undone. + + +Replace the following placeholders: + +- Replace `` with the ARN of the vector index that you created earlier in Step 2. +- Replace `` with the short ID of the region where your vector index is located, for example `us-east-1`. + +```python +import boto3 + +index_arn = '' +index_region_short_id = '' +client = boto3.client('s3vectors', region_name=index_region_short_id) +num_vectors = 0 +next_token = None +verbose = True # Set to False to only print final results. + +# Delete all of the vectors in the S3 vector index. +# Vectors are deleted by "page", so loop through the index in pages. +while True: + kwargs = { + 'indexArn': index_arn, + 'returnData': True, + 'returnMetadata': True + } + + if next_token: + kwargs['nextToken'] = next_token + + response = client.list_vectors(**kwargs) + + for vector in response['vectors']: + # Delete each vector by its key. + for vector in response['vectors']: + if verbose: + print(f"Deleting vector entry with internal ID {vector['key']}. " + + f"First 20 characters of the text: {vector['metadata']['text'][:20]}") + + client.delete_vectors( + indexArn=index_arn, + keys=[vector['key']] + ) + + num_vectors += 1 + + if 'nextToken' in response: + next_token = response['nextToken'] + else: + break + +print(f"Deleted {num_vectors} vector entries from the vector index.") +``` + +### Delete a vector index + + + This operation will permanently delete a vector index. This operation cannot be undone. + + +Replace the following placeholders: + +- Replace `` with the ARN of the vector index that you created earlier in Step 2. +- Replace `` with the short ID of the region where your vector index is located, for example `us-east-1`. + +```python +import boto3 + +index_arn = '' +index_region_short_id = '' +client = boto3.client('s3vectors', region_name=index_region_short_id) + +client.delete_index( + indexArn=index_arn +) +``` + +### Delete a vector bucket + + + This operation will permanently delete a vector bucket. This operation cannot be undone. + + +Replace the following placeholders: + +- Replace `` with the ARN of the vector bucket that you created earlier in Step 1. To get the ARN, do the following: + + 1. In the Amazon S3 console, on the sidebar, click **Vector buckets**. + 2. Next to the name of the vector bucket that you want to delete, click the copy button next to the bucket's **Amazon Resource Name (ARN)**. + +- Replace `` with the short ID of the region where your vector bucket is located, for example `us-east-1`. +```python +import boto3 + +bucket_arn = '' +index_region_short_id = '' +client = boto3.client('s3vectors', region_name=index_region_short_id) + +client.delete_vector_bucket( + vectorBucketArn=bucket_arn +) +``` \ No newline at end of file