# Amazon S3 , AWS Glue, and Amazon Redshift Serverless Data Ingestion

This Notebook provide step by step guide to ingest open dataset to Amazon S3, Configure AWS Glue Tables and setup Redshift serverless table. This setup can be used to test Amazon Bedrock Knowledge Bases structured data retrieval capability by configuring Amazon S3 + AWS Glue as data store and RedShift Serverless as Query Engine. 

## Steps

1. Download dataset and load it to S3 bucket
2. Create AWS Glue Catalog Tables using AWS Glue Crawler
2. Create Redshift Serverless Cluster
3. Validate data ingestion setup


### Setup Environment and Install Required Libraries

This step will install necessary libraries and import

In [None]:
# Install required libraries
!pip install --upgrade pip -q --no-color
!pip install tabulate pandas -q --no-color
!pip install boto3 -q --no-color
!pip install awswrangler -q --no-color
!pip install retrying -q --no-color
!pip install awscli -q --no-color

In [None]:
# Import necessary libraries
import pandas as pd
import boto3
import awswrangler as wr
import time
from botocore.exceptions import ClientError
import sys
import json
from pathlib import Path

This code is part of the setup and used to :
- Add the parent directory to the python system path
- Imports a custom module (BedrockStructuredKnowledgeBase) from utils necessary for later executions

In [None]:
# Set the path to import utils module
current_path = Path().resolve()
current_path = str(current_path.parent.parent)  + "/features-examples"
if str(current_path) not in sys.path:
    sys.path.append(str(current_path))
# Print sys.path to verify
print(sys.path)
from utils.structured_knowledge_base import create_glue_crawler,create_redshift_workgroup

### Step 1: Download E-commerce Dataset from Kaggle

This step downloads a public free dataset from Kaggle into local folder. We load that data into pandas dataframe for preview

Loads the dataset and provides initial analysis:
- Reads CSV file into pandas DataFrame
- Displays first few rows of data
- Shows dataset information including dtypes and null values

In [None]:
# Create directories and download dataset

# Download the dataset using curl
!curl -L -o e-commerce-dataset.zip \
  "https://www.kaggle.com/api/v1/datasets/download/steve1215rogg/e-commerce-dataset" 

# Unzip the downloaded file
!unzip e-commerce-dataset.zip -d e-commerce-data

In [None]:
# Read and preview the dataset
df = pd.read_csv('e-commerce-data/ecommerce_dataset_updated.csv')
print("Dataset Preview:")
print(df.head())
print("\nDataset Info:")
print(df.info())

### Step 3: Configure AWS , IAM and Redshift

Sets up AWS connectivity:
- Configures AWS session with region
- Creates S3 client
- Generates unique timestamp for resource naming
- Sets up database and workgroup variables

In [None]:
# Configure AWS session
region_name='us-east-1'
session = boto3.Session(region_name=region_name)
s3 = session.client('s3')
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()["Account"]

# Get the current timestamp
current_time = time.time()
# Format the timestamp as a string
timestamp_str = time.strftime("%Y%m%d%H%M%S", time.localtime(current_time))[-7:]
# Create the suffix using the timestamp
suffix = f"{timestamp_str}"

### Step 4: Create S3 Bucket

This step handles S3 storage setup:
1. Creates new S3 bucket and Uploads the DataFrame to S3 as CSV
2. Uses AWS Wrangler for data transfer

In [None]:
# Create S3 bucket
bucket_name = f"my-ecommerce-data-bucket-{suffix}"
s3.create_bucket(Bucket=bucket_name)

# Upload data using AWS Wrangler
wr.s3.to_csv(
    df=df,
    path=f's3://{bucket_name}/ecommerce/data.csv',
    index=False
)
print(bucket_name)

In [None]:
#bucket_name = "my-ecommerce-data-bucket-8214019"

### Step 5: Create AWS Glue Crawler and Tables

This step creates AWS Glue Crawler based on S3 dataset. 
1. We use aws cli to create an IAM Role for AWS Glue Crawler with necessary permissions on S3 source bucket. 
2. Execute the AWS Glue Crawler to create necessary AWS Glue catalog database `ecommerce` and table `ecommerce`
3. Once the crawler is successfully completed, we list the table to validate table creation

In [None]:
# Define your variables
crawler_name = "ecommerce_crawler"
database_name = "ecommerce"
s3_path = f"s3://{bucket_name}/ecommerce/"
role_name = f"GlueCrawlerRole-{suffix}"
glue_policy_name = f"GlueS3Access-{suffix}"
print(s3_path)
print(role_name)
print(glue_policy_name)

In [None]:
result = create_glue_crawler(
    crawler_name=crawler_name,
    database_name=database_name,
    bucket_name=bucket_name,
    s3_path=s3_path,
    role_name=role_name,
    glue_policy_name=glue_policy_name,
    account_id=account_id
)

print("Crawler creation and start process completed!")
print(f"Created resources: {json.dumps(result, indent=2)}")

In [None]:
# Monitor crawler status
def check_crawler_status(crawler_name):
    while True:
        status = !aws glue get-crawler --name {crawler_name} --query "Crawler.State" --output text
        print(f"Crawler status: {status[0]}")
        
        if status[0] == "READY":
            print("Crawler finished!")
            break
        elif status[0] == "FAILED":
            print("Crawler failed!")
            break
            
        time.sleep(30)

# Check status
check_crawler_status(crawler_name)

# Optionally, show tables created
print("\nTables created in database:")
!aws glue get-tables --database-name {database_name} --query "TableList[].Name" --output text

### Step 6 - Create Redshift Serverless Namespace and Workgroup

[You can skip this step, if you are using your existing Redshift Cluster]

The step creates a Redshift Serverless Namespace and Workgroup 
1. Selects the default VPC, Subnets, and Security Groups to create Serverless WorkGroup
2. Creates a new Namespace if it do not exist
3. Creates a new Workgroup

In [None]:
namespace_name = "my-namespace"
workgroup_name = "my-workgroup"

# Run the function
create_redshift_workgroup(namespace_name, workgroup_name)

### Step 7: Validate data ingestion

Once Amazon Redshift Serverless workgroup is successfully created, following queries can be executed in Amazon Redshift Serverless query explorer to validate that the ecommerce data from AWS Glue catalog can be successfully queried from Amazon Redshift Serverless query engine

--Validate if the data catalog mount is available in this region, this should return "on"  
SHOW data_catalog_auto_mount;  
--View the databases available within aws glue catalog  
SHOW SCHEMAS FROM DATABASE awsdatacatalog;  
--View tables accessible within AnyComp database  
SHOW TABLES FROM SCHEMA awsdatacatalog.ecommerce;  
--Able to query data from TABLE  
SELECT * FROM "awsdatacatalog".ecommerce.ecommerce limit 10;  

## Clean Up Resources

Clean up for Redshift Serverless Namespace and Workgroup

In [None]:
# Configuration
namespace_name = "my-namespace"
workgroup_name = "my-workgroup"

def cleanup_redshift_serverless(namespace_name, workgroup_name):
    """Delete Redshift Serverless workgroup and namespace"""
    
    # Initialize boto3 client
    redshift_client = boto3.client('redshift-serverless')
    
    try:
        # Check workgroup status
        print("Checking workgroup status...")
        workgroup_status = get_workgroup_status(redshift_client, workgroup_name)
        print(f"Workgroup status: {workgroup_status}")

        # Delete workgroup if it exists
        if workgroup_status != "NONEXISTENT":
            print(f"Deleting workgroup '{workgroup_name}'...")
            try:
                redshift_client.delete_workgroup(workgroupName=workgroup_name)
                print("Workgroup deletion initiated")
                
                # Wait and verify deletion
                print("Waiting for workgroup deletion...")
                while True:
                    time.sleep(30)
                    status = get_workgroup_status(redshift_client, workgroup_name)
                    if status == "NONEXISTENT":
                        print("Workgroup deleted successfully")
                        break
                    print(f"Waiting for workgroup deletion... Current status: {status}")
                    
            except ClientError as e:
                print(f"Error deleting workgroup: {e}")
                return False
        else:
            print("Workgroup does not exist, skipping deletion")

        # Check namespace status  
        print("Checking namespace status...")
        namespace_status = get_namespace_status(redshift_client, namespace_name)
        print(f"Namespace status: {namespace_status}")

        # Delete namespace if it exists
        if namespace_status != "NONEXISTENT":
            print(f"Deleting namespace '{namespace_name}'...")
            try:
                redshift_client.delete_namespace(namespaceName=namespace_name)
                print("Namespace deletion initiated")
                
                # Wait and verify deletion
                print("Waiting for namespace deletion...")
                while True:
                    time.sleep(30)
                    status = get_namespace_status(redshift_client, namespace_name)
                    if status == "NONEXISTENT":
                        print("Namespace deleted successfully")
                        break
                    print(f"Waiting for namespace deletion... Current status: {status}")
                    
            except ClientError as e:
                print(f"Error deleting namespace: {e}")
                return False
        else:
            print("Namespace does not exist, skipping deletion")

        print("Cleanup completed!")
        return True
        
    except Exception as e:
        print(f"Unexpected error during cleanup: {e}")
        return False


def get_workgroup_status(client, workgroup_name):
    """Get workgroup status, return 'NONEXISTENT' if not found"""
    try:
        response = client.get_workgroup(workgroupName=workgroup_name)
        return response['workgroup']['status']
    except ClientError as e:
        if e.response['Error']['Code'] == 'ResourceNotFoundException':
            return "NONEXISTENT"
        else:
            print(f"Error checking workgroup status: {e}")
            return "ERROR"
    except Exception as e:
        print(f"Unexpected error checking workgroup: {e}")
        return "ERROR"


def get_namespace_status(client, namespace_name):
    """Get namespace status, return 'NONEXISTENT' if not found"""
    try:
        response = client.get_namespace(namespaceName=namespace_name)
        return response['namespace']['status']
    except ClientError as e:
        if e.response['Error']['Code'] == 'ResourceNotFoundException':
            return "NONEXISTENT"
        else:
            print(f"Error checking namespace status: {e}")
            return "ERROR"
    except Exception as e:
        print(f"Unexpected error checking namespace: {e}")
        return "ERROR"


# Execute the cleanup
if __name__ == "__main__":
    cleanup_redshift_serverless(namespace_name, workgroup_name)


Clean up for AWS Glue Crawler, Database, and Table

In [None]:
glue_client = boto3.client('glue')
# 1. Delete crawler
print("Deleting crawler...")
try:
    glue_client.delete_crawler(Name=crawler_name)
    print(f"Crawler {crawler_name} deleted successfully")
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityNotFoundException':
        print("Crawler does not exist")
    else:
        print(f"Error deleting crawler: {e}")

# 2. Delete Glue database
print("Deleting Glue database...")
try:
    glue_client.delete_database(Name=database_name)
    print(f"Database {database_name} deleted successfully")
except ClientError as e:
    if e.response['Error']['Code'] == 'EntityNotFoundException':
        print("Database does not exist")
    else:
        print(f"Error deleting database: {e}")

print("Cleanup completed!")

Clean up S3 bucket

In [None]:
cleanup_command = f"""
aws s3 rb s3://{bucket_name} --force
"""

print(f"Cleaning up bucket {bucket_name}...")
!{cleanup_command}