# Cross-Account S3 Bucket Access from SageMaker

This notebook demonstrates how to access S3 buckets in different AWS accounts from a SageMaker notebook instance or SageMaker Studio.

## Understanding Cross-Account Access

SageMaker notebooks run with the permissions of an execution role. To access cross-account resources, you need to:

1. Have an IAM role in your account that can assume roles in the target account
2. Have an IAM role in the target account that allows your account's role to assume it
3. Explicitly assume the target account role in your code

Our infrastructure includes:
- `sagemaker_execution_role`: The role attached to your SageMaker instance
- `data_lake_access_role`: Role in dev_org that can assume a role in data_org
- `extrn_data_access_role`: Role in dev_org that can assume a role in data_org for external data
- Corresponding roles in data_org that can be assumed

In [None]:
# Import required libraries
import boto3
import os
import json
import pandas as pd
from botocore.exceptions import ClientError

## Get Current SageMaker Role ARN

First, let's identify which role we're currently using

In [None]:
# Get current SageMaker execution role ARN
import sagemaker

role = sagemaker.get_execution_role()
print(f"Current SageMaker execution role: {role}")

# Get current AWS account ID
sts = boto3.client('sts')
account_id = sts.get_caller_identity()["Account"]
print(f"Current AWS account ID: {account_id}")

## Method 1: Assume Role and Create New Session

This is the most common approach - assume the cross-account role and create a new boto3 session with temporary credentials.

In [None]:
# Define the roles we can assume
# Replace with actual role ARNs from your terraform outputs
data_lake_access_role_arn = f"arn:aws:iam::{account_id}:role/data-science-data-lake-access"
extrn_data_access_role_arn = f"arn:aws:iam::{account_id}:role/data-science-extrn-data-access"

# Function to assume a role and get temporary credentials
def assume_role(role_arn, session_name="AssumeRoleSession"):
    sts_client = boto3.client('sts')
    try:
        response = sts_client.assume_role(
            RoleArn=role_arn,
            RoleSessionName=session_name
        )
        return response['Credentials']
    except ClientError as e:
        print(f"Error assuming role {role_arn}: {e}")
        return None

In [None]:
# Assume data lake access role and create a session
def get_data_lake_session():
    credentials = assume_role(data_lake_access_role_arn, "DataLakeAccessSession")
    if credentials:
        session = boto3.Session(
            aws_access_key_id=credentials['AccessKeyId'],
            aws_secret_access_key=credentials['SecretAccessKey'],
            aws_session_token=credentials['SessionToken']
        )
        return session
    return None

# Create a session with data lake access permissions
data_lake_session = get_data_lake_session()
if data_lake_session:
    print("Successfully created data lake access session")
    # Create an S3 client using the new session
    s3_data_lake = data_lake_session.client('s3')
else:
    print("Failed to create data lake access session")

## Access Data Lake Processed Bucket

Now we can access the processed data lake bucket in the data_org account using our assumed role

In [None]:
# Example: List objects in the processed data lake bucket
data_lake_processed_bucket = "asl-dataset-data-lake-processed-asl-dataset-00"

try:
    if data_lake_session:
        # Use the session with assumed role permissions
        response = s3_data_lake.list_objects_v2(
            Bucket=data_lake_processed_bucket,
            MaxKeys=10
        )
        
        if 'Contents' in response:
            print(f"Objects in {data_lake_processed_bucket}:")
            for obj in response['Contents']:
                print(f"- {obj['Key']} ({obj['Size']} bytes)")
        else:
            print(f"No objects found in {data_lake_processed_bucket}")
except Exception as e:
    print(f"Error accessing data lake bucket: {e}")

## Access External Training Data Bucket

Similarly, we can access the external training data bucket by assuming the appropriate role

In [None]:
# Assume external data access role
def get_extrn_data_session():
    credentials = assume_role(extrn_data_access_role_arn, "ExtrnDataAccessSession")
    if credentials:
        session = boto3.Session(
            aws_access_key_id=credentials['AccessKeyId'],
            aws_secret_access_key=credentials['SecretAccessKey'],
            aws_session_token=credentials['SessionToken']
        )
        return session
    return None

# Create a session with external data access permissions
extrn_data_session = get_extrn_data_session()
if extrn_data_session:
    print("Successfully created external data access session")
    # Create an S3 client using this session
    s3_extrn_data = extrn_data_session.client('s3')
else:
    print("Failed to create external data access session")

In [None]:
# Example: List objects in the external training data bucket
extrn_training_data_bucket = "asl-dataset-extrn-raw-training-asl-dataset-00"

try:
    if extrn_data_session:
        response = s3_extrn_data.list_objects_v2(
            Bucket=extrn_training_data_bucket,
            MaxKeys=10
        )
        
        if 'Contents' in response:
            print(f"Objects in {extrn_training_data_bucket}:")
            for obj in response['Contents']:
                print(f"- {obj['Key']} ({obj['Size']} bytes)")
        else:
            print(f"No objects found in {extrn_training_data_bucket}")
except Exception as e:
    print(f"Error accessing external training data bucket: {e}")

## Method 2: Using a Boto3 Resource with Assumed Role

Another approach is to create a specific S3 resource with the assumed role credentials.

In [None]:
# Create S3 resource with data lake access role
def get_s3_resource_with_role(role_arn):
    credentials = assume_role(role_arn)
    if credentials:
        s3_resource = boto3.resource(
            's3',
            aws_access_key_id=credentials['AccessKeyId'],
            aws_secret_access_key=credentials['SecretAccessKey'],
            aws_session_token=credentials['SessionToken']
        )
        return s3_resource
    return None

# Get S3 resource with data lake access
s3_data_lake_resource = get_s3_resource_with_role(data_lake_access_role_arn)
if s3_data_lake_resource:
    print("Successfully created S3 resource with data lake access")

In [None]:
# Example: Download a CSV file from the data lake and load it into a pandas DataFrame
def read_csv_from_s3(bucket, key, s3_resource=None):
    try:
        if s3_resource is None:
            s3_resource = boto3.resource('s3')  # Uses default credentials
            
        obj = s3_resource.Object(bucket, key)
        # Download file to local temporary file
        local_file = f"/tmp/{os.path.basename(key)}"
        obj.download_file(local_file)
        
        # Read into pandas
        df = pd.read_csv(local_file)
        return df
    except Exception as e:
        print(f"Error reading CSV from S3: {e}")
        return None

# Example usage (replace with an actual CSV path)
csv_path = "sample-dir/sample-data.csv"  # Replace with actual path
# Uncomment when you have an actual file to read
# df = read_csv_from_s3(data_lake_processed_bucket, csv_path, s3_data_lake_resource)
# if df is not None:
#     print(f"Successfully loaded data with shape {df.shape}")
#     display(df.head())

## Creating a Helper Function for Data Scientists

To simplify the process, you can create helper functions that data scientists can use without needing to understand the details of cross-account access.

In [None]:
def get_data_from_data_org(data_type="processed", prefix="", limit=100):
    """Helper function to access data in data_org buckets
    
    Args:
        data_type (str): Type of data to access - 'processed', 'external', or 'captured'
        prefix (str): S3 prefix/folder to list
        limit (int): Maximum number of objects to list
        
    Returns:
        list: List of S3 object information dictionaries
    """
    # Define bucket and role based on data type
    if data_type == "processed":
        bucket = "asl-dataset-data-lake-processed-asl-dataset-00"
        role_arn = data_lake_access_role_arn
    elif data_type == "external":
        bucket = "asl-dataset-extrn-raw-training-asl-dataset-00"
        role_arn = extrn_data_access_role_arn
    elif data_type == "captured":
        bucket = "asl-dataset-captured-raw-training-asl-dataset-00"
        role_arn = extrn_data_access_role_arn
    else:
        print(f"Unknown data type: {data_type}")
        return None
    
    # Assume the appropriate role
    credentials = assume_role(role_arn)
    if not credentials:
        print(f"Failed to assume role for {data_type} data")
        return None
    
    # Create S3 client with temporary credentials
    s3 = boto3.client(
        's3',
        aws_access_key_id=credentials['AccessKeyId'],
        aws_secret_access_key=credentials['SecretAccessKey'],
        aws_session_token=credentials['SessionToken']
    )
    
    # List objects
    try:
        if prefix:
            response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, MaxKeys=limit)
        else:
            response = s3.list_objects_v2(Bucket=bucket, MaxKeys=limit)
            
        if 'Contents' in response:
            return response['Contents']
        else:
            print(f"No objects found in {bucket} with prefix '{prefix}'")
            return []
    except Exception as e:
        print(f"Error accessing {data_type} data: {e}")
        return None

In [None]:
# Example usage of the helper function
processed_data = get_data_from_data_org("processed")
if processed_data:
    print(f"Found {len(processed_data)} objects in the processed data lake")
    if len(processed_data) > 0:
        # Show first few items
        for item in processed_data[:3]:
            print(f"- {item['Key']} ({item['Size']} bytes)")

## Creating a High-Level Data Access API

For even more convenience, you can create a class that handles all the cross-account access details.

In [None]:
class DataOrgAccess:
    """Class to simplify access to data in data_org buckets"""
    
    def __init__(self):
        self.account_id = boto3.client('sts').get_caller_identity()["Account"]
        self.data_lake_access_role_arn = f"arn:aws:iam::{self.account_id}:role/data-science-data-lake-access"
        self.extrn_data_access_role_arn = f"arn:aws:iam::{self.account_id}:role/data-science-extrn-data-access"
        
        # Define buckets
        self.processed_bucket = "asl-dataset-data-lake-processed-asl-dataset-00"
        self.external_bucket = "asl-dataset-extrn-raw-training-asl-dataset-00"
        self.captured_bucket = "asl-dataset-captured-raw-training-asl-dataset-00"
        
        # Sessions (created on demand)
        self._data_lake_session = None
        self._extrn_data_session = None
    
    def _assume_role(self, role_arn, session_name="AssumeRoleSession"):
        """Assume a role and get temporary credentials"""
        sts_client = boto3.client('sts')
        try:
            response = sts_client.assume_role(
                RoleArn=role_arn,
                RoleSessionName=session_name
            )
            return response['Credentials']
        except ClientError as e:
            print(f"Error assuming role {role_arn}: {e}")
            return None
    
    @property
    def data_lake_session(self):
        """Get or create a session with data lake access"""
        if self._data_lake_session is None:
            credentials = self._assume_role(self.data_lake_access_role_arn, "DataLakeAccessSession")
            if credentials:
                self._data_lake_session = boto3.Session(
                    aws_access_key_id=credentials['AccessKeyId'],
                    aws_secret_access_key=credentials['SecretAccessKey'],
                    aws_session_token=credentials['SessionToken']
                )
        return self._data_lake_session
    
    @property
    def extrn_data_session(self):
        """Get or create a session with external data access"""
        if self._extrn_data_session is None:
            credentials = self._assume_role(self.extrn_data_access_role_arn, "ExtrnDataAccessSession")
            if credentials:
                self._extrn_data_session = boto3.Session(
                    aws_access_key_id=credentials['AccessKeyId'],
                    aws_secret_access_key=credentials['SecretAccessKey'],
                    aws_session_token=credentials['SessionToken']
                )
        return self._extrn_data_session
    
    def list_processed_data(self, prefix="", limit=100):
        """List objects in the processed data lake bucket"""
        if self.data_lake_session is None:
            return None
        
        s3 = self.data_lake_session.client('s3')
        try:
            if prefix:
                response = s3.list_objects_v2(Bucket=self.processed_bucket, Prefix=prefix, MaxKeys=limit)
            else:
                response = s3.list_objects_v2(Bucket=self.processed_bucket, MaxKeys=limit)
                
            return response.get('Contents', [])
        except Exception as e:
            print(f"Error listing processed data: {e}")
            return None
    
    def list_external_data(self, prefix="", limit=100):
        """List objects in the external training data bucket"""
        if self.extrn_data_session is None:
            return None
        
        s3 = self.extrn_data_session.client('s3')
        try:
            if prefix:
                response = s3.list_objects_v2(Bucket=self.external_bucket, Prefix=prefix, MaxKeys=limit)
            else:
                response = s3.list_objects_v2(Bucket=self.external_bucket, MaxKeys=limit)
                
            return response.get('Contents', [])
        except Exception as e:
            print(f"Error listing external data: {e}")
            return None
    
    def read_csv(self, bucket_type, key):
        """Read a CSV file from one of the data_org buckets
        
        Args:
            bucket_type (str): 'processed', 'external', or 'captured'
            key (str): S3 object key (path to the CSV file)
            
        Returns:
            pandas.DataFrame or None
        """
        if bucket_type == "processed":
            bucket = self.processed_bucket
            session = self.data_lake_session
        elif bucket_type in ["external", "captured"]:
            bucket = self.external_bucket if bucket_type == "external" else self.captured_bucket
            session = self.extrn_data_session
        else:
            print(f"Unknown bucket type: {bucket_type}")
            return None
        
        if session is None:
            return None
        
        try:
            s3 = session.client('s3')
            local_file = f"/tmp/{os.path.basename(key)}"
            s3.download_file(bucket, key, local_file)
            return pd.read_csv(local_file)
        except Exception as e:
            print(f"Error reading CSV: {e}")
            return None

In [None]:
# Example usage of the DataOrgAccess class
data_access = DataOrgAccess()

# List processed data
processed_data = data_access.list_processed_data(limit=5)
if processed_data:
    print(f"Found {len(processed_data)} objects in processed data")
    for item in processed_data[:3]:
        print(f"- {item['Key']} ({item['Size']} bytes)")

# List external data
external_data = data_access.list_external_data(limit=5)
if external_data:
    print(f"\nFound {len(external_data)} objects in external data")
    for item in external_data[:3]:
        print(f"- {item['Key']} ({item['Size']} bytes)")

## Conclusion

SageMaker does not automatically assume cross-account roles. You need to:

1. Explicitly assume roles in your notebook code using `boto3.client('sts').assume_role()`
2. Create sessions or clients with the temporary credentials
3. Use those sessions when accessing cross-account resources

The helper functions and class we've created make this easier for data scientists to use without understanding all the IAM and cross-account access details.