# Accessing AWS Open Data Using Boto3
This notebook demonstrates how to access public datasets in the AWS Open Data program using the Boto3 SDK in Python. We'll use the `human-pangenomics` dataset as an example and explore how to list contents, navigate folders, and download files programmatically.

This notebook is designed to run in **Amazon SageMaker Studio**. You can also run it in [SageMaker Studio Lab](https://studiolab.sagemaker.aws/) if you don't have an AWS account.

## List Top-Level Contents of the Bucket
When working with AWS Open Data, one of the first tasks is to browse the contents of an S3 bucket. Similar to using the AWS CLI to list bucket contents, you can achieve the same functionality using Boto3. The following example retrieves and lists only the top-level directories and files within a bucket using the list_objects_v2 method with the Delimiter='/' parameter.

In [None]:
import boto3
from botocore import UNSIGNED
from botocore.config import Config

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
bucket_name = "human-pangenomics"

response = s3.list_objects_v2(Bucket=bucket_name, Delimiter='/')

if 'CommonPrefixes' in response:
    for prefix in response['CommonPrefixes']:
        print(prefix['Prefix'])

if 'Contents' in response:
    for obj in response['Contents']:
        print(obj['Key'])
else:
    print("No objects found in the bucket.")

## Navigate to a Specific Folder
Once inside an S3 bucket, you may want to explore a specific folder to locate the files you need. By specifying a folder prefix in the request, Boto3 allows you to narrow down the results to a particular directory within the bucket. This approach mirrors the AWS CLI's ability to list contents within a folder. Let's explore the contents of the `pangenomes/` folder.

In [None]:
folder_prefix = "pangenomes/"
response = s3.list_objects_v2(Bucket=bucket_name, Prefix=folder_prefix, Delimiter='/')

if 'CommonPrefixes' in response:
    for prefix in response['CommonPrefixes']:
        print(prefix['Prefix'])

if 'Contents' in response:
    for obj in response['Contents']:
        print(obj['Key'])
else:
    print("No objects found in the folder.")

## Download a Single File
After identifying the required files, the next step is to download them to your local environment. Boto3 provides the download_file method to retrieve individual files from an S3 bucket, just as you would with the AWS CLI. Let's download a single README file from the dataset.

In [None]:
file_key = "pangenomes/scratch/2021_03_22_minigraph/00README.txt"
local_file_name = "00README.txt"
s3.download_file(bucket_name, file_key, local_file_name)
print(f"File {local_file_name} downloaded successfully.")

## Download an Entire Folder Recursively
In cases where you need multiple files, downloading an entire folder is often more efficient than retrieving files individually. Unlike the AWS CLI, which has a built-in --recursive flag, Boto3 requires iterating through the folder's contents and downloading each file programmatically. The example below demonstrates how to achieve this using a paginator to retrieve all objects within the `working/T2T/CHM13/paper/Nurk_2021/fig3/` folder.

In [None]:
import os

def download_folder(bucket_name, folder_prefix, local_directory):
    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=bucket_name, Prefix=folder_prefix):
        for obj in page.get('Contents', []):
            file_key = obj['Key']
            local_path = os.path.join(local_directory, os.path.relpath(file_key, folder_prefix))
            os.makedirs(os.path.dirname(local_path), exist_ok=True)
            s3.download_file(bucket_name, file_key, local_path)
            print(f"Downloaded {file_key} to {local_path}")

folder_prefix = "working/T2T/CHM13/paper/Nurk_2021/fig3/"
local_directory = "./copied_folder"
download_folder(bucket_name, folder_prefix, local_directory)