# Goal

This notebook picks one of our medium-large datasets in Delta and converts it to Parquet files to test data sharding using `ShardedByKey` in a Sagemaker Processing Job.

The selected dataset is stored at `s3://mvp-mlops-platform/poc-multi-instance-data-prep-repartitioned-delta/`, which one of our largest datasets originally stored at `s3://ml-rd-ml-datasets/generateVectorEmbed/Qwen3-Embedding-0.6B/miracl/fr/vector_corpus/` but repartitioned to have ~1GB per underlying Parquet file.

## 1. Authentication to AWS

In [1]:
import boto3

ml_session = boto3.Session(profile_name="ml", region_name="us-east-1")

In [2]:
import os

credentials = ml_session.get_credentials().get_frozen_credentials()
os.environ["AWS_ACCESS_KEY_ID"] = credentials.access_key
os.environ["AWS_SECRET_ACCESS_KEY"] = credentials.secret_key
os.environ["AWS_SESSION_TOKEN"] = credentials.token

## 2. Copy underlying Parquet files from the latest Delta Lake table

In [3]:
from deltalake import DeltaTable

ORIGINAL_DATASET_S3_URI = "s3://mvp-mlops-platform/poc-multi-instance-data-prep-repartitioned-delta/"

dt = DeltaTable(
    table_uri=ORIGINAL_DATASET_S3_URI,
    storage_options={"timeout": "3600s"}
)

[90m[[0m2025-07-31T07:57:05Z [33mWARN [0m aws_config::imds::region[90m][0m failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: timeout: client error (Connect): HTTP connect timeout occurred after 1s: timed out (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper_util::client::legacy::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), connection: Unknown } }) }))
[90m[[0m2025-07-31T07:57:06Z [33mWARN [0m aws_config::imds::region[90m][0m failed to load region from IMDS err=failed to load IMDS session token: dispatch failure: timeout: client error (Connect): HTTP connect timeout occurred after 1s: timed out (FailedToLoadToken(FailedToLoadToken { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Timeout, source: hyper_util::client::legacy::Error(Connect, HttpTimeoutError { kind: "HTTP connect", duration: 1s }), conne

In [4]:
dt.to_pyarrow_dataset().count_rows()

14636953

Retrieve the Parquet files that form the latest version of the table:

In [5]:
parquet_files = dt.file_uris()

print(f"# of Parquet files forming the latest version of the Delta table: {len(parquet_files)}\n")
print(parquet_files[0])  # print only one sample

# of Parquet files forming the latest version of the Delta table: 61

s3://mvp-mlops-platform/poc-multi-instance-data-prep-repartitioned-delta/part-00001-0eebc8f7-76e7-4024-84e1-fa2b5651b08c-c000.zstd.parquet


Let's copy all the Parquet files of the latest version of the Delta table to a new destination:

In [None]:
s3_client = ml_session.client("s3")

source_bucket = "mvp-mlops-platform"
destination_bucket = "mvp-mlops-platform"
destination_prefix = "poc-multi-instance-data-prep-repartitioned-parquet/"

for i, parquet_file in enumerate(parquet_files):
    print(f"Copying {i} of {len(parquet_files)}: {parquet_file}...")

    source_key = parquet_file.replace(f"s3://{source_bucket}/", "")
    destination_key = destination_prefix + source_key.split("/")[-1]

    s3_client.copy_object(
        Bucket=destination_bucket,
        CopySource={'Bucket': source_bucket, 'Key': source_key},
        Key=destination_key
    )

    print(f"{parquet_file} successfully copied!")

Copying 0 of 61: s3://mvp-mlops-platform/poc-multi-instance-data-prep-repartitioned-delta/part-00001-0eebc8f7-76e7-4024-84e1-fa2b5651b08c-c000.zstd.parquet...
s3://mvp-mlops-platform/poc-multi-instance-data-prep-repartitioned-delta/part-00001-0eebc8f7-76e7-4024-84e1-fa2b5651b08c-c000.zstd.parquet successfully copied!
Copying 1 of 61: s3://mvp-mlops-platform/poc-multi-instance-data-prep-repartitioned-delta/part-00001-0ed79d16-c12d-4c6c-8f87-8253daa072c6-c000.zstd.parquet...
s3://mvp-mlops-platform/poc-multi-instance-data-prep-repartitioned-delta/part-00001-0ed79d16-c12d-4c6c-8f87-8253daa072c6-c000.zstd.parquet successfully copied!
Copying 2 of 61: s3://mvp-mlops-platform/poc-multi-instance-data-prep-repartitioned-delta/part-00001-b2c77227-e837-47cf-b409-f371f50250c0-c000.zstd.parquet...
s3://mvp-mlops-platform/poc-multi-instance-data-prep-repartitioned-delta/part-00001-b2c77227-e837-47cf-b409-f371f50250c0-c000.zstd.parquet successfully copied!
Copying 3 of 61: s3://mvp-mlops-platform/po