TL;DR: i created download_trades.py from this notebook.

long story:

here is what i've found.

1) rclone doesn't compare checksums for polygon flat files. proof:

```
> rclone copy s3polygon:flatfiles/us_stocks_sip/trades_v1/2025/06/2025-06-12.csv.gz . --log-level DEBUG --progress --checksum
...
2025/06/15 12:13:14 DEBUG : 2025-06-12.csv.gz: Src hash empty - aborting Dst hash check
...
```

2) files in s3 have ETag that contains checksum. but... files in s3 are stored in chuncks. so it's not possible to just calculate md5 of a local file and compare it to ETag, you have to split the local file on chuncks first. join md5's of all chunks and get md5 of the join. the ETag has the number of chunks in the end, for example for the 1.8G file `2025-06-12.csv.gz` the ETag is `b8744a0cf028db0fb5a01f6b89a2c853-18`. here 18 means 18 chunks.

3) the size of chunks is not a standard is s3 and can be configured by the guy who uploads the file (no proof, chatGPT said and i believe it).
i found out that for polygon files the chunk size is 100Mb. 
i'm not sure if it's the same size for all files, but so far I checked it for 2Gb files, for files less than 100mb and for 109mb file.
the checksum check seems to be working fine.


In [None]:
%pip install boto3 botocore

In [None]:
# setup client for AWS S3
import sys
import os

# Add the parent directory to Python path to import api_key module
sys.path.append(os.path.dirname(os.path.abspath('')))

import api_key

In [None]:
aws_access_key_id = api_key.read_api_key_id()
aws_secret_access_key = api_key.read_api_key()

In [None]:
import boto3
from botocore.config import Config

# Initialize a session using your credentials
session = boto3.Session(
  aws_access_key_id,
  aws_secret_access_key,
)

# Create a client with your session and specify the endpoint
s3 = session.client(
  's3',
  endpoint_url='https://files.polygon.io',
  config=Config(signature_version='s3v4'),
)

# List Example
# Initialize a paginator for listing objects
paginator = s3.get_paginator('list_objects_v2')

# Choose the appropriate prefix depending on the data you need:
# - 'global_crypto' for global cryptocurrency data
# - 'global_forex' for global forex data
# - 'us_indices' for US indices data
# - 'us_options_opra' for US options (OPRA) data
# - 'us_stocks_sip' for US stocks (SIP) data
prefix = 'us_stocks_sip/trades_v1'  # Example: Change this prefix to match your data need

# List objects using the selected prefix
for page in paginator.paginate(Bucket='flatfiles', Prefix=prefix):
  for obj in page['Contents']:
    print(obj['Key'])

In [None]:
# Specify the bucket name
bucket_name = 'flatfiles'

# Specify the S3 object key name
object_key = 'us_stocks_sip/trades_v1/2025/06/2025-06-12.csv.gz'

# Specify the local file name and path to save the downloaded file
# This splits the object_key string by '/' and takes the last segment as the file name
local_file_name = object_key.split('/')[-1]

# This constructs the full local file path
local_file_path = './' + local_file_name

In [None]:
# Get ETag without downloading the file
response = s3.head_object(Bucket=bucket_name, Key=object_key)
etag = response['ETag'].strip('"')  # Remove quotes around ETag
print(f"ETag for {object_key}: {etag}")

# ETag is typically MD5 hash for single-part uploads
# You can compare this with the MD5 of your local file

In [None]:
# Download the file
# s3.download_file(bucket_name, object_key, local_file_path)

In [None]:
import hashlib

def calculate_multipart_etag(file_path, chunk_size=8 * 1024 * 1024):
    """Calculate ETag for multipart upload to match S3's calculation"""
    md5s = []
    with open(file_path, 'rb') as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            md5s.append(hashlib.md5(chunk).digest())
    if len(md5s) == 1:
        # Single part upload - return simple MD5
        return md5s[0].hex()
    else:
        # Multipart upload - combine MD5s and add part count
        combined_md5 = hashlib.md5(b''.join(md5s)).hexdigest()
        return f"{combined_md5}-{len(md5s)}"

def equal_md5(bucket_name, object_key, local_file_path, s3_client):
    """Verify that local file and the bucket object have the same size and MD5 checksum"""
    print(f"Verifying {local_file_path} against {bucket_name}/{object_key}")
    # Get remote file metadata
    response = s3_client.head_object(Bucket=bucket_name, Key=object_key)
    remote_etag = response['ETag'].strip('"')
    remote_size = response['ContentLength']
    # Check file sizes first (quick check)
    local_size = os.path.getsize(local_file_path)
    if remote_size != local_size:
        print("✗ File sizes do not match")
        return False
    print("✓ File sizes match")
    # For multipart uploads use known chunk sizes
    if '-' in remote_etag:
        chunk_size = 100 * 1024 * 1024
        calculated_etag = calculate_multipart_etag(local_file_path, chunk_size)
        if calculated_etag == remote_etag:
            print(f"✓ File verified with {chunk_size // (1024*1024)}MB chunks")
            return True
        print("✗ Could not verify multipart file integrity")
        return False
    else:
        # Single part - simple MD5
        local_md5 = calculate_multipart_etag(local_file_path)
        if local_md5 == remote_etag:
            print("✓ File verified (single part)")
            return True
        else:
            print("✗ File verification failed")
            return False

is_valid = equal_md5(bucket_name, object_key, local_file_path, s3)
is_valid