# Base S3 Bucket preparation and training data download to Jupyter Notebook Pod

This Notebook sets the scene for the base bucket and then downloads available x-ray images for training to the pod for the execution of the ML training notebook.

Note! The actual upload of train/test/validation files is outside the scope of this notebook. Plese use either AWS console or aws cli to upload the dataset to S3.

In [1]:
!pip install boto3
import boto3
import json
import os
import tqdm

You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.[0m


In [2]:
# direct keys to S3 and not to the ODF storage instance from OCP4
aws_access_key_id = 'your_key_id'
aws_secret_access_key = 'your_access_key'
region_name = 'default' #default region for the profile e.g., us-east-2

In [3]:
def create_bucket(bucket_name):
    location = {'LocationConstraint': region_name}
    result = s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=location)
    return result

In [4]:
# Retrieve the list of existing buckets
s3 = boto3.client('s3',
                  aws_access_key_id = aws_access_key_id,
                  aws_secret_access_key = aws_secret_access_key,
                  region_name = region_name)

#### Uncomment the below to create the base bucket where the images are stored. 
Optionally, you can change the name of the bucket, though ensure to replace the new name in all instances where you use it (this file included)

In [5]:
# create_bucket('ml-pneumonia-datasource')

In [6]:
response = s3.list_buckets()

# Output the bucket names
print('Existing buckets:')
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')

Existing buckets:
  cluster-tws4r-289dz-image-registry-us-east-2-woscgptkqjgwywhvd
  ml-pneumonia-datasource
  nb.1634660370505.cluster-tws4r.tws4r.sandbox702.opentlc.com


In [7]:
def download_dir(aws_access_key_id, aws_secret_access_key, region_name,  bucket, s3_prefix = '', local_base = ''):
    """
    params:
    - aws_access_key_id: The aws_access_key_id
    - aws_secret_access_key: The aws_secret_access_key
    - region_name: The region where the bucket was created
    - bucket: s3 bucket with target contents
    - s3_prefix: pattern to match in s3
    - local_base: local path to folder in which to place files
    """
    
    s3_resource = boto3.resource('s3',
                             aws_access_key_id = aws_access_key_id,
                             aws_secret_access_key = aws_secret_access_key,
                             region_name = region_name)
    
    ml_ds_bucket = s3_resource.Bucket(bucket)
    bucket_objects = ml_ds_bucket.objects.all()
    
    files = []
    for item in bucket_objects:
        files.append(item.key)

    print(f'Downloading files...')
    for file in tqdm.tqdm(files):
        dest_pathname = os.path.join(local_base, file)
        if not os.path.exists(os.path.dirname(dest_pathname)):
            os.makedirs(os.path.dirname(dest_pathname))
        ml_ds_bucket.download_file(file, dest_pathname)
        
    print(f'Done!')

#### The below cell will download to this pod in the (new) dataset folder the contents of the S3 bucket.

In [8]:
download_dir(aws_access_key_id = aws_access_key_id,
              aws_secret_access_key = aws_secret_access_key,
              region_name = region_name,
              bucket = 'ml-pneumonia-datasource',
              s3_prefix = '',
              local_base = 'dataset')

  0%|          | 2/5856 [00:00<06:53, 14.14it/s]

Downloading files...


100%|██████████| 5856/5856 [06:06<00:00, 15.97it/s]

Done!





Final check to ensure the number of files matches the one from the bucket.

In [10]:
!ls -lR dataset | wc -l

5894
