# Data Preprocessing Tutorial: Medical Image Splitting

Split medical imaging datasets into train/test/validation sets using SageMaker Processing.

## Prepare the Dockerfile
Here is the Dockerfile

In [None]:
%%writefile Dockerfile
FROM ubuntu:latest

RUN apt update
RUN apt install python3 -y

ENTRYPOINT ["python3"]

## Build and push the Docker image

In [None]:
%%writefile build_and_push.sh
#!/bin/bash
AWS_ACCOUNT_ID=$1
AWS_REGION=$2
AWS_ECR_REPO=sm-preprocessing

IMAGE=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$AWS_ECR_REPO:latest

docker build -t $IMAGE .
# docker run -it -p 8080:8080 $IMAGE 
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
# Push the image to ECR

aws ecr create-repository --repository-name $AWS_ECR_REPO

docker push $IMAGE

Run the following command

`bash build_and_push.sh <Your_account_id> <region>`

In [None]:
!bash build_and_push.sh xxxxxxxxx us-east-1

## Step 1: Setup

Import SageMaker libraries and configure session.

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor
from sagemaker import get_execution_role
import sagemaker

# Use default session (us-east-1)
sagemaker_session = sagemaker.Session()
execution_role = get_execution_role()
region = sagemaker_session.boto_region_name

print(f"Role: {execution_role}")
print(f"Region: {region}")

## Step 2: Configure Data Paths

Set S3 paths for input and output data.

### Dataset Structure
```
data
├── class_1
│   ├── image_1.png
│   ├── image_2.png
│   └── ...
├── class_2
│   ├── image_1.png
│   ├── image_2.png
│   └── ...
└── ...
```
Each class folder contains images belonging to that class.

### Processed dataset
```
data
├── train
│   ├── class_1
│   │   ├── image_1.png
│   │   ├── image_2.png
│   │   └── ...
│   ├── class_2
│   │   ├── image_1.png
│   │   ├── image_2.png
│   │   └── ...
│   └── ...
├── validation
│   ├── class_1
│   │   ├── image_1.png
│   │   ├── image_2.png
│   │   └── ...
│   ├── class_2
│   │   ├── image_1.png
│   │   ├── image_2.png 
│   │   └── ...
│   └── ...
└── test


In [None]:
# S3 paths to be changed as per your setup
input_bucket_name = "Enter-Your-Bucket-Name-Here"
dataset_name = 'enter-dataset-name-here'
output_dataset_name = 'processed-data'
input_s3_uri = f's3://{input_bucket_name}/{dataset_name}'
output_s3_uri = f's3://{input_bucket_name}/{output_dataset_name}'

print(f"Input: {input_s3_uri}")
print(f"Output: {output_s3_uri}")

## Step 3: Create Preprocessing Script

Write script to split data by class into train/test/val sets.

In [None]:
%%writefile preprocessing.py

import os
import json
import logging
import shutil
logging.getLogger().setLevel(logging.INFO)
logging.getLogger().info("Starting preprocessing script")
"""
The dataset is distributed in the following way. The source directory contains all the images in individual class folders. 
We need to split the data into train, test and validation sets. In each of the train, test and validation sets, we need to have the same distribution of classes.

"""


def main():
    data_src = '/opt/ml/processing/input'
    data_dest = '/opt/ml/processing/output'
    
    list_classes = os.listdir(data_src)
    list_classes.sort()
    logging.info(f"Classes found: {list_classes}")
    for class_name in list_classes:
        os.makedirs(os.path.join(data_dest, 'train', class_name), exist_ok=True)
        os.makedirs(os.path.join(data_dest, 'test', class_name), exist_ok=True)
        os.makedirs(os.path.join(data_dest, 'val', class_name), exist_ok=True)
        
    # Split the data into train, test and validation sets
    for class_name in list_classes:
        list_images = os.listdir(os.path.join(data_src, class_name))
        list_images.sort()
        train_images = list_images[:int(len(list_images)*0.7)]
        test_images = list_images[int(len(list_images)*0.7):int(len(list_images)*0.9)]
        val_images = list_images[int(len(list_images)*0.9):]
        for image in train_images:
            # copy the image to the train folder
            shutil.copy(os.path.join(data_src, class_name, image), os.path.join(data_dest, 'train', class_name, image))
        for image in test_images:
            shutil.copy(os.path.join(data_src, class_name, image), os.path.join(data_dest, 'test', class_name, image))
        for image in val_images:
            shutil.copy(os.path.join(data_src, class_name, image), os.path.join(data_dest, 'val', class_name, image))

if __name__ == "__main__":
    main()


## Step 4: Create Script Processor

Configure SageMaker processor with container image and instance type.

In [None]:
# Use SageMaker Python container
from sagemaker.processing import ProcessingInput, ProcessingOutput, ScriptProcessor
from sagemaker import get_execution_role
import sagemaker

region='us-east-1'
image_uri = f'683313688378.dkr.ecr.{region}.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3'

print(image_uri)
script_processor = ScriptProcessor(
    command=['python3'],
    image_uri=image_uri,
    role=execution_role,
    instance_type='ml.m5.xlarge',
    instance_count=1,
)

print(f"Using image: {image_uri}")

## Step 5: Run Processing Job

Execute data splitting with custom split ratios.

In [None]:
import time

# Generate unique job name
job_name = f'data-splitting-{int(time.time())}'

script_processor.run(
    code='preprocessing.py',
    job_name=job_name,
    inputs=[
        ProcessingInput(
            source=input_s3_uri, 
            destination='/opt/ml/processing/input'
        )
    ],
    outputs=[
        ProcessingOutput(
            source='/opt/ml/processing/output/train', 
            destination=output_s3_uri + '/train'
        ),
        ProcessingOutput(
            source='/opt/ml/processing/output/test', 
            destination=output_s3_uri + '/test'
        ),
        ProcessingOutput(
            source='/opt/ml/processing/output/val', 
            destination=output_s3_uri + '/val'
        )
    ],
    arguments=['--train-split', '0.7', '--test-split', '0.15', '--val-split', '0.15']
)

print(f"Processing job {job_name} completed")

## Step 6: Verify Output

Check processed data in S3.

In [None]:
import boto3

s3 = boto3.client('s3')

# List output folders
for split in ['train', 'test', 'val']:
    response = s3.list_objects_v2(
        Bucket=input_bucket_name,
        Prefix=f'{output_dataset_name}/{split}/',
        MaxKeys=10
    )
    count = response.get('KeyCount', 0)
    print(f"{split}: {count} objects")

## Next Steps

Data is now split and ready for training. Use the processed data paths in the EC2 training notebook.