# TODO: Title
**TODO**: Give a helpful introduction to what this notebook is for. Remember that comments, explanations and good documentation make your project informative and professional.

**Note:** This notebook has a bunch of code and markdown cells with TODOs that you have to complete. These are meant to be helpful guidelines for you to finish your project while meeting the requirements in the project rubrics. Feel free to change the order of the TODO's and/or use more than one cell to complete all the tasks.

In [None]:
# TODO: Install any packages that you might need

%pip install protobuf==3.20.3
%pip install smdebug
%pip install torch
%pip install -U sagemaker
%pip install torchvision

In [13]:
# TODO: Import any packages that you might need
from tqdm import tqdm
import os
import json
import boto3
from botocore import UNSIGNED
from botocore.config import Config
from PIL import Image
from io import BytesIO
import numpy as np

import sagemaker
import sagemaker.image_uris
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import Rule, rule_configs, ProfilerRule, ProfilerConfig, FrameworkProfile, DebuggerHookConfig
import IPython.display
import matplotlib.pyplot as plt
from smdebug.trials import create_trial
from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import IdentitySerializer
from sagemaker.predictor import Predictor
from sagemaker.inputs import TrainingInput


[2024-09-03 20:53:31.739 DuLam:21916 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None


## Data Preparation
**TODO:** Run the cell below to download the data.

The cell below creates a folder called `train_data`, downloads training data and arranges it in subfolders. Each of these subfolders contain images where the number of objects is equal to the name of the folder. For instance, all images in folder `1` has images with 1 object in them. Images are not divided into training, testing or validation sets. If you feel like the number of samples are not enough, you can always download more data (instructions for that can be found [here](https://registry.opendata.aws/amazon-bin-imagery/)). However, we are not acessing you on the accuracy of your final trained model, but how you create your machine learning engineering pipeline.

In [None]:
# Get the SageMaker execution role
role = sagemaker.get_execution_role()

# Get the current AWS region
region = boto3.Session().region_name

# Create a SageMaker session
session = sagemaker.Session()

# Set the default S3 bucket
default_bucket = session.default_bucket()


In [8]:
# Function to download and arrange data from S3
train_data_dir = 'train_data'

def download_and_arrange_data():

    if os.path.exists(train_data_dir):
        print(f"{train_data_dir} folder already exists. Skipping download.")
        return
    
    s3_resource = boto3.resource('s3', config=Config(signature_version=UNSIGNED))

    with open('file_list.json', 'r') as f:
        d=json.load(f)

    for k, v in d.items():
        print(f"Downloading Images with {k} objects")
        directory=os.path.join('train_data', k)
        if not os.path.exists(directory):
            os.makedirs(directory)
        for file_path in tqdm(v):
            file_name=os.path.basename(file_path).split('.')[0]+'.jpg'
            s3_resource.Bucket('aft-vbi-pds').download_file(os.path.join('bin-images', file_name), os.path.join(directory, file_name)) # type: ignore

download_and_arrange_data()

Downloading Images with 1 objects


  0%|          | 0/1228 [00:00<?, ?it/s]

100%|██████████| 1228/1228 [14:46<00:00,  1.39it/s]


Downloading Images with 2 objects


100%|██████████| 2299/2299 [28:28<00:00,  1.35it/s] 


Downloading Images with 3 objects


100%|██████████| 2666/2666 [31:41<00:00,  1.40it/s]


Downloading Images with 4 objects


100%|██████████| 2373/2373 [30:32<00:00,  1.30it/s] 


Downloading Images with 5 objects


100%|██████████| 1875/1875 [22:02<00:00,  1.42it/s]


## Dataset
**TODO:** Explain what dataset you are using for this project. Give a small overview of the classes, class distributions etc that can help anyone not familiar with the dataset get a better understanding of it. You can find more information about the data [here](https://registry.opendata.aws/amazon-bin-imagery/).

In [14]:
#TODO: Perform any data cleaning or data preprocessing

def preprocess_images(data_dir):
    processed_images = []
    labels = []
    image_names = []
    
    for class_folder in os.listdir(data_dir):
        class_path = os.path.join(data_dir, class_folder)
        if os.path.isdir(class_path):
            for image_file in os.listdir(class_path):
                image_path = os.path.join(class_path, image_file)
                
                try:
                    # Open the image
                    with Image.open(image_path) as img:
                        # Convert to RGB if not already
                        img = img.convert('RGB')
                        
                        # Resize the image (e.g., to 224x224 for many standard models)
                        img = img.resize((224, 224))
                        
                        labels.append(class_folder)

                        image_names.append(image_file)
                        
                        # Save processed image to BytesIO object
                        buffer = BytesIO()
                        img.save(buffer, format="JPEG")
                        buffer.seek(0)

                        processed_images.append(buffer)
                        
                except (IOError, SyntaxError) as e:
                    print(f"Skipping corrupted image: {image_path}: {str(e)}")
                    continue
    
    return processed_images, labels, image_names




In [None]:
# Verify some uploaded files in S3
s3_client = boto3.client('s3')


def verify_some_uploaded_files(bucket_name):
    sample = 'data/test/4/02573.jpg'
    try:
        s3_client.head_object(Bucket=bucket_name, Key=sample)
        print(f"Verified {sample} is in {bucket_name}")
        return True
    except s3_client.exceptions.NoSuchKey:
        print(f"{sample} not found in {bucket_name}")
    except Exception as e:
        print(f"Error verifying {sample}: {str(e)}")

# Verify some of the uploaded data
is_uploaded = verify_some_uploaded_files(default_bucket)


In [None]:
#TODO: Upload the data to AWS S3

def upload_data_to_s3(data, labels, image_names, bucket_name):
    
    for index, buffer in enumerate(data):
        s3_key = f"processed/{labels[index]}/{image_names[index]}"
        # Split data into train, test, and validation sets
        train_ratio, test_ratio, valid_ratio = 0.6, 0.2, 0.2
        
        # Generate a random number for each image
        random_num = np.random.random()
        
        if random_num < train_ratio:
            s3_key = f"data/train/{labels[index]}/{image_names[index]}"
        elif random_num < train_ratio + test_ratio:
            s3_key = f"data/test/{labels[index]}/{image_names[index]}"
        else:
            s3_key = f"data/valid/{labels[index]}/{image_names[index]}"
            
        try:
            s3_client.upload_fileobj(buffer, bucket_name, s3_key)
            print(f"Uploaded {image_names[index]} to {s3_key}")
        except Exception as e:
            print(f"Error uploading {image_names[index]}: {str(e)}")

if is_uploaded:
    print("Data already uploaded. Skipping upload.")
else:
    processed_images, labels, image_names = preprocess_images(train_data_dir)
    # Upload the data
    upload_data_to_s3(processed_images, labels, image_names, default_bucket)


## Model Training
**TODO:** This is the part where you can train a model. The type or architecture of the model you use is not important. 

**Note:** You will need to use the `train.py` script to train your model.

In [None]:
#TODO: Declare your model training hyperparameter.
#NOTE: You do not need to do hyperparameter tuning. You can use fixed hyperparameter values

hyperparameters = {
    'batch-size': 32,
    'learning-rate': 0.001,
    'momentum': 0.9,
    'weight-decay': 1e-4,
    'num-classes': 5,
}

# Print hyperparameters for logging
print("Hyperparameters:")
for key, value in hyperparameters.items():
    print(f"{key}: {value}")


In [None]:
#TODO: Create your training estimator

# Define the training script
training_script = 'train.py'

# Define the framework version
version = "1.13.1"

#  Define the python version
py_version = "py39"

# Define the training instance type
# instance_type = 'ml.c5.2xlarge'
instance_type = 'ml.c6i.2xlarge'

metric_definitions = [
    {'Name': 'train:loss', 'Regex': 'Train Loss: ([0-9\\.]+)'},
    {'Name': 'test:loss', 'Regex': 'Average loss: ([0-9\\.]+)'},
    {'Name': 'test:accuracy', 'Regex': 'Accuracy: ([0-9\\.]+)'}
]

# Create the estimator
estimator = PyTorch(
    entry_point=training_script,
    framework_version=version,
    py_version=py_version,
    role=role,
    instance_count=1,
    instance_type=instance_type,
    region=region,
    hyperparameters=hyperparameters,
    output_path=f"s3://{default_bucket}/output",
    metric_definitions=metric_definitions,
    dependencies=['requirements.txt'],
    enable_sagemaker_metrics=True
)


# Split data for training
train_input = TrainingInput(f"s3://{default_bucket}/data/train", content_type="application/x-image")
test_input = TrainingInput(f"s3://{default_bucket}/data/test", content_type="application/x-image")
valid_input = TrainingInput(f"s3://{default_bucket}/data/valid", content_type="application/x-image")


In [23]:
# train.py --batch_size 32 --learning_rate 0.001 --momentum 0.9 --num_classes 5 --weight_decay 0.0001
# !python train.py --batch_size 32 --learning_rate 0.001 --momentum 0.9 --num_classes 5 --weight_decay 0.0001 --train-data ./train_data --test-data ./train_data --valid ./train_data --model-dir ./model


In [None]:
# TODO: Fit your estimator
# Set the data channels
estimator.fit({
    'train': train_input,
    'test': test_input,
    'valid': valid_input
})

## Standout Suggestions
You do not need to perform the tasks below to finish your project. However, you can attempt these tasks to turn your project into a more advanced portfolio piece.

### Hyperparameter Tuning
**TODO:** Here you can perform hyperparameter tuning to increase the performance of your model. You are encouraged to 
- tune as many hyperparameters as you can to get the best performance from your model
- explain why you chose to tune those particular hyperparameters and the ranges.


In [None]:
#TODO: Create your hyperparameter search space

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

# Define the hyperparameter search space
hyperparameter_ranges = {
    'batch-size': IntegerParameter(16, 32),
    'learning-rate': ContinuousParameter(0.0001, 0.01),
    'momentum': ContinuousParameter(0.8, 0.99),
}

# Define the objective metric
objective_metric_name = 'test:loss'


In [None]:
#TODO: Create your training estimator
# Create the training estimator for hyperparameter tuning
tuning_estimator = PyTorch(
    entry_point=training_script,
    base_job_name='pytorch-inventory-tuning',
    role=role,
    framework_version=version,
    instance_count=1,
    instance_type=instance_type,
    py_version=py_version
)

# Create the hyperparameter tuner
tuner = HyperparameterTuner(
    tuning_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions, # type: ignore
    max_jobs=4,
    max_parallel_jobs=2,
    objective_type='Minimize'
)


In [None]:
# TODO: Fit your estimator
# Start the hyperparameter tuning job
tuner.fit({
    'train': train_input,
    'test': test_input,
    'valid': valid_input
})


In [None]:
# TODO: Find the best hyperparameters

# Get the best training job
best_training_job = tuner.best_training_job()

# Attach to the best training job
best_estimator = PyTorch.attach(best_training_job)

# Print the hyperparameters of the best estimator
print("Best Estimator Hyperparameters:")
best_hyperparameters = best_estimator.hyperparameters()

# Update the hyperparameters dictionary with the best hyperparameters
hyperparameters.update(best_hyperparameters) # type: ignore

hyperparameters




### Model Profiling and Debugging
**TODO:** Use model debugging and profiling to better monitor and debug your model training job.

In [None]:
# TODO: Set up debugging and profiling rules and hooks
# Define rules for debugging and profiling
rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

# Configure debugger hook
hook_config = DebuggerHookConfig(
    hook_parameters={
        "train.save_interval": "100",  # Save every 100 steps
        "eval.save_interval": "10"     # Save every 10 steps during evaluation
    }
)

# Configure profiler
profiler_config = ProfilerConfig(
    system_monitor_interval_millis=1000,  # Monitor system metrics every 1 second
    framework_profile_params=FrameworkProfile(num_steps=10)  # Profile 10 steps
)

# Add a custom rule to detect loss not decreasing
custom_loss_not_decreasing_rule = Rule.sagemaker(
    base_config=rule_configs.loss_not_decreasing(),
    rule_parameters={
        "threshold": "0.01",
        "patience": "5",
        "scan_interval_steps": "10"
    }
)
rules.append(custom_loss_not_decreasing_rule)

In [None]:
# TODO: Create and fit an estimator
# Suggest using a more descriptive name for the estimator
inventory_estimator = PyTorch(
    entry_point='train.py',
    base_job_name='inventory-monitoring',  # More descriptive job name
    role=role,
    instance_count=1,
    instance_type=instance_type,
    framework_version=version,
    py_version=py_version,
    hyperparameters=hyperparameters,
    # Debugger and Profiler parameters
    rules=rules,
    debugger_hook_config=hook_config,
    profiler_config=profiler_config,
)

# Suggest using a try-except block to handle potential errors during training
try:
    inventory_estimator.fit({
        'train': train_input,
        'test': test_input,
        'valid': valid_input
    })
except Exception as e:
    print(f"An error occurred during training: {str(e)}")
    # Optionally, add more error handling or logging here

# Suggest adding a print statement to confirm training completion
print("Training completed successfully!")

In [None]:
# TODO: Plot a debugging output.
job_name = inventory_estimator.latest_training_job.job_name # type: ignore

# Get the S3 path to the debugger artifacts
debugger_artifacts_path = f"s3://{default_bucket}/output/{job_name}/debug-output"

# Create a trial to access the debugger artifacts
trial = create_trial(debugger_artifacts_path)

# Example to get loss values
losses = trial.tensor("CrossEntropyLoss_output_0").values() # type: ignore

epochs = list(range(1, len(losses) + 1))
loss = [val for _, val in losses]

plt.plot(epochs, loss, label='Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss over Epochs')
plt.legend()
plt.show()


**TODO**: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?  
**TODO**: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [None]:
# TODO: Display the profiler output

rule_output_path = estimator.output_path + job_name + "/rule-output" # type: ignore

# Convert AWS CLI command to boto3
prefix = '/'.join(rule_output_path.split('/')[3:])  # Extract the prefix (folder path)

# List objects in the S3 bucket with the given prefix
response = s3_client.list_objects_v2(Bucket=default_bucket, Prefix=prefix)

# Download each file
for obj in response.get('Contents', []):
    file_key = obj['Key']
    file_name = file_key.split('/')[-1]  # Get the file name
    s3_client.download_file(default_bucket, file_key, file_name)

profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary() # type: ignore
    if "Profiler" in rule["RuleConfigurationName"]
][0]
IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")

### Model Deploying and Querying
**TODO:** Can you deploy your model to an endpoint and then query that endpoint to get a result?

In [None]:
# TODO: Deploy your model to an endpoint
# Get the model data from the estimator
model_data = str(inventory_estimator.model_data)
print(f"Model data location: {model_data}")

# Create a PyTorchModel for deployment
pytorch_inference_model = PyTorchModel(
    model_data=model_data,
    role=sagemaker.get_execution_role(),
    entry_point='inference.py',
    framework_version=version,
    py_version=py_version,
)

# Deploy the model to an endpoint
deployment = pytorch_inference_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
)

# Get the endpoint name for later use
endpoint_name = pytorch_inference_model.endpoint_name
print(f"Model deployed to endpoint: {endpoint_name}")

In [None]:
# TODO: Run a prediction on the endpoint

# Initialize the predictor
predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=session
)

# Set the serializer to handle image data
predictor.serializer = IdentitySerializer("image/jpeg")

# Define the image path
image_path = "train_data/4/00059.jpg"

# Read the image file
with open(image_path, "rb") as f:
    payload = f.read()

# Make a prediction
response = predictor.predict(payload)

# Print the response
print(response)

In [None]:
# TODO: Remember to shutdown/delete your endpoint once your work is done

predictor.delete_endpoint()

### Cheaper Training and Cost Analysis
**TODO:** Can you perform a cost analysis of your system and then use spot instances to lessen your model training cost?

In [1]:
# TODO: Cost Analysis

# Define the cost of training on a regular instance
regular_instance_cost_per_hour = 0.204  # Example cost for ml.c5.xxlarge
training_time_hours = 2  # Example training time

# Calculate the total cost for training on a regular instance
total_regular_instance_cost = regular_instance_cost_per_hour * training_time_hours
print(f"Total cost for training on a regular instance: ${total_regular_instance_cost:.2f}")

# Define the cost of training on a spot instance
spot_instance_cost_per_hour = 0.17  # Example cost for ml.c5.xxlarge spot instance

# Calculate the total cost for training on a spot instance
total_spot_instance_cost = spot_instance_cost_per_hour * training_time_hours
print(f"Total cost for training on a spot instance: ${total_spot_instance_cost:.2f}")

# Calculate the cost savings
cost_savings = total_regular_instance_cost - total_spot_instance_cost
print(f"Cost savings by using a spot instance: ${cost_savings:.2f}")


Total cost for training on a regular instance: $0.41
Total cost for training on a spot instance: $0.34
Cost savings by using a spot instance: $0.07


In [None]:
# TODO: Train your model using a spot instance

# Define the spot instance type
train_use_spot_instances = True
train_max_run = 3600
train_max_wait = 3600 if train_use_spot_instances else None

# Create the spot training estimator
spot_estimator = PyTorch(
    entry_point=training_script,
    base_job_name="pytorch-inventory-spot",
    role=role,
    instance_type=instance_type,
    train_use_spot_instances=train_use_spot_instances,
    train_max_run=train_max_run,
    train_max_wait=train_max_wait,
    instance_count=1,
    framework_version=version,
    py_version=py_version,
    hyperparameters=hyperparameters,
    rules=rules,
    debugger_hook_config=hook_config,
    profiler_config=profiler_config
)

spot_estimator.fit({
    'train': train_input,
    'test': test_input,
    'valid': valid_input
})

### Multi-Instance Training
**TODO:** Can you train your model on multiple instances?

In [None]:
# TODO: Train your model on Multiple Instances

# Create the multi-instance training estimator
multi_instance_estimator = PyTorch(
    entry_point=training_script,
    base_job_name="pytorch-inventory-multi-instance",
    role=role,
    instance_type=instance_type,
    instance_count=2,
    framework_version=version,
    py_version=py_version,
    hyperparameters=hyperparameters,
    rules=rules,
    debugger_hook_config=hook_config,
    profiler_config=profiler_config
)

# Start the multi-instance training job
multi_instance_estimator.fit({
    'train': train_input,
    'test': test_input,
    'valid': valid_input
})