## Get Data and load to S3

In [1]:
import boto3
import pandas as pd
import re
import tarfile
from io import StringIO
import io

## Train model

- Identify location on S3 with the train & test data
- Define the git repo that houses the code that will be executed (train.py)
- Define the output location and infrastructure requirements
- Launch the training job

We're using the AWS maintained Scikit image for this training job, leveraged through the sagemaker SDK. You can also leverage other similar images (PyTorch, Tensorflow etc...), or just roll your own container image completely (registered through ECS). This is simpler as no customization is required, but if customization is necessary, defining your own container is a great solution.

When the training job is launched, it launches a different dedicated compute instance(s) where the model runs. The instance is only up and running for as long as the model is running (and only charged for that amount of time as well). At the conclusion of the job, the model is saved to an S3 location, as well as any data if desired. This is just a single job with fixed hyperparameters, but the job can be adjusted for hyperparameter optimization as well.

In [2]:
# Define git config

from sagemaker.sklearn.estimator import SKLearn
from sagemaker import get_execution_role

git_config = {'repo': 'https://github.com/WesleyPasfield/sagemaker_demo.git'}
framework_version = '1.2-1' # Scikit version - needed using Scikit SDK
output_root = 'housingdemo/output'
s3_bucket = 'censussmdemo'
s3_location_train = 'housingdemo/train/train.csv'
s3_location_test = 'housingdemo/test/test.csv'

sklearn_estimator = SKLearn(
    entry_point="train.py",
    source_dir='initial_demo',
    role=get_execution_role(),
    framework_version=framework_version,
    output_path= f's3://{s3_bucket}/{output_root}/model_output/',
    instance_count=1, # Flexibility if parallizable
    instance_type="ml.m5.large",  # Flexibility to scale up if needed
    base_job_name="rf-scikit",
    use_spot_instances=True, # Keep it cheap
    max_wait=1000, # Needed when using spot
    max_run = 800, # Needed when using spot
    hyperparameters={
        "n-estimators": 100,
        "min-samples-leaf": 3,
        "max_depth": 4,
        "target": "target"
    },
    git_config=git_config
)

  import scipy.sparse


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [3]:
train_s3_path = f"s3://{s3_bucket}/{s3_location_train}"
test_s3_path = f"s3://{s3_bucket}/{s3_location_test}"
sklearn_estimator.fit({"train": train_s3_path, "test": test_s3_path})

Cloning into '/tmp/tmpl8mnmplz'...
INFO:sagemaker:Creating training-job with name: rf-scikit-2024-09-16-17-54-45-674


2024-09-16 17:54:47 Starting - Starting the training job...
2024-09-16 17:55:00 Starting - Preparing the instances for training...
2024-09-16 17:55:50 Downloading - Downloading the training image........[34m2024-09-16 17:57:01,276 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2024-09-16 17:57:01,280 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-09-16 17:57:01,283 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-09-16 17:57:01,304 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2024-09-16 17:57:01,552 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-09-16 17:57:01,556 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-09-16 17:57:01,578 sagemaker-training-toolkit INFO     No GPUs detected (normal if

## Compute Predictions in Bulk

- Define the model to use
- Define the git repo with the code that will be used (inference.py)
- Define the input data and the model location
- Define the output location in S3 where predictions will go

This aims to simulate generating predictions on a trained model. This could be done in the model training process itself, but is meant to reflect a more practical scenario where predictions are generated against a model in production.

We will reuse the SKLearn container image here, this is explicitly there for model training, but it can be repurposed for batch predictions as well. If real-time predictions are required, we can follow a different process to stand up an API. We also could leverage other services (Glue, SageMaker Batch Transform, even Lambda if model is small enough), but we'll use this approach for simplicity

In [4]:
# Define git config

from sagemaker.sklearn.estimator import SKLearn
from sagemaker import get_execution_role

git_config = {'repo': 'https://github.com/WesleyPasfield/sagemaker_demo.git'}
framework_version = '1.2-1'
output_root = 'housingdemo/output'
training_job_name = sklearn_estimator._current_job_name

sklearn_predict = SKLearn(
    entry_point="inference.py",
    source_dir='initial_demo',
    role=get_execution_role(),
    framework_version=framework_version,
    use_spot_instances=True,
    max_wait=1000,
    max_run = 800,
    output_path= f's3://{s3_bucket}/{output_root}/prediction_output/{training_job_name}/',
    instance_count=1, # Flexibility if parallizable
    instance_type="ml.m5.large",  # Flexibility to scale up if needed
    base_job_name="rf-preds",
    hyperparameters={
        "n-estimators": 100,
        "min-samples-leaf": 3,
        "max_depth": 4,
        "target": "target"
    },
    git_config=git_config
)

In [5]:
model_s3_path = f'{sklearn_estimator.output_path}{sklearn_estimator._current_job_name}/output/model.tar.gz'
test_s3_path = "s3://censussmdemo/housingdemo/test/test.csv"
sklearn_predict.fit({"train": model_s3_path, "test": test_s3_path})

Cloning into '/tmp/tmp7jzlhpmi'...
INFO:sagemaker:Creating training-job with name: rf-preds-2024-09-16-17-58-03-817


2024-09-16 17:58:05 Starting - Starting the training job...
2024-09-16 17:58:19 Starting - Preparing the instances for training...
2024-09-16 17:59:07 Downloading - Downloading the training image......
2024-09-16 18:00:08 Training - Training image download completed. Training in progress..[34m2024-09-16 18:00:10,791 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2024-09-16 18:00:10,794 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-09-16 18:00:10,797 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-09-16 18:00:10,814 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2024-09-16 18:00:11,090 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2024-09-16 18:00:11,094 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m

In [6]:
# Read predictions from S3 to show they are completed
# Read from boto3 s3 and convert to dataframe

def prediction_reader(s3_bucket, sklearn_predict_job):
    pattern = r'^[^/]+://[^/]+/'
    s3_location_preds = re.sub(pattern, '', f'{sklearn_predict_job.output_path}{sklearn_predict_job._current_job_name}/output/output.tar.gz')
    print(f'S3 location with predictions: {s3_location_preds}')

    s3_preds = s3_client.get_object(Bucket=s3_bucket, Key=s3_location_preds)
    prediction_raw = s3_preds['Body'].read()

    # Create a BytesIO object from the S3 object body
    tar_content = io.BytesIO(prediction_raw)

    # Open the tarfile
    with tarfile.open(fileobj=tar_content, mode='r:gz') as tar:
        # Find the first CSV file in the archive
        csv_file = next((f for f in tar.getmembers() if f.name.endswith('.csv')), None)

        if csv_file is None:
            raise ValueError("No CSV file found in the tar.gz archive")

        # Extract the CSV file content
        csv_content = tar.extractfile(csv_file)

        # Read the CSV content into a pandas DataFrame
        preds = pd.read_csv(csv_content, header=0)

    return preds

## Review Predictions

After running the model & generating predictions, we can pull predictions down from S3 to review.

Note the notebook is just orchestrating the model & prediction generation. We are loading the predictions themselves directly into the notebook environment. Since the predictions are likely smaller, this allows us to scale up to larger compute when needed, but persistently have a smaller instance available for analysis. A helper function to pull in data is hidden in the cell above for reference


In [7]:
s3_client = boto3.client('s3')

preds = prediction_reader(s3_bucket, sklearn_predict)
preds

S3 location with predictions: housingdemo/output/prediction_output/rf-scikit-2024-09-16-17-54-45-674/rf-preds-2024-09-16-17-58-03-817/output/output.tar.gz


Unnamed: 0,# predictions
0,1.239431
1,1.226810
2,3.162654
3,2.587397
4,1.748417
...,...
5155,3.822272
5156,1.028411
5157,1.465872
5158,3.165391
