# Amazon SageMaker Processing Jobs - Using Scikit-learn to preprocess data

In this tutorial, we will use a scikit-learn ready container to preprocess data, and using sagemaker sdk, we will submit the jobs to be run in the cloud. Additionally, we will show how to use scikit-learn ready containers but with additonal requirements (to manage extra dependencies) without providing an specific custom image.

In [1]:
# Import general modules
import boto3
import sagemaker

In [3]:
# Set global variables and initialize sessions to be used along notebook

region = "eu-west-2" # Replace with your region
sagemaker_session = sagemaker.session.Session()
default_bucket = sagemaker_session.default_bucket() # Replace if you have another bucket in mind
prefix_bucket = "sagemaker-training-preprocessing" # Use it in case you want to put your artifacts inside another directory

# Get execution role arn to be used and perform operation in cloud
try:
    # get_execution_role() will only work within Sagemaker studio or notebook instance
    role_arn = sagemaker.get_execution_role()
except ValueError:
    # Will need to get the role ARN by initializing a a new IAM session and get the role by their name
    iam = boto3.client('iam')
    role_arn = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20230204T144648')['Role']['Arn']
    print("Role ARN successfully extracted")

Couldn't call 'get_role' to get Role ARN from role name francisco-learning to get Role path.


Role ARN successfully extracted


## Download data

In [6]:
# Download data from s3 bucket to local
!aws s3 cp s3://sagemaker-sample-data-$region/processing/census/census-income.csv .

Completed 256.0 KiB/99.1 MiB (902.8 KiB/s) with 1 file(s) remaining
Completed 512.0 KiB/99.1 MiB (1.7 MiB/s) with 1 file(s) remaining  
Completed 768.0 KiB/99.1 MiB (1.8 MiB/s) with 1 file(s) remaining  
Completed 1.0 MiB/99.1 MiB (2.4 MiB/s) with 1 file(s) remaining    
Completed 1.2 MiB/99.1 MiB (3.0 MiB/s) with 1 file(s) remaining    
Completed 1.5 MiB/99.1 MiB (3.0 MiB/s) with 1 file(s) remaining    
Completed 1.8 MiB/99.1 MiB (3.3 MiB/s) with 1 file(s) remaining    
Completed 2.0 MiB/99.1 MiB (3.6 MiB/s) with 1 file(s) remaining    
Completed 2.2 MiB/99.1 MiB (4.0 MiB/s) with 1 file(s) remaining    
Completed 2.5 MiB/99.1 MiB (4.4 MiB/s) with 1 file(s) remaining    
Completed 2.8 MiB/99.1 MiB (4.3 MiB/s) with 1 file(s) remaining    
Completed 3.0 MiB/99.1 MiB (4.6 MiB/s) with 1 file(s) remaining    
Completed 3.2 MiB/99.1 MiB (5.0 MiB/s) with 1 file(s) remaining    
Completed 3.5 MiB/99.1 MiB (5.0 MiB/s) with 1 file(s) remaining    
Completed 3.8 MiB/99.1 MiB (5.3 MiB/s) with 1 fi

In [11]:
# Read data and store in pandas df
import pandas as pd

input_data = r"C:\Users\franc\personal-coding-projects\aws-sagemaker-training\sagemaker-processing\scikit-learn-data-processing\census-income.csv"
df = pd.read_csv(input_data, nrows=10)
df.head(n=10)

Unnamed: 0,age,class of worker,detailed industry recode,detailed occupation recode,education,wage per hour,enroll in edu inst last wk,marital stat,major industry code,major occupation code,...,country of birth father,country of birth mother,country of birth self,citizenship,own business or self employed,fill inc questionnaire for veteran's admin,veterans benefits,weeks worked in year,year,income
0,73,Not in universe,0,0,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
1,58,Self-employed-not incorporated,4,34,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
2,18,Not in universe,0,0,10th grade,0,High school,Never married,Not in universe or children,Not in universe,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.
3,9,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
4,10,Not in universe,0,0,Children,0,Not in universe,Never married,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,0,0,94,- 50000.
5,48,Private,40,10,Some college but no degree,1200,Not in universe,Married-civilian spouse present,Entertainment,Professional specialty,...,Philippines,United-States,United-States,Native- Born in the United States,2,Not in universe,2,52,95,- 50000.
6,42,Private,34,3,Bachelors degree(BA AB BS),0,Not in universe,Married-civilian spouse present,Finance insurance and real estate,Executive admin and managerial,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
7,28,Private,4,40,High school graduate,0,Not in universe,Never married,Construction,Handlers equip cleaners etc,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,30,95,- 50000.
8,47,Local government,43,26,Some college but no degree,876,Not in universe,Married-civilian spouse present,Education,Adm support including clerical,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,95,- 50000.
9,34,Private,4,37,Some college but no degree,0,Not in universe,Married-civilian spouse present,Construction,Machine operators assmblrs & inspctrs,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.


## Preprocessing Job 

Using scikit-learn ready container provided by AWS

In [14]:
%%writefile preprocessing.py

import argparse
import os
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import make_column_transformer

from sklearn.exceptions import DataConversionWarning

warnings.filterwarnings(action="ignore", category=DataConversionWarning)


columns = [
    "age",
    "education",
    "major industry code",
    "class of worker",
    "num persons worked for employer",
    "capital gains",
    "capital losses",
    "dividends from stocks",
    "income",
]
class_labels = [" - 50000.", " 50000+."]


def print_shape(df):
    negative_examples, positive_examples = np.bincount(df["income"])
    print(
        "Data shape: {}, {} positive examples, {} negative examples".format(
            df.shape, positive_examples, negative_examples
        )
    )


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--train-test-split-ratio", type=float, default=0.3)
    args, _ = parser.parse_known_args()

    print("Received arguments {}".format(args))

    input_data_path = os.path.join("/opt/ml/processing/input", "census-income.csv")

    print("Reading input data from {}".format(input_data_path))
    df = pd.read_csv(input_data_path)
    df = pd.DataFrame(data=df, columns=columns)
    df.dropna(inplace=True)
    df.drop_duplicates(inplace=True)
    df.replace(class_labels, [0, 1], inplace=True)

    negative_examples, positive_examples = np.bincount(df["income"])
    print(
        "Data after cleaning: {}, {} positive examples, {} negative examples".format(
            df.shape, positive_examples, negative_examples
        )
    )

    split_ratio = args.train_test_split_ratio
    print("Splitting data into train and test sets with ratio {}".format(split_ratio))
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop("income", axis=1), df["income"], test_size=split_ratio, random_state=0
    )

    preprocess = make_column_transformer(
        (
            ["age", "num persons worked for employer"],
            KBinsDiscretizer(encode="onehot-dense", n_bins=10),
        ),
        (["capital gains", "capital losses", "dividends from stocks"], StandardScaler()),
        (["education", "major industry code", "class of worker"], OneHotEncoder(sparse=False)),
    )
    print("Running preprocessing and feature engineering transformations")
    train_features = preprocess.fit_transform(X_train)
    test_features = preprocess.transform(X_test)

    print("Train data shape after preprocessing: {}".format(train_features.shape))
    print("Test data shape after preprocessing: {}".format(test_features.shape))

    train_features_output_path = os.path.join("/opt/ml/processing/train", "train_features.csv")
    train_labels_output_path = os.path.join("/opt/ml/processing/train", "train_labels.csv")

    test_features_output_path = os.path.join("/opt/ml/processing/test", "test_features.csv")
    test_labels_output_path = os.path.join("/opt/ml/processing/test", "test_labels.csv")

    print("Saving training features to {}".format(train_features_output_path))
    pd.DataFrame(train_features).to_csv(train_features_output_path, header=False, index=False)

    print("Saving test features to {}".format(test_features_output_path))
    pd.DataFrame(test_features).to_csv(test_features_output_path, header=False, index=False)

    print("Saving training labels to {}".format(train_labels_output_path))
    y_train.to_csv(train_labels_output_path, header=False, index=False)

    print("Saving test labels to {}".format(test_labels_output_path))
    y_test.to_csv(test_labels_output_path, header=False, index=False)

Overwriting preprocessing.py


In [17]:
# Define processor to use, specifying container to use in run-time
from sagemaker.sklearn.processing import SKLearnProcessor
sklearn_processor = SKLearnProcessor(
    framework_version="0.20.0", role=role_arn, instance_type="ml.t3.large", instance_count=1
)

INFO:sagemaker.image_uris:Defaulting to only available Python version: py3


In [18]:
# Run processing job, specifying code, inputs, outputs, etc.
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor.run(
    code="preprocessing.py",
    inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)

INFO:sagemaker:Creating processing-job with name sagemaker-scikit-learn-2023-02-04-15-18-08-063



Job Name:  sagemaker-scikit-learn-2023-02-04-15-18-08-063
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-2-247231311879/sagemaker-scikit-learn-2023-02-04-15-18-08-063/input/input-1/census-income.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-2-247231311879/sagemaker-scikit-learn-2023-02-04-15-18-08-063/input/code/preprocessing.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'AppManaged': False, 'S3Output': {'S3Uri': 's3://sagemaker-eu-west-2-247231311879/sagemaker-scikit-learn-2023-02-04-15-18-08-063/output/train_data', 'LocalPath': '/opt/ml/processing/train

In [19]:
# Get details about the job execution, such as location of train, test data.
preprocessing_job_description = sklearn_processor.jobs[-1].describe()
output_config = preprocessing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
    if output["OutputName"] == "train_data":
        preprocessed_training_data = output["S3Output"]["S3Uri"]
    if output["OutputName"] == "test_data":
        preprocessed_test_data = output["S3Output"]["S3Uri"]

In [25]:
# Download training data (after preprocessing) locally to inspect how it looks like (optional)
!aws s3 cp $preprocessed_training_data/train_features.csv .

Completed 256.0 KiB/17.8 MiB (1.2 MiB/s) with 1 file(s) remaining
Completed 512.0 KiB/17.8 MiB (2.4 MiB/s) with 1 file(s) remaining
Completed 768.0 KiB/17.8 MiB (3.0 MiB/s) with 1 file(s) remaining
Completed 1.0 MiB/17.8 MiB (3.7 MiB/s) with 1 file(s) remaining  
Completed 1.2 MiB/17.8 MiB (4.5 MiB/s) with 1 file(s) remaining  
Completed 1.5 MiB/17.8 MiB (4.5 MiB/s) with 1 file(s) remaining  
Completed 1.8 MiB/17.8 MiB (4.9 MiB/s) with 1 file(s) remaining  
Completed 2.0 MiB/17.8 MiB (5.3 MiB/s) with 1 file(s) remaining  
Completed 2.2 MiB/17.8 MiB (5.8 MiB/s) with 1 file(s) remaining  
Completed 2.5 MiB/17.8 MiB (5.6 MiB/s) with 1 file(s) remaining  
Completed 2.8 MiB/17.8 MiB (6.0 MiB/s) with 1 file(s) remaining  
Completed 3.0 MiB/17.8 MiB (6.1 MiB/s) with 1 file(s) remaining  
Completed 3.2 MiB/17.8 MiB (6.5 MiB/s) with 1 file(s) remaining  
Completed 3.5 MiB/17.8 MiB (6.4 MiB/s) with 1 file(s) remaining  
Completed 3.8 MiB/17.8 MiB (6.5 MiB/s) with 1 file(s) remaining  
Completed 

## Preprocessing Job - with extra dependencies

Not all the time, the containers provided by AWS will be suitable for our use case, and some additional dependencies might be needed. Instead of building the image ourselves, pushing it to ECR and use it, we will still use the same scikit-learn container, but providing a requirements file to install before exeucting code.

In [27]:
%%writefile code/requirements.txt

#
###### Requirements without Version Specifiers ######
keras
tensorflow
pandas
seaborn
pillow

# More examples below to illustrate ways to list specific dependencies 
###### Requirements with Version Specifiers ######
#   See https://www.python.org/dev/peps/pep-0440/#version-specifiers
#numpy==1.13.0
#docopt == 0.6.1             # Version Matching. Must be version 0.6.1
#keyring >= 4.1.1            # Minimum version 4.1.1
#coverage != 3.5             # Version Exclusion. Anything except version 3.5
#Mopidy-Dirble ~= 1.1        # Compatible release. Same as >= 1.1, == 1.*
#
###### Example for referring to other requirements files with additional dependencies ######
# -r other-requirements.txt
#

Writing code/requirements.txt


In [28]:
from sagemaker.processing import FrameworkProcessor

est_cls = sagemaker.sklearn.estimator.SKLearn
framework_version_str = "0.20.0"

script_processor = FrameworkProcessor(
    role=role_arn,
    instance_count=1,
    instance_type="ml.t3.large",
    estimator_cls=est_cls,
    framework_version=framework_version_str,
)

In [30]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

script_processor.run(
    code="preprocessing.py",
    source_dir="code",
    inputs=[ProcessingInput(source=input_data, destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train_data", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test_data", source="/opt/ml/processing/test"),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)

INFO:sagemaker.processing:Uploaded code to s3://sagemaker-eu-west-2-247231311879/sklearn-2023-02-04-15-42-14-454/source/sourcedir.tar.gz
INFO:sagemaker.processing:runproc.sh uploaded to s3://sagemaker-eu-west-2-247231311879/sklearn-2023-02-04-15-42-14-454/source/runproc.sh
INFO:sagemaker:Creating processing-job with name sklearn-2023-02-04-15-42-14-454



Job Name:  sklearn-2023-02-04-15-42-14-454
Inputs:  [{'InputName': 'input-1', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-2-247231311879/sklearn-2023-02-04-15-42-14-454/input/input-1/census-income.csv', 'LocalPath': '/opt/ml/processing/input', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-2-247231311879/sklearn-2023-02-04-15-42-14-454/source/sourcedir.tar.gz', 'LocalPath': '/opt/ml/processing/input/code/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}, {'InputName': 'entrypoint', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-2-247231311879/sklearn-2023-02-04-15-42-14-454/source/runproc.sh', 'LocalPath': '/opt/ml/processing/input/entrypoint', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistri