## Task 3: Perform data processing with SageMaker Processing and the built-in scikit-learn container

In this notebook, you set up the environment needed to run a scikit-learn script using a Docker image provided and maintained by Amazon SageMaker Processing. 

You then use the **SKLearnProcessor** class from the Amazon SageMaker Python SDK to define and run a scikit-learn processing job.

Finally, you validate the data processing results saved in Amazon Simple Storage Service (Amazon S3).

**Note:** The processing script does some basic data processing, such as removing duplicates, transforming the target column into a column that contains two labels, one-hot encoding, and an 80-20 split to produce training and test datasets. 

### Task 3.1: Setup the environment

In this task, you install the needed packages and dependencies. 

You set up an Amazon S3 bucket to store the outputs from the processing job and get the execution role to run the SageMaker Processing job.

Update the library by running the following commands.

In [1]:
%%sh
sudo rm /usr/lib/x86_64-linux-gnu/libstdc++.so.6
sudo cp /opt/conda/lib/libstdc++.so.6 /usr/lib/x86_64-linux-gnu/libstdc++.so.6

In [2]:
#install-dependencies
import logging
import boto3
import sagemaker
import pandas as pd
from sagemaker.s3 import S3Downloader

sagemaker_logger = logging.getLogger("sagemaker")
sagemaker_logger.setLevel(logging.INFO)
sagemaker_logger.addHandler(logging.StreamHandler())

#Execution role to run the SageMaker Processing job
role = sagemaker.get_execution_role()
print("SageMaker Execution Role: ", role)

#S3 bucket to read the SKLearn processing script and writing processing job outputs
s3 = boto3.resource('s3')
for buckets in s3.buckets.all():
    if 'labdatabucket' in buckets.name:
        bucket = buckets.name
print("Bucket: ", bucket)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
SageMaker Execution Role:  arn:aws:iam::281777908394:role/LabVPC-notebook-role
Bucket:  labdatabucket-us-west-2-226013955


### Task 3.2: Run the SageMaker processing job

In this task, you import and review the preprocessed dataset.

In [3]:
#import-data
prefix = 'data/input'

S3Downloader.download(s3_uri=f"s3://{bucket}/{prefix}/adult_data.csv", local_path= 'data/')

shape=pd.read_csv("data/adult_data.csv", header=None)
shape.sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
533,60,Private,209844,Some-college,10,Divorced,Adm-clerical,Other-relative,White,Female,0,0,40,United-States,<=50K
208,73,Local-gov,143437,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,20,United-States,<=50K
739,41,Local-gov,129793,HS-grad,9,Never-married,Other-service,Own-child,White,Male,0,0,30,United-States,<=50K
793,57,Private,111385,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,40,United-States,<=50K
461,50,State-gov,196900,HS-grad,9,Married-civ-spouse,Other-service,Wife,White,Female,0,0,39,United-States,<=50K


For this lab, you perform data transformations such as removing duplicates, transforming the target column into a column that contains two labels, and one-hot encoding of the categorical features are be performed.

You then create a SKLearnProcessor class to define and run a scikit-learn processing script as a processing job. Refer to [SageMaker scikit-learn SKLearnProcessor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#sagemaker.sklearn.processing.SKLearnProcessor) for more information about this class.

To create the SKLearnProcessor class, you configure the following parameters:
- **base_job_name**: Prefix for the processing job name
- **framework_version**: scikit-learn version
- **role**: SageMaker execution role
- **instance_count**: Number of instances to run the processing job
- **instance_type**: Type of Amazon Elastic Compute Cloud (Amazon EC2) instance used for the processing job

In [4]:
#scikit-learn-processor
import os
from sagemaker.sklearn.processing import SKLearnProcessor

# create a SKLearnProcessor
sklearn_processor = SKLearnProcessor(
    base_job_name="sklearn-preprocessor",
    framework_version="1.2-1", 
    role=role, 
    instance_count=1,
    instance_type="ml.m5.xlarge"
)

Defaulting to only available Python version: py3



Next, you use the **SKLearnProcessor.run()** method to run a **sklearn_preprocessing.py** script as a processing job. 

For running the processing job, you configure the following parameters:
- **code**: Path of the preprocessing script 
- **inputs and outputs**: Path of input and output for the preprocessing script (Amazon S3 input and output locations)
- **arguments**: Command-line arguments to the preprocessing script (such as a train and test split ratio)

The processing job takes approximately 4–5 minutes to complete. While the job is running, you can review the source for the preprocessing script (which has been preconfigured as part of this lab) by opening the **sklearn_preprocessing.py** file from the file browser.

In [None]:
#processing-job
from sagemaker.processing import ProcessingInput, ProcessingOutput

# Amazon S3 path prefix
input_raw_data_prefix = "data/input"
output_preprocessed_data_prefix = "data/output"

# Run the processing job
sklearn_processor.run(
    code="sklearn_preprocessing.py",
    inputs=[ProcessingInput(source="s3://" + os.path.join(bucket, input_raw_data_prefix, "adult_data.csv"),
                            destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train_data", 
                         source="/opt/ml/processing/train",
                         destination="s3://" + os.path.join(bucket, output_preprocessed_data_prefix, "train")),
        ProcessingOutput(output_name="test_data", 
                         source="/opt/ml/processing/test",
                         destination="s3://" + os.path.join(bucket, output_preprocessed_data_prefix, "test")),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)

print("SKLearn Processing Job Completed.")

Creating processing-job with name sklearn-preprocessor-2024-11-05-14-47-38-476
INFO:sagemaker:Creating processing-job with name sklearn-preprocessor-2024-11-05-14-47-38-476


.....

### Task 3.3: Validate the data processing results

In this task, you validate the output of the processing job that you ran by reviewing the first five rows of the train and test output datasets.

In [None]:
#view-train-dataset
print("Top 5 rows from s3://{}/{}/train/".format(bucket, output_preprocessed_data_prefix))
!aws s3 cp --quiet s3://$bucket/$output_preprocessed_data_prefix/train/train_features.csv - | head -n5

In [None]:
#view-validation-dataset
print("Top 5 rows from s3://{}/{}/validation/".format(bucket, output_preprocessed_data_prefix))
!aws s3 cp --quiet s3://$bucket/$output_preprocessed_data_prefix/test/test_features.csv - | head -n5

### Conclusion

Congratulations! You have used SageMaker Processing to successfully create a scikit-learn processing job using the SageMaker Python SDK and run the processing job.

The next task of the lab focuses on data processing using SageMaker Processing with your own processing container.

### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with **Task 4: Perform data processing with your own container**.