# Prepare data for ML using SageMaker Processing

This notebook demonstrates how we can move the prototyped scripts developed in the notebook [Data-Preparation](./../00-onboard/00-Data-Preparation.ipynb) into plain Python scripts for executing SageMaker Processing Jobs.

**SageMaker Studio Kernel**: Data Science

In this notebook you will do:
 - Run a Preprocessing Job using Amazon SageMaker Processing Job for preparing data for training ML models

***

# Dataset

The data set (The Social Dilemma Tweets - Text Classification 2020) was downloaded from [Kaggle](https://www.kaggle.com/datasets/kaushiksuresh147/the-social-dilemma-tweets).
This dataset brings you the twitter responses made with the #TheSocialDilemma hashtag after watching the eye-opening documentary "The Social Dilemma" released in an OTT platform(Netflix) on September 9th, 2020.
The dataset was extracted using TwitterAPI, consisting of nearly 10,526 tweets from twitter users all over the globe!

We'd like to train a model based on the content of the text in order to determine the sentiment.

This is a multi-class classification problem:
* Negative - 0
* Neutral - 1
* Positive - 2

In [None]:
! rm -rf ./data && mkdir -p data
! curl https://sagemaker-sample-files.s3.amazonaws.com/datasets/tabular/tweets_dataset/TheSocialDilemma.csv -o data/data.csv

# Step 1 - Import Modules

Here we’ll import some libraries and define some variables.

In [None]:
import boto3
import sagemaker
from sagemaker.processing import FrameworkProcessor, ProcessingInput, ProcessingOutput
from sagemaker.sklearn.estimator import SKLearn

In [None]:
sagemaker_client = boto3.client("sagemaker")
s3_client = boto3.client("s3")

Create a SageMaker Session and save the default region and the execution role in some Python variables

In [None]:
sagemaker_session = sagemaker.Session()
region = boto3.session.Session().region_name
role = sagemaker.get_execution_role()

In [None]:
bucket_name = sagemaker_session.default_bucket()

## Upload the dataset in the default Amazon S3 Bucket

In order to make data available for the SageMaker Processing Job, let's copy the dataset in the default S3 Bucket

In [None]:
# Download the 
# clean the buckets first
s3_client.delete_object(Bucket=bucket_name, Key="e2e-base/data/input")

input_data = sagemaker_session.upload_data('./data/data.csv', key_prefix="e2e-base/data/input")

input_data

***

# Step 2 - Run the processing job

By using [FrameworkProcessor](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job-frameworks.html), we can provide to the Amazon SageMaker Job the execution scripts and the requirements.txt for installing additional Python modules.

In order to make sure that Amazon SageMaker will install our additional Python modules by reading `requirements.txt`, we are compressing the content of the [processing](./code/processing) folder and uploading it in the default S3 Bucket.

In [None]:
! pygmentize ./code/processing/processing.py

In [None]:
! ./code/buildspec.sh processing

Upload the generated `sourcedir.tar.gz` in the default S3 Bucket

In [None]:
# Download the 
# clean the buckets first
s3_client.delete_object(Bucket=bucket_name, Key="e2e-base/artifact/processing")

code_path = sagemaker_session.upload_data('./code/dist/processing/sourcedir.tar.gz', key_prefix="e2e-base/artifact/processing")

code_path

## Global Parameters

In order to allow users to execute the SageMaker Processing Job locally, we are defining the variable `local_mode`. If you want to test the local mode capability, please put the variable to `True`

In [None]:
local_mode = False

In [None]:
processing_image_uri="{}.dkr.ecr.{}.amazonaws.com/sagemaker-processing-sklearn:latest".format(boto3.client("sts").get_caller_identity()["Account"], region)

processing_artifact_path = "e2e-base/artifact/processing"
processing_artifact_name = "sourcedir.tar.gz"

processing_input_files_path = "e2e-base/data/input"
processing_output_files_path = "e2e-base/data/output"

processing_instance_count = 1
processing_instance_type = "ml.t3.large"

Define the `FrameworkProcessor` object

In [None]:
processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    image_uri=processing_image_uri,
    framework_version=None,
    role=role,
    instance_count=processing_instance_count,
    instance_type=processing_instance_type,
    sagemaker_session=sagemaker_session
)

In [None]:
run_args = processor.get_run_args(
        "processing.py",
        source_dir="s3://{}/{}/{}".format(bucket_name,
                                      processing_artifact_path,
                                      processing_artifact_name
                                      ),
        inputs=[
            ProcessingInput(
                input_name="input",
                source="s3://{}/{}".format(bucket_name, processing_input_files_path),
                destination="/opt/ml/processing/input"
            )
        ],
        outputs=[
            ProcessingOutput(
                output_name="output",
                source="/opt/ml/processing/output",
                destination="s3://{}/{}".format(bucket_name, processing_output_files_path))
        ]
    )

In [None]:
processor.run(
    code=run_args.code,
    inputs=run_args.inputs,
    outputs=run_args.outputs,
    wait=True
)