
# Dataset analysis with Amazon SageMaker Processing Jobs using Apache Spark


Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Spark are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Spark in a managed SageMaker environment to run our preprocessing workload. Then, we'll take our preprocessed dataset and train a regression model using XGBoost.

In [None]:
!pip install pandas==1.0.2

In [None]:
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()

## Using Amazon SageMaker Processing to execute a SparkML job

### Build a Spark container for running the preprocessing job

An example Spark container is included in the `./container` directory of this example. The container handles the bootstrapping of all Spark configuration, and serves as a wrapper around the `spark-submit` CLI. At a high level the container provides:
* A set of default Spark/YARN/Hadoop configurations
* A bootstrapping script for configuring and starting up Spark master/worker nodes
* A wrapper around the `spark-submit` CLI to submit a Spark application


After the container build and push process is complete, use the Amazon SageMaker Python SDK to submit a managed, distributed Spark application that performs our dataset preprocessing.

Build the example Spark container.

In [None]:
!cat container/Dockerfile

In [None]:
docker_repo = 'amazon-reviews-spark-analyzer'
docker_tag = 'latest'

In [None]:
!docker build -t $docker_repo:$docker_tag -f container/Dockerfile ./container

Create an Amazon Elastic Container Registry (Amazon ECR) repository for the Spark container and push the image.

In [None]:
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

image_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account_id, region, docker_repo, docker_tag)
print(image_uri)

Create ECR repository and push docker image

In [None]:
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

In [None]:
!aws ecr describe-repositories --repository-names $docker_repo || aws ecr create-repository --repository-name $docker_repo

In [None]:
!docker tag $docker_repo:$docker_tag $image_uri

In [None]:
!docker push $image_uri

### Run the preprocessing job using Amazon SageMaker Processing

Next, use the Amazon SageMaker Python SDK to submit a processing job. Use the Spark container that was just built, and a SparkML script for preprocessing in the job configuration.

Review the Spark preprocessing script.

In [None]:
cat preprocess-deequ.py

In [None]:
from sagemaker.processing import ScriptProcessor

processor = ScriptProcessor(base_job_name='spark-amazon-reviews-analyzer',
                            image_uri=image_uri,
                            command=['/opt/program/submit'],
                            role=role,
                            instance_count=2, # instance_count needs to be > 1 or you will see the following error:  "INFO yarn.Client: Application report for application_ (state: ACCEPTED)"
                            instance_type='ml.r5.8xlarge',
#                            max_runtime_in_seconds=600,
                            env={
                                'mode': 'jar',
                                'main_class': 'Main'
                            })

In [None]:
# Inputs
s3_input_data = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(s3_input_data)

In [None]:
!aws s3 ls $s3_input_data

## Setup Output Data

In [None]:
from time import gmtime, strftime
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

output_prefix = 'amazon-reviews-spark-analyzer-{}'.format(timestamp_prefix)
processing_job_name = 'amazon-reviews-spark-analyzer-{}'.format(timestamp_prefix)

print('Processing job name:  {}'.format(processing_job_name))

In [None]:
s3_output_analyze_data = 's3://{}/{}/analyze_data'.format(bucket, output_prefix)

print(s3_output_analyze_data)

## Start the Spark Processing Job

_Notes on Invoking from Lambda:_
* However, if we use the boto3 SDK (ie. with a Lambda), we need to copy the `preprocess.py` file to S3 and specify the everything include --py-files, etc.
* We would need to do the following before invoking the Lambda:
     !aws s3 cp preprocess.py s3://<location>/sagemaker/spark-preprocess-reviews-demo/code/preprocess.py
     !aws s3 cp preprocess.py s3://<location>/sagemaker/spark-preprocess-reviews-demo/py_files/preprocess.py
* Then reference the s3://<location> above in the --py-files, etc.
* See Lambda example code in this same project for more details.

_Notes on not using ProcessingInput and Output:_
* Since Spark natively reads/writes from/to S3 using s3a://, we can avoid the copy required by ProcessingInput and ProcessingOutput (FullyReplicated or ShardedByS3Key) and just specify the S3 input and output buckets/prefixes._"
* See https://github.com/awslabs/amazon-sagemaker-examples/issues/994 for issues related to using /opt/ml/processing/input/ and output/
* If we use ProcessingInput, the data will be copied to each node (which we don't want in this case since Spark already handles this)

In [None]:
from sagemaker.processing import ProcessingOutput

processor.run(code='preprocess-deequ.py',
              arguments=['s3_input_data', s3_input_data,
                         's3_output_analyze_data', s3_output_analyze_data,
              ],
              # We need this dummy output to allow us to call 
              #    ProcessingJob.from_processing_name() later 
              #    to describe the job and poll for Completed status
              # See https://github.com/aws/sagemaker-python-sdk/issues/1341
              outputs=[
                  ProcessingOutput(s3_upload_mode='EndOfJob',
                                   output_name='dummy-output',
                                   source='/opt/ml/processing/output')
              ],
              logs=True,
              wait=False
)

In [None]:
preprocessing_job_description = processor.jobs[-1].describe()
print(preprocessing_job_description)

In [None]:
processing_job_name = preprocessing_job_description['ProcessingJobName']

from IPython.core.display import display, HTML
display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, processing_job_name)))

In [None]:
processing_job_status = preprocessing_job_description['ProcessingJobStatus']
if (processing_job_status in ['Completed', 'Stopped']):
    # TODO:  Do something interesting...
    print('Complete')
else:
    print(processing_job_status)

## Please wait until the Processing Job Completes above


In [None]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=processing_job_name,
                                                                            sagemaker_session=sagemaker_session)
running_processor.describe()

In [None]:
from IPython.core.display import display, HTML

s3_job_output_prefix = output_prefix

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, s3_job_output_prefix, region)))

#### Inspect the processed dataset
Take a look at a few rows of the transformed dataset to make sure the preprocessing was successful.

In [None]:
!aws s3 ls $s3_output_analyze_data/

# Load the output csv into a DataFrame and analyze
* constraint-checks/
* constraint-suggestions/
* dataset-metrics/
* success-metrics/

In [None]:
!aws s3 cp --recursive $s3_output_analyze_data ./amazon-reviews-spark-analyzer/

In [None]:
import pandas as pd

# Change the filenames below to match the filenames above
TODO:  Programmatically retrieve the names (or just `read_csv('s3://...`)

In [None]:
pd.read_csv('./amazon-reviews-spark-analyzer/constraint-checks/part-00000-ba7d24dd-ec7d-4e06-9a12-9496fdbda252-c000.csv', delimiter='\t', header=0)

In [None]:
pd.read_csv('./amazon-reviews-spark-analyzer/constraint-suggestions/part-00000-21a096b1-e5dd-4190-aa05-0e93c9549d99-c000.csv', delimiter='\t', header=0)

In [None]:
pd.read_csv('./amazon-reviews-spark-analyzer/dataset-metrics/part-00000-9897ff58-4376-411d-a569-40de3d764da9-c000.csv', delimiter='\t', header=0)

In [None]:
pd.read_csv('./amazon-reviews-spark-analyzer/success-metrics/part-00000-c42aeccd-d35d-44ed-8fa3-72d25f05a374-c000.csv', delimiter='\t', header=0)