# Feature Transformation with Amazon a SageMaker Processing Job and Apache Spark

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Apache Spark are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Apache Spark in a managed SageMaker environment to run our processing workload.

![](img/prepare_dataset_bert.png)

![](img/processing.jpg)


# Setup Environment

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [1]:
import sagemaker
from time import gmtime, strftime
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

# Setup Input Data

In [2]:
# Inputs
s3_input_data = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(s3_input_data)

s3://sagemaker-us-west-2-393371431575/amazon-reviews-pds/tsv/


In [3]:
!aws s3 ls $s3_input_data

2020-07-25 17:13:26   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-07-25 17:13:29   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz


# Build a Spark Docker Image to Run the Processing Job

An example Spark container is included in the `./container` directory of this example. The container handles the bootstrapping of all Spark configuration, and serves as a wrapper around the `spark-submit` CLI. At a high level the container provides:
* A set of default Spark/YARN/Hadoop configurations
* A bootstrapping script for configuring and starting up Spark master/worker nodes
* A wrapper around the `spark-submit` CLI to submit a Spark application


After the container build and push process is complete, use the Amazon SageMaker Python SDK to submit a managed, distributed Spark application that performs our dataset processing.

Build the example Spark container.

In [4]:
docker_repo = 'amazon-reviews-spark-processor'
docker_tag = 'latest'

In [5]:
!docker build -t $docker_repo:$docker_tag -f container/Dockerfile ./container

Sending build context to Docker daemon  4.441MB
Step 1/37 : FROM openjdk:8-jre-slim
 ---> f2e91f81bf2c
Step 2/37 : RUN apt-get update
 ---> Using cache
 ---> a2e6bbe0e5e2
Step 3/37 : RUN apt-get install -y curl unzip python3 python3-setuptools python3-pip python-dev python3-dev python-psutil
 ---> Using cache
 ---> d89d8588c27a
Step 4/37 : RUN pip3 install py4j psutil==5.6.5 numpy==1.17.4
 ---> Using cache
 ---> 9c4ed3a036d1
Step 5/37 : RUN apt-get clean
 ---> Using cache
 ---> 4c0b61a8e5a6
Step 6/37 : RUN rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> a29d440c3512
Step 7/37 : ENV PYTHONHASHSEED 0
 ---> Using cache
 ---> 44e45fa97bce
Step 8/37 : ENV PYTHONIOENCODING UTF-8
 ---> Using cache
 ---> 3a635a1b310f
Step 9/37 : ENV PIP_DISABLE_PIP_VERSION_CHECK 1
 ---> Using cache
 ---> 00533a1ded45
Step 10/37 : ENV HADOOP_VERSION 3.2.1
 ---> Using cache
 ---> 56e303fe9daa
Step 11/37 : ENV HADOOP_HOME /usr/hadoop-$HADOOP_VERSION
 ---> Using cache
 ---> bd2df5877dee
Step 12/37 : ENV HADOOP

Create an Amazon Elastic Container Registry (Amazon ECR) repository for the Spark container and push the image.

In [6]:
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

image_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account_id, region, docker_repo, docker_tag)
print(image_uri)

393371431575.dkr.ecr.us-west-2.amazonaws.com/amazon-reviews-spark-processor:latest


### Create ECR repository and push docker image

In [7]:
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


### Ignore any `RepositoryNotFoundException` error, we are creating the repo right after.

In [8]:
!aws ecr describe-repositories --repository-names $docker_repo || aws ecr create-repository --repository-name $docker_repo


An error occurred (RepositoryNotFoundException) when calling the DescribeRepositories operation: The repository with name 'amazon-reviews-spark-processor' does not exist in the registry with id '393371431575'
{
    "repository": {
        "repositoryArn": "arn:aws:ecr:us-west-2:393371431575:repository/amazon-reviews-spark-processor",
        "registryId": "393371431575",
        "repositoryName": "amazon-reviews-spark-processor",
        "repositoryUri": "393371431575.dkr.ecr.us-west-2.amazonaws.com/amazon-reviews-spark-processor",
        "createdAt": 1595705481.0,
        "imageTagMutability": "MUTABLE",
        "imageScanningConfiguration": {
            "scanOnPush": false
        }
    }
}


In [9]:
!docker tag $docker_repo:$docker_tag $image_uri

In [10]:
!docker push $image_uri

The push refers to repository [393371431575.dkr.ecr.us-west-2.amazonaws.com/amazon-reviews-spark-processor]

[1B7ab02bbf: Preparing 
[1Bee1fa09f: Preparing 
[1B248e9c4d: Preparing 
[1B778ed779: Preparing 
[1B2d7f1f65: Preparing 
[1B4b621a5b: Preparing 
[1B803f711d: Preparing 
[1Bcb33680a: Preparing 
[1B04eaa3b2: Preparing 
[1B5d56d659: Preparing 
[1B20285432: Preparing 
[1B333168eb: Preparing 
[1B2c370ca9: Preparing 
[1B37da49ee: Preparing 
[1Bd076f217: Preparing 
[1Bc95dcfbb: Preparing 
[9B04eaa3b2: Waiting g 
[12B03f711d: Waiting g 


[19Bab02bbf: Pushing  1.833GB/2.011GB[16A[2K[16A[2K[16A[2K[15A[2K[16A[2K[18A[2K[16A[2K[18A[2K[16A[2K[16A[2K[19A[2K[19A[2K[15A[2K[19A[2K[18A[2K[12A[2K[18A[2K[16A[2K[19A[2K[12A[2K[19A[2K[10A[2K[19A[2K[18A[2K[19A[2K[18A[2K[19A[2K[18A[2K[10A[2K[19A[2K[10A[2K[18A[2K[10A[2K[19A[2K[11A[2K[18A[2K[11A[2K[18A[2K[19A[2K[18A[2K[10A[2K[18A[2K[11A[2K[18A[2K[10A[2K[18A[2K[10A[2K[18A[2K[10A[2K[7A[2K[18A[2K[10A[2K[19A[2K[11A[2K[19A[2K[11A[2K[19A[2K[11A[2K[19A[2K[18A[2K[19A[2K[7A[2K[10A[2K[7A[2K[18A[2K[19A[2K[18A[2K[19A[2K[18A[2K[19A[2K[11A[2K[18A[2K[11A[2K[10A[2K[19A[2K[18A[2K[19A[2K[7A[2K[10A[2K[11A[2K[10A[2K[19A[2K[7A[2K[19A[2K[7A[2K[18A[2K[7A[2K[19A[2K[7A[2K[11A[2K[7A[2K[11A[2K[7A[2K[11A[2K[10A[2K[11A[2K[10A[2K[11A[2K[6A[2K[7A[2K[10A[2K[6A[2K[10A[2K[10A[2K[11A[2K[10A[2K[7A[2K[10A[2K[6A

[19Bab02bbf: Pushed   2.021GB/2.011GB[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2K[19A[2Klatest: digest: sha256:7d4b6f4465e20d0ba2299481bd029c690be7d972d2bdb84c96a6e6f16b86ae39 size: 4318


# Run the Job using Amazon SageMaker Processing Jobs

Next, use the Amazon SageMaker Python SDK to submit a processing job. Use the Spark container that was just built, and a Spark ML script for processing in the job configuration.

Review the Spark processing script.

In [11]:
!pygmentize preprocess-spark-text-to-bert.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function
[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m unicode_literals

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mshutil[39;49;00m
[34mimport[39;49;00m [04m[36mcsv[39;49;00m
[34mimport[39;49;00m [04m[36mcollections[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'pip', '--upgrade'])[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'wrapt', '--upgrade', '--ignore-installed'])[39;49;00m
[37m#subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'tensorflow==2.1.0', '--ignore-installed'])[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;4

    ])

    bert_transformer = udf([34mlambda[39;49;00m text, label: convert_input(text, label), tfrecord_schema)

    spark.udf.register([33m'[39;49;00m[33mbert_transformer[39;49;00m[33m'[39;49;00m, bert_transformer)

    transformed_df = features_df.select(bert_transformer([33m'[39;49;00m[33mstar_rating[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mreview_body[39;49;00m[33m'[39;49;00m).alias([33m'[39;49;00m[33mtfrecords[39;49;00m[33m'[39;49;00m))
    transformed_df.show(truncate=[34mFalse[39;49;00m)

    flattened_df = transformed_df.select([33m'[39;49;00m[33mtfrecords.*[39;49;00m[33m'[39;49;00m)
    flattened_df.show()

    [37m# Split 90-5-5%[39;49;00m
    train_df, validation_df, test_df = flattened_df.randomSplit([[34m0.9[39;49;00m, [34m0.05[39;49;00m, [34m0.05[39;49;00m])

    train_df.write.format([33m'[39;49;00m[33mtfrecords[39;49;00m[33m'[39;49;00m).option([33m'[39;49;00m[33mrecordType[39;49;00m[33m'[39;49;00

In [12]:
from sagemaker.processing import ScriptProcessor

processor = ScriptProcessor(base_job_name='spark-amazon-reviews-processor',
                            image_uri=image_uri,
                            command=['/opt/program/submit'],
                            role=role,
                            instance_count=2, # instance_count needs to be > 1 or you will see the following error:  "INFO yarn.Client: Application report for application_ (state: ACCEPTED)"
                            instance_type='ml.r5.xlarge',
                            env={'mode': 'python'})

# Setup Output Data

In [13]:
from time import gmtime, strftime
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

output_prefix = 'amazon-reviews-spark-processor-{}'.format(timestamp_prefix)

In [14]:
train_data_bert_output = 's3://{}/{}/output/bert-train'.format(bucket, output_prefix)
validation_data_bert_output = 's3://{}/{}/output/bert-validation'.format(bucket, output_prefix)
test_data_bert_output = 's3://{}/{}/output/bert-test'.format(bucket, output_prefix)

print(train_data_bert_output)
print(validation_data_bert_output)
print(test_data_bert_output)

s3://sagemaker-us-west-2-393371431575/amazon-reviews-spark-processor-2020-07-25-19-32-49/output/bert-train
s3://sagemaker-us-west-2-393371431575/amazon-reviews-spark-processor-2020-07-25-19-32-49/output/bert-validation
s3://sagemaker-us-west-2-393371431575/amazon-reviews-spark-processor-2020-07-25-19-32-49/output/bert-test


In [15]:
from sagemaker.processing import ProcessingOutput

processor.run(code='preprocess-spark-text-to-bert.py',
              arguments=['s3_input_data', s3_input_data,
                         's3_output_train_data', train_data_bert_output,
                         's3_output_validation_data', validation_data_bert_output,
                         's3_output_test_data', test_data_bert_output,                         
              ],
              # We need this dummy output to allow us to call 
              #    ProcessingJob.from_processing_name() later 
              #    to describe the job and poll for Completed status                            
              outputs=[
                       ProcessingOutput(s3_upload_mode='EndOfJob',
                                        output_name='bert-train',
                                        source='/opt/ml/processing/output/bert/train'),
                       ProcessingOutput(s3_upload_mode='EndOfJob',
                                        output_name='bert-validation',
                                        source='/opt/ml/processing/output/bert/validation'),
                       ProcessingOutput(s3_upload_mode='EndOfJob',
                                        output_name='bert-test',
                                        source='/opt/ml/processing/output/bert/test'),
              ],          
              logs=True,
              wait=False
)

Parameter 'session' will be renamed to 'sagemaker_session' in SageMaker Python SDK v2.



Job Name:  spark-amazon-reviews-processor-2020-07-25-19-32-49-384
Inputs:  [{'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-393371431575/spark-amazon-reviews-processor-2020-07-25-19-32-49-384/input/code/preprocess-spark-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'bert-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-393371431575/spark-amazon-reviews-processor-2020-07-25-19-32-49-384/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'bert-validation', 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-393371431575/spark-amazon-reviews-processor-2020-07-25-19-32-49-384/output/bert-validation', 'LocalPath': '/opt/ml/processing/output/bert/validation', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'bert-test', 'S3Output': {'S3Uri

In [16]:
from IPython.core.display import display, HTML

spark_processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, spark_processing_job_name)))


In [17]:
from IPython.core.display import display, HTML

# This is different than the job name because we are not using ProcessingOutput's in this Spark ML case.
spark_processing_job_s3_output_prefix = output_prefix

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Processing Job Has Completed</b>'.format(bucket, spark_processing_job_s3_output_prefix, region)))


# List Processing Jobs through boto3 Python SDK

In [18]:
import boto3

client = boto3.client('sagemaker')
client.list_processing_jobs()

{'ProcessingJobSummaries': [{'ProcessingJobName': 'spark-amazon-reviews-processor-2020-07-25-19-32-49-384',
   'ProcessingJobArn': 'arn:aws:sagemaker:us-west-2:393371431575:processing-job/spark-amazon-reviews-processor-2020-07-25-19-32-49-384',
   'CreationTime': datetime.datetime(2020, 7, 25, 19, 32, 49, 851000, tzinfo=tzlocal()),
   'LastModifiedTime': datetime.datetime(2020, 7, 25, 19, 32, 49, 851000, tzinfo=tzlocal()),
   'ProcessingJobStatus': 'InProgress'},
  {'ProcessingJobName': 'sagemaker-scikit-learn-2020-07-25-19-21-49-035',
   'ProcessingJobArn': 'arn:aws:sagemaker:us-west-2:393371431575:processing-job/sagemaker-scikit-learn-2020-07-25-19-21-49-035',
   'CreationTime': datetime.datetime(2020, 7, 25, 19, 21, 49, 485000, tzinfo=tzlocal()),
   'ProcessingEndTime': datetime.datetime(2020, 7, 25, 19, 26, 35, tzinfo=tzlocal()),
   'LastModifiedTime': datetime.datetime(2020, 7, 25, 19, 26, 35, 842000, tzinfo=tzlocal()),
   'ProcessingJobStatus': 'Completed'},
  {'ProcessingJobName

# Please Wait Until the Processing Job Completes
Re-run this next cell until the job status shows `Completed`.

In [19]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=spark_processing_job_name,
                                                                            sagemaker_session=sagemaker_session)

processing_job_description = running_processor.describe()

processing_job_status = processing_job_description['ProcessingJobStatus']
print('\n')
print(processing_job_status)
print('\n')

print(processing_job_description)



InProgress


{'ProcessingInputs': [{'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-west-2-393371431575/spark-amazon-reviews-processor-2020-07-25-19-32-49-384/input/code/preprocess-spark-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'bert-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-393371431575/spark-amazon-reviews-processor-2020-07-25-19-32-49-384/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'bert-validation', 'S3Output': {'S3Uri': 's3://sagemaker-us-west-2-393371431575/spark-amazon-reviews-processor-2020-07-25-19-32-49-384/output/bert-validation', 'LocalPath': '/opt/ml/processing/output/bert/validation', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'bert-test', 'S3Output': {'S3Uri': 's3://sa

# _Please Wait Until the ^^ Processing Job ^^ Completes Above._

In [20]:
running_processor.wait()

[34m2020-07-25 19:36:26,115 INFO namenode.NameNode: STARTUP_MSG: [0m
[34m/************************************************************[0m
[34mSTARTUP_MSG: Starting NameNode[0m
[34mSTARTUP_MSG:   host = algo-1/10.0.237.72[0m
[34mSTARTUP_MSG:   args = [-format, -force][0m
[34mSTARTUP_MSG:   version = 3.2.1[0m
[34mSTARTUP_MSG:   classpath = /usr/hadoop-3.2.1/etc/hadoop:/usr/hadoop-3.2.1/share/hadoop/common/lib/kerby-xdr-1.0.1.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/commons-compress-1.18.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/kerby-util-1.0.1.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/hadoop-auth-3.2.1.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/json-smart-2.3.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/commons-lang3-3.7.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/jetty-security-9.3.24.v20180605.jar:/usr/hadoop-3.2.1/share/hadoop/common/lib/commons-logging-1.1.3.jar:/usr/hadoop-3.2.1/share/ha

[34mStarting nodemanagers[0m
[34mlocalhost: /usr/hadoop-3.2.1/bin/../libexec/hadoop-functions.sh: line 982: ssh: command not found[0m
[34m2020-07-25 19:36:39,337 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable[0m
[34m2020-07-25 19:36:40.256785: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory[0m
[34m2020-07-25 19:36:40.256873: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory[0m
[34m2020-07-25 19:36:40.256882: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing librari

[34m2020-07-25 19:36:47,343 INFO yarn.Client: Submitting application application_1595705797766_0001 to ResourceManager[0m
[34m2020-07-25 19:36:47,562 INFO impl.YarnClientImpl: Submitted application application_1595705797766_0001[0m
[34m2020-07-25 19:36:47,565 INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_1595705797766_0001 and attemptId None[0m
[34m2020-07-25 19:36:48,583 INFO yarn.Client: Application report for application_1595705797766_0001 (state: ACCEPTED)[0m
[34m2020-07-25 19:36:48,586 INFO yarn.Client: [0m
[34m#011 client token: N/A[0m
[34m#011 diagnostics: AM container is launched, waiting for AM container to Register with RM[0m
[34m#011 ApplicationMaster host: N/A[0m
[34m#011 ApplicationMaster RPC port: -1[0m
[34m#011 queue: default[0m
[34m#011 start time: 1595705807449[0m
[34m#011 final status: UNDEFINED[0m
[34m#011 tracking URL: http://algo-1:8088/proxy/application_1595705797766_0001/[0m
[34m#011 user

[34m2020-07-25 19:37:18,565 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2536 ms on algo-2 (executor 1) (1/1)[0m
[34m2020-07-25 19:37:18,568 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool [0m
[34m2020-07-25 19:37:18,572 INFO scheduler.DAGScheduler: ResultStage 0 (showString at NativeMethodAccessorImpl.java:0) finished in 2.620 s[0m
[34m2020-07-25 19:37:18,575 INFO scheduler.DAGScheduler: Job 0 finished: showString at NativeMethodAccessorImpl.java:0, took 2.651232 s[0m
[34m+-----------+-----------+--------------+----------+--------------+--------------------+-------------------+-----------+-------------+-----------+----+-----------------+--------------------+--------------------+-----------+[0m
[34m|marketplace|customer_id|     review_id|product_id|product_parent|       product_title|   product_category|star_rating|helpful_votes|total_votes|vine|verified_purchase|     review_headline|         review_body|

[34m2020-07-25 19:37:28,768 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 6.0 (TID 6) in 1942 ms on algo-2 (executor 1) (1/1)[0m
[34m2020-07-25 19:37:28,768 INFO cluster.YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool [0m
[34m2020-07-25 19:37:28,769 INFO scheduler.DAGScheduler: ResultStage 6 (showString at NativeMethodAccessorImpl.java:0) finished in 1.954 s[0m
[34m2020-07-25 19:37:28,769 INFO scheduler.DAGScheduler: Job 6 finished: showString at NativeMethodAccessorImpl.java:0, took 1.957557 s[0m
[34m+--------------------+--------------------+--------------------+---------+[0m
[34m|           input_ids|          input_mask|         segment_ids|label_ids|[0m
[34m+--------------------+--------------------+--------------------+---------+[0m
[34m|[101, 1045, 2562,...|[1, 1, 1, 1, 1, 1...|[0, 0, 0, 0, 0, 0...|      [1]|[0m
[34m|[101, 12476, 102,...|[1, 1, 1, 0, 0, 0...|[0, 0, 0, 0, 0, 0...|      [4]|[0m
[34m|[101, 2065, 2017,...|

[34m2020-07-25 19:46:44,233 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 7.0 (TID 8) in 555061 ms on algo-2 (executor 1) (1/2)[0m
[34m2020-07-25 19:50:15,538 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 7.0 (TID 7) in 766367 ms on algo-2 (executor 1) (2/2)[0m
[34m2020-07-25 19:50:15,539 INFO cluster.YarnScheduler: Removed TaskSet 7.0, whose tasks have all completed, from pool [0m
[34m2020-07-25 19:50:15,539 INFO scheduler.DAGScheduler: ResultStage 7 (runJob at SparkHadoopWriter.scala:78) finished in 766.402 s[0m
[34m2020-07-25 19:50:15,540 INFO scheduler.DAGScheduler: Job 7 finished: runJob at SparkHadoopWriter.scala:78, took 766.406970 s[0m
[34m2020-07-25 19:50:15,560 INFO io.SparkHadoopWriter: Job job_20200725193729_0038 committed.[0m
[34mWrote to output file:  /opt/ml/processing/output/bert/train[0m
[34m2020-07-25 19:50:15,603 INFO datasources.FileSourceStrategy: Pruning directories with: [0m
[34m2020-07-25 19:50:15,604 INFO datasources.Fil

[34m2020-07-25 20:06:53,941 INFO spark.ContextCleaner: Cleaned accumulator 244[0m
[34m2020-07-25 20:06:53,941 INFO spark.ContextCleaner: Cleaned accumulator 284[0m
[34m2020-07-25 20:06:53,941 INFO spark.ContextCleaner: Cleaned accumulator 217[0m
[34m2020-07-25 20:06:53,941 INFO spark.ContextCleaner: Cleaned accumulator 249[0m
[34m2020-07-25 20:06:53,941 INFO spark.ContextCleaner: Cleaned accumulator 224[0m
[34m2020-07-25 20:06:53,941 INFO spark.ContextCleaner: Cleaned accumulator 250[0m
[34m2020-07-25 20:06:53,941 INFO spark.ContextCleaner: Cleaned accumulator 269[0m
[34m2020-07-25 20:06:53,941 INFO spark.ContextCleaner: Cleaned accumulator 233[0m
[34m2020-07-25 20:06:53,943 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on 10.0.237.72:35827 in memory (size: 692.2 KB, free: 364.8 MB)[0m
[34m2020-07-25 20:06:53,946 INFO storage.BlockManagerInfo: Removed broadcast_14_piece0 on algo-2:33601 in memory (size: 692.2 KB, free: 11.9 GB)[0m
[34m2020-07-25 20:06:

[35m2020-07-25 20:15:34[0m
[35mFinished Yarn configuration files setup.
[0m
[35mReceived end of job signal, exiting...[0m
[34mFinished Yarn configuration files setup.
[0m



# Inspect the Processed Output Dataset


## _The next cells will not work properly until the job completes above._


Take a look at a few rows of the transformed dataset to make sure the processing was successful.

In [21]:
!aws s3 ls --recursive $train_data_bert_output/

In [22]:
!aws s3 ls --recursive $validation_data_bert_output/

In [23]:
!aws s3 ls --recursive $test_data_bert_output/

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();