# Distributed Data Processing using Apache Spark and SageMaker Processing



Apache Spark is a unified analytics engine for large-scale data processing. The Spark framework is often used within the context of machine learning workflows to run data transformation or feature engineering workloads at scale. Amazon SageMaker provides a set of prebuilt Docker images that include Apache Spark and other dependencies needed to run distributed data processing jobs on Amazon SageMaker. This example notebook demonstrates how to use the prebuilt Spark images on SageMaker Processing using the SageMaker Core Python SDK.

## Runtime

This notebook takes approximately 22 minutes to run.

## Contents

1. [Setup](#Setup)
1. [Example 1: Running a basic PySpark application](#Example-1:-Running-a-basic-PySpark-application)
1. [Example 2: Specify additional Python and jar file dependencies](#Example-2:-Specify-additional-Python-and-jar-file-dependencies)
1. [Example 3: Run a Java/Scala Spark application](#Example-3:-Run-a-Java/Scala-Spark-application)
1. [Example 4: Specifying additional Spark configuration](#Example-4:-Specifying-additional-Spark-configuration)

## Setup

### Install the latest SageMaker Core Python SDK

In [None]:
!pip uninstall sagemaker-core -y
!pip install pip --upgrade --quiet
!pip install sagemaker-core --upgrade

*Restart your notebook kernel after upgrading the SDK*

## Example 1: Running a basic PySpark application

The first example is a basic Spark MLlib data processing script. This script will take a raw data set and do some transformations on it such as string indexing and one hot encoding.

### Setup S3 bucket locations and roles

First, setup some locations in the default SageMaker bucket to store the raw input datasets and the Spark job output. Here, you'll also define the role that will be used to run all SageMaker Processing jobs.

In [93]:
import logging
from time import gmtime, strftime
from sagemaker_core.helper.session_helper import get_execution_role, Session

sagemaker_logger = logging.getLogger("sagemaker")
sagemaker_logger.setLevel(logging.INFO)
sagemaker_logger.addHandler(logging.StreamHandler())

sagemaker_session = Session()
REGION_NAME = sagemaker_session._region_name
role = get_execution_role()
s3_bucket_name = sagemaker_session.default_bucket()

Next, you'll download the example dataset from a SageMaker staging bucket.

In [None]:
# Fetch the dataset from the SageMaker bucket
import boto3

s3 = boto3.client("s3")
s3.download_file(
    f"sagemaker-sample-files", "datasets/tabular/uci_abalone/abalone.csv", "./data/abalone.csv"
)

### Write the PySpark script

The source for a preprocessing script is in the cell below. The cell uses the `%%writefile` directive to save this file locally. This script does some basic feature engineering on a raw input dataset. In this example, the dataset is the [Abalone Data Set](https://archive.ics.uci.edu/ml/datasets/abalone) and the code below performs string indexing, one hot encoding, vector assembly, and combines them into a pipeline to perform these transformations in order. The script then does an 80-20 split to produce training and validation datasets as output.

In [None]:
%%writefile ./code/preprocess.py
from __future__ import print_function
from __future__ import unicode_literals

import argparse
import csv
import os
import shutil
import sys
import time

import pyspark
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import (
    OneHotEncoder,
    StringIndexer,
    VectorAssembler,
    VectorIndexer,
)
from pyspark.sql.functions import *
from pyspark.sql.types import (
    DoubleType,
    StringType,
    StructField,
    StructType,
)


def csv_line(data):
    r = ",".join(str(d) for d in data[1])
    return str(data[0]) + "," + r


def main():
    parser = argparse.ArgumentParser(description="app inputs and outputs")
    parser.add_argument("--s3_input_bucket", type=str, help="s3 input bucket")
    parser.add_argument("--s3_input_key_prefix", type=str, help="s3 input key prefix")
    parser.add_argument("--s3_output_bucket", type=str, help="s3 output bucket")
    parser.add_argument("--s3_output_key_prefix", type=str, help="s3 output key prefix")
    args = parser.parse_args()

    spark = SparkSession.builder.appName("PySparkApp").getOrCreate()

    # This is needed to save RDDs which is the only way to write nested Dataframes into CSV format
    spark.sparkContext._jsc.hadoopConfiguration().set(
        "mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter"
    )

    # Defining the schema corresponding to the input data. The input data does not contain the headers
    schema = StructType(
        [
            StructField("sex", StringType(), True),
            StructField("length", DoubleType(), True),
            StructField("diameter", DoubleType(), True),
            StructField("height", DoubleType(), True),
            StructField("whole_weight", DoubleType(), True),
            StructField("shucked_weight", DoubleType(), True),
            StructField("viscera_weight", DoubleType(), True),
            StructField("shell_weight", DoubleType(), True),
            StructField("rings", DoubleType(), True),
        ]
    )

    # Downloading the data from S3 into a Dataframe
    total_df = spark.read.csv(
        ("s3://" + os.path.join(args.s3_input_bucket, args.s3_input_key_prefix, "abalone.csv")),
        header=False,
        schema=schema,
    )

    # StringIndexer on the sex column which has categorical value
    sex_indexer = StringIndexer(inputCol="sex", outputCol="indexed_sex")

    # one-hot-encoding is being performed on the string-indexed sex column (indexed_sex)
    sex_encoder = OneHotEncoder(inputCol="indexed_sex", outputCol="sex_vec")

    # vector-assembler will bring all the features to a 1D vector for us to save easily into CSV format
    assembler = VectorAssembler(
        inputCols=[
            "sex_vec",
            "length",
            "diameter",
            "height",
            "whole_weight",
            "shucked_weight",
            "viscera_weight",
            "shell_weight",
        ],
        outputCol="features",
    )

    # The pipeline is comprised of the steps added above
    pipeline = Pipeline(stages=[sex_indexer, sex_encoder, assembler])

    # This step trains the feature transformers
    model = pipeline.fit(total_df)

    # This step transforms the dataset with information obtained from the previous fit
    transformed_total_df = model.transform(total_df)

    # Split the overall dataset into 80-20 training and validation
    (train_df, validation_df) = transformed_total_df.randomSplit([0.8, 0.2])

    # Convert the train dataframe to RDD to save in CSV format and upload to S3
    train_rdd = train_df.rdd.map(lambda x: (x.rings, x.features))
    train_lines = train_rdd.map(csv_line)
    train_lines.saveAsTextFile(
        "s3://" + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix, "train")
    )

    # Convert the validation dataframe to RDD to save in CSV format and upload to S3
    validation_rdd = validation_df.rdd.map(lambda x: (x.rings, x.features))
    validation_lines = validation_rdd.map(csv_line)
    validation_lines.saveAsTextFile(
        "s3://" + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix, "validation")
    )


if __name__ == "__main__":
    main()

### Run the SageMaker Processing Job

Next, you'll use the `ProcessingJob` class to define a Spark job and run it using SageMaker Processing. A few things to note in the definition of the `ProcessingJob`:

* This is a multi-node job with two m5.xlarge instances (which is specified via the `instance_count` and `instance_type` parameters)
* Spark framework version 3.1 image is specified via the `image_uri` parameter
* The PySpark script defined above is passed via via the `ProcessingInput` class
* Command-line arguments to the PySpark script (such as the S3 input and output locations) are passed via the `arguments` parameter
* Spark event logs will be offloaded to the S3 location in `spark_event_logs` folder.


In [None]:
from sagemaker_core.shapes import ProcessingInput,ProcessingResources,AppSpecification,ProcessingS3Input,ProcessingOutputConfig
from sagemaker_core.shapes import ProcessingResources,ProcessingClusterConfig,ProcessingOutput,ProcessingS3Output
from sagemaker_core.resources import ProcessingJob

# Upload the raw input dataset to a unique S3 location
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
prefix = "sagemaker/spark-preprocess-demo/{}".format(timestamp_prefix)
input_prefix_abalone = "{}/input/raw/abalone".format(prefix)
input_preprocessed_prefix_abalone = "{}/input/preprocessed/abalone".format(prefix)

base_job_name = "sm-spark"
final_job_name = base_job_name + "-"+ timestamp_prefix

input_script_prefix = "{}/input/code".format(final_job_name)

# uploading required data to S3 for reference
# uploading abolone.csv to S3 bucket
sagemaker_session.upload_data(
    path="./data/abalone.csv", bucket=s3_bucket_name, key_prefix=input_prefix_abalone
)

# uploading preprocess.py to S3 bucket
sagemaker_session.upload_data(
    path="./code/preprocess.py", bucket=s3_bucket_name, key_prefix=input_script_prefix
)

# initializing ProcessingInputs,ProcessingResources,ProcessingOutputConfig and AppSpecification configurations
processing_input = ProcessingInput(input_name="code",s3_input = ProcessingS3Input(
                                        s3_uri = f"s3://{s3_bucket_name}/{final_job_name}/input/code/preprocess.py",
                                        #s3_uri="s3://sagemaker-us-east-1-774297356213/sm-spark-2024-08-30-05-25-18-294/input/code/preprocess.py",
                                        s3_data_type="S3Prefix",
                                        local_path = "/opt/ml/processing/input/code",
                                        s3_input_mode="File"
                                        ))

processing_output_config = ProcessingOutputConfig(outputs= [ProcessingOutput(output_name = "output-1",s3_output=ProcessingS3Output(
    s3_uri=f"s3://{s3_bucket_name}/{prefix}/spark_event_logs",
    local_path="/opt/ml/processing/spark-events/", s3_upload_mode="Continuous"))])

processing_resources = ProcessingResources(cluster_config=ProcessingClusterConfig
                                           (instance_count=2,instance_type="ml.m5.xlarge",volume_size_in_gb=30))

app_specification = AppSpecification(image_uri = "173754725891.dkr.ecr.us-east-1.amazonaws.com/sagemaker-spark-processing:3.1-cpu",
                                    container_entrypoint = ["smspark-submit",
                                                            "--local-spark-event-logs-dir",
                                                            "/opt/ml/processing/spark-events/",
                                                            "/opt/ml/processing/input/code/preprocess.py"],
                                    container_arguments = ["--s3_input_bucket",
                                                           f"{s3_bucket_name}",
                                                           "--s3_input_key_prefix",
                                                           f"{input_prefix_abalone}",
                                                           "--s3_output_bucket",
                                                           f"{s3_bucket_name}",
                                                           "--s3_output_key_prefix",
                                                           f"{input_preprocessed_prefix_abalone}"])

# Run the processing job
processing_job_obj = ProcessingJob.create(processing_job_name = final_job_name,
                            processing_resources=processing_resources,
                            app_specification=app_specification,
                            role_arn=role,
                            processing_inputs=[processing_input],
                            processing_output_config=processing_output_config)

processing_job_obj.wait()

### Validate Data Processing Results

Next, validate the output of our data preprocessing job by looking at the first 5 rows of the output dataset.

In [None]:
print("Top 5 rows from s3://{}/{}/train/".format(s3_bucket_name, input_preprocessed_prefix_abalone))
!aws s3 cp --quiet s3://$s3_bucket_name/$input_preprocessed_prefix_abalone/train/part-00000 - | head -n5

## Example 2: Specify additional Python and jar file dependencies

The next example demonstrates a scenario where additional Python file dependencies are required by the PySpark script. You'll use a sample PySpark script that requires additional user-defined functions (UDFs) defined in a local module.

In [None]:
%%writefile ./code/hello_py_spark_app.py
import argparse
import time

# Import local module to test spark-submit--py-files dependencies
import hello_py_spark_udfs as udfs
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
import time

if __name__ == "__main__":
    print("Hello World, this is PySpark!")

    parser = argparse.ArgumentParser(description="inputs and outputs")
    parser.add_argument("--input", type=str, help="path to input data")
    parser.add_argument("--output", required=False, type=str, help="path to output data")
    args = parser.parse_args()
    spark = SparkSession.builder.appName("SparkTestApp").getOrCreate()
    sqlContext = SQLContext(spark.sparkContext)

    # Load test data set
    inputPath = args.input
    outputPath = args.output
    salesDF = spark.read.json(inputPath)
    salesDF.printSchema()

    salesDF.createOrReplaceTempView("sales")

    # Define a UDF that doubles an integer column
    # The UDF function is imported from local module to test spark-submit--py-files dependencies
    double_udf_int = udf(udfs.double_x, IntegerType())

    # Save transformed data set to disk
    salesDF.select("date", "sale", double_udf_int("sale").alias("sale_double")).write.json(
        outputPath
    )

Creating `hello_py_spark_udfs.py` inside `code` folder

In [None]:
%%writefile ./code/hello_py_spark_udfs.py
def double_x(x):
    return x + x

### Create a processing job with Python file dependencies

Then, you'll create a processing job where the additional Python file dependencies are specified via the `py-files` input name in the `ProcessingInput` class.

In [None]:
from sagemaker_core.shapes import ProcessingInput,ProcessingResources,AppSpecification,ProcessingS3Input,ProcessingOutputConfig
from sagemaker_core.shapes import ProcessingResources,ProcessingClusterConfig,ProcessingOutput,ProcessingS3Output
from sagemaker_core.resources import ProcessingJob

# Upload the raw input dataset to a unique S3 location
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
prefix = "sagemaker/spark-preprocess-demo/{}".format(timestamp_prefix)
input_prefix_sales = "{}/input/sales".format(prefix)
output_prefix_sales = "{}/output/sales".format(prefix)
input_s3_uri = "s3://{}/{}".format(s3_bucket_name, input_prefix_sales)
output_s3_uri = "s3://{}/{}".format(s3_bucket_name, output_prefix_sales)

base_job_name = "sm-spark-udfs"
final_job_name = base_job_name + "-"+ timestamp_prefix

input_script_prefix = "{}/input/code".format(final_job_name)
input_pyfiles_prefix = "{}/input/py-files".format(final_job_name)

# uploading required data to S3 for reference
# uploading data.jsonl to S3
sagemaker_session.upload_data(
    path="./data/data.jsonl", bucket=s3_bucket_name, key_prefix=input_prefix_sales
)

# uploading hello_py_spark_app.py to S3
sagemaker_session.upload_data(
    path="./code/hello_py_spark_app.py", bucket=s3_bucket_name, key_prefix=input_script_prefix
)

# uploading hello_py_spark_udfs.py to S3
sagemaker_session.upload_data(
    path="./code/hello_py_spark_udfs.py", bucket=s3_bucket_name, key_prefix=input_pyfiles_prefix
)

# initializing ProcessingInputs,ProcessingResources and AppSpecification configurations
# providing processing script
processing_input_code = ProcessingInput(input_name="code",s3_input = ProcessingS3Input(
                                        s3_uri = f"s3://{s3_bucket_name}/{final_job_name}/input/code/hello_py_spark_app.py",
                                        s3_data_type="S3Prefix",
                                        local_path = "/opt/ml/processing/input/code",
                                        s3_input_mode="File"
                                        ))

#providing py files
processing_input_pyfiles = ProcessingInput(input_name="py-files",s3_input = ProcessingS3Input(
                                        s3_uri = f"s3://{s3_bucket_name}/{final_job_name}/input/py-files",
                                        s3_data_type="S3Prefix",
                                        local_path = "/opt/ml/processing/input/py-files",
                                        s3_input_mode="File"
                                        ))

# providing processing resources
processing_resources = ProcessingResources(cluster_config=ProcessingClusterConfig
                                           (instance_count=2,instance_type="ml.m5.xlarge",volume_size_in_gb=30))

# providing app specification
app_specification = AppSpecification(image_uri = "173754725891.dkr.ecr.us-east-1.amazonaws.com/sagemaker-spark-processing:3.1-cpu",
                                    container_entrypoint = ["smspark-submit",
                                                            "--py-files",
                                                            "/opt/ml/processing/input/py-files",
                                                            "/opt/ml/processing/input/code/hello_py_spark_app.py"],
                                    container_arguments = ["--input",
                                                           f"s3://{s3_bucket_name}/{input_prefix_sales}",
                                                           "--output",
                                                           f"s3://{s3_bucket_name}/{output_prefix_sales}",
                                                           ])

# Run the processing job
processing_job_obj = ProcessingJob.create(processing_job_name = final_job_name,
                            processing_resources=processing_resources,
                            app_specification=app_specification,
                            role_arn=role,
                            processing_inputs=[processing_input_code,processing_input_pyfiles])

processing_job_obj.wait()

### Validate Data Processing Results

Next, validate the output of the Spark job by ensuring that the output URI contains the Spark `_SUCCESS` file along with the output json lines file.

In [None]:
print("Output files in {}".format(output_s3_uri))
!aws s3 ls $output_s3_uri/

## Example 3: Run a Java/Scala Spark application

In the next example, you'll take a Spark application jar (located in `./code/spark-test-app.jar`) that is already built and run it using SageMaker Processing.

In [None]:
from sagemaker_core.shapes import ProcessingInput,ProcessingResources,AppSpecification,ProcessingS3Input,ProcessingOutputConfig
from sagemaker_core.shapes import ProcessingResources,ProcessingClusterConfig,ProcessingOutput,ProcessingS3Output
from sagemaker_core.resources import ProcessingJob

# Upload the raw input dataset to S3
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
prefix = "sagemaker/spark-preprocess-demo/{}".format(timestamp_prefix)
input_prefix_sales = "{}/input/sales".format(prefix)
output_prefix_sales = "{}/output/sales".format(prefix)
input_s3_uri = "s3://{}/{}".format(s3_bucket_name, input_prefix_sales)
output_s3_uri = "s3://{}/{}".format(s3_bucket_name, output_prefix_sales)


base_job_name = "sm-spark-java"
final_job_name = base_job_name + "-"+ timestamp_prefix

input_script_prefix = "{}/input/code".format(final_job_name)
input_pyfiles_prefix = "{}/input/py-files".format(final_job_name)

# uploading required data to S3 for reference
sagemaker_session.upload_data(
    path="./data/data.jsonl", bucket=s3_bucket_name, key_prefix=input_prefix_sales
)

sagemaker_session.upload_data(
    path="./code/spark-test-app.jar", bucket=s3_bucket_name, key_prefix=input_script_prefix
)

# initializing ProcessingInputs,ProcessingResources and AppSpecification configurations
processing_input_code = ProcessingInput(input_name="code",s3_input = ProcessingS3Input(
                                        s3_uri = f"s3://{s3_bucket_name}/{final_job_name}/input/code/spark-test-app.jar",
                                        #s3_uri="s3://sagemaker-us-east-1-774297356213/sm-spark-2024-08-30-05-25-18-294/input/code/preprocess.py",
                                        s3_data_type="S3Prefix",
                                        local_path = "/opt/ml/processing/input/code",
                                        s3_input_mode="File"
                                        ))

processing_resources = ProcessingResources(cluster_config=ProcessingClusterConfig
                                           (instance_count=2,instance_type="ml.m5.xlarge",volume_size_in_gb=30))


app_specification = AppSpecification(image_uri = "173754725891.dkr.ecr.us-east-1.amazonaws.com/sagemaker-spark-processing:3.1-cpu",
                                    container_entrypoint = ["smspark-submit",
                                                            "--class",
                                                            "com.amazonaws.sagemaker.spark.test.HelloJavaSparkApp",
                                                            "/opt/ml/processing/input/code/spark-test-app.jar"],
                                    container_arguments = ["--input",
                                                           f"s3://{s3_bucket_name}/{input_prefix_sales}",
                                                           "--output",
                                                           f"s3://{s3_bucket_name}/{output_prefix_sales}",
                                                           ])

# Run the processing job
processing_job_obj = ProcessingJob.create(processing_job_name = final_job_name,
                            processing_resources=processing_resources,
                            app_specification=app_specification,
                            role_arn=role,
                            processing_inputs=[processing_input_code,processing_input_pyfiles])

# waiting for the processing job to be completed
processing_job_obj.wait()

## Example 4: Specifying additional Spark configuration

Overriding Spark configuration is crucial for a number of tasks such as tuning your Spark application or configuring the Hive metastore. Using the SageMaker Python SDK, you can easily override Spark/Hive/Hadoop configuration.

The next example demonstrates this by overriding Spark executor memory/cores.

For more information on configuring your Spark application, see the EMR documentation on [Configuring Applications](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html)

#### Creating configuration.json file for overriding Spark executor memory/cores 

In [None]:
%%writefile ./code/configuration.json
[{"Classification": "spark-defaults", "Properties": {"spark.executor.memory": "2g", "spark.executor.cores": "1"}}]

In [None]:
from sagemaker_core.shapes import ProcessingInput,ProcessingResources,AppSpecification,ProcessingS3Input,ProcessingOutputConfig
from sagemaker_core.shapes import ProcessingResources,ProcessingClusterConfig,ProcessingOutput,ProcessingS3Output
from sagemaker_core.resources import ProcessingJob

# Upload the raw input dataset to a unique S3 location
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
prefix = "sagemaker/spark-preprocess-demo/{}".format(timestamp_prefix)
input_prefix_abalone = "{}/input/raw/abalone".format(prefix)
input_preprocessed_prefix_abalone = "{}/input/preprocessed/abalone".format(prefix)

#base job name
base_job_name = "sm-spark"
final_job_name = base_job_name + "-"+ timestamp_prefix

input_script_prefix = "{}/input/code".format(final_job_name)
input_conf_prefix = "{}/input/conf".format(final_job_name)

# uploading required data to S3 for reference
sagemaker_session.upload_data(
    path="./data/abalone.csv", bucket=s3_bucket_name, key_prefix=input_prefix_abalone
)

sagemaker_session.upload_data(
    path="./code/preprocess.py", bucket=s3_bucket_name, key_prefix=input_script_prefix
)

sagemaker_session.upload_data(
    path="./code/configuration.json", bucket=s3_bucket_name, key_prefix=input_conf_prefix
)

# initializing ProcessingInputs,ProcessingResources and AppSpecification configurations
processing_input_code = ProcessingInput(input_name="code",s3_input = ProcessingS3Input(
                                        s3_uri = f"s3://{s3_bucket_name}/{final_job_name}/input/code/preprocess.py",
                                        #s3_uri="s3://sagemaker-us-east-1-774297356213/sm-spark-2024-08-30-05-25-18-294/input/code/preprocess.py",
                                        s3_data_type="S3Prefix",
                                        local_path = "/opt/ml/processing/input/code",
                                        s3_input_mode="File"
                                        ))

processing_input_conf = ProcessingInput(input_name="conf",s3_input = ProcessingS3Input(
                                        s3_uri = f"s3://{s3_bucket_name}/{final_job_name}/input/conf/configuration.json",
                                        #s3_uri="s3://sagemaker-us-east-1-774297356213/sm-spark-2024-08-30-05-25-18-294/input/code/preprocess.py",
                                        s3_data_type="S3Prefix",
                                        local_path = "/opt/ml/processing/input/conf",
                                        s3_input_mode="File"
                                        ))


processing_resources = ProcessingResources(cluster_config=ProcessingClusterConfig
                                           (instance_count=2,instance_type="ml.m5.xlarge",volume_size_in_gb=30))

app_specification = AppSpecification(image_uri = "173754725891.dkr.ecr.us-east-1.amazonaws.com/sagemaker-spark-processing:3.1-cpu",
                                    container_entrypoint = ["smspark-submit",
                                                            "/opt/ml/processing/input/code/preprocess.py"],
                                    container_arguments = ["--s3_input_bucket",
                                                           s3_bucket_name,
                                                           "--s3_input_key_prefix",
                                                           input_prefix_abalone,
                                                           "--s3_output_bucket",
                                                           s3_bucket_name,
                                                           "--s3_output_key_prefix",
                                                           input_preprocessed_prefix_abalone])

# Run the processing job
processing_job_obj = ProcessingJob.create(processing_job_name = final_job_name,
                            processing_resources=processing_resources,
                            app_specification=app_specification,
                            role_arn=role,
                            processing_inputs=[processing_input_code,processing_input_conf])

# waiting for the processing job to be completed
processing_job_obj.wait()