# Feature Transformation with Amazon SageMaker and Spark

Sometimes, when dealing with big data, standard data preprocessing techniques fall short in terms of performance because of the big data factor. In this notebook, we will show how to use Spark, a distributed processing framework, that can provide performance benefits when dealing with big data. We will use it with Amazon SageMaker Processing to run our preprocessing workload, but leveraging the Spark library to perform the data preparation tasks.

**Note:** We need to build the Spark container by ourselves, that is why the directory container is needed, and we will also need docker installed to build an image and push it to ECR.

## Setup

In [1]:
# Import general modules
import boto3
import sagemaker
from time import gmtime, strftime

In [2]:
# Set global variables and initialize sessions to be used along notebook

region = "eu-west-2" # Replace with your region
sagemaker_session = sagemaker.session.Session()
default_bucket = sagemaker_session.default_bucket() # Replace if you have another bucket in mind
prefix_bucket = "sagemaker-training-preprocessing" # Use it in case you want to put your artifacts inside another directory

# Get execution role arn to be used and perform operation in cloud
try:
    # get_execution_role() will only work within Sagemaker studio or notebook instance
    role_arn = sagemaker.get_execution_role()
except ValueError:
    # Will need to get the role ARN by initializing a a new IAM session and get the role by their name
    iam = boto3.client('iam')
    role_arn = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20230204T144648')['Role']['Arn']
    print("Role ARN successfully extracted")

Couldn't call 'get_role' to get Role ARN from role name francisco-learning to get Role path.


Role ARN successfully extracted


In [3]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

prefix = f"{prefix_bucket}/spark-preprocess-demo/{timestamp_prefix}"
input_prefix = prefix + "/input/raw/abalone"
input_preprocessed_prefix = prefix + "/input/preprocessed/abalone"
model_prefix = prefix + "/model"

## Download data

In [4]:
# Fetch the dataset from the SageMaker bucket and download it locally
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/abalone.csv .

# Uploading the training data to S3 (default bucket)
sagemaker_session.upload_data(path="abalone.csv", bucket=default_bucket, key_prefix=input_prefix)

Completed 187.4 KiB/187.4 KiB (254.0 KiB/s) with 1 file(s) remaining
download: s3://sagemaker-sample-files/datasets/tabular/uci_abalone/abalone.csv to .\abalone.csv


's3://sagemaker-eu-west-2-247231311879/sagemaker-training-preprocessing/spark-preprocess-demo/2023-02-07-20-37-43/input/raw/abalone/abalone.csv'

## Build docker container

We need to build a docker container with the Dockerfile inside the `container` directory. This will help us set up Spark and the master/worker nodes correctly.

For this, do the following before running the next code cell.

1. cd into container directory
2. Run the following command `docker build -t sagemaker-spark-example .` Note: Make sure docker is installed correctly.

In [5]:
# Generate variables names to be used for creating ECR registry, and push image.

account_id = boto3.client("sts").get_caller_identity().get("Account") # Account number for AWS
region = boto3.session.Session().region_name # Region name

For pushing the image to ECR, we will do the following:
1. Create ECR repository, using the following command: 

    `aws ecr create-repository --repository-name sagemaker-spark-example`
2. Log in into ECR to be able to push image, using the following command: 

    `aws ecr get-login-password --region eu-west-2 | docker login --username AWS --password-stdin {REPLACE_WITH_ACCOUNT_ID}.dkr.ecr.{REPLACE_WITH_REGION}.amazonaws.com`
3. Build image to be pushed, using the following command: 

    `docker build -t sagemaker-spark-example .` 
    
    Note: make sure to be inside `container directory`.
4. Tag image with latest so it be pushed to ECR, using the following command: 

    `docker tag sagemaker-spark-example:latest {REPLACE_WITH_ACCOUNT_ID}.dkr.ecr.{REPLACE_WITH_REGION}.amazonaws.com/sagemaker-spark-example:latest`
5. Push image to ECR repository, using the following command: 

    `docker push {REPLACE_WITH_ACCOUNT_ID}.dkr.ecr.{REPLACE_WITH_REGION}.amazonaws.com/sagemaker-spark-example:latest`

In [10]:
spark_repository_uri = "247231311879.dkr.ecr.eu-west-2.amazonaws.com/sagemaker-spark-example:latest"

## Preprocessing Job

Now that we have a container to be used, already in ECR, we can create the preprocessing job script that will use Spark (in this case, pyspark) to prepare the data.

We will first create a preprocessing script, and then a preprocesing job definition using SageMaker.

In [9]:
%%writefile preprocess.py
from __future__ import print_function
from __future__ import unicode_literals

import time
import sys
import os
import shutil
import csv

import pyspark
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.functions import *


def csv_line(data):
    r = ",".join(str(d) for d in data[1])
    return str(data[0]) + "," + r


def main():
    spark = SparkSession.builder.appName("PySparkAbalone").getOrCreate()

    # Convert command line args into a map of args
    args_iter = iter(sys.argv[1:])
    args = dict(zip(args_iter, args_iter))

    # This is needed to save RDDs which is the only way to write nested Dataframes into CSV format
    spark.sparkContext._jsc.hadoopConfiguration().set(
        "mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter"
    )

    # Defining the schema corresponding to the input data. The input data does not contain the headers
    schema = StructType(
        [
            StructField("sex", StringType(), True),
            StructField("length", DoubleType(), True),
            StructField("diameter", DoubleType(), True),
            StructField("height", DoubleType(), True),
            StructField("whole_weight", DoubleType(), True),
            StructField("shucked_weight", DoubleType(), True),
            StructField("viscera_weight", DoubleType(), True),
            StructField("shell_weight", DoubleType(), True),
            StructField("rings", DoubleType(), True),
        ]
    )

    # Downloading the data from S3 into a Dataframe
    total_df = spark.read.csv(
        (
            "s3a://"
            + os.path.join(args["s3_input_bucket"], args["s3_input_key_prefix"], "abalone.csv")
        ),
        header=False,
        schema=schema,
    )

    # StringIndexer on the sex column which has categorical value
    sex_indexer = StringIndexer(inputCol="sex", outputCol="indexed_sex")

    # one-hot-encoding is being performed on the string-indexed sex column (indexed_sex)
    sex_encoder = OneHotEncoder(inputCol="indexed_sex", outputCol="sex_vec")

    # vector-assembler will bring all the features to a 1D vector for us to save easily into CSV format
    assembler = VectorAssembler(
        inputCols=[
            "sex_vec",
            "length",
            "diameter",
            "height",
            "whole_weight",
            "shucked_weight",
            "viscera_weight",
            "shell_weight",
        ],
        outputCol="features",
    )

    # The pipeline comprises of the steps added above
    pipeline = Pipeline(stages=[sex_indexer, sex_encoder, assembler])

    # This step trains the feature transformers
    model = pipeline.fit(total_df)

    # This step transforms the dataset with information obtained from the previous fit
    transformed_total_df = model.transform(total_df)

    # Split the overall dataset into 80-20 training and validation
    (train_df, validation_df) = transformed_total_df.randomSplit([0.8, 0.2])

    # Convert the train dataframe to RDD to save in CSV format and upload to S3
    train_rdd = train_df.rdd.map(lambda x: (x.rings, x.features))
    train_lines = train_rdd.map(csv_line)
    train_lines.saveAsTextFile(
        "s3a://" + os.path.join(args["s3_output_bucket"], args["s3_output_key_prefix"], "train")
    )

    # Convert the validation dataframe to RDD to save in CSV format and upload to S3
    validation_rdd = validation_df.rdd.map(lambda x: (x.rings, x.features))
    validation_lines = validation_rdd.map(csv_line)
    validation_lines.saveAsTextFile(
        "s3a://"
        + os.path.join(args["s3_output_bucket"], args["s3_output_key_prefix"], "validation")
    )


if __name__ == "__main__":
    main()

Writing preprocess.py


In [13]:
from sagemaker.processing import ScriptProcessor, ProcessingInput

# Define the preprocessing job, and use the container we just built by specifying an image_uri value.
spark_processor = ScriptProcessor(
    base_job_name="spark-preprocessor",
    image_uri=spark_repository_uri,
    command=["/opt/program/submit"],
    role=role_arn,
    instance_count=2,
    instance_type="ml.t3.xlarge", # Note: For free tier account, this instance will be not fall into that free tier.
    max_runtime_in_seconds=1200,
    env={"mode": "python"},
)

spark_processor.run(
    code="preprocess.py",
    arguments=[
        "s3_input_bucket",
        default_bucket,
        "s3_input_key_prefix",
        input_prefix,
        "s3_output_bucket",
        default_bucket,
        "s3_output_key_prefix",
        input_preprocessed_prefix,
    ],
    logs=False,
)

INFO:sagemaker:Creating processing-job with name spark-preprocessor-2023-02-07-21-26-03-070



Job Name:  spark-preprocessor-2023-02-07-21-26-03-070
Inputs:  [{'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-eu-west-2-247231311879/spark-preprocessor-2023-02-07-21-26-03-070/input/code/preprocess.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  []
........................................................................!

Now, lets inspect the data from the preprocessing job. We will first download the data file (in this case, it is a text file) and then display the first 10 lines of that file.

**Note:** The first column is the target variable, and the remaining columns are the features. There are no headers, because it is intended that the next step is to train a XGBoost model (which AWS algorithm requires no headers).

In [18]:
!aws s3 cp --quiet s3://$default_bucket/$input_preprocessed_prefix/train/part-00000 .

In [23]:
with open(r"C:\Users\franc\personal-coding-projects\aws-sagemaker-training\sagemaker-processing\spark-ml-data-processing\part-00000") as f:
    for _ in range(5): # first 5 lines
        print(f.readline())

5.0,0.0,0.0,0.275,0.195,0.07,0.08,0.031,0.0215,0.025

6.0,0.0,0.0,0.29,0.21,0.075,0.275,0.113,0.0675,0.035

5.0,0.0,0.0,0.29,0.225,0.075,0.14,0.0515,0.0235,0.04

7.0,0.0,0.0,0.305,0.225,0.07,0.1485,0.0585,0.0335,0.045

7.0,0.0,0.0,0.305,0.23,0.08,0.156,0.0675,0.0345,0.048

