## Amazon SageMaker Processing in Fannie Mae Sandbox
### PySpark example

This notebook shows a working example of using Amazon SageMaker Processing feature to run data preprocessing using PySpark workloads on the Amazon SageMaker platform within Fannie Mae Sandbox environment.  

The data and the scripts used in this notebook are originally from the SageMaker example notebook: \
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_processing/feature_transformation_with_sagemaker_processing/feature_transformation_with_sagemaker_processing.ipynb

#### 0. Prerequisite

Taking the following steps to make sure that SageMaker will be upgraded to the latest version:

- pip install sagemaker --upgrade
- pip install pyarrow
- check the version of SageMaker installed
- ensure that sagemaker_notebook_config.py is also uploaded to the same directory as this notebook

In [None]:
!pip install sagemaker --upgrade

In [None]:
!pip install pyarrow

In [None]:
import sagemaker
print(sagemaker.__version__)

If the SageMaker version shown above >= 2.11, it is good to move forward.  If not, please check if the notebook kernel is restarted after pip install and rerun the cell above.

#### 1. Configuration


- setup existing bucket name as the default bucket of SageMaker Session to bypass `CreateBucket` operation
- retrieve notebook instance configuration
- upload input data to S3, including both csv and parquet data

In [None]:
import json
import warnings
import boto3

import pandas as pd

import sagemaker

#from sagemaker_notebook_config import SageMakerNotebookConfig
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
from sagemaker.network import NetworkConfig

In [None]:
bucket = "fnma-dsmlp-devl-edl-us-east-1-edl"
user_home_folder = "home/user/gaubxs"  # the folder that was created in s3 bucket

sagemaker_session = sagemaker.Session(default_bucket=bucket)

# run the following line to bypass `try: create_bucket` block in sagemaker session:
sagemaker_session._default_bucket = bucket

# test if the default bucket has been set 
print(sagemaker_session.default_bucket())

In [None]:
class SageMakerNotebookConfig:
    def __init__(self, notebook_instance_name=None, fp_resource_metadata=None):
        self.fp_resource_metadata = fp_resource_metadata or self._fp_resource_metadata_default
        self.notebook_metadata = self.get_resource_metadata()
        self.notebook_instance_name = self.get_instance_name(notebook_instance_name)

        self._sgm_client = boto3.client("sagemaker")
        self._ec2_client = boto3.client("ec2")

    @property
    def _fp_resource_metadata_default(self):
        return "/opt/ml/metadata/resource-metadata.json"

    def get_resource_metadata(self):
        with open(self.fp_resource_metadata) as fp:
            metadata = json.load(fp)
        return metadata

    def get_instance_name(self, notebook_instance_name):
        if notebook_instance_name and self.notebook_metadata["ResourceName"] != notebook_instance_name:
            message = "Inconsistent Notebook Instance Name(s): \n"
            message += f"Input instance name: {notebook_instance_name} \n"
            message += "Instance name found in default metadata: {} \n".format(self.notebook_metadata["ResourceName"])
            message += "Will use the input instance name. \n"
            warnings.warn(message, ResourceWarning)

            return notebook_instance_name
        else:
            return self.notebook_metadata["ResourceName"]
    
    @property
    def desc_nb_instance(self):
        return self._sgm_client.describe_notebook_instance(NotebookInstanceName=self.notebook_instance_name)
    
    @property
    def notebook_instance_arn(self):
        return self.desc_nb_instance["NotebookInstanceArn"]
    
    @property
    def account_id(self):
        return self.notebook_instance_arn.split(":")[4]
    
    @property
    def instance_region(self):
        return self.notebook_instance_arn.split(":")[3]
    
    @property
    def subnet_id(self):
        return self.desc_nb_instance.get("SubnetId", None)
    
    @property
    def security_groups(self):
        return self.desc_nb_instance.get("SecurityGroups", [])

    @property
    def role(self):
        return self.desc_nb_instance.get("RoleArn", "")
    
    @property
    def kms_id(self):
        return self.desc_nb_instance.get("KmsKeyId", "")
    
    @property
    def kms_key(self):
        return self.kms_id.split("key/")[1]
    
    @property
    def tags(self):
        tags = []
        for tag_dict in self._sgm_client.list_tags(ResourceArn=self.notebook_instance_arn).get("Tags", []):
            if not tag_dict["Key"].startswith("aws"):
                tags.append(tag_dict)
        return tags
    
    def get_tag_value(self, key_name):
        for tag in self.tags:
            if tag["Key"] == key_name:
                return tag["Value"]
        
        message = f"Key name: {key_name} not found in tags \n"
        message += "Return None"
        warnings.warn(message)
        return None

    def vpc_config(self, tag_key_filter=["Function"]):
        # need to get the vpc_id first
        # use the subnet_id of the current notebook instance
        # to retrieve the vpc_id
        
        desc_current_subnet = self._ec2_client.describe_subnets(SubnetIds=[self.subnet_id])["Subnets"][0]
        
        vpc_id = desc_current_subnet["VpcId"]
        
        # create a tag filter based on the input tag_key_filter 
        # and the corresponding value of the current subnet
        
        tag_filter = []
        for key in tag_key_filter:
            for tag in desc_current_subnet["Tags"]:
                if tag["Key"] == key:
                    value = tag["Value"]
                    break
            
            tag_filter.append({"Key": key, "Value": value})
        
        # use filtering to get a list subnets inside the same vpc
        
        filters = [{"Name": "vpc-id", "Values": [vpc_id]}]
        
        for tag in tag_filter:
            filters.append({"Name": "tag:{}".format(tag["Key"]), "Values": [tag["Value"]]})
        
        subnets = [subnet["SubnetId"] for subnet in self._ec2_client.describe_subnets(Filters=filters)["Subnets"]]
        
        return {
            "SecurityGroupIds": self.security_groups,
            "Subnets": subnets
        }
    
            

In [None]:
config = SageMakerNotebookConfig()

#### 2. Download the sample dataset and upload it to S3

In [None]:
!wget https://s3-us-west-2.amazonaws.com/sparkml-mleap/data/abalone/abalone.csv

column_names = [
    "sex", 
    "length", 
    "diameter", 
    "height", 
    "whole_weight", 
    "shucked_weight", 
    "viscera_weight", 
    "shell_weight", 
    "rings", 
]

df = pd.read_csv("abalone.csv", header=None, names=column_names)
df.to_parquet("abalone.parquet")





In [None]:
# upload csv data file to S3
input_data = f"s3://{bucket}/{user_home_folder}/sagemaker_processing/examples/abalone/input/abalone.csv"
!aws s3 cp abalone.csv $input_data --sse aws:kms --sse-kms-key-id $config.kms_id

# upload parquet data file to S3
input_data = f"s3://{bucket}/{user_home_folder}/sagemaker_processing/examples/abalone/input/abalone.parquet"
!aws s3 cp abalone.parquet $input_data --sse aws:kms --sse-kms-key-id $config.kms_id

#### 3. Preprocessing by using SageMaker PySparkProcessor

1. create the pyspark processing script
2. Upload the script to S3
3. Upload the data needs to be processed to S3


***Note***
At high level DataProcessing is divided in to three parts
1) Reading the data
2) Processing the data
3) Saving the output (processed data)

Reading Data:  your code should refer the file path when reading the data as in this example the data will be uploaded to file system path.

Saving the output:  the output will be saved to file system and then copied to S3

In [None]:
%%writefile preprocess_pyspark.py
import time
import sys
import os
import shutil
import csv

import pyspark
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.functions import *
print("&"*100)
import glob
root_dir = '/opt/ml/processing'
input_data = 'file:///opt/ml/processing/input-data'
output_data = 'file:///opt/ml/processing/output-data'

for filename in glob.iglob(root_dir + '**/**', recursive=True):
     print(filename)
for currentpath, folders, files in os.walk(root_dir):
    print(currentpath, folders, files)



def csv_line(data):
    r = ','.join(str(d) for d in data[1])
    return str(data[0]) + "," + r

def main():
    print("start preprocessing job.")
    print("%"*100)
    spark = SparkSession.builder.appName("PySparkAbalone").getOrCreate()
    

    
    print("set up schema")
    # Defining the schema corresponding to the input data. The input data does not contain the headers
    schema = StructType([StructField("sex", StringType(), True), 
                         StructField("length", DoubleType(), True),
                         StructField("diameter", DoubleType(), True),
                         StructField("height", DoubleType(), True),
                         StructField("whole_weight", DoubleType(), True),
                         StructField("shucked_weight", DoubleType(), True),
                         StructField("viscera_weight", DoubleType(), True), 
                         StructField("shell_weight", DoubleType(), True), 
                         StructField("rings", DoubleType(), True)])

    
    print("read parquet")
    print("!"*100)
    
    ###########################################################################
    ###################### reading data #######################################
    ###########################################################################
    #total_df = spark.read.parquet('file:///opt/ml/processing/input/abalone.parquet')
    total_df = spark.read.parquet(input_data)
    print(total_df.head())
    
    print("Successfully read parquet file as input")
    
    ###########################################################################
    ###################### data processing ####################################
    ###########################################################################

    #StringIndexer on the sex column which has categorical value
    sex_indexer = StringIndexer(inputCol="sex", outputCol="indexed_sex")
    
    #one-hot-encoding is being performed on the string-indexed sex column (indexed_sex)
    sex_encoder = OneHotEncoder(inputCol="indexed_sex", outputCol="sex_vec")

    #vector-assembler will bring all the features to a 1D vector for us to save easily into CSV format
    assembler = VectorAssembler(inputCols=["sex_vec", 
                                           "length", 
                                           "diameter", 
                                           "height", 
                                           "whole_weight", 
                                           "shucked_weight", 
                                           "viscera_weight", 
                                           "shell_weight"], 
                                outputCol="features")
    
    # The pipeline comprises of the steps added above
    pipeline = Pipeline(stages=[sex_indexer, sex_encoder, assembler])
    
    # This step trains the feature transformers
    model = pipeline.fit(total_df)
    
    # This step transforms the dataset with information obtained from the previous fit
    transformed_total_df = model.transform(total_df)
    
    # Split the overall dataset into 80-20 training and validation
    (train_df, validation_df) = transformed_total_df.randomSplit([0.8, 0.2])
    
    # Convert the train dataframe to RDD to save in CSV format and upload to S3
    #train_rdd = train_df.rdd.map(lambda x: (x.rings, x.features))
    #train_lines = train_rdd.map(csv_line)
    #train_lines.saveAsTextFile('s3a://' + os.path.join(args['s3_output_bucket'], args['s3_output_key_prefix'], 'train'))
    

    ###########################################################################
    ###################### saving output ######################################
    ###########################################################################
    #transformed_file_name='transformed.parquet'
    #validation_file_name='validate.parquet'
    transformed_total_df.write.parquet(output_data+'/transformed')
    validation_df.write.parquet(output_data+'/validate')
    
    for currentpath, folders, files in os.walk(output_data):
        print(currentpath, folders, files)

    
    # Convert the validation dataframe to RDD to save in CSV format and upload to S3
    #validation_rdd = validation_df.rdd.map(lambda x: (x.rings, x.features))
    #validation_lines = validation_rdd.map(csv_line)
    #validation_lines.saveAsTextFile('s3a://' + os.path.join(args['s3_output_bucket'], args['s3_output_key_prefix'], 'validation'))
    
    # Here add one line to output the validation data to S3 in parquet format
    
if __name__ == "__main__":
    main()
    print("#"*100)
    print("end of job")

In [None]:
input_code = f"s3://{bucket}/{user_home_folder}/sagemaker_processing/examples/abalone/input/preprocess_pyspark.py"
!aws s3 cp preprocess_pyspark.py $input_code --sse aws:kms --sse-kms-key-id $config.kms_id

In [None]:

input_data =  f"s3://{bucket}/{user_home_folder}/sagemaker_processing/examples/abalone/input/abalone.parquet"
output_location = f"s3://{bucket}/{user_home_folder}/sagemaker_processing/spark/census-income/output"
spark_log_location = f"s3://{bucket}/{user_home_folder}/store-spark-events"

**Note**
`/opt/program/submit` is the entry point of the pre-built Spark docker container.  Inside the container, `/opt/program/submit` is a python script that execute the input python script as shown above.

#### 4. run sagemaker PySparkProcessor to process the data

1. Create the PySparkProcessor object: for list of parameters refer: https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.spark.processing.PySparkProcessor
2. Run the PySparkProcessor to perform the data processing

In [None]:
from sagemaker.spark.processing import PySparkProcessor

spark_processor = PySparkProcessor(
    
    framework_version="2.4",
    role=config.role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    py_version="py37",
    container_version="1",
    volume_kms_key=config.kms_id, 
    output_kms_key=config.kms_id,
    base_job_name='dsmlp-usr-gaubxs', 
    sagemaker_session=sagemaker_session, 
    tags=config.tags, 
    network_config=NetworkConfig(
        enable_network_isolation=False, # the processing instance need to communicate with S3
        encrypt_inter_container_traffic=True,
        security_group_ids=config.vpc_config()['SecurityGroupIds'],
        subnets=config.vpc_config()['Subnets'],
    ),
    
    max_runtime_in_seconds=1200,
)



In [None]:
spark_processor.run(
    submit_app=input_code,
    inputs=[
        ProcessingInput(
            source=input_data,
            destination='/opt/ml/processing/input-data'
        )
    ],
    outputs=[
        ProcessingOutput(
            destination= output_location,
            output_name='train_data',
            source='/opt/ml/processing/output-data'
        ),
    
    ],
    spark_event_logs_s3_uri= spark_log_location,
    #wait=False
)