# Data preparation in Spark cluster for ML in Amazon SageMaker

This example prepares [New York City Taxi and Limousine Commission Trip Record Data](https://registry.opendata.aws/nyc-tlc-trip-records-pds/) dataset for machine learning in [Amazon SageMaker](https://aws.amazon.com/sagemaker/).

This example requires that this Jupyter notebook is running in an [Amazon SageMaker Notebook instance](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html) attached to a Spark cluster. So, if you have not already done so, let us first create an Amazon SageMaker Notebook instance either attached to an AWS Glue Development Endpoint (Option 1), or  attached to an Amazon EMR cluster (Option 2) as described below.

### Option 1: Create Amazon SageMaker Notebook instance attached to AWS Glue Development Endpoint

The first option is to create an Amazon SageMaker Notebook instance attached to an [AWS Glue Development Endpoint](https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint.html), as described below:

  -  [Add an AWS Glue Development Endpoint](https://docs.aws.amazon.com/glue/latest/dg/add-dev-endpoint.html)
  -  [Use an Amazon SageMaker Notebook with your AWS Glue Development Endpoint](https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-sage.html)

### Option 2: Create Amazon SageMaker Notebook instance attached to Amazon EMR

The second option is to create an [Amazon EMR Cluster](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-launch.html) using <b>Advanced Options</b> to include <b>Spark and Livy</b> in the software for the EMR cluster. Create Amazon SageMaker notebook instance attached to the Amazon EMR cluster you just created following this [Amazon blog post](https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/). 

After you create the Amazon SageMaker Notebook instance using one of the two options described above, reopen this Jupyter notebook from the Amazon SageMaker notebook instance you just created, so this Juypyter notebook can use the attached Spark cluster.

### Configure PySpark

Below we configure PySpark to use Python 3. 

In [None]:
%%configure -f
{ "conf":{
          "spark.pyspark.python": "python3",
          "spark.pyspark.virtualenv.enabled": "true",
          "spark.pyspark.virtualenv.type":"native",
          "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
         }
}

### Load data into a DataFrame 
Next, we create a Resilient Distributed Dataset (RDD) from a CSV file stored in S3 bucket as part of [New York City Taxi and Limousine Commission (TLC) Trip Record Data](https://registry.opendata.aws/nyc-tlc-trip-records-pds/) dataset in [Registry of Open Data on AWS](https://registry.opendata.aws/). We convert the RDD to a PySpark DataFrame.

In [None]:
# Loads RDD
lines = sc.textFile("s3://nyc-tlc/misc/uber_nyc_data.csv")
# Split lines into columns; change split() argument depending on deliminiter e.g. '\t'
parts = lines.map(lambda l: l.split(','))
# Convert RDD into DataFrame
uber_df = spark.createDataFrame(parts, ['id','origin_taz','destination_taz','pickup_datetime','trip_distance','trip_duration'])

We print the schema of the data we just loaded and also show 10 rows to understand the data.

In [None]:
print(uber_df.printSchema())
uber_df.show(10)

### Clean data ###
Next, we clean the data. We drop any rows with any NULL or NaN values. 

In [None]:
# clean up data 
# remove id column as we don't need it
uber_df1=uber_df.drop(uber_df.id)

# drop all rows with any null value
uber_df1=uber_df1.dropna(how='any')

# filter rows where destnation, orign and trip duration are not set to NULL
uber_df1=uber_df1.filter((uber_df1.destination_taz != 'NULL')  & 
    (uber_df1.origin_taz != 'NULL')  & 
    (uber_df1.trip_duration != 'NULL')  & 
    (uber_df1.destination_taz != 'destination_taz'))

# show 10 rows
uber_df1.show(10)

### Define PySpark user-defined functions ###
Below, we import relevant Python clasess for defining PySpark user-defined functions.

In [None]:
from pyspark.sql.functions import udf, to_timestamp
from pyspark.sql.types import IntegerType
from datetime import datetime

Below we define a PySpark user-defined function for extracting ordinal day of the week from pickup date timestamp.

In [None]:
# define UDF for extracting pickup day of the week from datetime

def weekday(x):
    pickup=datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
    return int(pickup.date().weekday())
    
pickup_day_udf = udf(weekday, IntegerType())

Below we define a PySpark user-defined function for extracting month from the pickup date timestamp.

In [None]:
# define month udf for extracting pickup month from datetime
def month(x):
    pickup=datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
    return int(pickup.date().month)
    
pickup_month_udf = udf(month, IntegerType())

Below we define a PySpark user-defined function for extracting hour of the day from the pickup date timestamp.

In [None]:
# define pickup_time udf for extracting pickup hour from datetime

def pickup_time(x):
    ptime = datetime.strptime(x, '%Y-%m-%d %H:%M:%S').time()
    return int(ptime.hour)
    
pickup_time_udf = udf(pickup_time, IntegerType())

We define a PySpark user-defined function that parses source and target zones as hexadecimal integers.

In [None]:
def encode_taz(x):
   return int(x, 16)

taz_udf=udf(encode_taz, IntegerType())

Below we define a PySpark user-defined function that computes duration of the trip in minutes.

In [None]:
# define duration udf for extracting duration in minutes
def duration(x):
    time=x.split(':')
    duration = int(time[0]*60) + int(time[1])
    return duration

duration_udf = udf(duration, IntegerType())

### Prepare data for SageMaker XGBoost algorithm ###
SageMaker XGBoost algorithm expects the label to be the first column. So, we transform the PySpark DataFrame to make `duration` as the first column, because we want to train the model to predict duration of an Uber ride, given ride pickup source and target zones, the month of the year, the day of the week, and hour of the day. Columns of the DataFrame are transformed using PySpark user-defined functions defined above. 

We also drop any rows with Null or NaN values as a result of transformations. We filter rows to keep rows with duration greater than 0 but less than 120 minutes.

In [None]:
# create a new data frame
# we want trip duration (minutes) in the first column as label for the row
# our feature vector includes origin, desination, and pickup month, day, and hour
# we will discard other columms
uber_df2 = uber_df1.select(duration_udf(uber_df1.trip_duration).alias('duration'),
    taz_udf(uber_df1.origin_taz).alias('origin'), 
    taz_udf(uber_df1.destination_taz).alias('destination'), 
    pickup_month_udf(uber_df1.pickup_datetime).alias('month'), 
    pickup_day_udf(uber_df1.pickup_datetime).alias('day'), 
    pickup_time_udf(uber_df1.pickup_datetime).alias('pickup_time'))

uber_df3 = uber_df2.dropna(how='any')
uber_df4 = uber_df3.filter((uber_df3.duration > 0) & (uber_df3.duration < 120))

In [None]:
# show 
uber_df4.show(10)

### Save prepared data in S3 bucket ###
Finally, we save the transformed PySpark DataFrame in S3 bucket, so we set the S3 bucket name.

In [None]:
# bucket name for saving PySpark output
bucket_name=

Since a PySpark DataFrame is a RDD, data will be saved in S3 in multiple files in CSV format.

In [None]:
#save prepared data frame S3 bucket
uber_df4.write.save(f"s3://{bucket_name}/emr/output/uber_nyc/v1", format='csv', header=False)