# Perform ETL and train a model using PySpark
---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

---

To perform extract transform load (ETL) operations on multiple files, we recommend opening a Jupyter notebook within Amazon SageMaker Studio and using the `Glue PySpark and Ray` kernel. The kernel is connected to an AWS Glue Interactive Session. The session connects your notebook to a cluster that automatically scales up the storage and compute to meet your data processing needs. When you shut down the kernel, the session stops and you're no longer charged for the compute on the cluster.

Within the notebook you can use Spark commands to join and transform your data. Writing Spark commands is both faster and easier than writing SQL queries. For example, you can use the join command to join two tables. Instead of writing a query that can sometimes take minutes to complete, you can join a table within seconds.

To show the utility of using the PySpark kernel for your ETL and model training worklows, we're predicting the fare amount of the NYC taxi dataset. It imports data from 47 files across 2 different Amazon Simple Storage Service (Amazon S3) locations. Amazon S3 is an object storage service that you can use to save and access data and machine learning artifacts for your models. For more information about Amazon S3, see [What is Amazon S3?](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html).

The notebook is not meant to be a comprehensive analysis. Instead, it's meant to be a proof of concept to help you quickly get started.

__Prerequisites:__

This tutorial assumes that you've in the us-east-1 AWS Region. It also assumes that you've provided the IAM role you're using to run the notebook with permissions to use Glue. For more information, see [Providing AWS Glue permissions
](docs.aws.amazon.com/sagemaker/latest/dg/perform-etl-and-train-model-pyspark.html#providing-aws-glue-permissions).

## Solution overview 

To perform ETL on the NYC taxi data and train a model, we do the following

1. Start a Glue Session and load the SageMaker Python SDK
2. Set up the utilities needed to work with AWS Glue.
3. Load the data from the Amazon S3 into Spark dataframes.
4. Verify that we've loaded the data successfully.
5. Save a 20000 row sample of the Spark dataframe as a pandas dataframe.
6. Create a correlation matrix as an example of the types of analyses we can perform.
7. Split the Spark dataframe into training, validation, and test datasets.
8. Write the datasets to Amazon S3 locations that can be accessed by an Amazon SageMaker training job.
9. Use the training and validation datasets to train a model.

### Start a Glue Session and load the SageMaker Python SDK

In [None]:
%additional_python_modules sagemaker

### Set up the utilities needed to work with AWS Glue

We're importing `Join` to join our Spark dataframes.  `GlueContext` provides methods for transforming our dataframes. In the context of the notebook, it reads the data from the Amazon S3 locations and uses the Spark cluster to transform the data. `SparkContext` represents the connection to the Spark cluster. `GlueContext` uses `SparkContext` to transform the data. `getResolvedOptions` lets you resolve configuration options within the Glue interactive session.

In [None]:
import sys
from awsglue.transforms import Join
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

### Create the `df_ride_info` dataframe

Create a single dataframe from all the ride_info Parquet files for 2019.

In [None]:
df_ride_info = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    format="parquet",
    connection_options={
        "paths": [
            "s3://dsoaws/nyc-taxi-orig-cleaned-split-parquet-per-year-multiple-files/ride-info/year=2019/"
        ],
        "recurse": True,
    },
).toDF()

### Create the `df_ride_info` dataframe

Create a single dataframe from all the ride_fare Parquet files for 2019.

In [None]:
df_ride_fare = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    format="parquet",
    connection_options={
        "paths": [
            "s3://dsoaws/nyc-taxi-orig-cleaned-split-parquet-per-year-multiple-files/ride-fare/year=2019/"
        ],
        "recurse": True,
    },
).toDF()

### Show the first five rows of `dr_ride_fare`

In [None]:
df_ride_fare.show(5)

### Join df_ride_fare and df_ride_info on the `ride_id` column

In [None]:
df_joined = df_ride_info.join(df_ride_fare, ["ride_id"])

### Show the first five rows of the joined dataframe

In [None]:
df_joined.show(5)

### Show the data types of the dataframe

In [None]:
df_joined.printSchema()

### Count the number of rows

In [None]:
df_joined.count()

### Drop duplicates if there are any

In [None]:
df_no_dups = df_joined.dropDuplicates(["ride_id"])

### Count the number of rows after dropping the duplicates

In this case, there were no duplicates in the original dataframe.

In [None]:
df_no_dups.count()

### Drop columns
Time series data and categorical data is outside of the scope of the notebook.

In [None]:
df_cleaned = df_joined.drop(
    "pickup_at", "dropoff_at", "store_and_fwd_flag", "vendor_id", "payment_type"
)

### Take a sample from the notebook and convert it to a pandas dataframe

In [None]:
df_sample = df_cleaned.sample(False, 0.1, seed=0).limit(20000)

In [None]:
df_sample.count()

In [None]:
df_pandas = df_sample.toPandas()
df_pandas.describe()

In [None]:
print("Dataset shape: ", df_pandas.shape)

In [None]:
df_pandas.head()

In [None]:
df_pandas.info()

### Create a correlation matrix of the features

We're creating a correlation matrix to see which features are the most predictive. This is an example of an analysis that you can use for your own use case.

In [None]:
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd  # not sure how the kernel runs, but it looks like I have import pandas again after going back to the notebook after a while

vector_col = "corr_features"
assembler = VectorAssembler(inputCols=df_sample.columns, outputCol=vector_col)
df_vector = assembler.transform(df_sample).select(vector_col)

matrix = Correlation.corr(df_vector, vector_col).collect()[0][0]
corr_matrix = matrix.toArray().tolist()
corr_matrix_df = pd.DataFrame(data=corr_matrix, columns=df_sample.columns, index=df_sample.columns)

plt.figure(figsize=(16, 10))
sns.heatmap(
    corr_matrix_df,
    xticklabels=corr_matrix_df.columns.values,
    yticklabels=corr_matrix_df.columns.values,
    cmap="Greens",
    annot=True,
)

%matplot plt

### Split the dataset into train, validation, and test sets

In [None]:
df_train, df_val, df_test = df_cleaned.randomSplit([0.7, 0.15, 0.15])

### Define the Amazon S3 locations that store the datasets

If you're getting a module not found error, restart the kernel and run all the cells again.

In [None]:
# Define the S3 locations to store the datasets
import boto3
import sagemaker

sagemaker_session = sagemaker.Session()
s3_bucket = sagemaker_session.default_bucket()
train_data_prefix = "sandbox/glue-demo/train"
validation_data_prefix = "sandbox/glue-demo/validation"
test_data_prefix = "sandbox/glue-demo/test"
region = boto3.Session().region_name

### Write the files to the locations

In [None]:
df_train.write.parquet(f"s3://{s3_bucket}/{train_data_prefix}", mode="overwrite")

In [None]:
df_val.write.parquet(f"s3://{s3_bucket}/{validation_data_prefix}", mode="overwrite")

In [None]:
df_test.write.parquet(f"s3://{s3_bucket}/{test_data_prefix}", mode="overwrite")

### Train a model

The following code uses the `df_train` and `df_val` datasets to train an XGBoost model. 

In [None]:
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput

hyperparameters = {
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:squarederror",
    "num_round": "50",
}

# Set an output path to save the trained model.
prefix = "sandbox/glue-demo"
output_path = f"s3://{s3_bucket}/{prefix}/xgb-built-in-algo/output"

# The following line looks for the XGBoost image URI and builds an XGBoost container.
# We use version 1.7-1 of the image URI, you can specify a version that you prefer.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")

# Construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(
    image_uri=xgboost_container,
    hyperparameters=hyperparameters,
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type="ml.m5.4xlarge",
    output_path=output_path,
)

content_type = "application/x-parquet"
train_input = TrainingInput(f"s3://{s3_bucket}/{prefix}/train/", content_type=content_type)
validation_input = TrainingInput(
    f"s3://{s3_bucket}/{prefix}/validation/", content_type=content_type
)

# Run the XGBoost training job
estimator.fit({"train": train_input, "validation": validation_input})

### Clean up

To clean up, shut down the kernel. Shutting down the kernel, stops the Glue cluster. You won't be charged for any more compute other than what you used to run the tutorial.

## Notebook CI Test Results
    
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/use-cases|pyspark_etl_and_training|pyspark-etl-training.ipynb)