# From Delta Lake to Amazon SageMaker

[Delta Lake](https://delta.io/) is a common open-source framework used for storing data in Lakehouse architectures.

In this sample we demonstrate how to integrate Delta Tables with Amazon SageMaker for performing data exploration, ingestion, processing, training, and hosting for Machine Learning.

---

## 1 - Data Exploration and Visualization

***Use Kernel "Data Science 3.0 (Python 3)" for running this notebook***

In this notebook, we will perform some Exploratory Data Analysis (EDA) over our Delta Tables.

In [2]:
import sagemaker
sagemaker.__version__

'2.130.0'

In [3]:
import numpy as np
import pandas as pd
import boto3

In [4]:
# S3 bucket for saving processing job outputs
sm_session = sagemaker.Session()
bucket = sm_session.default_bucket()
region = sm_session.boto_region_name

sm_client = boto3.client('sagemaker')
iam_role = sagemaker.get_execution_role()

print('Default bucket: '+bucket)

Default bucket: sagemaker-eu-west-1-889960878219


In [5]:
# Import pyspark and build Spark session
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext

In [6]:
# Build list of packages entries using Maven coordinates (groupId:artifactId:version)
pkg_list = []
pkg_list.append("io.delta:delta-core_2.12:1.1.0")
pkg_list.append("org.apache.hadoop:hadoop-aws:3.2.2")

packages=(",".join(pkg_list))
print('packages: '+packages)

packages: io.delta:delta-core_2.12:1.1.0,org.apache.hadoop:hadoop-aws:3.2.2


In [7]:
# Instantiate Spark via builder
# Note: we use the `ContainerCredentialsProvider` to give us access to underlying IAM role permissions

spark = (SparkSession
    .builder
    .appName("PySparkApp") 
    .config("spark.jars.packages", packages) 
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") 
    .config("fs.s3a.aws.credentials.provider",'com.amazonaws.auth.ContainerCredentialsProvider') 
    .getOrCreate())

sc = spark.sparkContext

print('Spark version: '+str(sc.version))



:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c2e80bfc-2ebd-456a-a206-b0984bf78362;1.0
	confs: [default]
	found io.delta#delta-core_2.12;1.1.0 in central
	found org.antlr#antlr4-runtime;4.8 in central
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
	found org.apache.hadoop#hadoop-aws;3.2.2 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
:: resolution report :: resolve 660ms :: artifacts dl 35ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default]
	io.delta#delta-core_2.12;1.1.0 from central in [default]
	org.antlr#antlr4-runtime;4.8 from central in [default]
	org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
	org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
	--------

Spark version: 3.2.0


In [8]:
s3a_delta_table_uri=f's3a://{bucket}/delta_to_sagemaker/delta_format/'
print(s3a_delta_table_uri)

s3a://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_format/


In [9]:
# Create SQL command inserting the S3 path location

sql_cmd = f'SELECT * FROM delta.`{s3a_delta_table_uri}` ORDER BY timestamp'
print(f'SQL command: {sql_cmd}')

SQL command: SELECT * FROM delta.`s3a://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_format/` ORDER BY timestamp


In [10]:
# Execute SQL command which returns dataframe
sql_results = spark.sql(sql_cmd)
print(type(sql_results))

sql_results.show(10)

23/01/30 10:56:55 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

<class 'pyspark.sql.dataframe.DataFrame'>


[Stage 8:>                                                          (0 + 1) / 1]

+-----+----------+--------+------+-------+--------------+-----------+--------------+
|rowID| timestamp|ratingID|userID|placeID|rating_overall|rating_food|rating_service|
+-----+----------+--------+------+-------+--------------+-----------+--------------+
|    0|2022-08-25|    3416|    gK|    681|             1|          2|             2|
|    1|2022-08-25|    3417|    gK|    719|             1|          1|             1|
|    2|2022-08-25|    3418|    gK|   1128|             1|          2|             2|
|    3|2022-08-25|    3419|    gK|   1203|             1|          2|             2|
|    4|2022-08-25|    3420|    gK|   1058|             1|          1|             1|
|    5|2022-08-25|    3421|    gK|    585|             1|          0|             0|
|    6|2022-08-25|    3422|    gL|    990|             2|          2|             2|
|    7|2022-08-25|    3423|    gL|   1192|             2|          2|             2|
|    8|2022-08-25|    3424|    gL|   1390|             2|        

                                                                                

In [11]:
import io
import sagemaker_datawrangler

df = sql_results.toPandas()

                                                                                

In [12]:
df

     rowID   timestamp ratingID userID placeID rating_overall rating_food  \
0        0  2022-08-25     3416     gK     681              1           2   
1        1  2022-08-25     3417     gK     719              1           1   
2        2  2022-08-25     3418     gK    1128              1           2   
3        3  2022-08-25     3419     gK    1203              1           2   
4        4  2022-08-25     3420     gK    1058              1           1   
...    ...         ...      ...    ...     ...            ...         ...   
8443  8443  2022-08-25    11859     zV     984              1           1   
8444  8444  2022-08-25    11860     zV    1311              0           1   
8445  8445  2022-08-25    11861     zV    1025              1           2   
8446  8446  2022-08-25    11862     zV     871              1           2   
8447  8447  2022-08-25    11863     zV     432              1           1   

     rating_service  
0                 2  
1                 1  
2        

Running the cell above should open the interactive data preparation widget embedded in your notebook, powered by SageMaker Data Wrangler.

This will allow you getting insights on the data, as well as recommendations for suggested transforms for improving the quality of the data in preparation for training.

<center><img src="../images/DeltaLake_to_SageMaker_1.png" width="60%"></center>

----

In the following notebook, we will rely on SageMaker Processing for performing these transformations, but note you could alternatively also run these with SageMaker Data Wrangler just by clicking on the "Apply and export code" buttons in the suggested transforms directly.