# From Delta Lake to Amazon SageMaker

[Delta Lake](https://delta.io/) is a common open-source framework used for storing data in Lakehouse architectures.

In this sample we demonstrate how to integrate Delta Tables with Amazon SageMaker for performing data exploration, ingestion, processing, training, and hosting for Machine Learning.

---

## 0 - Connection Set-up - Via AWS Glue Interactive Sessions in Amazon SageMaker Studio

***Use Kernel "SparkAnalytics 2.0 (Glue PySpark)" for running this notebook***

In this notebook, we will setup a connection between Amazon SageMaker and a Delta Table.
This time we will do this by using the built-in integration between Amazon SageMaker and AWS Glue, through the Kernel for "Glue Interactive Sessions" in SageMaker Studio.

<center><img src="../images/DeltaLake_to_SageMaker_0_GIS.png" width="50%"></center>

This method provides the following advantages compared to using any of the other alternatives explored in the previous notebook:
* It uses SageMaker Studio notebooks as the development environment for setting up the connections, exploring, and processing the Delta Lake tables
* It uses the serverless and high-scale performance of Glue for powering the queries, exploration, and pre-processed of the Delta Lake tables' data

#### Important - Pre-Requisite:

Note, for using the Delta Lake library we need to point our session towards the delta-core library jar-file stored in Amazon S3. For doing this, please upload the file "delta-core_2.12-1.0.1.jar" included with this repo to your preferred S3 location, and update the path in the following cell accordingly.


In [10]:
%help

%session_id_prefix delta-to-sagemaker-
%glue_version 3.0
%idle_timeout 480
%additional_python_modules 'sagemaker'
%extra_jars "s3://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_jar/delta-core_2.12-1.0.1.jar"
%extra_py_files "s3://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_jar/delta-core_2.12-1.0.1.jar"
%%configure
{
"--enable-spark-ui": "true",
"--spark-event-logs-path": "s3://sagemaker-eu-west-1-889960878219/gis-spark-logs/",
"--conf": "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog",
"--conf": "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"
}

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
It looks like there is a newer version of the kernel available. The latest version is 0.37.2 and you have 0.36 installed.
Please run `pip install --upgrade aws-glue-sessions` to upgrade your kernel

Available Magic Commands

## Sessions Magics
%help | Return a list of descriptions and input types for all magic commands. 
%profile | String | Specify a profile in your aws configuration to use as the credentials provider.
%region | String | Specify the AWS region in which to initialize a session | Default from ~/.aws/configure
%idle_timeout | Int | The number of minutes of inactivity after which a session will timeout. The default idle timeout value is 2880 minutes (48 hours).

In [1]:
import sagemaker
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext
from delta import DeltaTable
sagemaker.__version__

Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::889960878219:role/service-role/AmazonSageMaker-ExecutionRole-20180920T165537
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: delta-to-sagemaker--c089f5f4-27a3-4ff0-8a90-94ed24d70642
Applying the following default arguments:
--glue_kernel_version 0.36
--enable-glue-datacatalog true
--additional-python-modules sagemaker
--extra-jars s3://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_jar/delta-core_2.12-1.0.1.jar
--extra-py-files s3://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_jar/delta-core_2.12-1.0.1.jar
--enable-spark-ui true
--spark-event-logs-path s3://sagemaker-eu-west-1-889960878219/gis-spark-logs/
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
Waiting for session delta-to-sagemaker--c089f5f4-27a3-4ff0-8a90-94ed24d70642 to get into ready status...
Session delta-to-sagemaker--c089f5f4-27a3-4ff0-8a90-

In [2]:
# S3 bucket for saving processing job outputs
sm_session = sagemaker.Session()
bucket = sm_session.default_bucket()
print('Default bucket: '+bucket)

Default bucket: sagemaker-eu-west-1-889960878219


In [3]:
spark = (SparkSession
    .builder
    .appName("PySparkApp") 
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") 
    .getOrCreate())

sc = spark.sparkContext

print('Spark version: '+str(sc.version))

Spark version: 3.1.1-amzn-0


In [4]:
s3a_delta_table_uri=f's3a://{bucket}/delta_to_sagemaker/delta_format/'
print(s3a_delta_table_uri)

s3a://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_format/


In [5]:
print(f'Is this a Delta Table?:\n{DeltaTable.isDeltaTable(spark, s3a_delta_table_uri)}')

Is this a Delta Table?:
True


In [6]:
rating_df_gis = spark.read.format("delta").load(f's3a://{bucket}/delta_to_sagemaker/delta_format/')
rating_df_gis.show(10)

+-----+----------+--------+------+-------+--------------+-----------+--------------+
|rowID| timestamp|ratingID|userID|placeID|rating_overall|rating_food|rating_service|
+-----+----------+--------+------+-------+--------------+-----------+--------------+
|    0|2022-08-25|    3416|    gK|    681|             1|          2|             2|
|    1|2022-08-25|    3417|    gK|    719|             1|          1|             1|
|    2|2022-08-25|    3418|    gK|   1128|             1|          2|             2|
|    3|2022-08-25|    3419|    gK|   1203|             1|          2|             2|
|    4|2022-08-25|    3420|    gK|   1058|             1|          1|             1|
|    5|2022-08-25|    3421|    gK|    585|             1|          0|             0|
|    6|2022-08-25|    3422|    gL|    990|             2|          2|             2|
|    7|2022-08-25|    3423|    gL|   1192|             2|          2|             2|
|    8|2022-08-25|    3424|    gL|   1390|             2|        

----

In the following notebook, we will perform an Exploratory Data Analysis by relying on SageMaker Data Wrangler interactive widgets embedded in our notebook.