# Create an Iceberg Table using Amazon Athena and AWS Glue Catalog
This will create an Athena table in the Glue Catalog (Hive Metastore) which allows us to query the dataset files in S3. We will create a table in the Glue Catalog based on the `Amazon Customer Reviews Dataset` in S3. We will be using the Apache Iceberg format to create an ACID compiant data source in S3 which will support various capabilities like in-line updates and time travel.


# Upload Maven Jar's to S3

In [5]:
!curl -O https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.15.40/bundle-2.15.40.jar
!curl -O https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/0.13.1/iceberg-spark-runtime-3.1_2.12-0.13.1.jar
!curl -O https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.15.40/url-connection-client-2.15.40.jar

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
It looks like there is a newer version of the kernel available. The latest version is 0.37.2 and you have 0.37.0 installed.
Please run `pip install --upgrade aws-glue-sessions` to upgrade your kernel
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  245M  100  245M    0     0   290M      0 --:--:-- --:--:-- --:--:--  290M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21.0M  100 21.0M    0     0   189M      0 --:--:-- --:--:-- --:--:--  190M
  % Tota

In [7]:
!ACCOUNT_ID=$(aws sts get-caller-identity --query Account | tr -d '"') && echo $ACCOUNT_ID && REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') && echo $REGION 


371366150581
us-east-1


In [9]:
!ACCOUNT_ID=$(aws sts get-caller-identity --query Account | tr -d '"') && echo $ACCOUNT_ID && REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') && echo $REGION && aws s3 cp iceberg-spark-runtime-3.1_2.12-0.13.1.jar s3://sagemaker-$REGION-$ACCOUNT_ID/glue-iceberg-jars/iceberg-spark-runtime-3.1_2.12-0.13.1.jar && aws s3 cp bundle-2.15.40.jar s3://sagemaker-$REGION-$ACCOUNT_ID/glue-iceberg-jars/bundle-2.15.40.jar && aws s3 cp url-connection-client-2.15.40.jar s3://sagemaker-$REGION-$ACCOUNT_ID/glue-iceberg-jars/url-connection-client-2.15.40.jar && aws s3 ls s3://sagemaker-$REGION-$ACCOUNT_ID/glue-iceberg-jars/


371366150581
us-east-1
upload: ./iceberg-spark-runtime-3.1_2.12-0.13.1.jar to s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/iceberg-spark-runtime-3.1_2.12-0.13.1.jar
upload: ./bundle-2.15.40.jar to s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/bundle-2.15.40.jar
upload: ./url-connection-client-2.15.40.jar to s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/url-connection-client-2.15.40.jar
2023-04-04 18:21:47  257939967 bundle-2.15.40.jar
2023-04-04 18:21:45   22123750 iceberg-spark-runtime-3.1_2.12-0.13.1.jar
2023-04-04 18:21:50      21027 url-connection-client-2.15.40.jar


# Setup the Glue Interactive Session

In [63]:
%stop_session

Stopping session: iceberg-to-sagemaker-f1d25814-d8b5-4c4a-8de9-c58487529b89
Stopped session.


In [13]:
!ACCOUNT_ID=$(aws sts get-caller-identity --query Account | tr -d '"') && echo $ACCOUNT_ID && REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') && echo $REGION 


371366150581
us-east-1


# YOU MUST UPDATE THE FOLLOWING CELL WITH YOUR REGION (ie. `us-east-1`) AND ACCOUNT_ID (ie. `1234567890`) in 3 PLACES!!!

In [None]:
%session_id_prefix iceberg-to-sagemaker
%additional_python_modules seaborn,psutil,sagemaker
%number_of_workers 10
%extra_jars s3://sagemaker-REGION-ACCOUNT_ID/glue-iceberg-jars/bundle-2.15.40.jar,s3://sagemaker-REGION-ACCOUNT_ID/glue-iceberg-jars/iceberg-spark-runtime-3.1_2.12-0.13.1.jar,s3://sagemaker-REGION-ACCOUNT_ID/glue-iceberg-jars/url-connection-client-2.15.40.jar
%glue_version 3.0

In [71]:
%%configure
{
    "conf": "spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://sagemaker-us-east-1-371366150581/iceberg/ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=demo-iceberg-gis",
}

The following configurations have been updated: {'conf': 'spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://sagemaker-us-east-1-371366150581/iceberg/ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=demo-iceberg-gis'}


In [1]:
spark

Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::371366150581:role/SageMakerRepoRole
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 10
Session ID: iceberg-to-sagemaker-9c333891-afeb-4c75-b5df-a9801138c2a8
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.0
--enable-glue-datacatalog true
--additional-python-modules seaborn,psutil,sagemaker
--extra-jars s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/bundle-2.15.40.jar,s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/iceberg-spark-runtime-3.1_2.12-0.13.1.jar,s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/url-connection-client-2.15.40.jar
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://sagemaker-us-east-1-371366150581/iceberg/ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionE

# Create the Iceberg Database in Glue Catalog

In [3]:
catalog = "glue_catalog"
database = "iceberg_reviews"
spark.sql(f"CREATE DATABASE IF NOT EXISTS {catalog}.{database}")

DataFrame[]


In [4]:
dbs = spark.sql(f"SHOW DATABASES")
assert database in dbs.toPandas()['namespace'].values, "Database has not been created propertly"




# Write Data to Iceberg Format
**This section will take a few minutes as we are transfering the data source to your S3 bucket**

In [112]:
statement = f"DROP TABLE IF EXISTS {catalog}.{database}.{table}"
spark.sql(statement)

DataFrame[]


In [113]:
df_parquet = spark.read.parquet("s3://amazon-reviews-pds/parquet/")
df_parquet = df_parquet[df_parquet.product_category.isin(['Home', 'Furniture', 'Shoes'])]




In [114]:
df_parquet_old = df_parquet[df_parquet.product_category.isin(['Home', 'Furniture'])]
df_parquet_new = df_parquet[df_parquet.product_category.isin(['Shoes'])]




In [115]:
table = 'reviews'
df_parquet_old.writeTo(f"{catalog}.{database}.{table}").tableProperty("format-version", "2").createOrReplace()




In [116]:
df_parquet_old.count()

7020781


In [117]:
df_parquet_new.count()

4379475


# Verify The Table Has Been Created Succesfully

In [118]:
statement = f"SHOW TABLES in {catalog}.{database}"
df_tables = spark.sql(statement).toPandas()
df_tables

         namespace tableName
0  iceberg_reviews   reviews


# Run A Sample Query

In [119]:
query = f"""
SELECT product_category, product_title, star_rating from {catalog}.{database}.{table}
WHERE product_category = "Home"
"""
df = spark.sql(query)
df.show(10)

+----------------+--------------------+-----------+
|product_category|       product_title|star_rating|
+----------------+--------------------+-----------+
|            Home|Reed & Barton Kin...|          5|
|            Home|Mpi 6-1/2-Inch by...|          3|
|            Home|Leachco Snoogle O...|          3|
|            Home|Darice 68-Piece A...|          5|
|            Home|Eureka Enviro Har...|          3|
|            Home|Hoover Vacuum Cle...|          2|
|            Home|Wall Mounted Squi...|          5|
|            Home|      Hoover Shampoo|          4|
|            Home|Woodland Camo Com...|          4|
|            Home|ORECK Steam-It St...|          2|
+----------------+--------------------+-----------+
only showing top 10 rows


In [120]:
empty = True if df.count() == 0 else False
if empty:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOUR DATA HAS NOT BEEN REGISTERED WITH ATHENA.")
    print("LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.             ")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++")
else:
    print("[OK]")

[OK]


# Demonstrate Iceberg Time Travel

In [121]:
import datetime as dt
(
    df_parquet_new
    .write
    .format("iceberg")
    .mode("append")
    .save(f"{catalog}.{database}.{table}")
)
append_timestamp = dt.datetime.now()




In [123]:
iceberg_table = (
    spark
    .read
    .format("iceberg")
    .load(f"{catalog}.{database}.{table}")
)
iceberg_table.select('product_category').distinct().orderBy('product_category', ascending=False).show(5)

+----------------+
|product_category|
+----------------+
|           Shoes|
|            Home|
|       Furniture|
+----------------+


In [124]:
iceberg_table = (
    spark
    .read
    .option("as-of-timestamp", str(int((append_timestamp.timestamp() - 10) * 1000)))
    .format("iceberg")
    .load(f"{catalog}.{database}.{table}")
)
iceberg_table.select('product_category').distinct().orderBy('product_category', ascending=False).show(5)

+----------------+
|product_category|
+----------------+
|            Home|
|       Furniture|
+----------------+


# Perform Iceberg Optimizations on Data Lake Files

In [74]:
# WIP