# Create an Iceberg Table using Amazon Athena and AWS Glue Catalog
This will create an Athena table in the Glue Catalog (Hive Metastore) which allows us to query the dataset files in S3. We will create a table in the Glue Catalog based on the `Amazon Customer Reviews Dataset` in S3. We will be using the Apache Iceberg format to create an ACID compiant data source in S3 which will support various capabilities like in-line updates and time travel.


# Upload Maven Jar's to S3

In [5]:
!curl -O https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.15.40/bundle-2.15.40.jar
!curl -O https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.1_2.12/0.13.1/iceberg-spark-runtime-3.1_2.12-0.13.1.jar
!curl -O https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.15.40/url-connection-client-2.15.40.jar

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
It looks like there is a newer version of the kernel available. The latest version is 0.37.2 and you have 0.37.0 installed.
Please run `pip install --upgrade aws-glue-sessions` to upgrade your kernel
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  245M  100  245M    0     0   216M      0  0:00:01  0:00:01 --:--:--  216M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 21.0M  100 21.0M    0     0   169M      0 --:--:-- --:--:-- --:--:--  170M
  % Tota

In [7]:
!ACCOUNT_ID=$(aws sts get-caller-identity --query Account | tr -d '"') && echo $ACCOUNT_ID && REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') && echo $REGION 


371366150581
us-east-1


In [9]:
!ACCOUNT_ID=$(aws sts get-caller-identity --query Account | tr -d '"') && echo $ACCOUNT_ID && REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') && echo $REGION && aws s3 cp iceberg-spark-runtime-3.1_2.12-0.13.1.jar s3://sagemaker-$REGION-$ACCOUNT_ID/glue-iceberg-jars/iceberg-spark-runtime-3.1_2.12-0.13.1.jar && aws s3 cp bundle-2.15.40.jar s3://sagemaker-$REGION-$ACCOUNT_ID/glue-iceberg-jars/bundle-2.15.40.jar && aws s3 cp url-connection-client-2.15.40.jar s3://sagemaker-$REGION-$ACCOUNT_ID/glue-iceberg-jars/url-connection-client-2.15.40.jar && aws s3 ls s3://sagemaker-$REGION-$ACCOUNT_ID/glue-iceberg-jars/


371366150581
us-east-1
upload: ./iceberg-spark-runtime-3.1_2.12-0.13.1.jar to s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/iceberg-spark-runtime-3.1_2.12-0.13.1.jar
upload: ./bundle-2.15.40.jar to s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/bundle-2.15.40.jar
upload: ./url-connection-client-2.15.40.jar to s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/url-connection-client-2.15.40.jar
2023-04-05 02:56:09  257939967 bundle-2.15.40.jar
2023-04-05 02:56:08   22123750 iceberg-spark-runtime-3.1_2.12-0.13.1.jar
2023-04-05 02:56:12      21027 url-connection-client-2.15.40.jar


# Setup the Glue Interactive Session

In [11]:
!ACCOUNT_ID=$(aws sts get-caller-identity --query Account | tr -d '"') && echo $ACCOUNT_ID && REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]') && echo $REGION 


371366150581
us-east-1


In [13]:
%stop_session

There is no current session.


# YOU MUST UPDATE THE FOLLOWING CELL WITH YOUR REGION (ie. `us-east-1`) AND ACCOUNT_ID (ie. `1234567890`) in 3 PLACES!!!

In [20]:
%session_id_prefix iceberg-to-sagemaker
%additional_python_modules seaborn,psutil,sagemaker
%number_of_workers 10
%extra_jars s3://sagemaker-REGION-ACCOUNT_ID/glue-iceberg-jars/bundle-2.15.40.jar,s3://sagemaker-REGION-ACCOUNT_ID/glue-iceberg-jars/iceberg-spark-runtime-3.1_2.12-0.13.1.jar,s3://sagemaker-REGION-ACCOUNT_ID/glue-iceberg-jars/url-connection-client-2.15.40.jar
%glue_version 3.0

In [22]:
%%configure
{
    "conf": "spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://sagemaker-us-east-1-371366150581/iceberg/ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=demo-iceberg-gis",
}

The following configurations have been updated: {'conf': 'spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://sagemaker-us-east-1-371366150581/iceberg/ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=demo-iceberg-gis'}


In [1]:
spark

Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::371366150581:role/SageMakerRepoRole
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 10
Session ID: iceberg-to-sagemaker-41cdae23-cf1c-4c1e-ba70-41e26f7a133f
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.0
--enable-glue-datacatalog true
--additional-python-modules seaborn,psutil,sagemaker
--extra-jars s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/bundle-2.15.40.jar,s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/iceberg-spark-runtime-3.1_2.12-0.13.1.jar,s3://sagemaker-us-east-1-371366150581/glue-iceberg-jars/url-connection-client-2.15.40.jar
--conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://sagemaker-us-east-1-371366150581/iceberg/ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionE

# Create the Iceberg Database in Glue Catalog

In [2]:
catalog = "glue_catalog"
database = "iceberg_reviews"
spark.sql(f"CREATE DATABASE IF NOT EXISTS {catalog}.{database}")

DataFrame[]


In [3]:
dbs = spark.sql(f"SHOW DATABASES")
assert database in dbs.toPandas()['namespace'].values, "Database has not been created propertly"




# Write Data to Iceberg Format
**This section will take a few minutes as we are transfering the data source to your S3 bucket**

In [4]:
table = 'reviews'
statement = f"DROP TABLE IF EXISTS {catalog}.{database}.{table}"
spark.sql(statement)

DataFrame[]


In [5]:
df_parquet = spark.read.parquet("s3://amazon-reviews-pds/parquet/")
df_parquet = df_parquet[df_parquet.product_category.isin(['Home', 'Furniture', 'Sports', 'Tools', 'Kitchen', 'Jewelry', 'Shoes',])]
df_parquet_old = df_parquet[df_parquet.product_category.isin(['Home', 'Furniture', 'Sports', 'Tools', 'Kitchen', 'Jewelry',])]
df_parquet_new = df_parquet[df_parquet.product_category.isin(['Shoes'])]




In [6]:
df_parquet_old.writeTo(f"{catalog}.{database}.{table}").tableProperty("format-version", "2").createOrReplace()




# Verify The Table Has Been Created Succesfully

In [7]:
statement = f"SHOW TABLES in {catalog}.{database}"
df_tables = spark.sql(statement).toPandas()
df_tables

         namespace tableName
0  iceberg_reviews   reviews


# Run A Sample Query

In [8]:
query = f"""
SELECT product_category, product_title, star_rating from {catalog}.{database}.{table}
WHERE product_category = "Home"
"""
df = spark.sql(query)
df.show(10)

+----------------+--------------------+-----------+
|product_category|       product_title|star_rating|
+----------------+--------------------+-----------+
|            Home|Skinned Right Arm...|          5|
|            Home|Leegoal DIY 12 PC...|          2|
|            Home|Bai Pick-Me-Up Al...|          1|
|            Home|Nachtmann Dancing...|          4|
|            Home|PEWTER MY GRANDSO...|          5|
|            Home|Hospitology Heave...|          5|
|            Home|1600 Thread Count...|          5|
|            Home|The Birth of Venu...|          3|
|            Home|Hoover Vacuum Cle...|          4|
|            Home|Libman Wonder Mop...|          5|
+----------------+--------------------+-----------+
only showing top 10 rows


In [9]:
empty = True if df.count() == 0 else False
if empty:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOUR DATA HAS NOT BEEN REGISTERED WITH GLUE.")
    print("LOOK IN PREVIOUS CELLS TO FIND THE ISSUE.           ")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++")
else:
    print("[OK]")

[OK]


# Iceberg Time Travel

In [10]:
import datetime as dt
(
    df_parquet_new
    .write
    .format("iceberg")
    .mode("append")
    .save(f"{catalog}.{database}.{table}")
)
append_timestamp = dt.datetime.now()




In [11]:
iceberg_table = (
    spark
    .read
    .format("iceberg")
    .load(f"{catalog}.{database}.{table}")
)
iceberg_table.select('product_category').distinct().orderBy('product_category', ascending=False).show(10)

+----------------+
|product_category|
+----------------+
|           Tools|
|          Sports|
|           Shoes|
|         Kitchen|
|         Jewelry|
|            Home|
|       Furniture|
+----------------+


In [12]:
iceberg_table = (
    spark
    .read
    .option("as-of-timestamp", str(int((append_timestamp.timestamp() - 5) * 1000)))
    .format("iceberg")
    .load(f"{catalog}.{database}.{table}")
)
iceberg_table.select('product_category').distinct().orderBy('product_category', ascending=False).show(10)

+----------------+
|product_category|
+----------------+
|           Tools|
|          Sports|
|         Kitchen|
|         Jewelry|
|            Home|
|       Furniture|
+----------------+


Notice how "Shoes" is not available in the time travel query result above

# Compact Data Lake Files

In [13]:
files = spark.sql(f"SELECT * FROM {catalog}.{database}.{table}.files")
total_files_original = files.count()




In [14]:
import time
start = time.time()
example_query = spark.sql(f"SELECT count(*) FROM {catalog}.{database}.{table}")
example_query.show()
end = time.time()
time_for_query_original = end - start

+--------+
|count(1)|
+--------+
|24659504|
+--------+


In [15]:
%%sql
CALL glue_catalog.system.rewrite_data_files(
  table => 'iceberg_reviews.reviews', 
  strategy => 'binpack'
)

+--------------------------+----------------------+
|rewritten_data_files_count|added_data_files_count|
+--------------------------+----------------------+
|                        52|                     8|
+--------------------------+----------------------+


In [16]:
files = spark.sql(f"SELECT * FROM {catalog}.{database}.{table}.files")
total_files_compact = files.count()




In [17]:
time.sleep(5)
start = time.time()
example_query = spark.sql(f"SELECT count(*) FROM {catalog}.{database}.{table}")
example_query.show()
end = time.time()
time_for_query_compact = end - start

+--------+
|count(1)|
+--------+
|24659504|
+--------+


In [18]:
print(f'Before file compaction, there were {total_files_original} files in our table')
print(f'After file compaction, there were {total_files_compact} files in our table')
print()
print(f'Before file compaction, our analysis query took {time_for_query_original:.2f} seconds')
print(f'After file compaction, our analysis query took {time_for_query_compact:.2f} seconds')

Before file compaction, there were 52 files in our table
After file compaction, there were 8 files in our table

Before file compaction, our analysis query took 1.82 seconds
After file compaction, our analysis query took 1.62 seconds
