# Apache Iceberg example using EMR Serverless on EMR Studio

#### Topics covered in this example
<ol>
    <li> Configure a Spark session </li>
    <li> Create an Apache Iceberg table </li>
    <li> Query the table </li>
</ol>

***

## Prerequisites
<div class="alert alert-block alert-info">
<b>NOTE :</b> In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.</div>

* EMR Serverless should be chosen as the Compute. The Application version should be 6.14 or higher.
* Make sure the Studio user role has permission to attach the Workspace to the Application and to pass the runtime role to it.
* You must have a database in AWS Glue named "default".
* This notebook uses the `PySpark` kernel.
***

## 1. Configure your Spark session.
Configure the Spark Session. Set up Spark SQL extensions to use Apache Iceberg. 
<div class="alert alert-block alert-info">
    <b>NOTE :</b> You will need to update <b>my_bucket</b> in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.</div>

In [None]:
%%configure -f
{
    "conf": {
        "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
        "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.glue_catalog.warehouse": "s3://my_bucket/aws_workshop",
        "spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
        "spark.sql.catalog.glue_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
        "spark.jars": "/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar"
    }
}

---
## 2. Create an Apache Iceberg Table
We will create a Spark Dataframe with sample data and write this into an Iceberg table. 

<div class="alert alert-block alert-info">
    <b>NOTE :</b> You will need to update <b>my_bucket</b> in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.</div>

In [None]:
data = spark.createDataFrame([
 ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
 ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
 ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
 ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z")
],["id", "creation_date", "last_update_time"])

## Write a DataFrame as a Iceberg dataset to the Amazon S3 location.
spark.sql("""CREATE TABLE IF NOT EXISTS glue_catalog.default.iceberg_table (id string,
creation_date string,
last_update_time string)
USING iceberg
location """ + "\"s3://my_bucket/aws_workshop/iceberg_table\"")

data.writeTo("glue_catalog.default.iceberg_table").append()

---
## 3. Query the table
We will query the table using %% sql magic and Spark SQL statement

In [None]:
%%sql

SELECT * from glue_catalog.default.iceberg_table LIMIT 10


We will read the table using spark.read into a Spark dataframe

In [None]:
df = spark.read.format("iceberg").load("glue_catalog.default.iceberg_table")
df.show()

### You have made it to the end of this notebook!!