# Apache Hudi example using EMR Serverless on EMR Studio

#### Topics covered in this example
<ol>
    <li> Configure a Spark session </li>
    <li> Create an Apache Hudi table </li>
    <li> Query the table </li>
</ol>

***

## Prerequisites
<div class="alert alert-block alert-info">
<b>NOTE :</b> In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.</div>

* EMR Serverless should be chosen as the Compute. The Application version should be 6.14 or higher.
* Make sure the Studio user role has permission to attach the Workspace to the Application and to pass the runtime role to it.
* This notebook uses the `PySpark` kernel.
***

## 1. Configure your Spark session.
Configure the Spark Session. Set up Spark SQL extensions to use Apache Hudi. Set up the options for the Hudi table.

In [None]:
%%configure -f
{
    "conf": {
        "spark.jars": "/usr/lib/hudi/hudi-spark-bundle.jar",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
        "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
}

<div class="alert alert-block alert-info">
    <b>NOTE :</b> You will need to update <b>my_bucket</b> in the Spark SQL statement below to your own bucket. Please make sure you have read and write permissions for this bucket.</div>

In [None]:
tableName = "hudi_table"
basePath = "s3://my_bucket/aws_workshop/hudi_data_location/" + tableName

hudi_options = {
  'hoodie.table.name': tableName,
  'hoodie.datasource.write.recordkey.field': 'id',
  'hoodie.datasource.write.table.name': tableName,
  'hoodie.datasource.write.operation': 'insert',
  'hoodie.datasource.write.precombine.field': 'creation_date'
}

---
## 2. Create an Apache Hudi Table
We will create a Spark Dataframe with sample data and write this into a Hudi table. 

In [None]:
data = spark.createDataFrame([
 ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
 ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
 ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
 ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z")
],["id", "creation_date", "last_update_time"])

In [None]:
data.write.format("hudi"). \
  options(**hudi_options). \
  mode("overwrite"). \
  save(basePath)


---
## 3. Query the table
We will read the table using spark.read into a Spark dataframe

In [None]:
df = spark.read.format("hudi").load(basePath)
df.show()

### You have made it to the end of this notebook!!