## Reference Material

* [Hudi Documentation Quick Start Guide][1]
* [EMR Hudi Documentation][2]
* [EMR Hudi Docuementation - Work with a Hudi dataset][3]

[1]:https://hudi.apache.org/docs/quick-start-guide/#setup]
[2]:https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi.html
[3]:https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html

## Configuration

Before running the code in the cell(s) below SSH into your EMR cluster and run the following 

```hdfs dfs -mkdir -p /apps/hudi/lib```

```hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-spark-bundle.jar /apps/hudi/lib/hudi-spark-bundle.jar```

```hdfs dfs -copyFromLocal /usr/lib/spark/external/lib/spark-avro.jar /apps/hudi/lib/spark-avro.jar```

This will copy the Hudi jar files from the local file system to HDFS on the master node of the notebook cluster

In [1]:
%%configure
{
    "conf": {
            "spark.jars":"hdfs:///apps/hudi/lib/hudi-spark-bundle.jar,hdfs:///apps/hudi/lib/spark-avro.jar",
            "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
            "spark.sql.hive.convertMetastoreParquet":"false"
    }
}

In [2]:
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.spark.sql.types
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.config.HoodieWriteConfig._

import org.apache.hudi.hive.MultiPartKeysValueExtractor

import java.sql.Timestamp

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
7,application_1638389236079_0008,spark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.spark.sql.types
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.config.HoodieWriteConfig._
import org.apache.hudi.hive.MultiPartKeysValueExtractor
import java.sql.Timestamp


In [3]:
val inputDF = Seq(
    ("1", "Chris", "2020-01-01", Timestamp.valueOf("2020-01-01 00:00:00")),
    ("2", "Will", "2020-01-01", Timestamp.valueOf("2020-01-01 00:00:00")),
    ("3", "Emma", "2020-01-01", Timestamp.valueOf("2020-01-01 00:00:00")),
    ("4", "John", "2020-01-01", Timestamp.valueOf("2020-01-01 00:00:00")),
    ("5", "Eric", "2020-01-01", Timestamp.valueOf("2020-01-01 00:00:00")),
    ("6", "Adam", "2020-01-01", Timestamp.valueOf("2020-01-01 00:00:00"))
).toDF(
    "id",
    "name",
    "create_date",
    "last_update_time"
)

// inputDF.show()
// inputDF.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

inputDF: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]


## Write to S3 via. Hudi

We create a ```hudiOptions``` variable. We use this when we write data to S3. 

DataSourceWriteOptions for ***Hudi***: 

Option|Description
:---|:---|
TABLE_NAME|The table name under which to register the dataset
TABLE_TYPE_OPT_KEY|Optional. Specifies whether the dataset is created as ```COPY_ON_WRITE``` or ```MERGE_ON_READ```. The default is ```COPY_ON_WRITE```
RECORDKEY_FIELD_OPT_KEY|The record key field whose value will be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation, for example, a.b.c
PARTITIONPATH_FIELD_OPT_KEY|The partition path field whose value will be used as the partitionPath component of HoodieKey. The actual value will be obtained by invoking .toString() on the field value
PRECOMBINE_FIELD_OPT_KEY|The field used in pre-combining before actual write. When two records have the same key value, Hudi picks the one with the largest value for the precombine field as determined by Object.compareTo(..)

DataSourceWriteOptions for ***Hive***:

Option|Description
:---|:---|
HIVE_DATABASE_OPT_KEY|The Hive database to sync to. The default is ```default```
HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY|The class used to extract partition field values into Hive partition columns
HIVE_PARTITION_FIELDS_OPT_KEY|The field in the dataset to use for determining Hive partition columns
HIVE_SYNC_ENABLED_OPT_KEY|When set to ```true```, registers the dataset with the Apache Hive metastore.                     
HIVE_TABLE_OPT_KEY|Required. The name of the table in Hive to sync to. For example the table name can be, ```my_hudi_table``` or any other name can be specified for the value of the Hive Table Opt Key Hudi table name
HIVE_USER_OPT_KEY|Optional. The Hive user name to use when syncing. For example, ```hadoop```
HIVE_PASS_OPT_KEY|Optional. The Hive password for the user specified by HIVE_USER_OPT_KEY
HIVE_URL_OPT_KEY|The Hive metastore URL

For a full list of configurations view the [Hudi Documentation - Configurations][1]

*ensure that the ```DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY``` and ```DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY``` use different columns*

[1]:https://hudi.apache.org/docs/configurations/#Index-Configs

In [4]:
// Create hudiOptions variable
val hudiOptions = Map[String,String](
  HoodieWriteConfig.TABLE_NAME -> "copy_on_write_scala",
  DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE", 
  DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "id",
  DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "creation_date",
  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "last_update_time",
  DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
  DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> "copy_on_write_scala",
  DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "creation_date",
  DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName
)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

hudiOptions: scala.collection.immutable.Map[String,String] = Map(hoodie.datasource.write.precombine.field -> last_update_time, hoodie.datasource.hive_sync.partition_fields -> creation_date, hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.MultiPartKeysValueExtractor, hoodie.datasource.hive_sync.table -> copy_on_write_scala, hoodie.datasource.hive_sync.enable -> true, hoodie.datasource.write.recordkey.field -> id, hoodie.table.name -> copy_on_write_scala, hoodie.datasource.write.table.type -> COPY_ON_WRITE, hoodie.datasource.write.partitionpath.field -> creation_date)


*Note* adjust the s3 path in ```.save()```

Options for ```DataSourceWriteOptions```

Option|Description
:---|:---|
UPSERT_OPERATION_OPT_VAL|This is the default operation where the input records are first tagged as inserts or updates by looking up the index. The records are ultimately written after heuristics are run to determine how best to pack them on storage to optimize for things like file sizing. This operation is recommended for use-cases like database change capture where the input almost certainly contains updates. The target table will never show duplicates
INSERT_OPERATION_OPT_VAL|This operation is very similar to upsert in terms of heuristics/file sizing but completely skips the index lookup step. Thus, it can be a lot faster than upserts for use-cases like log de-duplication (in conjunction with options to filter duplicates mentioned below). This is also suitable for use-cases where the table can tolerate duplicates, but just need the transactional writes/incremental pull/storage management capabilities of Hudi
BULK_INSERT_OPERATION_OPT_VAL|Both upsert and insert operations keep input records in memory to speed up storage heuristics computations faster (among other things) and thus can be cumbersome for initial loading/bootstrapping a Hudi table at first. Bulk insert provides the same semantics as insert, while implementing a sort-based data writing algorithm, which can scale very well for several hundred TBs of initial load. However, this just does a best-effort job at sizing files vs guaranteeing file sizes like inserts/upserts do

In [5]:
// Write the DataFrame as a Hudi dataset
inputDF.
    write.
    format("org.apache.hudi").
    option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL).
    options(hudiOptions).
    mode(SaveMode.Overwrite).
    save("s3://hudi-sharkech/copy_on_wrte_scala/")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Read the Hudi Table

Hudi performs snapshot queries by default. Snapshot queries retrieve data at the present point in time

In [6]:
val snapshotQueryDF = spark.read.format("org.apache.hudi").load("s3://hudi-sharkech/copy_on_wrte_scala" + "/*/*")

// snapshotQueryDF.orderBy("id").show()
snapshotQueryDF.select("id", "_hoodie_record_key", "_hoodie_commit_time", "_hoodie_commit_seqno", "_hoodie_partition_path", "_hoodie_file_name").orderBy("id").show()

// snapshotQueryDF.select("id", "name", "create_date", "last_update_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

snapshotQueryDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+------------------+-------------------+--------------------+----------------------+--------------------+
| id|_hoodie_record_key|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_partition_path|   _hoodie_file_name|
+---+------------------+-------------------+--------------------+----------------------+--------------------+
|  1|                 1|     20211202020217|  20211202020217_0_4|               default|e4ddbdb0-abfb-435...|
|  2|                 2|     20211202020217|  20211202020217_0_1|               default|e4ddbdb0-abfb-435...|
|  3|                 3|     20211202020217|  20211202020217_0_5|               default|e4ddbdb0-abfb-435...|
|  4|                 4|     20211202020217|  20211202020217_0_2|               default|e4ddbdb0-abfb-435...|
|  5|                 5|     20211202020217|  20211202020217_0_6|               default|e4ddbdb0-abfb-

## Upsert data

Lets do an upsert ... this will be *upsert #1*

In [7]:
val updateDF = Seq(
    ("1", "Chris Sharkey", "2020-01-01", Timestamp.valueOf("2020-01-02 00:00:00")),
    ("7", "Kelly", "2020-01-02", Timestamp.valueOf("2020-01-02 00:00:00"))
).toDF(
    "id",
    "name",
    "create_date",
    "last_update_time"
)

// inputDF.show()
// inputDF.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

updateDF: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]


In [8]:
// Upsert the records in updateDF
updateDF.
    write.
    format("org.apache.hudi").
    option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL).
    options(hudiOptions).
    mode(SaveMode.Append).
    save("s3://hudi-sharkech/copy_on_wrte_scala/")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [9]:
// Check that the upsert worked
val snapshotQueryDF = spark.read.format("org.apache.hudi").load("s3://hudi-sharkech/copy_on_wrte_scala" + "/*/*")

// snapshotQueryDF.orderBy("id").show()
// snapshotQueryDF.select("id", "_hoodie_record_key", "_hoodie_commit_time", "_hoodie_commit_seqno", "_hoodie_partition_path", "_hoodie_file_name").orderBy("id").show()

snapshotQueryDF.select("id", "name", "create_date", "last_update_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

snapshotQueryDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+-------------+-----------+-------------------+
| id|         name|create_date|   last_update_time|
+---+-------------+-----------+-------------------+
|  1|Chris Sharkey| 2020-01-01|2020-01-02 00:00:00|
|  2|         Will| 2020-01-01|2020-01-01 00:00:00|
|  3|         Emma| 2020-01-01|2020-01-01 00:00:00|
|  4|         John| 2020-01-01|2020-01-01 00:00:00|
|  5|         Eric| 2020-01-01|2020-01-01 00:00:00|
|  6|         Adam| 2020-01-01|2020-01-01 00:00:00|
|  7|        Kelly| 2020-01-02|2020-01-02 00:00:00|
+---+-------------+-----------+-------------------+



Lets do another upsert ... this will be *upsert #2*

In [10]:
val updateDF = Seq(
    ("1", "Christopher Sharkey", "2020-01-01", Timestamp.valueOf("2020-01-03 00:00:00")),
    ("8", "Ella", "2020-01-03", Timestamp.valueOf("2020-01-03 00:00:00"))
).toDF(
    "id",
    "name",
    "create_date",
    "last_update_time"
)

// inputDF.show()
// inputDF.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

updateDF: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]


In [11]:
// Upsert the records in updateDF
updateDF.
    write.
    format("org.apache.hudi").
    option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL).
    options(hudiOptions).
    mode(SaveMode.Append).
    save("s3://hudi-sharkech/copy_on_wrte_scala/")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
// Check that the upsert worked
val snapshotQueryDF = spark.read.format("org.apache.hudi").load("s3://hudi-sharkech/copy_on_wrte_scala" + "/*/*")

// snapshotQueryDF.orderBy("id").show()
// snapshotQueryDF.select("id", "_hoodie_record_key", "_hoodie_commit_time", "_hoodie_commit_seqno", "_hoodie_partition_path", "_hoodie_file_name").orderBy("id").show()

snapshotQueryDF.select("id", "name", "create_date", "last_update_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

snapshotQueryDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+-------------------+-----------+-------------------+
| id|               name|create_date|   last_update_time|
+---+-------------------+-----------+-------------------+
|  1|Christopher Sharkey| 2020-01-01|2020-01-03 00:00:00|
|  2|               Will| 2020-01-01|2020-01-01 00:00:00|
|  3|               Emma| 2020-01-01|2020-01-01 00:00:00|
|  4|               John| 2020-01-01|2020-01-01 00:00:00|
|  5|               Eric| 2020-01-01|2020-01-01 00:00:00|
|  6|               Adam| 2020-01-01|2020-01-01 00:00:00|
|  7|              Kelly| 2020-01-02|2020-01-02 00:00:00|
|  8|               Ella| 2020-01-03|2020-01-03 00:00:00|
+---+-------------------+-----------+-------------------+



## Incremental query

So far we have preformed 3 actions on our Hudi table 

1. Inital write to the Hudi table ```myhudidataset```
2. *Upsert 1* - Changed **Chris** to **Chris Sharkey** & added a new record for **Kelly**
3. *Upsert 2*  - Changed **Chris Sharkey** to **Christopher Sharkey** & added a new record for **Ella**

We performed each of the 3 actions seperatly aka. in 3 Hudi commits 

Lets look at the distinct commit times

In [13]:
// View the commits times
val snapshotQueryDF = spark.read.format("org.apache.hudi").load("s3://hudi-sharkech/copy_on_wrte_scala" + "/*/*")

snapshotQueryDF.select("_hoodie_commit_time").orderBy("_hoodie_commit_time").distinct().show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

snapshotQueryDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+-------------------+
|_hoodie_commit_time|
+-------------------+
|     20211202020217|
|     20211202020316|
|     20211202020406|
+-------------------+



notice that we have 3 unique commit times corresponding to the 3 commits we have performed so far

Hudi provides a query type ```incremental``` which can identify all of the records that have changed between given commit times

The example below will identify all of the records that have changed since our first commit

Note the options we set in the variable ```incremental_read_options```
* ```QUERY_TYPE_OPT_KEY``` is set to ```QUERY_TYPE_INCREMENTAL_OPT_VAL``` 
* ```BEGIN_INSTANTTIME_OPT_KEY``` is set to the time of our fist commit
* Since we do not specify an ```END_INSTANTTIME_OPT_KEY``` this query will return all of the records that have changed since the ```BEGIN_INSTANTTIME_OPT_KEY```

*Note* adjust the ```BEGIN_INSTANTTIME_OPT_KEY``` based on the results of the last query. Set ```BEGIN_INSTANTTIME_OPT_KEY``` to the time of the first Hudi commit

In [14]:
val incremental_read_options = Map[String,String](
    DataSourceReadOptions.QUERY_TYPE_OPT_KEY -> QUERY_TYPE_INCREMENTAL_OPT_VAL,
    DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY -> "20211202020217"
)

val tripsIncrementalDF = spark.
    read.format("hudi").
    options(incremental_read_options).
    load("s3://hudi-sharkech/copy_on_wrte_scala" + "/*/*")

tripsIncrementalDF.select("id", "name", "create_date", "last_update_time", "_hoodie_commit_time").orderBy("_hoodie_commit_time").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

incremental_read_options: scala.collection.immutable.Map[String,String] = Map(hoodie.datasource.query.type -> incremental, hoodie.datasource.read.begin.instanttime -> 20211202020217)
tripsIncrementalDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+-------------------+-----------+-------------------+-------------------+
| id|               name|create_date|   last_update_time|_hoodie_commit_time|
+---+-------------------+-----------+-------------------+-------------------+
|  7|              Kelly| 2020-01-02|2020-01-02 00:00:00|     20211202020316|
|  1|Christopher Sharkey| 2020-01-01|2020-01-03 00:00:00|     20211202020406|
|  8|               Ella| 2020-01-03|2020-01-03 00:00:00|     20211202020406|
+---+-------------------+-----------+-------------------+-------------------+



## Point in Time query

Building on our incremental query we can specify a specific range for the ```BEGIN_INSTANTTIME_OPT_KEY``` and ```END_INSTANTTIME_OPT_KEY``` aka a point in time query

This will the show the changes at specific points in time

For the first point in time query we can set the ```BEGIN_INSTANTTIME_OPT_KEY``` to a time before our first commit and ```END_INSTANTTIME_OPT_KEY``` to the time of our first commit

*Note* adjust the ```END_INSTANTTIME_OPT_KEY``` to the the time of our fist commit

In [16]:
val incremental_read_options = Map[String,String](
    DataSourceReadOptions.QUERY_TYPE_OPT_KEY -> QUERY_TYPE_INCREMENTAL_OPT_VAL,
    DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY -> "20211105180300",
    DataSourceReadOptions.END_INSTANTTIME_OPT_KEY -> "20211202020217"
)

val tripsIncrementalDF = spark.
    read.format("hudi").
    options(incremental_read_options).
    load("s3://hudi-sharkech/copy_on_wrte_scala" + "/*/*")

tripsIncrementalDF.select("id", "name", "create_date", "last_update_time", "_hoodie_commit_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

incremental_read_options: scala.collection.immutable.Map[String,String] = Map(hoodie.datasource.query.type -> incremental, hoodie.datasource.read.begin.instanttime -> 20211105180300, hoodie.datasource.read.end.instanttime -> 20211202020217)
tripsIncrementalDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+-----+-----------+-------------------+-------------------+
| id| name|create_date|   last_update_time|_hoodie_commit_time|
+---+-----+-----------+-------------------+-------------------+
|  1|Chris| 2020-01-01|2020-01-01 00:00:00|     20211202020217|
|  2| Will| 2020-01-01|2020-01-01 00:00:00|     20211202020217|
|  3| Emma| 2020-01-01|2020-01-01 00:00:00|     20211202020217|
|  4| John| 2020-01-01|2020-01-01 00:00:00|     20211202020217|
|  5| Eric| 2020-01-01|2020-01-01 00:00:00|     20211202020217|
|  6| Adam| 2020-01-01|2020-01-01 00:00:00|     20211202020217|
+---+-----+-----------+-------------------+---------

for the second point in time query we can set the ```END_INSTANTTIME_OPT_KEY``` equal to the time of our first commit and ```END_INSTANTTIME_OPT_KEY``` equal to the time of our second commit

*Note* adjust the ```BEGIN_INSTANTTIME_OPT_KEY``` to the the time of our first commit and adjust the ```END_INSTANTTIME_OPT_KEY``` to the the time of our second commit

In [17]:
val incremental_read_options = Map[String,String](
    DataSourceReadOptions.QUERY_TYPE_OPT_KEY -> QUERY_TYPE_INCREMENTAL_OPT_VAL,
    DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY -> "20211202020217",
    DataSourceReadOptions.END_INSTANTTIME_OPT_KEY -> "20211202020316"
)

val tripsIncrementalDF = spark.
    read.format("hudi").
    options(incremental_read_options).
    load("s3://hudi-sharkech/copy_on_wrte_scala" + "/*/*")

tripsIncrementalDF.select("id", "name", "create_date", "last_update_time", "_hoodie_commit_time").orderBy("_hoodie_commit_time").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

incremental_read_options: scala.collection.immutable.Map[String,String] = Map(hoodie.datasource.query.type -> incremental, hoodie.datasource.read.begin.instanttime -> 20211202020217, hoodie.datasource.read.end.instanttime -> 20211202020316)
tripsIncrementalDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+-------------+-----------+-------------------+-------------------+
| id|         name|create_date|   last_update_time|_hoodie_commit_time|
+---+-------------+-----------+-------------------+-------------------+
|  1|Chris Sharkey| 2020-01-01|2020-01-02 00:00:00|     20211202020316|
|  7|        Kelly| 2020-01-02|2020-01-02 00:00:00|     20211202020316|
+---+-------------+-----------+-------------------+-------------------+



## Delete data

In [18]:
// Create a new data frame
val deleteDF = Seq(
    ("1", "Christopher Sharkey", "2020-01-01", Timestamp.valueOf("2020-01-03 00:00:00")),
    ("8", "Ella", "2020-01-03", Timestamp.valueOf("2020-01-03 00:00:00"))
).toDF(
    "id",
    "name",
    "create_date",
    "last_update_time"
)

// inputDF.show()
// inputDF.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

deleteDF: org.apache.spark.sql.DataFrame = [id: string, name: string ... 2 more fields]


In [19]:
deleteDF.
    write.
    format("org.apache.hudi").
    option(OPERATION_OPT_KEY,"delete").
    option(PRECOMBINE_FIELD_OPT_KEY, "last_update_time").
    option(RECORDKEY_FIELD_OPT_KEY, "id").
    option(PARTITIONPATH_FIELD_OPT_KEY, "creation_date").
    option(TABLE_NAME, "copy_on_write_scala").
    mode(SaveMode.Append).
    save("s3://hudi-sharkech/copy_on_wrte_scala/")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [20]:
// Check that the delete worked
val snapshotQueryDF = spark.read.format("org.apache.hudi").load("s3://hudi-sharkech/copy_on_wrte_scala" + "/*/*")

// snapshotQueryDF.show()

snapshotQueryDF.select("id", "name", "create_date", "last_update_time").orderBy("id").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

snapshotQueryDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 7 more fields]
+---+-----+-----------+-------------------+
| id| name|create_date|   last_update_time|
+---+-----+-----------+-------------------+
|  2| Will| 2020-01-01|2020-01-01 00:00:00|
|  3| Emma| 2020-01-01|2020-01-01 00:00:00|
|  4| John| 2020-01-01|2020-01-01 00:00:00|
|  5| Eric| 2020-01-01|2020-01-01 00:00:00|
|  6| Adam| 2020-01-01|2020-01-01 00:00:00|
|  7|Kelly| 2020-01-02|2020-01-02 00:00:00|
+---+-----+-----------+-------------------+

