# Apache Hudi Core Conceptions (5) - COW: Clustering

## 1. Configuration

In [1]:
%%sh
# deploy hudi bundle jar
hdfs dfs -copyFromLocal -f /usr/lib/hudi/hudi-spark-bundle.jar /tmp/hudi-spark-bundle.jar
hdfs dfs -ls /tmp/hudi-spark-bundle.jar

-rw-r--r--   1 emr-notebook hdfsadmingroup   61421977 2023-03-17 08:52 /tmp/hudi-spark-bundle.jar


In [2]:
%%configure -f
{
    "conf" : {
        "spark.jars":"hdfs:///tmp/hudi-spark-bundle.jar",            
        "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
        "spark.sql.extensions":"org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
        "spark.sql.catalog.spark_catalog":"org.apache.spark.sql.hudi.catalog.HoodieCatalog"
    }
}

In [3]:
%env S3_BUCKET=apache-hudi-core-conceptions

env: S3_BUCKET=apache-hudi-core-conceptions


In [4]:
%%sql
set S3_BUCKET=apache-hudi-core-conceptions

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
93,application_1678096020253_0174,spark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

In [5]:
%env WORKSPACE=/home/emr-notebook/apache-hudi-core-conceptions

env: WORKSPACE=/home/emr-notebook/apache-hudi-core-conceptions


In [6]:
%%sh
# make workspace
mkdir -p $WORKSPACE
# deploy hudi-stat.sh, a utility shell script to output hudi table status
wget https://raw.githubusercontent.com/bluishglc/apache-hudi-core-conceptions/master/hudi-stat.sh -O $WORKSPACE/hudi-stat.sh &>/dev/null
chmod a+x $WORKSPACE/hudi-stat.sh
ls $WORKSPACE/hudi-stat.sh

/home/emr-notebook/apache-hudi-core-conceptions/hudi-stat.sh


In [7]:
%%html
<style>
table {float:left}
</style>

## 2. Test Case 1 - Sync Clustering ( Inline Schedule + Inline Execute )

### 2.1. Test Plan

Step No.|Action|Volume / Partition |Storage
:--------:|:------|:------|:----------
1|Insert|96MB|+ 1 Small FileGroup
2|Insert|213MB|+ 1 Max File + 1 Small File
3|Insert|182MB|+ 1 Max File + 1 Small File

### 2.2. Key Settings

KEY|DEFAULT VALUE|SET VALUE
:---|:---|:---
hoodie.clustering.inline|false|true
hoodie.clustering.schedule.inline|false|false
hoodie.clustering.async.enabled|false|false
hoodie.clustering.inline.max.commits|4|3
hoodie.clustering.plan.strategy.target.file.max.bytes|1073741824 / 1GB|314572800 / 300MB
hoodie.clustering.plan.strategy.small.file.limit|314572800 / 300MB|209715200 / 200MB
hoodie.clustering.plan.strategy.sort.columns|-|review_date
hoodie.parquet.small.file.limit|104857600 / 100MB | 0
hoodie.copyonwrite.record.size.estimate|1024|175

### 2.3. Set Variables

In [8]:
%%sql
set TABLE_NAME=reviews_cow_clustering_1

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

In [9]:
%env TABLE_NAME=reviews_cow_clustering_1

env: TABLE_NAME=reviews_cow_clustering_1


### 2.4. Create Table

In [10]:
%%sh
aws s3 rm s3://${S3_BUCKET}/${TABLE_NAME} --recursive &>/dev/null
rm -rf ${WORKSPACE}/${TABLE_NAME}
sleep 5

In [11]:
%%sql
drop table if exists ${TABLE_NAME}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [12]:
%%sql
create table if not exists ${TABLE_NAME}(
    review_id string, 
    star_rating int, 
    review_body string, 
    review_date date, 
    year long,
    timestamp long,
    parity int
)
using hudi
location 's3://${S3_BUCKET}/${TABLE_NAME}'
partitioned by (parity)
options ( 
    type = 'cow',  
    primaryKey = 'review_id', 
    preCombineField = 'timestamp',
    hoodie.clustering.inline = 'true',
    hoodie.clustering.schedule.inline = 'false',
    hoodie.clustering.async.enabled = 'false',
    hoodie.clustering.inline.max.commits = '3',
    hoodie.clustering.plan.strategy.small.file.limit = '209715200',
    hoodie.clustering.plan.strategy.target.file.max.bytes = '314572800',
    hoodie.clustering.plan.strategy.sort.columns = 'review_date',
    hoodie.parquet.small.file.limit = '0',
    hoodie.copyonwrite.record.size.estimate = '175'
);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

### 2.5. Insert 96MB / + 1 Small File

In [13]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year = 2003;

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [14]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │        │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317085304196 │ commit │ COMPLETED │ 03-17 08:53 │ 03-17 08:53 │ 03-17 08:53 ║
╚═════╧═══════════════════╧════════╧═══════════╧═════════════╧═════════════╧═════════════╝

[ STORAGE ]

/home/emr-notebook/apache-hudi-core-conceptions/reviews_cow_clustering_1
├── [ 96M 08:54:09]  parity=0
│   └── [ 96M 08:53:48]  3283c2a3-af0a-43cd-8db5-d0410806083f-0_0-34-628_20230317085304196.parquet
└── [ 96M 08:54:09]  parity=1
    └── [ 96M 08:53:48]  cc5b25d3-7d8f-4098-a785-32c6946e52da-0_1-34-629_20230317085304196.parquet


### 2.6. Insert 213MB / + 1 Max File + 1 Small File

In [15]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year in (2004, 2005);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [16]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │        │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317085304196 │ commit │ COMPLETED │ 03-17 08:53 │ 03-17 08:53 │ 03-17 08:53 ║
╟─────┼───────────────────┼────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317085410617 │ commit │ COMPLETED │ 03-17 08:54 │ 03-17 08:54 │ 03-17 08:55 ║
╚═════╧═══════════════════╧════════╧═══════════╧═════════════╧═════════════╧═════════════╝

[ STORAGE ]

/home/emr-notebook/apache-hudi-core-conceptions/reviews_cow_clustering_1
├── [309M 08:55:25]  parity=0
│   ├── [ 96M 08:53:48]  3283c2a3-af0a-43cd-8db5-d0410806083f-0_0-34-628_20230317085304196.parquet
│   ├── [120M 08:55:05]  6f6cfb7a-8978-4d

### 2.7. Insert 182MB / + 1 Max File + 1 Small File

In [17]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year = 2007;

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [18]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤═══════════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action        │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │               │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪═══════════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317085304196 │ commit        │ COMPLETED │ 03-17 08:53 │ 03-17 08:53 │ 03-17 08:53 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317085410617 │ commit        │ COMPLETED │ 03-17 08:54 │ 03-17 08:54 │ 03-17 08:55 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317085526534 │ commit        │ COMPLETED │ 03-17 08:55 │ 03-17 08:55 │ 03-17 08:56 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 3  

## 3. Test Case 2 - Async Clustering ( Offline Schedule -> Offline Execute )

### 3.1. Test Plan

Step No.|Action|Volume / Partition |Storage
:--------:|:------|:------|:----------
1|Insert|96MB|+ 1 Small FileGroup
2|Insert|213MB|+ 1 Max File + 1 Small File
3|Insert|182MB|+ 1 Max File + 1 Small File
4|Offline Schedule|491MB|N/A
5|Offline Eexecute|491MB|+ 2 Clustered Files

### 3.2. Key Settings

KEY|DEFAULT VALUE|SET VALUE
:---|:---|:---
hoodie.clustering.inline|false|false
hoodie.clustering.schedule.inline|false|false
hoodie.clustering.async.enabled|false|true
hoodie.clustering.async.max.commits|4|3
hoodie.clustering.plan.strategy.target.file.max.bytes|1073741824 / 1GB|314572800 / 300MB
hoodie.clustering.plan.strategy.small.file.limit|314572800 / 300MB|209715200 / 200MB
hoodie.clustering.plan.strategy.sort.columns|-|review_date
hoodie.parquet.small.file.limit|104857600 / 100MB | 0
hoodie.copyonwrite.record.size.estimate|1024|175

### 3.3. Set Variables

In [19]:
%%sql
set TABLE_NAME=reviews_cow_clustering_2

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

In [20]:
%env TABLE_NAME=reviews_cow_clustering_2

env: TABLE_NAME=reviews_cow_clustering_2


### 3.4. Create Table

In [21]:
%%sh
aws s3 rm s3://${S3_BUCKET}/${TABLE_NAME} --recursive &>/dev/null
rm -rf ${WORKSPACE}/${TABLE_NAME}
sleep 5

In [22]:
%%sql
drop table if exists ${TABLE_NAME}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [23]:
%%sql
-- for async mode: 
-- 1. some clustering properties works as table properties, i.e. hoodie.clustering.async.enabled and hoodie.clustering.async.max.commits
-- 2. all clustering async & plan properties will be overwritten by spark job level configuration via --hoodie-conf or --props
-- so, to be simple, do NOT set any clustering-ralated properties on table, always set by --hoodie-conf or --props
-- so, comment out clustering related settings, just keep them as reference.
create table if not exists ${TABLE_NAME}(
    review_id string, 
    star_rating int, 
    review_body string, 
    review_date date, 
    year long,
    timestamp long,
    parity int
)
using hudi
location 's3://${S3_BUCKET}/${TABLE_NAME}'
partitioned by (parity)
options ( 
    type = 'cow',  
    primaryKey = 'review_id', 
    preCombineField = 'timestamp',
    -- hoodie.clustering.inline = 'false',
    -- hoodie.clustering.schedule.inline = 'false',
    -- hoodie.clustering.async.enabled = 'true',
    -- hoodie.clustering.async.max.commits = '3',
    -- hoodie.clustering.plan.strategy.small.file.limit = '209715200',
    -- hoodie.clustering.plan.strategy.target.file.max.bytes = '314572800',
    -- hoodie.clustering.plan.strategy.sort.columns = 'review_date',
    hoodie.parquet.small.file.limit = '0',
    hoodie.copyonwrite.record.size.estimate = '175'
);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

### 3.5. Insert 96MB / + 1 Small File

In [24]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year = 2003;

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [25]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │        │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317085930947 │ commit │ COMPLETED │ 03-17 08:59 │ 03-17 08:59 │ 03-17 09:00 ║
╚═════╧═══════════════════╧════════╧═══════════╧═════════════╧═════════════╧═════════════╝

[ STORAGE ]

/home/emr-notebook/apache-hudi-core-conceptions/reviews_cow_clustering_2
├── [ 96M 09:00:23]  parity=0
│   └── [ 96M 09:00:04]  23929c20-1018-433c-b2bd-6f59848ed246-0_0-217-2973_20230317085930947.parquet
└── [ 96M 09:00:23]  parity=1
    └── [ 96M 09:00:04]  8862197d-b470-4652-b971-91fe1a9458f5-0_1-217-2974_20230317085930947.parquet


### 3.6. Insert 213MB / + 1 Max File + 1 Small File

In [26]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year in (2004, 2005);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [27]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │        │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317085930947 │ commit │ COMPLETED │ 03-17 08:59 │ 03-17 08:59 │ 03-17 09:00 ║
╟─────┼───────────────────┼────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317090024178 │ commit │ COMPLETED │ 03-17 09:00 │ 03-17 09:00 │ 03-17 09:01 ║
╚═════╧═══════════════════╧════════╧═══════════╧═════════════╧═════════════╧═════════════╝

[ STORAGE ]

/home/emr-notebook/apache-hudi-core-conceptions/reviews_cow_clustering_2
├── [309M 09:01:32]  parity=0
│   ├── [ 96M 09:00:04]  23929c20-1018-433c-b2bd-6f59848ed246-0_0-217-2973_20230317085930947.parquet
│   ├── [ 93M 09:01:09]  5f305998-4b02-

### 3.7. Insert 182MB / + 1 Max File + 1 Small File

In [28]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year = 2007;

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [29]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │        │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317085930947 │ commit │ COMPLETED │ 03-17 08:59 │ 03-17 08:59 │ 03-17 09:00 ║
╟─────┼───────────────────┼────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317090024178 │ commit │ COMPLETED │ 03-17 09:00 │ 03-17 09:00 │ 03-17 09:01 ║
╟─────┼───────────────────┼────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317090133814 │ commit │ COMPLETED │ 03-17 09:01 │ 03-17 09:02 │ 03-17 09:02 ║
╚═════╧═══════════════════╧════════╧═══════════╧═════════════╧═════════════╧═════════════╝

[ STORAGE ]

/home/emr-notebook/apache-hudi-core-conceptions/reviews_cow_c

### 3.8. Offline Schedule 491MB

In [30]:
%%sh
# it's required for current user (emr-notebook) to get sudo permission
sudo -u hadoop spark-submit \
  --jars '/usr/lib/hudi/hudi-spark-bundle.jar' \
  --class 'org.apache.hudi.utilities.HoodieClusteringJob' \
  /usr/lib/hudi/hudi-utilities-bundle.jar \
  --spark-memory '4g' \
  --mode 'schedule' \
  --base-path "s3://${S3_BUCKET}/${TABLE_NAME}" \
  --table-name "${TABLE_NAME}" \
  --hoodie-conf "hoodie.clustering.async.enabled=true" \
  --hoodie-conf "hoodie.clustering.async.max.commits=3" \
  --hoodie-conf "hoodie.clustering.plan.strategy.small.file.limit=209715200" \
  --hoodie-conf "hoodie.clustering.plan.strategy.target.file.max.bytes=314572800" \
  --hoodie-conf "hoodie.clustering.plan.strategy.sort.columns=review_date" > ${WORKSPACE}/${TABLE_NAME}.schedule.out &>/dev/null

In [31]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤═══════════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action        │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │               │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪═══════════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317085930947 │ commit        │ COMPLETED │ 03-17 08:59 │ 03-17 08:59 │ 03-17 09:00 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317090024178 │ commit        │ COMPLETED │ 03-17 09:00 │ 03-17 09:00 │ 03-17 09:01 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317090133814 │ commit        │ COMPLETED │ 03-17 09:01 │ 03-17 09:02 │ 03-17 09:02 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 3  

### 3.9. Offline Eexecute 491MB / + 2 Clustered Files

In [32]:
%%sh
# no need to provide --hoodie-conf setting again for execute, they are only used for schedule
# for execute, if there are requested replacecommits in timeline, they will be executed via this job.
# it's required for current user (emr-notebook) to get sudo permission
sudo -u hadoop spark-submit \
  --jars '/usr/lib/hudi/hudi-spark-bundle.jar' \
  --class 'org.apache.hudi.utilities.HoodieClusteringJob' \
  /usr/lib/hudi/hudi-utilities-bundle.jar \
  --spark-memory '4g' \
  --mode 'execute' \
  --base-path "s3://${S3_BUCKET}/${TABLE_NAME}" \
  --table-name "${TABLE_NAME}" > ${WORKSPACE}/${TABLE_NAME}.execute.out &>/dev/null

In [33]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤═══════════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action        │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │               │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪═══════════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317085930947 │ commit        │ COMPLETED │ 03-17 08:59 │ 03-17 08:59 │ 03-17 09:00 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317090024178 │ commit        │ COMPLETED │ 03-17 09:00 │ 03-17 09:00 │ 03-17 09:01 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317090133814 │ commit        │ COMPLETED │ 03-17 09:01 │ 03-17 09:02 │ 03-17 09:02 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 3  

## 4. Test Case 3 - Async Clustering ( Offline Schedule，Offline Execute )

### 4.1. Test Plan

Step No.|Action|Volume / Partition |Storage
:--------:|:------|:------|:----------
1|Insert|96MB|+ 1 Small FileGroup
2|Insert|213MB|+ 1 Max File + 1 Small File
3|Insert|182MB|+ 1 Max File + 1 Small File
4|Offline Schedule + Eexecute|491MB|+ 2 Clustered Files

### 4.2. Key Settings

KEY|DEFAULT VALUE|SET VALUE
:---|:---|:---
hoodie.clustering.inline|false|false
hoodie.clustering.schedule.inline|false|false
hoodie.clustering.async.enabled|false|true
hoodie.clustering.async.max.commits|4|3
hoodie.clustering.plan.strategy.target.file.max.bytes|1073741824 / 1GB|314572800 / 300MB
hoodie.clustering.plan.strategy.small.file.limit|314572800 / 300MB|209715200 / 200MB
hoodie.clustering.plan.strategy.sort.columns|-|review_date
hoodie.parquet.small.file.limit|104857600 / 100MB | 0
hoodie.copyonwrite.record.size.estimate|1024|175

### 4.3. Set Variables

In [34]:
%%sql
set TABLE_NAME=reviews_cow_clustering_3

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

In [35]:
%env TABLE_NAME=reviews_cow_clustering_3

env: TABLE_NAME=reviews_cow_clustering_3


### 4.4. Create Table

In [36]:
%%sh
aws s3 rm s3://${S3_BUCKET}/${TABLE_NAME} --recursive &>/dev/null
rm -rf ${WORKSPACE}/${TABLE_NAME}
sleep 5

In [37]:
%%sql
drop table if exists ${TABLE_NAME}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [38]:
%%sql
-- for async mode: 
-- 1. some clustering properties works as table properties, i.e. hoodie.clustering.async.enabled and hoodie.clustering.async.max.commits
-- 2. all clustering async & plan properties will be overwritten by spark job level configuration via --hoodie-conf or --props
-- so, to be simple, do NOT set any clustering-ralated properties on table, always set by --hoodie-conf or --props
-- so, comment out clustering related settings, just keep them as reference.
create table if not exists ${TABLE_NAME}(
    review_id string, 
    star_rating int, 
    review_body string, 
    review_date date, 
    year long,
    timestamp long,
    parity int
)
using hudi
location 's3://${S3_BUCKET}/${TABLE_NAME}'
partitioned by (parity)
options ( 
    type = 'cow',  
    primaryKey = 'review_id', 
    preCombineField = 'timestamp',
    -- hoodie.clustering.inline = 'false',
    -- hoodie.clustering.schedule.inline = 'false',
    -- hoodie.clustering.async.enabled = 'true',
    -- hoodie.clustering.async.max.commits = '3',
    -- hoodie.clustering.plan.strategy.small.file.limit = '209715200',
    -- hoodie.clustering.plan.strategy.target.file.max.bytes = '314572800',
    -- hoodie.clustering.plan.strategy.sort.columns = 'review_date',
    hoodie.parquet.small.file.limit = '0',
    hoodie.copyonwrite.record.size.estimate = '175'
);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

### 4.5. Insert 96MB / + 1 Small File

In [39]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year = 2003;

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [40]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │        │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317090651018 │ commit │ COMPLETED │ 03-17 09:06 │ 03-17 09:07 │ 03-17 09:07 ║
╚═════╧═══════════════════╧════════╧═══════════╧═════════════╧═════════════╧═════════════╝

[ STORAGE ]

/home/emr-notebook/apache-hudi-core-conceptions/reviews_cow_clustering_3
├── [ 96M 09:07:57]  parity=0
│   └── [ 96M 09:07:37]  be11e5a7-ab09-48e2-af11-04cdc0677d70-0_0-378-4564_20230317090651018.parquet
└── [ 96M 09:07:57]  parity=1
    └── [ 96M 09:07:39]  fe1beb81-eb50-441f-aaf1-d84c9750584b-0_1-378-4565_20230317090651018.parquet


### 4.6. Insert 213MB / + 1 Max File + 1 Small File

In [41]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year in (2004, 2005);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [42]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │        │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317090651018 │ commit │ COMPLETED │ 03-17 09:06 │ 03-17 09:07 │ 03-17 09:07 ║
╟─────┼───────────────────┼────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317090759205 │ commit │ COMPLETED │ 03-17 09:08 │ 03-17 09:08 │ 03-17 09:09 ║
╚═════╧═══════════════════╧════════╧═══════════╧═════════════╧═════════════╧═════════════╝

[ STORAGE ]

/home/emr-notebook/apache-hudi-core-conceptions/reviews_cow_clustering_3
├── [309M 09:09:22]  parity=0
│   ├── [120M 09:09:01]  1c457f8c-5012-40c6-8de8-82c097a5aa40-0_0-425-4833_20230317090759205.parquet
│   ├── [ 93M 09:08:54]  588614c5-5eae-

### 4.7. Insert 182MB / + 1 Max File + 1 Small File

In [43]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year = 2007;

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [44]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │        │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317090651018 │ commit │ COMPLETED │ 03-17 09:06 │ 03-17 09:07 │ 03-17 09:07 ║
╟─────┼───────────────────┼────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317090759205 │ commit │ COMPLETED │ 03-17 09:08 │ 03-17 09:08 │ 03-17 09:09 ║
╟─────┼───────────────────┼────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317090923456 │ commit │ COMPLETED │ 03-17 09:09 │ 03-17 09:09 │ 03-17 09:10 ║
╚═════╧═══════════════════╧════════╧═══════════╧═════════════╧═════════════╧═════════════╝

[ STORAGE ]

/home/emr-notebook/apache-hudi-core-conceptions/reviews_cow_c

### 4.8. Offline Schedule + Eexecute 491MB / + 2 Clustered Files

In [45]:
%%sh
# it's required for current user (emr-notebook) to get sudo permission
sudo -u hadoop spark-submit \
  --jars '/usr/lib/hudi/hudi-spark-bundle.jar' \
  --class 'org.apache.hudi.utilities.HoodieClusteringJob' \
  /usr/lib/hudi/hudi-utilities-bundle.jar \
  --spark-memory '4g' \
  --mode 'scheduleAndExecute' \
  --base-path "s3://${S3_BUCKET}/${TABLE_NAME}" \
  --table-name "${TABLE_NAME}" \
  --hoodie-conf "hoodie.clustering.async.enabled=true" \
  --hoodie-conf "hoodie.clustering.async.max.commits=3" \
  --hoodie-conf "hoodie.clustering.plan.strategy.small.file.limit=209715200" \
  --hoodie-conf "hoodie.clustering.plan.strategy.target.file.max.bytes=314572800" \
  --hoodie-conf "hoodie.clustering.plan.strategy.sort.columns=review_date" > ${WORKSPACE}/${TABLE_NAME}.schedule.out &>/dev/null

In [46]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤═══════════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action        │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │               │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪═══════════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317090651018 │ commit        │ COMPLETED │ 03-17 09:06 │ 03-17 09:07 │ 03-17 09:07 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317090759205 │ commit        │ COMPLETED │ 03-17 09:08 │ 03-17 09:08 │ 03-17 09:09 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317090923456 │ commit        │ COMPLETED │ 03-17 09:09 │ 03-17 09:09 │ 03-17 09:10 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 3  

## 5. Test Case 4 - Semi Async Clustering ( Inline Schedule，Offline Execute )

NOTE:

This mode has a problem, under this mode, both `hoodie.clustering.inline.max.commits` or `hoodie.clustering.async.max.commits` does NOT work. This behavior is defferent from comapcation

### 5.1. Test Plan

Step No.|Action|Volume / Partition |Storage
:--------:|:------|:------|:----------
1|Insert|96MB|+ 1 Small FileGroup
2|Insert|213MB|+ 1 Max File + 1 Small File
3|Insert|182MB|+ 1 Max File + 1 Small File
4|Offline Eexecute|96MB|+ 1 Clustered File
5|Insert 14.6MB|+1 Small File
6|Offline Eexecute|503MB|+ 2 Clustered Files

### 5.2. Key Settings

KEY|DEFAULT VALUE|SET VALUE
:---|:---|:---
hoodie.clustering.inline|false|false
hoodie.clustering.schedule.inline|false|true
hoodie.clustering.async.enabled|false|false
hoodie.clustering.plan.strategy.target.file.max.bytes|1073741824 / 1GB|314572800 / 300MB
hoodie.clustering.plan.strategy.small.file.limit|314572800 / 300MB|209715200 / 200MB
hoodie.clustering.plan.strategy.sort.columns|-|review_date
hoodie.parquet.small.file.limit|104857600 / 100MB | 0
hoodie.copyonwrite.record.size.estimate|1024|175

### 5.3. Set Variables

In [47]:
%%sql
set TABLE_NAME=reviews_cow_clustering_4

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

In [48]:
%env TABLE_NAME=reviews_cow_clustering_4

env: TABLE_NAME=reviews_cow_clustering_4


### 5.4. Create Table

In [49]:
%%sh
aws s3 rm s3://${S3_BUCKET}/${TABLE_NAME} --recursive &>/dev/null
rm -rf ${WORKSPACE}/${TABLE_NAME}
sleep 5

In [50]:
%%sql
drop table if exists ${TABLE_NAME}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [51]:
%%sql
create table if not exists ${TABLE_NAME}(
    review_id string, 
    star_rating int, 
    review_body string, 
    review_date date, 
    year long,
    timestamp long,
    parity int
)
using hudi
location 's3://${S3_BUCKET}/${TABLE_NAME}'
partitioned by (parity)
options ( 
    type = 'cow',  
    primaryKey = 'review_id', 
    preCombineField = 'timestamp',
    hoodie.clustering.inline = 'false',
    hoodie.clustering.schedule.inline = 'true',
    hoodie.clustering.async.enabled = 'false',
    -- NOT work for this mode
    hoodie.clustering.inline.max.commits = '3',
    -- NOT work for this mode
    hoodie.clustering.async.max.commits = '3',
    hoodie.clustering.plan.strategy.small.file.limit = '209715200',
    hoodie.clustering.plan.strategy.target.file.max.bytes = '314572800',
    hoodie.clustering.plan.strategy.sort.columns = 'review_date',
    hoodie.parquet.small.file.limit = '0',
    hoodie.copyonwrite.record.size.estimate = '175'
);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

### 5.5. Insert 96MB / + 1 Small File

In [52]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year = 2003;

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [53]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤═══════════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action        │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │               │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪═══════════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317091403489 │ commit        │ COMPLETED │ 03-17 09:14 │ 03-17 09:14 │ 03-17 09:14 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317091456798 │ replacecommit │ REQUESTED │ 03-17 09:14 │ -           │ -           ║
╚═════╧═══════════════════╧═══════════════╧═══════════╧═════════════╧═════════════╧═════════════╝

[ STORAGE ]

/home/emr-notebook/apache-hudi-core-conceptions/reviews_cow_clustering_4
├── [ 96M 09:15:10]  parity=0
│   └── [ 96M 09:14:50]  3d9a2d9d-cce5-4f58-8c7b-91dddd078939-0_0-539-5266_202303170

### 5.6. Insert 213MB / + 1 Max File + 1 Small File

In [54]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year in (2004, 2005);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [55]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤═══════════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action        │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │               │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪═══════════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317091403489 │ commit        │ COMPLETED │ 03-17 09:14 │ 03-17 09:14 │ 03-17 09:14 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317091456798 │ replacecommit │ REQUESTED │ 03-17 09:14 │ -           │ -           ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317091512213 │ commit        │ COMPLETED │ 03-17 09:15 │ 03-17 09:15 │ 03-17 09:16 ║
╚═════╧═══════════════════╧═══════════════╧═══════════╧═════════════╧═════════════╧═════════════╝

[ ST

### 5.7. Insert 182MB / + 1 Max File + 1 Small File

In [56]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year = 2007;

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [57]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤═══════════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action        │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │               │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪═══════════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317091403489 │ commit        │ COMPLETED │ 03-17 09:14 │ 03-17 09:14 │ 03-17 09:14 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317091456798 │ replacecommit │ REQUESTED │ 03-17 09:14 │ -           │ -           ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317091512213 │ commit        │ COMPLETED │ 03-17 09:15 │ 03-17 09:15 │ 03-17 09:16 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 3  

### 5.8. Offline Eexecute 96MB / + 1 Clustered File

In [58]:
%%sh
# it's required for current user (emr-notebook) to get sudo permission
sudo -u hadoop spark-submit \
  --jars '/usr/lib/hudi/hudi-spark-bundle.jar' \
  --class 'org.apache.hudi.utilities.HoodieClusteringJob' \
  /usr/lib/hudi/hudi-utilities-bundle.jar \
  --spark-memory '4g' \
  --mode 'execute' \
  --base-path "s3://${S3_BUCKET}/${TABLE_NAME}" \
  --table-name "${TABLE_NAME}" > ${WORKSPACE}/${TABLE_NAME}.execute.out &>/dev/null

In [59]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤═══════════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action        │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │               │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪═══════════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317091403489 │ commit        │ COMPLETED │ 03-17 09:14 │ 03-17 09:14 │ 03-17 09:14 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317091456798 │ replacecommit │ COMPLETED │ 03-17 09:14 │ 03-17 09:18 │ 03-17 09:19 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317091512213 │ commit        │ COMPLETED │ 03-17 09:15 │ 03-17 09:15 │ 03-17 09:16 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 3  

### 5.9. Insert 14.6MB / +1 Small File

In [60]:
%%sql
insert into 
    ${TABLE_NAME}
select 
    review_id, 
    star_rating, 
    review_body, 
    review_date, 
    year,
    unix_timestamp(current_timestamp()) as timestamp,
    mod(crc32(review_id), 2) as parity
from
    reviews
where
    year = 1998;

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [61]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤═══════════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action        │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │               │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪═══════════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317091403489 │ commit        │ COMPLETED │ 03-17 09:14 │ 03-17 09:14 │ 03-17 09:14 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317091456798 │ replacecommit │ COMPLETED │ 03-17 09:14 │ 03-17 09:18 │ 03-17 09:19 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317091512213 │ commit        │ COMPLETED │ 03-17 09:15 │ 03-17 09:15 │ 03-17 09:16 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 3  

### 5.10. Offline Eexecute 503MB / + 2 Clustered Files

In [62]:
%%sh
# it's required for current user (emr-notebook) to get sudo permission
sudo -u hadoop spark-submit \
  --jars '/usr/lib/hudi/hudi-spark-bundle.jar' \
  --class 'org.apache.hudi.utilities.HoodieClusteringJob' \
  /usr/lib/hudi/hudi-utilities-bundle.jar \
  --spark-memory '4g' \
  --mode 'execute' \
  --base-path "s3://${S3_BUCKET}/${TABLE_NAME}" \
  --table-name "${TABLE_NAME}" > ${WORKSPACE}/${TABLE_NAME}.execute.out &>/dev/null

In [63]:
%%sh
${WORKSPACE}/hudi-stat.sh s3://${S3_BUCKET}/${TABLE_NAME} timeline commits storage


[ TIMELINE ]

╔═════╤═══════════════════╤═══════════════╤═══════════╤═════════════╤═════════════╤═════════════╗
║ No. │ Instant           │ Action        │ State     │ Requested   │ Inflight    │ Completed   ║
║     │                   │               │           │ Time        │ Time        │ Time        ║
╠═════╪═══════════════════╪═══════════════╪═══════════╪═════════════╪═════════════╪═════════════╣
║ 0   │ 20230317091403489 │ commit        │ COMPLETED │ 03-17 09:14 │ 03-17 09:14 │ 03-17 09:14 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 1   │ 20230317091456798 │ replacecommit │ COMPLETED │ 03-17 09:14 │ 03-17 09:18 │ 03-17 09:19 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 2   │ 20230317091512213 │ commit        │ COMPLETED │ 03-17 09:15 │ 03-17 09:15 │ 03-17 09:16 ║
╟─────┼───────────────────┼───────────────┼───────────┼─────────────┼─────────────┼─────────────╢
║ 3  