[MINOR] Fix default config values if not specified in MultipleSparkJobExecutionStrategy by voonhous · Pull Request #9625 · apache/hudi

voonhous · 2023-09-06T02:43:49Z

Change Logs

The default values for the configs below are incorrect:

hoodie.datasource.write.row.writer.enable
hoodie.clustering.preserve.commit.metadata (getPreserveHoodieMetadata)

The default values are not loaded from #defaultVal as the configurations are defined in a module-scope that is inaccessible by the current scope. This is why config keys are defined as string here.

Raising a PR to fix these inconsistencies first. Subsequent refactoring might be required to move these config-keys to a scope that is accessible by all other (relevant) modules.

Note: The existing test coverage does not cover clustering performed using the RowWriter API. Only RDD API is included as of now.

Impact

None - correctness + ease of debugging through consistency

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

pratyakshsharma · 2023-09-13T03:39:16Z

@voonhous Please check the CI failures.

danny0405 · 2023-09-13T03:45:40Z

Please chech the CI failures.

voonhous · 2023-09-13T06:26:03Z

The affected tests TestSparkConsistentBucketClustering#testClusteringColumnSort assumes the default config below:

hoodie.datasource.write.row.writer.enable=false

Since I changed the default config to be aligned with the global config of it to true, the tests started failing. As such, I have fixed the test by overriding it back to false in the test.

Will open a separate PR to fix sorting for native row writers when performing clustering for ConsistentBucketClustering.

Caused by: java.lang.UnsupportedOperationException: org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$8$1
	at org.apache.parquet.io.api.PrimitiveConverter.addLong(PrimitiveConverter.java:105)
	at org.apache.parquet.column.impl.ColumnReaderBase$2$4.writeValue(ColumnReaderBase.java:325)
	at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440)
	at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
	at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:234)


Error:  Errors: 
Error:    Can not read value at 1 in block 0 in file file:/tmp/junit2525135472431698271/dataset/2016/03/15/398f4e47-ded4-46b7-90d4-3da6e4a1485a-0_2-116-257_20230906061353657.parquet
Error:    Can not read value at 1 in block 0 in file file:/tmp/junit13763144683442925835/dataset/2016/03/15/fca915de-8b3b-42fd-b2b5-a151558f64ec-0_1-105-225_20230906061411593.parquet
[INFO] 
Error:  Tests run: 199, Failures: 0, Errors: 2, Skipped: 1

voonhous · 2023-09-13T10:07:33Z

Alright, added comments for future devs whom are writing tests around this area.

yihua · 2023-09-14T22:07:26Z

Looks like Azure CI still fails. I triggered a rerun.

yihua · 2023-09-14T23:36:10Z

@voonhous There are still a lot of CI failures. Could you check them?

voonhous · 2023-09-15T02:09:50Z

@yihua @yihua Looked through the CI failures, they seem to be errors when trying to invoke the RowWriter implementation when performing clustering.

[ERROR] Errors: 
[ERROR] TestHoodieBackedMetadata.testClusterOperationOnMainTable()(TestHoodieBackedMetadata)
[ERROR]   Run 1: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 2: java.util.concurrent.CancellationException
[ERROR]   Run 3: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 4: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[INFO] 
[ERROR] TestHoodieBackedMetadata.testMDTCompactionWithFailedCommits()(TestHoodieBackedMetadata)
[ERROR]   Run 1: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 2: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 3: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 4: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter

Prior to this change, the existing tests are using the RDD implementation. But due to the mismatch in configs, the RowWriter implementation was not really tested for all tests invoking clustering.

Since this is a "[MINOR]" PR fix, i will add configs in the affected tests to ensure that they use the RDD implementation.

We can create another PR to increase the coverage of the clustering writers after this.

yihua · 2023-09-15T04:43:18Z

@yihua @yihua Looked through the CI failures, they seem to be errors when trying to invoke the RowWriter implementation when performing clustering.
[ERROR] Errors: 
[ERROR] TestHoodieBackedMetadata.testClusterOperationOnMainTable()(TestHoodieBackedMetadata)
[ERROR]   Run 1: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 2: java.util.concurrent.CancellationException
[ERROR]   Run 3: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 4: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[INFO] 
[ERROR] TestHoodieBackedMetadata.testMDTCompactionWithFailedCommits()(TestHoodieBackedMetadata)
[ERROR]   Run 1: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 2: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 3: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 4: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
Prior to this change, the existing tests are using the RDD implementation. But due to the mismatch in configs, the RowWriter implementation was not really tested for all tests invoking clustering.

Since this is a "[MINOR]" PR fix, i will add configs in the affected tests to ensure that they use the RDD implementation.

We can create another PR to increase the coverage of the clustering writers after this.

Sounds good.

yihua · 2023-09-19T17:50:38Z

I triggered the rerun of CI now. CI is flaky recently.

voonhous · 2023-09-20T03:48:11Z

@hudi-bot run azure

hudi-bot · 2023-09-20T06:49:10Z

CI report:

c84d093 Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous · 2023-09-20T06:55:29Z

@yihua @danny0405 @pratyakshsharma
Okay, CI has finally passed...

Me RN:

yihua · 2023-09-21T19:16:19Z

+1. @voonhous Have you created follow-up JIRAs to fix the row writer in relevant write flows?

voonhous · 2023-09-22T02:12:38Z

+1. @voonhous Have you created follow-up JIRAs to fix the row writer in relevant write flows?

Nope, no follow-up JIRAs to fix the tests yet.

nsivabalan · 2023-11-16T22:33:31Z

...n/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java

      Stream<HoodieData<WriteStatus>> writeStatusesStream = FutureUtils.allOf(
              clusteringPlan.getInputGroups().stream()
                  .map(inputGroup -> {
-                    if (getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable", false)) {


@yihua : this was intentionally kept it as false (default value).
So only if user explicitly enabled row writer, we will enable row writer w/ clustering.

The default values for the configs below are incorrect: 1. hoodie.datasource.write.row.writer.enable 2. hoodie.clustering.preserve.commit.metadata (getPreserveHoodieMetadata) The default values are not loaded from `#defaultVal` as the configurations are defined in a module-scope that is inaccessible by the current scope. This is why config keys are defined as string here. This commit fixes these inconsistencies first. Subsequent refactoring might be required to move these config-keys to a scope that is accessible by all other (relevant) modules. **Note:** The existing test coverage does not cover clustering performed using the RowWriter API. Only RDD API is included as of now. Co-authored-by: voon <voonhou.su@shopee.com>

danny0405 approved these changes Sep 13, 2023

View reviewed changes

voonhous force-pushed the minor_fix_default_values branch 2 times, most recently from c8d5378 to 2136b10 Compare September 13, 2023 06:20

voonhous force-pushed the minor_fix_default_values branch from 2136b10 to 0847642 Compare September 13, 2023 10:07

apache deleted a comment from hudi-bot Sep 14, 2023

voonhous force-pushed the minor_fix_default_values branch 6 times, most recently from 239190a to 51a71ae Compare September 18, 2023 09:27

apache deleted a comment from hudi-bot Sep 18, 2023

voonhous force-pushed the minor_fix_default_values branch 2 times, most recently from 49743c6 to 1655dcc Compare September 19, 2023 06:53

[MINOR] Fix default config values if not specified

c84d093

voonhous force-pushed the minor_fix_default_values branch from 1655dcc to c84d093 Compare September 19, 2023 08:29

apache deleted a comment from hudi-bot Sep 19, 2023

yihua merged commit 9259287 into apache:master Sep 21, 2023

nsivabalan reviewed Nov 16, 2023

View reviewed changes

voonhous deleted the minor_fix_default_values branch December 20, 2025 10:09

Conversation

voonhous commented Sep 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

pratyakshsharma commented Sep 13, 2023

Uh oh!

danny0405 commented Sep 13, 2023

Uh oh!

voonhous commented Sep 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voonhous commented Sep 13, 2023

Uh oh!

yihua commented Sep 14, 2023

Uh oh!

yihua commented Sep 14, 2023

Uh oh!

voonhous commented Sep 15, 2023

Uh oh!

yihua commented Sep 15, 2023

Uh oh!

yihua commented Sep 19, 2023

Uh oh!

voonhous commented Sep 20, 2023

Uh oh!

hudi-bot commented Sep 20, 2023

CI report:

Uh oh!

voonhous commented Sep 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yihua commented Sep 21, 2023

Uh oh!

voonhous commented Sep 22, 2023

Uh oh!

nsivabalan Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

voonhous commented Sep 6, 2023 •

edited

Loading

voonhous commented Sep 13, 2023 •

edited

Loading

voonhous commented Sep 20, 2023 •

edited

Loading