Skip to content

[MINOR] Fix default config values if not specified in MultipleSparkJobExecutionStrategy#9625

Merged
yihua merged 1 commit intoapache:masterfrom
voonhous:minor_fix_default_values
Sep 21, 2023
Merged

[MINOR] Fix default config values if not specified in MultipleSparkJobExecutionStrategy#9625
yihua merged 1 commit intoapache:masterfrom
voonhous:minor_fix_default_values

Conversation

@voonhous
Copy link
Member

@voonhous voonhous commented Sep 6, 2023

Change Logs

The default values for the configs below are incorrect:

  1. hoodie.datasource.write.row.writer.enable
  2. hoodie.clustering.preserve.commit.metadata (getPreserveHoodieMetadata)

The default values are not loaded from #defaultVal as the configurations are defined in a module-scope that is inaccessible by the current scope. This is why config keys are defined as string here.

Raising a PR to fix these inconsistencies first. Subsequent refactoring might be required to move these config-keys to a scope that is accessible by all other (relevant) modules.

Note: The existing test coverage does not cover clustering performed using the RowWriter API. Only RDD API is included as of now.

Impact

None - correctness + ease of debugging through consistency

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@pratyakshsharma
Copy link
Contributor

@voonhous Please check the CI failures.

@danny0405
Copy link
Contributor

Please chech the CI failures.

@voonhous voonhous force-pushed the minor_fix_default_values branch 2 times, most recently from c8d5378 to 2136b10 Compare September 13, 2023 06:20
@voonhous
Copy link
Member Author

voonhous commented Sep 13, 2023

The affected tests TestSparkConsistentBucketClustering#testClusteringColumnSort assumes the default config below:

hoodie.datasource.write.row.writer.enable=false

Since I changed the default config to be aligned with the global config of it to true, the tests started failing. As such, I have fixed the test by overriding it back to false in the test.

Will open a separate PR to fix sorting for native row writers when performing clustering for ConsistentBucketClustering.

Caused by: java.lang.UnsupportedOperationException: org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$8$1
	at org.apache.parquet.io.api.PrimitiveConverter.addLong(PrimitiveConverter.java:105)
	at org.apache.parquet.column.impl.ColumnReaderBase$2$4.writeValue(ColumnReaderBase.java:325)
	at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440)
	at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
	at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:234)


Error:  Errors: 
Error:    Can not read value at 1 in block 0 in file file:/tmp/junit2525135472431698271/dataset/2016/03/15/398f4e47-ded4-46b7-90d4-3da6e4a1485a-0_2-116-257_20230906061353657.parquet
Error:    Can not read value at 1 in block 0 in file file:/tmp/junit13763144683442925835/dataset/2016/03/15/fca915de-8b3b-42fd-b2b5-a151558f64ec-0_1-105-225_20230906061411593.parquet
[INFO] 
Error:  Tests run: 199, Failures: 0, Errors: 2, Skipped: 1

@voonhous voonhous force-pushed the minor_fix_default_values branch from 2136b10 to 0847642 Compare September 13, 2023 10:07
@voonhous
Copy link
Member Author

Alright, added comments for future devs whom are writing tests around this area.

@apache apache deleted a comment from hudi-bot Sep 14, 2023
@yihua
Copy link
Contributor

yihua commented Sep 14, 2023

Looks like Azure CI still fails. I triggered a rerun.

@yihua
Copy link
Contributor

yihua commented Sep 14, 2023

@voonhous There are still a lot of CI failures. Could you check them?

@voonhous
Copy link
Member Author

@yihua @yihua Looked through the CI failures, they seem to be errors when trying to invoke the RowWriter implementation when performing clustering.

[ERROR] Errors: 
[ERROR] TestHoodieBackedMetadata.testClusterOperationOnMainTable()(TestHoodieBackedMetadata)
[ERROR]   Run 1: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 2: java.util.concurrent.CancellationException
[ERROR]   Run 3: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 4: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[INFO] 
[ERROR] TestHoodieBackedMetadata.testMDTCompactionWithFailedCommits()(TestHoodieBackedMetadata)
[ERROR]   Run 1: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 2: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 3: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 4: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter

Prior to this change, the existing tests are using the RDD implementation. But due to the mismatch in configs, the RowWriter implementation was not really tested for all tests invoking clustering.

Since this is a "[MINOR]" PR fix, i will add configs in the affected tests to ensure that they use the RDD implementation.

We can create another PR to increase the coverage of the clustering writers after this.

@yihua
Copy link
Contributor

yihua commented Sep 15, 2023

@yihua @yihua Looked through the CI failures, they seem to be errors when trying to invoke the RowWriter implementation when performing clustering.

[ERROR] Errors: 
[ERROR] TestHoodieBackedMetadata.testClusterOperationOnMainTable()(TestHoodieBackedMetadata)
[ERROR]   Run 1: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 2: java.util.concurrent.CancellationException
[ERROR]   Run 3: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 4: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[INFO] 
[ERROR] TestHoodieBackedMetadata.testMDTCompactionWithFailedCommits()(TestHoodieBackedMetadata)
[ERROR]   Run 1: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 2: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 3: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter
[ERROR]   Run 4: java.lang.ClassNotFoundException: org.apache.spark.sql.adapter.Spark3_2Adapter

Prior to this change, the existing tests are using the RDD implementation. But due to the mismatch in configs, the RowWriter implementation was not really tested for all tests invoking clustering.

Since this is a "[MINOR]" PR fix, i will add configs in the affected tests to ensure that they use the RDD implementation.

We can create another PR to increase the coverage of the clustering writers after this.

Sounds good.

@voonhous voonhous force-pushed the minor_fix_default_values branch 6 times, most recently from 239190a to 51a71ae Compare September 18, 2023 09:27
@apache apache deleted a comment from hudi-bot Sep 18, 2023
@voonhous voonhous force-pushed the minor_fix_default_values branch 2 times, most recently from 49743c6 to 1655dcc Compare September 19, 2023 06:53
@voonhous voonhous force-pushed the minor_fix_default_values branch from 1655dcc to c84d093 Compare September 19, 2023 08:29
@apache apache deleted a comment from hudi-bot Sep 19, 2023
@yihua
Copy link
Contributor

yihua commented Sep 19, 2023

I triggered the rerun of CI now. CI is flaky recently.

@voonhous
Copy link
Member Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous
Copy link
Member Author

voonhous commented Sep 20, 2023

@yihua @danny0405 @pratyakshsharma
Okay, CI has finally passed...

Me RN:

when_your_ci_finally_passes

@yihua
Copy link
Contributor

yihua commented Sep 21, 2023

+1. @voonhous Have you created follow-up JIRAs to fix the row writer in relevant write flows?

@yihua yihua merged commit 9259287 into apache:master Sep 21, 2023
@voonhous
Copy link
Member Author

+1. @voonhous Have you created follow-up JIRAs to fix the row writer in relevant write flows?

Nope, no follow-up JIRAs to fix the tests yet.

Stream<HoodieData<WriteStatus>> writeStatusesStream = FutureUtils.allOf(
clusteringPlan.getInputGroups().stream()
.map(inputGroup -> {
if (getWriteConfig().getBooleanOrDefault("hoodie.datasource.write.row.writer.enable", false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua : this was intentionally kept it as false (default value).
So only if user explicitly enabled row writer, we will enable row writer w/ clustering.

nsivabalan pushed a commit that referenced this pull request Nov 21, 2023
The default values for the configs below are incorrect:

1. hoodie.datasource.write.row.writer.enable
2. hoodie.clustering.preserve.commit.metadata (getPreserveHoodieMetadata)

The default values are not loaded from `#defaultVal` as the configurations are defined in a module-scope that is inaccessible by the current scope. This is why config keys are defined as string here.

This commit fixes these inconsistencies first. Subsequent refactoring might be required to move these config-keys to a scope that is accessible by all other (relevant) modules.

**Note:** The existing test coverage does not cover clustering performed using the RowWriter API. Only RDD API is included as of now. 

Co-authored-by: voon <voonhou.su@shopee.com>
@voonhous voonhous deleted the minor_fix_default_values branch December 20, 2025 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants