[HUDI-6528] Fix premature RDD unpersist during index lookup #9188

xushiyan · 2023-07-13T19:44:49Z

Change Logs

Currently when bloom/simple index tag locations for input records, incoming RDDs are supposed to be cached (by default), but rdd.unpersist() was invoked prematurely to make caching ineffective. This PR fixes the behavior by marking caching RDD for uncaching at SparkRDDWriteClient#releaseResources stage.

Impact

Indexing performance

Risk level

Medium

e2e testing & verification

Documentation Update

NA

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan · 2023-07-14T20:31:11Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java

@@ -80,7 +81,7 @@ public O updateLocation(O writeStatuses, HoodieEngineContext context,
  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
  public abstract <R> HoodieData<HoodieRecord<R>> tagLocation(
      HoodieData<HoodieRecord<R>> records, HoodieEngineContext context,
-      HoodieTable hoodieTable) throws HoodieIndexException;
+      HoodieTable hoodieTable, Option<String> instantTime) throws HoodieIndexException;


this is a public api. we might have to deprecate and add a new one if we wish to change the signature

the api is marked as "Evolving" so changes are expected in major release

I get it. we do the same for key generation and other public interfaces. lets add a new method w/o breaking any existing users.

nsivabalan · 2023-07-14T20:32:11Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java

-      records.persist(new HoodieConfig(config.getProps())
-          .getString(HoodieIndexConfig.BLOOM_INDEX_INPUT_STORAGE_LEVEL_VALUE));
+    if (config.getBloomIndexUseCaching() && instantTime.isPresent()) {
+      String storageLevel = config.getString(HoodieIndexConfig.BLOOM_INDEX_INPUT_STORAGE_LEVEL_VALUE);


can we move this to constructor and use it everywhere instead of parsing multiple times?

sounds good

nsivabalan · 2023-07-14T20:35:07Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java

@@ -103,11 +102,6 @@ record -> new ImmutablePair<>(record.getPartitionPath(), record.getRecordKey()))
    // Step 3: Tag the incoming records, as inserts or updates, by joining with existing record keys
    HoodieData<HoodieRecord<R>> taggedRecords = tagLocationBacktoRecords(keyFilenamePairs, records, hoodieTable);

-    if (config.getBloomIndexUseCaching()) {


I guess this was intentional. After this, taggedRecords is what is getting used. and we do cache that in BaseSparkCommitActionExecutor.execute

@Override public HoodieWriteMetadata<HoodieData<WriteStatus>> execute(HoodieData<HoodieRecord<T>> inputRecords) { // Cache the tagged records, so we don't end up computing both JavaRDD<HoodieRecord<T>> inputRDD = HoodieJavaRDD.getJavaRDD(inputRecords); if (inputRDD.getStorageLevel() == StorageLevel.NONE()) { HoodieJavaRDD.of(inputRDD).persist(config.getTaggedRecordStorageLevel(), context, HoodieDataCacheKey.of(config.getBasePath(), instantTime)); } else { LOG.info("RDD PreppedRecords was persisted at: " + inputRDD.getStorageLevel()); } . .

So, not sure if we want to keep the persistance until the very end for these rdds which may not be used only.

the main purpose of this PR is about fixing premature un-persisting like this example here.

hudi-bot · 2023-07-14T23:15:44Z

CI report:

4c247d8 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan

Lets do a manual verification that both Rdds are persisted and is not unpersisted until the release resources is invoked.
LGTM otherwise

nsivabalan · 2023-07-17T03:44:00Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java

@@ -80,7 +81,7 @@ public O updateLocation(O writeStatuses, HoodieEngineContext context,
  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
  public abstract <R> HoodieData<HoodieRecord<R>> tagLocation(
      HoodieData<HoodieRecord<R>> records, HoodieEngineContext context,
-      HoodieTable hoodieTable) throws HoodieIndexException;
+      HoodieTable hoodieTable, Option<String> instantTime) throws HoodieIndexException;


I get it. we do the same for key generation and other public interfaces. lets add a new method w/o breaking any existing users.

[HUDI-6528] Fix premature RDD unpersist during index lookup

4c247d8

xushiyan closed this Jul 13, 2023

xushiyan reopened this Jul 13, 2023

apache deleted a comment from hudi-bot Jul 14, 2023

nsivabalan reviewed Jul 14, 2023

View reviewed changes

nsivabalan reviewed Jul 21, 2023

View reviewed changes

nsivabalan added release-0.14.0 priority:blocker labels Jul 21, 2023

codope added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Aug 4, 2023

nsivabalan removed the release-0.14.0 label Aug 6, 2023

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-6528] Fix premature RDD unpersist during index lookup #9188

[HUDI-6528] Fix premature RDD unpersist during index lookup #9188

xushiyan commented Jul 13, 2023 •

edited

Loading

nsivabalan Jul 14, 2023

xushiyan Jul 15, 2023

nsivabalan Jul 17, 2023

nsivabalan Jul 14, 2023

xushiyan Jul 15, 2023

nsivabalan Jul 14, 2023

xushiyan Jul 15, 2023

hudi-bot commented Jul 14, 2023

nsivabalan left a comment

nsivabalan Jul 17, 2023

[HUDI-6528] Fix premature RDD unpersist during index lookup #9188

Are you sure you want to change the base?

[HUDI-6528] Fix premature RDD unpersist during index lookup #9188

Conversation

xushiyan commented Jul 13, 2023 • edited Loading

Change Logs

Impact

Risk level

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Jul 14, 2023

CI report:

nsivabalan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xushiyan commented Jul 13, 2023 •

edited

Loading