Checking storage level before persisting preppedRecords #358

jianxu · 2018-03-20T19:33:07Z

In HoodieWriteClient, for API upsertPreppedRecords or insert without using dedup, it's possible that client persists the preppedRecords to a different level than MEMORY_AND_DISK_SER, and due to https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L170, UnsupportedOperationException could be thrown.

This PR is to address this issue by persisting preppedRecords in upsertRecordsInternal when it was not persisted.

An alternative would be to introduce a new config hoodie.write.input.storage.level preppedRecords, so that preppedRecords could be persisted using the same storage level used by user.

ovj · 2018-03-21T08:34:09Z

Can we on the fly detect if given rdd is already persisted or not and then persist it only if it is not done already may be using configs?
public static Optional<RDDInfo> getRddInfo(@NonNull final SparkContext sc, final int rddId) { for (RDDInfo rddInfo : sc.getRDDStorageInfo()) { if (rddInfo.id() != rddId) { continue; } return Optional.of(rddInfo); } return Optional.absent(); }

vinothchandar · 2018-03-22T16:17:49Z

@bvaradar this is a good starter review.

jianxu · 2018-03-22T17:28:16Z

Synced with @ovj the current approach is very simple and would achieve what we want. cc @vinothchandar @bvaradar

bvaradar · 2018-03-22T20:31:00Z

hoodie-client/src/main/java/com/uber/hoodie/HoodieWriteClient.java

@@ -414,7 +414,11 @@ private void saveWorkloadProfileMetadataToInflight(WorkloadProfile profile,
      final boolean isUpsert) {

    // Cache the tagged records, so we don't end up computing both
-    preppedRecords.persist(StorageLevel.MEMORY_AND_DISK_SER());
+    if (preppedRecords.getStorageLevel() == StorageLevel.NONE()) {


The code change looks good but I have general concern about the API contract here.

HoodieWriteClient is one of the main user-facing class. There are a bunch of APIs (insert/upsert and their bulk versions) which takes in non-prepped records and another set with prepped records. In all cases, the APIs get a JavaRDD as input.

We need to give a consistent contract (like persist side-effect from operations perspective). In the case of non-prepped records, we just apply transformations.

To make them consistent, I see 2 options

when Hudi persists caller's prepped RDD, Hudi will unpersist() before it returns the control back to the caller (preferred choice) or

Make an explicit ownership contract where Hudi never touches client provided RDD and it is the responsibility of the caller to cache/persist.

@bvaradar Good point on making a consistent contract in HoodieWriteClient. As currently we don't do unpersist of preppedRecords, so the effect of this PR is to support a use case where client has persisted preppedRecords without other additional changes.

I'm in favor of suggestion #1 of unpersist if Hudi has persisted the record. However since this involves a contract change, can we consider to make this change in a separate issue and have the new contract reviewed and tested by all clients?

@jianxu @bvaradar For option 1, we should look at the performance overhead of doing a persist() and unpersist() if eventually the client wants to anyways persist.

@n3nash : In ingestion use-cases, I was expecting that writing to Hoodie (sink) will be the last part most of the time. But in any case in practical sense, Isn't prepping records (outside Hoodie) usually a non-trivial work and it is a good idea to persist as soon as it is ready to prevent rework in case of failures ? I am fine with (2) too as it gives flexibility and ownership to clients.

@jianxu : I am ok with doing the change in a separate PR as this PR is more of a bug fix. Can you open an issue to do the cleanup and add a TODO here.

Balaji.V

Good points @bvaradar @n3nash. And #362 is to track the consistent contract.

vinothchandar · 2018-03-23T04:59:06Z

Introducing a config may not be a bad idea - with default as current level. So clients have the ability to control this if they choose not to cache here

vinothchandar · 2018-03-23T05:00:31Z

Merging for now.. But I encourage us come up with more long term solutions in first cut moving forward.. or sit together and do a bug bash week of some sort to fix such issues that keep accumulating

Co-authored-by: Lokesh Lingarajan <lokeshlingarajan@Lokeshs-MacBook-Pro.local>

vinothchandar requested a review from bvaradar March 22, 2018 16:17

bvaradar requested changes Mar 22, 2018

View reviewed changes

jianxu mentioned this pull request Mar 22, 2018

Consistent contract in HoodieWriteClient regarding preppedRecord storage level handling #362

Closed

Checking storage level before persisting preppedRecords

855c34b

jianxu force-pushed the jian/prepped-record-persist branch from 4562aa3 to 855c34b Compare March 22, 2018 23:19

bvaradar approved these changes Mar 23, 2018

View reviewed changes

vinothchandar merged commit 4864379 into apache:master Mar 23, 2018

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023

[ONHS-10771] - Handling empty commit for s3 incr job (apache#358)

657425c

Co-authored-by: Lokesh Lingarajan <lokeshlingarajan@Lokeshs-MacBook-Pro.local>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checking storage level before persisting preppedRecords #358

Checking storage level before persisting preppedRecords #358

jianxu commented Mar 20, 2018

ovj commented Mar 21, 2018

vinothchandar commented Mar 22, 2018

jianxu commented Mar 22, 2018

bvaradar Mar 22, 2018

jianxu Mar 22, 2018

n3nash Mar 22, 2018

bvaradar Mar 22, 2018

jianxu Mar 22, 2018

vinothchandar commented Mar 23, 2018

vinothchandar commented Mar 23, 2018

Checking storage level before persisting preppedRecords #358

Checking storage level before persisting preppedRecords #358

Conversation

jianxu commented Mar 20, 2018

ovj commented Mar 21, 2018

vinothchandar commented Mar 22, 2018

jianxu commented Mar 22, 2018

bvaradar Mar 22, 2018

Choose a reason for hiding this comment

jianxu Mar 22, 2018

Choose a reason for hiding this comment

n3nash Mar 22, 2018

Choose a reason for hiding this comment

bvaradar Mar 22, 2018

Choose a reason for hiding this comment

jianxu Mar 22, 2018

Choose a reason for hiding this comment

vinothchandar commented Mar 23, 2018

vinothchandar commented Mar 23, 2018