Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking storage level before persisting preppedRecords #358

Merged
merged 1 commit into from Mar 23, 2018

Conversation

jianxu
Copy link
Contributor

@jianxu jianxu commented Mar 20, 2018

@vinothchandar @ovj

In HoodieWriteClient, for API upsertPreppedRecords or insert without using dedup, it's possible that client persists the preppedRecords to a different level than MEMORY_AND_DISK_SER, and due to https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L170, UnsupportedOperationException could be thrown.

This PR is to address this issue by persisting preppedRecords in upsertRecordsInternal when it was not persisted.

An alternative would be to introduce a new config hoodie.write.input.storage.level preppedRecords, so that preppedRecords could be persisted using the same storage level used by user.

@ovj
Copy link
Contributor

ovj commented Mar 21, 2018

Can we on the fly detect if given rdd is already persisted or not and then persist it only if it is not done already may be using configs?
public static Optional<RDDInfo> getRddInfo(@NonNull final SparkContext sc, final int rddId) { for (RDDInfo rddInfo : sc.getRDDStorageInfo()) { if (rddInfo.id() != rddId) { continue; } return Optional.of(rddInfo); } return Optional.absent(); }

@vinothchandar
Copy link
Member

@bvaradar this is a good starter review.

@jianxu
Copy link
Contributor Author

jianxu commented Mar 22, 2018

Synced with @ovj the current approach is very simple and would achieve what we want. cc @vinothchandar @bvaradar

@@ -414,7 +414,11 @@ private void saveWorkloadProfileMetadataToInflight(WorkloadProfile profile,
final boolean isUpsert) {

// Cache the tagged records, so we don't end up computing both
preppedRecords.persist(StorageLevel.MEMORY_AND_DISK_SER());
if (preppedRecords.getStorageLevel() == StorageLevel.NONE()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks good but I have general concern about the API contract here.

HoodieWriteClient is one of the main user-facing class. There are a bunch of APIs (insert/upsert and their bulk versions) which takes in non-prepped records and another set with prepped records. In all cases, the APIs get a JavaRDD as input.

We need to give a consistent contract (like persist side-effect from operations perspective). In the case of non-prepped records, we just apply transformations.

To make them consistent, I see 2 options

  1. when Hudi persists caller's prepped RDD, Hudi will unpersist() before it returns the control back to the caller (preferred choice) or
  2. Make an explicit ownership contract where Hudi never touches client provided RDD and it is the responsibility of the caller to cache/persist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bvaradar Good point on making a consistent contract in HoodieWriteClient. As currently we don't do unpersist of preppedRecords, so the effect of this PR is to support a use case where client has persisted preppedRecords without other additional changes.

I'm in favor of suggestion #1 of unpersist if Hudi has persisted the record. However since this involves a contract change, can we consider to make this change in a separate issue and have the new contract reviewed and tested by all clients?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianxu @bvaradar For option 1, we should look at the performance overhead of doing a persist() and unpersist() if eventually the client wants to anyways persist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash : In ingestion use-cases, I was expecting that writing to Hoodie (sink) will be the last part most of the time. But in any case in practical sense, Isn't prepping records (outside Hoodie) usually a non-trivial work and it is a good idea to persist as soon as it is ready to prevent rework in case of failures ? I am fine with (2) too as it gives flexibility and ownership to clients.

@jianxu : I am ok with doing the change in a separate PR as this PR is more of a bug fix. Can you open an issue to do the cleanup and add a TODO here.

Balaji.V

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points @bvaradar @n3nash. And #362 is to track the consistent contract.

@jianxu jianxu force-pushed the jian/prepped-record-persist branch from 4562aa3 to 855c34b Compare March 22, 2018 23:19
@vinothchandar
Copy link
Member

Introducing a config may not be a bad idea - with default as current level. So clients have the ability to control this if they choose not to cache here

@vinothchandar
Copy link
Member

Merging for now.. But I encourage us come up with more long term solutions in first cut moving forward.. or sit together and do a bug bash week of some sort to fix such issues that keep accumulating

@vinothchandar vinothchandar merged commit 4864379 into apache:master Mar 23, 2018
vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023
Co-authored-by: Lokesh Lingarajan <lokeshlingarajan@Lokeshs-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants