New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checking storage level before persisting preppedRecords #358
Checking storage level before persisting preppedRecords #358
Conversation
Can we on the fly detect if given rdd is already persisted or not and then persist it only if it is not done already may be using configs? |
@bvaradar this is a good starter review. |
Synced with @ovj the current approach is very simple and would achieve what we want. cc @vinothchandar @bvaradar |
@@ -414,7 +414,11 @@ private void saveWorkloadProfileMetadataToInflight(WorkloadProfile profile, | |||
final boolean isUpsert) { | |||
|
|||
// Cache the tagged records, so we don't end up computing both | |||
preppedRecords.persist(StorageLevel.MEMORY_AND_DISK_SER()); | |||
if (preppedRecords.getStorageLevel() == StorageLevel.NONE()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change looks good but I have general concern about the API contract here.
HoodieWriteClient is one of the main user-facing class. There are a bunch of APIs (insert/upsert and their bulk versions) which takes in non-prepped records and another set with prepped records. In all cases, the APIs get a JavaRDD as input.
We need to give a consistent contract (like persist side-effect from operations perspective). In the case of non-prepped records, we just apply transformations.
To make them consistent, I see 2 options
- when Hudi persists caller's prepped RDD, Hudi will unpersist() before it returns the control back to the caller (preferred choice) or
- Make an explicit ownership contract where Hudi never touches client provided RDD and it is the responsibility of the caller to cache/persist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bvaradar Good point on making a consistent contract in HoodieWriteClient. As currently we don't do unpersist of preppedRecords
, so the effect of this PR is to support a use case where client has persisted preppedRecords
without other additional changes.
I'm in favor of suggestion #1 of unpersist if Hudi has persisted the record. However since this involves a contract change, can we consider to make this change in a separate issue and have the new contract reviewed and tested by all clients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n3nash : In ingestion use-cases, I was expecting that writing to Hoodie (sink) will be the last part most of the time. But in any case in practical sense, Isn't prepping records (outside Hoodie) usually a non-trivial work and it is a good idea to persist as soon as it is ready to prevent rework in case of failures ? I am fine with (2) too as it gives flexibility and ownership to clients.
@jianxu : I am ok with doing the change in a separate PR as this PR is more of a bug fix. Can you open an issue to do the cleanup and add a TODO here.
Balaji.V
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4562aa3
to
855c34b
Compare
Introducing a config may not be a bad idea - with default as current level. So clients have the ability to control this if they choose not to cache here |
Merging for now.. But I encourage us come up with more long term solutions in first cut moving forward.. or sit together and do a bug bash week of some sort to fix such issues that keep accumulating |
Co-authored-by: Lokesh Lingarajan <lokeshlingarajan@Lokeshs-MacBook-Pro.local>
@vinothchandar @ovj
In
HoodieWriteClient
, for APIupsertPreppedRecords
orinsert
without using dedup, it's possible that client persists the preppedRecords to a different level thanMEMORY_AND_DISK_SER
, and due to https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L170,UnsupportedOperationException
could be thrown.This PR is to address this issue by persisting preppedRecords in
upsertRecordsInternal
when it was not persisted.An alternative would be to introduce a new config
hoodie.write.input.storage.level
preppedRecords, so thatpreppedRecords
could be persisted using the same storage level used by user.