New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][HUDI-752]Make CompactionAdminClient spark-free #1471
Conversation
/** | ||
* Util class for Spark Engine. | ||
*/ | ||
public class SparkEngineUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, If we depend on this class, we can not call it spark free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the spirit here.. but if we are introducing an abstraction like SparkEngineUtils
, then we need more upfront design on how its going to fit in everywhere.. thoughts? If we can produce such an abstraction, then we need not do this piecemeal like class by class.. we can see if this model can replace the core writing logic and you can run the TestCopyOnWriteStorage..... unit test successfully.. I think thats a better parallel approach to pursue...
I have scoped the work for HUDI-677 , which will move a bunch of code.. So also love to avoid stepping on toes for this week atleast :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I would have a base interface EngineContext
(let's not please overload Utils anymore.. I am trying to move us towards SRP principles).. and subclass SparkRDDEngineContext
(we may add a DataFrame engine, Flink Engine).. and generify the code such that we pass the engineContext once to HoodieWriteClient and the rest of the code can execute..
This sore of PoC will be extremely valuable to use at this stage, than doing classes one by one.. We will take a long time to be done :) .. @yanghua would you also agree with my thoughts here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I suggest whether we can temporarily block this PR. We internally expect that an incomplete version based on Flink implementation will be given this Friday. Can we look at its implementation then discuss furthermore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree too @yanghua @vinothchandar
@hddong Thanks for your contribution. Any time you work on the spark-free issue. Please make sure you communicated with @vinothchandar . |
@yanghua thanks, just want to have a try. Will communicate first next time. |
/** | ||
* Parallelize map function. | ||
*/ | ||
public static <T, R> List<R> parallelizeMap(List<T> list, int num, Function<T, R> f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My early thoughts are that this EngineContext abstraction needs to abstract Hudi logic and not at the level of map
, filter
etc... (if we wanted that, we can think of Beam).. There would be operations with different signatures across engines.. for e.g something like sortAndRepartitionWithPartitions
exists for spark RDDs and not DataFrames.
I may be wrong.. but it's worth first computing a table of all RDD apis we invoke today, its inputs and see how this can evolve.. ?
@vinothchandar I totally agree with you, we need abstraction class first. My original idea was that we have many transform (List to list) used jsc, It is not must depend on spark, we can take them out. |
Closing due to inactivity and we have done few different fixes around this. Please rebase, reopen if its still relevant |
Tips
What is the purpose of the pull request
SparkContext
in JVM. So, we can store it in aFactory
class, then we can get it everywhere. After that, we make many class spark-free*Brief change log
(for example:)
Verify this pull request
This pull request is a trivial rework / code cleanup without any test coverage.
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.