Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][HUDI-752]Make CompactionAdminClient spark-free #1471

Closed
wants to merge 3 commits into from

Conversation

hddong
Copy link
Contributor

@hddong hddong commented Mar 31, 2020

Tips

What is the purpose of the pull request

  • There can only one SparkContext in JVM. So, we can store it in a Factory class, then we can get it everywhere. After that, we make many class spark-free*

Brief change log

(for example:)

  • Make CompactionAdminClient spark-free

Verify this pull request

This pull request is a trivial rework / code cleanup without any test coverage.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@hddong hddong changed the title [HUDI-752]Make CompactionAdminClient spark-free [WIP][HUDI-752]Make CompactionAdminClient spark-free Mar 31, 2020
/**
* Util class for Spark Engine.
*/
public class SparkEngineUtils {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, If we depend on this class, we can not call it spark free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanghua IMO, this is the same Purpose of HUDI-678, just make this class spark-free. After all spark RDD calculate move to this class , we can abstract the FlinkEngine like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the spirit here.. but if we are introducing an abstraction like SparkEngineUtils, then we need more upfront design on how its going to fit in everywhere.. thoughts? If we can produce such an abstraction, then we need not do this piecemeal like class by class.. we can see if this model can replace the core writing logic and you can run the TestCopyOnWriteStorage..... unit test successfully.. I think thats a better parallel approach to pursue...

I have scoped the work for HUDI-677 , which will move a bunch of code.. So also love to avoid stepping on toes for this week atleast :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would have a base interface EngineContext (let's not please overload Utils anymore.. I am trying to move us towards SRP principles).. and subclass SparkRDDEngineContext (we may add a DataFrame engine, Flink Engine).. and generify the code such that we pass the engineContext once to HoodieWriteClient and the rest of the code can execute..

This sore of PoC will be extremely valuable to use at this stage, than doing classes one by one.. We will take a long time to be done :) .. @yanghua would you also agree with my thoughts here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suggest whether we can temporarily block this PR. We internally expect that an incomplete version based on Flink implementation will be given this Friday. Can we look at its implementation then discuss furthermore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.. I am with you... @yanghua and @hddong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree too @yanghua @vinothchandar

@yanghua
Copy link
Contributor

yanghua commented Mar 31, 2020

@hddong Thanks for your contribution. Any time you work on the spark-free issue. Please make sure you communicated with @vinothchandar .

@hddong
Copy link
Contributor Author

hddong commented Mar 31, 2020

@yanghua thanks, just want to have a try. Will communicate first next time.

/**
* Parallelize map function.
*/
public static <T, R> List<R> parallelizeMap(List<T> list, int num, Function<T, R> f) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My early thoughts are that this EngineContext abstraction needs to abstract Hudi logic and not at the level of map, filter etc... (if we wanted that, we can think of Beam).. There would be operations with different signatures across engines.. for e.g something like sortAndRepartitionWithPartitions exists for spark RDDs and not DataFrames.

I may be wrong.. but it's worth first computing a table of all RDD apis we invoke today, its inputs and see how this can evolve.. ?

@hddong
Copy link
Contributor Author

hddong commented Apr 1, 2020

@vinothchandar I totally agree with you, we need abstraction class first. My original idea was that we have many transform (List to list) used jsc, It is not must depend on spark, we can take them out.

@vinothchandar
Copy link
Member

Closing due to inactivity and we have done few different fixes around this. Please rebase, reopen if its still relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants