[WIP][HUDI-752]Make CompactionAdminClient spark-free #1471

hddong · 2020-03-31T07:33:45Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

There can only one SparkContext in JVM. So, we can store it in a Factory class, then we can get it everywhere. After that, we make many class spark-free*

Brief change log

(for example:)

Make CompactionAdminClient spark-free

Verify this pull request

This pull request is a trivial rework / code cleanup without any test coverage.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

yanghua · 2020-03-31T09:43:40Z

hudi-client/src/main/java/org/apache/hudi/client/utils/SparkEngineUtils.java

+/**
+ * Util class for Spark Engine.
+ */
+public class SparkEngineUtils {


Actually, If we depend on this class, we can not call it spark free.

@yanghua IMO, this is the same Purpose of HUDI-678, just make this class spark-free. After all spark RDD calculate move to this class , we can abstract the FlinkEngine like this.

I understand the spirit here.. but if we are introducing an abstraction like SparkEngineUtils, then we need more upfront design on how its going to fit in everywhere.. thoughts? If we can produce such an abstraction, then we need not do this piecemeal like class by class.. we can see if this model can replace the core writing logic and you can run the TestCopyOnWriteStorage..... unit test successfully.. I think thats a better parallel approach to pursue...

I have scoped the work for HUDI-677 , which will move a bunch of code.. So also love to avoid stepping on toes for this week atleast :)

Also I would have a base interface EngineContext (let's not please overload Utils anymore.. I am trying to move us towards SRP principles).. and subclass SparkRDDEngineContext (we may add a DataFrame engine, Flink Engine).. and generify the code such that we pass the engineContext once to HoodieWriteClient and the rest of the code can execute..

This sore of PoC will be extremely valuable to use at this stage, than doing classes one by one.. We will take a long time to be done :) .. @yanghua would you also agree with my thoughts here

Yes, I suggest whether we can temporarily block this PR. We internally expect that an incomplete version based on Flink implementation will be given this Friday. Can we look at its implementation then discuss furthermore?

yes.. I am with you... @yanghua and @hddong

Yes, agree too @yanghua @vinothchandar

yanghua · 2020-03-31T09:45:53Z

@hddong Thanks for your contribution. Any time you work on the spark-free issue. Please make sure you communicated with @vinothchandar .

hddong · 2020-03-31T09:57:10Z

@yanghua thanks, just want to have a try. Will communicate first next time.

vinothchandar · 2020-03-31T14:54:40Z

hudi-client/src/main/java/org/apache/hudi/client/utils/SparkEngineUtils.java

+  /**
+   * Parallelize map function.
+   */
+  public static <T, R> List<R> parallelizeMap(List<T> list, int num, Function<T, R> f) {


My early thoughts are that this EngineContext abstraction needs to abstract Hudi logic and not at the level of map, filter etc... (if we wanted that, we can think of Beam).. There would be operations with different signatures across engines.. for e.g something like sortAndRepartitionWithPartitions exists for spark RDDs and not DataFrames.

I may be wrong.. but it's worth first computing a table of all RDD apis we invoke today, its inputs and see how this can evolve.. ?

hddong · 2020-04-01T02:19:05Z

@vinothchandar I totally agree with you, we need abstraction class first. My original idea was that we have many transform (List to list) used jsc, It is not must depend on spark, we can take them out.

vinothchandar · 2020-05-14T17:02:51Z

Closing due to inactivity and we have done few different fixes around this. Please rebase, reopen if its still relevant

make CompactionAdminClient spark-free

700f723

hddong changed the title ~~[HUDI-752]Make CompactionAdminClient spark-free~~ [WIP][HUDI-752]Make CompactionAdminClient spark-free Mar 31, 2020

Update SparkEngineUtils.java

7fe016b

yanghua reviewed Mar 31, 2020

View reviewed changes

reset

f70b159

vinothchandar reviewed Mar 31, 2020

View reviewed changes

vinothchandar assigned yanghua Apr 1, 2020

vinothchandar closed this May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][HUDI-752]Make CompactionAdminClient spark-free #1471

[WIP][HUDI-752]Make CompactionAdminClient spark-free #1471

hddong commented Mar 31, 2020 •

edited

yanghua Mar 31, 2020

hddong Mar 31, 2020

vinothchandar Mar 31, 2020

vinothchandar Mar 31, 2020

yanghua Apr 1, 2020

vinothchandar Apr 1, 2020

hddong Apr 2, 2020

yanghua commented Mar 31, 2020

hddong commented Mar 31, 2020

vinothchandar Mar 31, 2020

hddong commented Apr 1, 2020

vinothchandar commented May 14, 2020

[WIP][HUDI-752]Make CompactionAdminClient spark-free #1471

[WIP][HUDI-752]Make CompactionAdminClient spark-free #1471

Conversation

hddong commented Mar 31, 2020 • edited

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanghua commented Mar 31, 2020

hddong commented Mar 31, 2020

Choose a reason for hiding this comment

hddong commented Apr 1, 2020

vinothchandar commented May 14, 2020

hddong commented Mar 31, 2020 •

edited