chore: improve spark parallel #450

zyxxoo · 2023-04-06T10:40:21Z

No description provided.

zyxxoo · 2023-04-06T10:46:51Z

hugegraph-loader/src/main/java/org/apache/hugegraph/loader/spark/HugeGraphSparkLoader.java

+                    LOG.info("\n Start to load data using spark bulkload \n");
+                    // gen-hfile
+                    HBaseDirectLoader directLoader = new HBaseDirectLoader(loadOptions, struct,
+                                                                           loadDistributeMetrics);


很奇怪这里的 loadDistributeMetrics，这个代码是跑在算子里面的，我理解应该是算子获取的是这个方法的备份吧？spark 怎么把这个传到 drive 里面来呢？

LoadDistributeMetrics 里用的是spark的累加器，能聚合executor的值到driver
https://github.com/apache/incubator-hugegraph-toolchain/blob/master/hugegraph-loader/src/main/java/org/apache/hugegraph/loader/metrics/LoadDistributeMetrics.java#L54

codecov · 2023-04-06T10:49:34Z

Codecov Report

Merging #450 (42927ce) into master (36a1ada) will decrease coverage by 0.05%.
The diff coverage is 0.00%.

❗ Current head 42927ce differs from pull request most recent head dad4504. Consider uploading reports for the commit dad4504 to get more accurate results

@@             Coverage Diff              @@
##             master     #450      +/-   ##
============================================
- Coverage     62.57%   62.52%   -0.05%     
+ Complexity     1867      894     -973     
============================================
  Files           260       91     -169     
  Lines          9418     4395    -5023     
  Branches        872      516     -356     
============================================
- Hits           5893     2748    -3145     
+ Misses         3143     1444    -1699     
+ Partials        382      203     -179

Impacted Files	Coverage Δ
...e/hugegraph/loader/spark/HugeGraphSparkLoader.java	`0.00% <0.00%> (ø)`

... and 169 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

zyxxoo · 2023-04-06T10:56:55Z

hugegraph-loader/src/main/java/org/apache/hugegraph/loader/spark/HugeGraphSparkLoader.java

-                    LoadContext context = initPartition(this.loadOptions, struct);
-                    p.forEachRemaining((Row row) -> {
-                        loadRow(struct, row, p, context);
+            Future<?> future = Executors.newCachedThreadPool().submit(() -> {


按我个人理解，这里并发应该没有线程安全问题了

这里用 cache threadpool，按我个人理解应该是加载多个文件，所以并行执行，生成多个 DAG，然后由 spark 去做调度具体任务，所以我这里没有考虑线程池大小

zyxxoo force-pushed the zy_dev branch from f082a97 to 42927ce Compare April 6, 2023 10:41

zyxxoo commented Apr 6, 2023

View reviewed changes

imbajin requested review from simon824 and imbajin April 6, 2023 10:48

zyxxoo commented Apr 6, 2023

View reviewed changes

chore: improve spark parallel

dad4504

zyxxoo force-pushed the zy_dev branch from 42927ce to dad4504 Compare April 6, 2023 11:00

javeme approved these changes Apr 6, 2023

View reviewed changes

simon824 approved these changes Apr 7, 2023

View reviewed changes

simon824 merged commit cf1312e into master Apr 7, 2023
9 checks passed

simon824 deleted the zy_dev branch April 7, 2023 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: improve spark parallel #450

chore: improve spark parallel #450

zyxxoo commented Apr 6, 2023

zyxxoo Apr 6, 2023

simon824 Apr 6, 2023

codecov bot commented Apr 6, 2023 •

edited

zyxxoo Apr 6, 2023

zyxxoo Apr 6, 2023

chore: improve spark parallel #450

chore: improve spark parallel #450

Conversation

zyxxoo commented Apr 6, 2023

zyxxoo Apr 6, 2023

Choose a reason for hiding this comment

simon824 Apr 6, 2023

Choose a reason for hiding this comment

codecov bot commented Apr 6, 2023 • edited

Codecov Report

zyxxoo Apr 6, 2023

Choose a reason for hiding this comment

zyxxoo Apr 6, 2023

Choose a reason for hiding this comment

codecov bot commented Apr 6, 2023 •

edited