[HUDI-575] Support Async Compaction for spark streaming writes to hudi table #1752

bvaradar · 2020-06-20T21:52:55Z

This PR is dependent on #1577 It has 2 commits: The first commit corresponds to #1577 and the second commit is for this PR.

Contains:

Structured Streaming Async Compaction Support
Integration tests was missing Structured Streaming. Added them to ITTestHoodieSanity with async compaction enabled for MOR tables.

bvaradar · 2020-06-20T21:55:46Z

hudi-client/src/main/java/org/apache/hudi/async/AsyncCompactService.java

+   * In case of deltastreamer, Spark job scheduling configs are automatically set.
+   * As the configs needs to be set before spark context is initiated, it is not
+   * automated for Structured Streaming.
+   * https://spark.apache.org/docs/latest/job-scheduling.html


https://jira.apache.org/jira/browse/HUDI-1031 to add to docs

vinothchandar

Just some preliminary comments..

I rebased this PR against latest master..
Will continue to review.

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

vinothchandar · 2020-06-28T07:58:42Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+      val asyncCompactionEnabled = isAsyncCompactionEnabled(client, parameters, jsc.hadoopConfiguration())
+      val compactionInstant : common.util.Option[java.lang.String] =
+      if (asyncCompactionEnabled) {
+        client.scheduleCompaction(common.util.Option.of(new util.HashMap[String, String](mapAsJavaMap(metaMap))))


is it possible that nothing is actually scheduled here, since there is nothiing to compact?

vinothchandar · 2020-06-28T08:00:32Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+  private def isAsyncCompactionEnabled(client: HoodieWriteClient[HoodieRecordPayload[Nothing]],
+                                       parameters: Map[String, String], configuration: Configuration) : Boolean = {
+    log.info(s"Config.isInlineCompaction ? ${client.getConfig.isInlineCompaction}")
+    if (!client.getConfig.isInlineCompaction


what if the user sets the writeClient config for inline = false and does not set async compaction datasource option? should we control at a single level..

vinothchandar · 2020-06-28T08:03:45Z

hudi-spark/src/main/java/org/apache/hudi/async/SparkStreamingWriterActivityDetector.java

+ * active. If there is no activity for sufficient period, async compactor shuts down. If the sink was indeed active,
+ * a subsequent batch will re-trigger async compaction.
+ */
+public class SparkStreamingWriterActivityDetector {


basic question.. if there are no writes , no compaction gets scheduled right? so async compaction is a no-op i.e it will check if there is some work to do, if not won't trigger anything?

vinothchandar

Looks close. but we need to iron out the activity detection/async compaction thread shutdown

vinothchandar · 2020-06-28T09:26:39Z

hudi-client/src/main/java/org/apache/hudi/async/AsyncCompactService.java

+  }
+
+  /**
+   * Spark Structured Streaming Sink implementation do not have mechanism to know when the stream is shutdown.


move comments that refer to a sub-class impl to that class itself?

vinothchandar · 2020-06-28T09:29:31Z

hudi-spark/src/main/java/org/apache/hudi/async/SparkStreamingWriterActivityDetector.java

+    long currTime = System.nanoTime();
+    long elapsedTimeSecs = Double.valueOf(Math.ceil(1.0 * (currTime - lastEndBatchTime) / SECS_TO_NANOS)).longValue();
+    if (elapsedTimeSecs > sinkInactivityTimeoutSecs) {
+      LOG.warn("Streaming Sink has been idle for " + elapsedTimeSecs + " seconds");


this does not mean there is no work for compaction right?

This code is deleted.

vinothchandar · 2020-06-28T09:37:55Z

hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java

@@ -58,6 +58,7 @@
  protected static final String PRESTO_COORDINATOR = "/presto-coordinator-1";
  protected static final String HOODIE_WS_ROOT = "/var/hoodie/ws";
  protected static final String HOODIE_JAVA_APP = HOODIE_WS_ROOT + "/hudi-spark/run_hoodie_app.sh";
+  protected static final String HOODIE_JAVA_STREAMING_APP = HOODIE_WS_ROOT + "/hudi-spark/run_hoodie_streaming_app.sh";


more importantly, we should also renable the test in TestDataSource

Enabled it after adding timed retry logic to wait for commits

vinothchandar · 2020-06-28T09:43:50Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieStreamingSink.scala

-          mode,
-          options,
-          data)
+          sqlContext, mode, options, data, writeClient, Some(triggerAsyncCompactor))


just confirming that reuse of writeClient across batches is fine..

Yes, this worked fine.

vinothchandar · 2020-06-28T09:46:45Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieStreamingSink.scala

+      })
+
+      // Add Shutdown Hook
+      Runtime.getRuntime.addShutdownHook(new Thread(new Runnable {


this alone should be good enough to prevent the jvm from not hanging during exit? do we really need the laststart/lastend logic?

Yes, this and setting daemon mode was good enough

vinothchandar · 2020-06-28T09:47:32Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieStreamingSink.scala

+      }))
+
+      // First time, scan .hoodie folder and get all pending compactions
+      val metaClient = new HoodieTableMetaClient(sqlContext.sparkContext.hadoopConfiguration,


Seems like this will happen each trigger/ not just first time?

Now, only for the first time when async compactor is null.

vinothchandar · 2020-06-28T09:48:22Z

hudi-spark/src/test/java/HoodieJavaStreamingApp.java

@@ -68,7 +74,7 @@
  private String tableName = "hoodie_test";

  @Parameter(names = {"--table-type", "-t"}, description = "One of COPY_ON_WRITE or MERGE_ON_READ")
-  private String tableType = HoodieTableType.MERGE_ON_READ.name();
+  private String tableType = HoodieTableType.COPY_ON_WRITE.name();


why move to COW?

vinothchandar · 2020-06-28T09:59:51Z

hudi-spark/src/main/scala/org/apache/hudi/HoodieStreamingSink.scala

@@ -38,46 +50,65 @@ class HoodieStreamingSink(sqlContext: SQLContext,
  private val retryIntervalMs = options(DataSourceWriteOptions.STREAMING_RETRY_INTERVAL_MS_OPT_KEY).toLong
  private val ignoreFailedBatch = options(DataSourceWriteOptions.STREAMING_IGNORE_FAILED_BATCH_OPT_KEY).toBoolean

+  private var isAsyncCompactorServiceShutdownAbnormally = false


https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-StreamingQueryManager.html seems like there are some listeners we can exploit to know of a StreamingQuery?

Thanks. Fixed by setting up daemon mode for async compactor thread

codecov-commenter · 2020-06-28T10:39:57Z

Codecov Report

Merging #1752 into master will decrease coverage by 2.86%.
The diff coverage is 14.68%.

@@             Coverage Diff              @@
##             master    #1752      +/-   ##
============================================
- Coverage     62.82%   59.95%   -2.87%     
- Complexity     3437     3609     +172     
============================================
  Files           401      439      +38     
  Lines         17091    19088    +1997     
  Branches       1698     1943     +245     
============================================
+ Hits          10737    11445     +708     
- Misses         5623     6850    +1227     
- Partials        731      793      +62

Flag	Coverage Δ	Complexity Δ
#hudicli	`67.89% <0.00%> (?)`	`1430.00 <0.00> (?)`
#hudiclient	`78.35% <0.00%> (-0.89%)`	`1258.00 <0.00> (ø)`
#hudicommon	`54.29% <ø> (+0.04%)`	`1486.00 <ø> (+1.00)`
#hudihadoopmr	`39.36% <ø> (ø)`	`163.00 <ø> (ø)`
#hudihivesync	`72.25% <ø> (ø)`	`121.00 <ø> (ø)`
#hudispark	`41.54% <18.89%> (-2.69%)`	`78.00 <2.00> (+2.00)`	⬇️
#huditimelineservice	`63.47% <ø> (ø)`	`47.00 <ø> (ø)`
#hudiutilities	`73.49% <100.00%> (-0.27%)`	`284.00 <0.00> (-3.00)`

Impacted Files	Coverage Δ	Complexity Δ
...va/org/apache/hudi/async/AbstractAsyncService.java	`2.17% <0.00%> (+0.04%)`	`1.00 <0.00> (ø)`
...ava/org/apache/hudi/async/AsyncCompactService.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...rc/main/java/org/apache/hudi/client/Compactor.java	`0.00% <ø> (ø)`	`0.00 <0.00> (?)`
.../hudi/async/SparkStreamingAsyncCompactService.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...di/async/SparkStreamingWriterActivityDetector.java	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...in/scala/org/apache/hudi/HoodieStreamingSink.scala	`0.00% <0.00%> (ø)`	`0.00 <0.00> (ø)`
...tilities/deltastreamer/SchedulerConfGenerator.java	`90.90% <ø> (ø)`	`8.00 <0.00> (ø)`
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`53.60% <50.00%> (-1.16%)`	`0.00 <0.00> (ø)`
...src/main/java/org/apache/hudi/DataSourceUtils.java	`54.90% <66.66%> (-0.10%)`	`27.00 <2.00> (+2.00)`	⬇️
...main/scala/org/apache/hudi/DataSourceOptions.scala	`93.81% <100.00%> (+0.26%)`	`0.00 <0.00> (ø)`
... and 41 more

bvaradar · 2020-08-03T14:38:26Z

All tests passes now. Will run this in a setup for few hours before merging.

bvaradar · 2020-08-05T02:51:05Z

For the remaining 2 questions, here is the answer:

what if the user sets the writeClient config for inline = false and does not set async compaction datasource option? should we control at a single level.. ?
After discussion, we decided to have Async compaction be enabled by default for MOR table. If async compaction is disabled by config, inline compaction will be enabled automatically.

basic question.. if there are no writes , no compaction gets scheduled right? so async compaction is a no-op i.e it will check if there is some work to do, if not won't trigger anything?
Yes, that is correct.

bvaradar · 2020-08-05T14:51:16Z

Merging as was previously discussed.

bvaradar force-pushed the hudi-575 branch from a86da38 to 33b5bac Compare June 20, 2020 21:54

bvaradar commented Jun 20, 2020

View reviewed changes

bvaradar force-pushed the hudi-575 branch from 33b5bac to 4b6a832 Compare June 21, 2020 06:21

vinothchandar self-assigned this Jun 21, 2020

bvaradar changed the title ~~[WIP] [HUDI-575] Support Async Compaction for spark streaming writes to hudi table~~ [HUDI-575] Support Async Compaction for spark streaming writes to hudi table Jun 24, 2020

vinothchandar mentioned this pull request Jun 28, 2020

[HUDI-855] Run Auto Cleaner in parallel with ingestion #1577

Merged

vinothchandar force-pushed the hudi-575 branch from 4b6a832 to 88f34ab Compare June 28, 2020 09:18

vinothchandar reviewed Jun 28, 2020

View reviewed changes

vinothchandar mentioned this pull request Jul 8, 2020

Slow Write into Hudi Dataset(MOR) #1694

Closed

bhasudha mentioned this pull request Jul 13, 2020

[SUPPORT] MOR trigger compaction from Hudi CLI #1823

Closed

This was referenced Jul 23, 2020

[SUPPORT] #1852

Closed

[SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer #1878

Closed

bvaradar force-pushed the hudi-575 branch 2 times, most recently from d542689 to 54e2e25 Compare August 3, 2020 07:47

bvaradar force-pushed the hudi-575 branch 3 times, most recently from 8d515f9 to f3736c2 Compare August 5, 2020 02:49

bvaradar force-pushed the hudi-575 branch from f3736c2 to a6992eb Compare August 5, 2020 05:08

[HUDI-575] Spark Streaming with async compaction support

281b820

bvaradar force-pushed the hudi-575 branch from a6992eb to 281b820 Compare August 5, 2020 05:58

bvaradar merged commit 7a2429f into apache:master Aug 5, 2020

vinothchandar mentioned this pull request Aug 5, 2020

[HUDI-69] Support Spark Datasource for MOR table - RDD approach #1848

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-575] Support Async Compaction for spark streaming writes to hudi table #1752

[HUDI-575] Support Async Compaction for spark streaming writes to hudi table #1752

bvaradar commented Jun 20, 2020

bvaradar Jun 20, 2020

vinothchandar left a comment

vinothchandar Jun 28, 2020

vinothchandar Jun 28, 2020

vinothchandar Jun 28, 2020

vinothchandar left a comment

vinothchandar Jun 28, 2020

bvaradar Aug 4, 2020

vinothchandar Jun 28, 2020

bvaradar Aug 4, 2020

vinothchandar Jun 28, 2020

bvaradar Aug 4, 2020

vinothchandar Jun 28, 2020

bvaradar Aug 4, 2020

vinothchandar Jun 28, 2020

bvaradar Aug 4, 2020

vinothchandar Jun 28, 2020

bvaradar Aug 4, 2020

vinothchandar Jun 28, 2020

bvaradar Aug 4, 2020

vinothchandar Jun 28, 2020

bvaradar Aug 4, 2020

codecov-commenter commented Jun 28, 2020

bvaradar commented Aug 3, 2020

bvaradar commented Aug 5, 2020

bvaradar commented Aug 5, 2020

[HUDI-575] Support Async Compaction for spark streaming writes to hudi table #1752

[HUDI-575] Support Async Compaction for spark streaming writes to hudi table #1752

Conversation

bvaradar commented Jun 20, 2020

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 28, 2020

Codecov Report

bvaradar commented Aug 3, 2020

bvaradar commented Aug 5, 2020

bvaradar commented Aug 5, 2020