Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-575] Support Async Compaction for spark streaming writes to hudi table #1752

Merged
merged 1 commit into from
Aug 5, 2020

Conversation

bvaradar
Copy link
Contributor

This PR is dependent on #1577 It has 2 commits: The first commit corresponds to #1577 and the second commit is for this PR.

Contains:

  1. Structured Streaming Async Compaction Support
  2. Integration tests was missing Structured Streaming. Added them to ITTestHoodieSanity with async compaction enabled for MOR tables.

* In case of deltastreamer, Spark job scheduling configs are automatically set.
* As the configs needs to be set before spark context is initiated, it is not
* automated for Structured Streaming.
* https://spark.apache.org/docs/latest/job-scheduling.html
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar vinothchandar self-assigned this Jun 21, 2020
@bvaradar bvaradar changed the title [WIP] [HUDI-575] Support Async Compaction for spark streaming writes to hudi table [HUDI-575] Support Async Compaction for spark streaming writes to hudi table Jun 24, 2020
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some preliminary comments..

I rebased this PR against latest master..
Will continue to review.

val asyncCompactionEnabled = isAsyncCompactionEnabled(client, parameters, jsc.hadoopConfiguration())
val compactionInstant : common.util.Option[java.lang.String] =
if (asyncCompactionEnabled) {
client.scheduleCompaction(common.util.Option.of(new util.HashMap[String, String](mapAsJavaMap(metaMap))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that nothing is actually scheduled here, since there is nothiing to compact?

private def isAsyncCompactionEnabled(client: HoodieWriteClient[HoodieRecordPayload[Nothing]],
parameters: Map[String, String], configuration: Configuration) : Boolean = {
log.info(s"Config.isInlineCompaction ? ${client.getConfig.isInlineCompaction}")
if (!client.getConfig.isInlineCompaction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the user sets the writeClient config for inline = false and does not set async compaction datasource option? should we control at a single level..

* active. If there is no activity for sufficient period, async compactor shuts down. If the sink was indeed active,
* a subsequent batch will re-trigger async compaction.
*/
public class SparkStreamingWriterActivityDetector {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basic question.. if there are no writes , no compaction gets scheduled right? so async compaction is a no-op i.e it will check if there is some work to do, if not won't trigger anything?

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks close. but we need to iron out the activity detection/async compaction thread shutdown

}

/**
* Spark Structured Streaming Sink implementation do not have mechanism to know when the stream is shutdown.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move comments that refer to a sub-class impl to that class itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

long currTime = System.nanoTime();
long elapsedTimeSecs = Double.valueOf(Math.ceil(1.0 * (currTime - lastEndBatchTime) / SECS_TO_NANOS)).longValue();
if (elapsedTimeSecs > sinkInactivityTimeoutSecs) {
LOG.warn("Streaming Sink has been idle for " + elapsedTimeSecs + " seconds");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not mean there is no work for compaction right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is deleted.

@@ -58,6 +58,7 @@
protected static final String PRESTO_COORDINATOR = "/presto-coordinator-1";
protected static final String HOODIE_WS_ROOT = "/var/hoodie/ws";
protected static final String HOODIE_JAVA_APP = HOODIE_WS_ROOT + "/hudi-spark/run_hoodie_app.sh";
protected static final String HOODIE_JAVA_STREAMING_APP = HOODIE_WS_ROOT + "/hudi-spark/run_hoodie_streaming_app.sh";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more importantly, we should also renable the test in TestDataSource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabled it after adding timed retry logic to wait for commits

mode,
options,
data)
sqlContext, mode, options, data, writeClient, Some(triggerAsyncCompactor))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirming that reuse of writeClient across batches is fine..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this worked fine.

})

// Add Shutdown Hook
Runtime.getRuntime.addShutdownHook(new Thread(new Runnable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this alone should be good enough to prevent the jvm from not hanging during exit? do we really need the laststart/lastend logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this and setting daemon mode was good enough

}))

// First time, scan .hoodie folder and get all pending compactions
val metaClient = new HoodieTableMetaClient(sqlContext.sparkContext.hadoopConfiguration,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this will happen each trigger/ not just first time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, only for the first time when async compactor is null.

@@ -68,7 +74,7 @@
private String tableName = "hoodie_test";

@Parameter(names = {"--table-type", "-t"}, description = "One of COPY_ON_WRITE or MERGE_ON_READ")
private String tableType = HoodieTableType.MERGE_ON_READ.name();
private String tableType = HoodieTableType.COPY_ON_WRITE.name();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move to COW?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted.

@@ -38,46 +50,65 @@ class HoodieStreamingSink(sqlContext: SQLContext,
private val retryIntervalMs = options(DataSourceWriteOptions.STREAMING_RETRY_INTERVAL_MS_OPT_KEY).toLong
private val ignoreFailedBatch = options(DataSourceWriteOptions.STREAMING_IGNORE_FAILED_BATCH_OPT_KEY).toBoolean

private var isAsyncCompactorServiceShutdownAbnormally = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-StreamingQueryManager.html seems like there are some listeners we can exploit to know of a StreamingQuery?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed by setting up daemon mode for async compactor thread

@codecov-commenter
Copy link

Codecov Report

Merging #1752 into master will decrease coverage by 2.86%.
The diff coverage is 14.68%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1752      +/-   ##
============================================
- Coverage     62.82%   59.95%   -2.87%     
- Complexity     3437     3609     +172     
============================================
  Files           401      439      +38     
  Lines         17091    19088    +1997     
  Branches       1698     1943     +245     
============================================
+ Hits          10737    11445     +708     
- Misses         5623     6850    +1227     
- Partials        731      793      +62     
Flag Coverage Δ Complexity Δ
#hudicli 67.89% <0.00%> (?) 1430.00 <0.00> (?)
#hudiclient 78.35% <0.00%> (-0.89%) 1258.00 <0.00> (ø)
#hudicommon 54.29% <ø> (+0.04%) 1486.00 <ø> (+1.00)
#hudihadoopmr 39.36% <ø> (ø) 163.00 <ø> (ø)
#hudihivesync 72.25% <ø> (ø) 121.00 <ø> (ø)
#hudispark 41.54% <18.89%> (-2.69%) 78.00 <2.00> (+2.00) ⬇️
#huditimelineservice 63.47% <ø> (ø) 47.00 <ø> (ø)
#hudiutilities 73.49% <100.00%> (-0.27%) 284.00 <0.00> (-3.00)
Impacted Files Coverage Δ Complexity Δ
...va/org/apache/hudi/async/AbstractAsyncService.java 2.17% <0.00%> (+0.04%) 1.00 <0.00> (ø)
...ava/org/apache/hudi/async/AsyncCompactService.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...rc/main/java/org/apache/hudi/client/Compactor.java 0.00% <ø> (ø) 0.00 <0.00> (?)
.../hudi/async/SparkStreamingAsyncCompactService.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...di/async/SparkStreamingWriterActivityDetector.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...in/scala/org/apache/hudi/HoodieStreamingSink.scala 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...tilities/deltastreamer/SchedulerConfGenerator.java 90.90% <ø> (ø) 8.00 <0.00> (ø)
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 53.60% <50.00%> (-1.16%) 0.00 <0.00> (ø)
...src/main/java/org/apache/hudi/DataSourceUtils.java 54.90% <66.66%> (-0.10%) 27.00 <2.00> (+2.00) ⬇️
...main/scala/org/apache/hudi/DataSourceOptions.scala 93.81% <100.00%> (+0.26%) 0.00 <0.00> (ø)
... and 41 more

@bvaradar
Copy link
Contributor Author

bvaradar commented Aug 3, 2020

All tests passes now. Will run this in a setup for few hours before merging.

@bvaradar bvaradar force-pushed the hudi-575 branch 3 times, most recently from 8d515f9 to f3736c2 Compare August 5, 2020 02:49
@bvaradar
Copy link
Contributor Author

bvaradar commented Aug 5, 2020

For the remaining 2 questions, here is the answer:

what if the user sets the writeClient config for inline = false and does not set async compaction datasource option? should we control at a single level.. ?
After discussion, we decided to have Async compaction be enabled by default for MOR table. If async compaction is disabled by config, inline compaction will be enabled automatically.

basic question.. if there are no writes , no compaction gets scheduled right? so async compaction is a no-op i.e it will check if there is some work to do, if not won't trigger anything?
Yes, that is correct.

@bvaradar
Copy link
Contributor Author

bvaradar commented Aug 5, 2020

Merging as was previously discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants