[GOBBLIN-1996] Add ability for Yarn app to terminate on finishing of temporal flow by homatthew · Pull Request #3865 · apache/gobblin

homatthew · 2024-01-24T18:16:26Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-1996] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1996

Description

Here are some details about my PR, including screenshots (if applicable):

In the above diagram, Yarn application will terminate when it's done exeucting the submitted Temporal workflow. And the monitoring Azkaban job will terminate when it sees the yarn application finish

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

homatthew · 2024-01-31T15:38:19Z

    // Start the application
-    this.serviceManager.startAsync().awaitHealthy();
+    try {
+      this.serviceManager.startAsync().awaitHealthy(startupTimeout.getSeconds(), TimeUnit.SECONDS);


One of the services, the FSConfigurationManager, has some main thread interrupting behavior that causes the start call to last indefinitely.

This prevents users from being able to call the shutdown method because this method is synchronized. I'm adding a configuration for the timeout to be configurable, with the default being indefinitely waiting

what does it mean for the "start call [to] last indefinitely"?

the startAsync (it's not actually async)?

or the awaitHealthy (e.g. the FSConfigMgr never actually becomes healthy)?

also, if awaitHealthy never returns does that impede our ability to fail-fast when there's a legit problem, like the deployment is DOA?

In this case, startAsync is not really working as expected when using the FSConfigurationManager. The actual work being done in cluster is done via the FSConfigurationManager when that class takes that config and produces a source out of the .conf file.

The awaitHealthy call here waits indefinitely and suspends this current thread in the current implementation. It does not impede our ability to fail-fast because often just hangs if there's an issue. If anything, we can fail faster because a configurable timeout allows us to unblock this thread and trigger the shutdown when the Workflow finishes (successfully or unsuccessfully)

homatthew · 2024-01-31T15:39:55Z

      properties.setProperty(ServiceBasedAppLauncher.APP_STOP_TIME_SECONDS, Long.toString(300));
    }
-    this.applicationLauncher = new ServiceBasedAppLauncher(properties, this.clusterName);
+    this.applicationLauncher = new ServiceBasedAppLauncherWithoutMetrics(properties, this.clusterName);


Metrics in these pipelines are causing noisy logging and the metrics here are currently not used by Temporal this temporal in any meaningful way, so I've disabled them here.

Disabling metrics via the gobblin metris key is not desirable because we still need GTE emission to work

homatthew · 2024-01-31T15:40:26Z


    LOGGER.info("Stopping the Gobblin Cluster Manager");

-    if (this.idleProcessThread != null) {


Temporal needs to run in yarn mode. And so we do not need the standalone cluster / bare metal code

phet

nice crisp solution... I like it!

phet · 2024-01-31T19:14:44Z

    // Start the application
-    this.serviceManager.startAsync().awaitHealthy();
+    try {
+      this.serviceManager.startAsync().awaitHealthy(startupTimeout.getSeconds(), TimeUnit.SECONDS);


what does it mean for the "start call [to] last indefinitely"?

the startAsync (it's not actually async)?

or the awaitHealthy (e.g. the FSConfigMgr never actually becomes healthy)?

also, if awaitHealthy never returns does that impede our ability to fail-fast when there's a legit problem, like the deployment is DOA?

phet · 2024-02-09T02:19:16Z

  public static class Factory {
-    private static final ActivityOptions DEFAULT_OPTS = ActivityOptions.newBuilder().build();
+    private static final ActivityOptions DEFAULT_OPTS = ActivityOptions.newBuilder()
+        .setStartToCloseTimeout(Duration.ofHours(24))


what are the ramifications of this? a job could still run >24 hours right? does this merely constrain the activity that attempts to send GTEs to finish performing the send within 24 hours?

Correct. the job itself can still run for > 24 hours. But if for some reason kafka is down, this operation can take an arbitrary amount of time. So let's cap it at a reasonably high number (24 hours)

There are 2 things to consider here:

GaaS won't be able to detect if a job has finished until this activity succeeds. If Kafka is down, we want to be able to retry this task until kafka is back.

If the job itself is completed but this timer is unable to send the message, I think 24 hours is a reasonable amount of time to retry for before giving up. We ideally don't want to have to retry the work just because the timer failed for a brief period of time (e.g. a few hours of downtime)

sure that's fine. for this, even 8 or 12 hours should be far sufficient. please add clarifying comment that this is max time to attempt to send, potentially waiting out a kafka outage

phet

I think we're there, or at least super close!

codecov-commenter · 2024-02-13T18:48:48Z

Codecov Report

Attention: Patch coverage is 15.68627% with 43 lines in your changes missing coverage. Please review.

Project coverage is 46.72%. Comparing base (aec3401) to head (6239d42).
Report is 156 commits behind head on master.

Files with missing lines	Patch %	Lines
...obblin/temporal/ddm/NoWorkUnitsInfiniteSource.java	0.00%	12 Missing ⚠️
...emporal/cluster/GobblinTemporalClusterManager.java	0.00%	9 Missing ⚠️
...e/gobblin/runtime/app/ServiceBasedAppLauncher.java	66.66%	3 Missing and 1 partial ⚠️
...mporal/joblauncher/GobblinTemporalJobLauncher.java	0.00%	4 Missing ⚠️
...cluster/ServiceBasedAppLauncherWithoutMetrics.java	0.00%	3 Missing ⚠️
...bblin/temporal/joblauncher/GobblinJobLauncher.java	0.00%	3 Missing ⚠️
...temporal/workflows/metrics/TemporalEventTimer.java	0.00%	3 Missing ⚠️
.../org/apache/gobblin/temporal/yarn/YarnService.java	0.00%	2 Missing ⚠️
...oral/ddm/launcher/ProcessWorkUnitsJobLauncher.java	0.00%	1 Missing ⚠️
.../loadgen/launcher/GenArbitraryLoadJobLauncher.java	0.00%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@              Coverage Diff              @@
##             master    #3865       +/-   ##
=============================================
+ Coverage          0   46.72%   +46.72%     
- Complexity        0    11131    +11131     
=============================================
  Files             0     2212     +2212     
  Lines             0    87430    +87430     
  Branches          0     9613     +9613     
=============================================
+ Hits              0    40855    +40855     
- Misses            0    42884    +42884     
- Partials          0     3691     +3691

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

homatthew force-pushed the mh-terminate-on-finish branch from 6666bdd to f479355 Compare January 29, 2024 18:42

homatthew changed the title ~~Mh terminate on finish~~ [GOBBLIN-1996] Add ability for Yarn app to terminate on finishing of temporal flow Jan 30, 2024

homatthew marked this pull request as ready for review January 30, 2024 02:31

homatthew commented Jan 31, 2024

View reviewed changes

phet reviewed Jan 31, 2024

View reviewed changes

homatthew force-pushed the mh-terminate-on-finish branch 2 times, most recently from ba5e03e to f1ae369 Compare February 1, 2024 18:26

homatthew added 2 commits February 5, 2024 15:26

Terminate on finish of temporal flow

b48d1b8

Refactor

d4e290d

homatthew force-pushed the mh-terminate-on-finish branch from f1ae369 to 87a478b Compare February 6, 2024 17:56

Address comments

4d2f1c7

homatthew force-pushed the mh-terminate-on-finish branch from 87a478b to 4d2f1c7 Compare February 6, 2024 17:56

phet reviewed Feb 9, 2024

View reviewed changes

Comment thread gobblin-temporal/src/main/java/org/apache/gobblin/temporal/ddm/NoWorkunitsInfiniteSource.java Outdated

Comment thread ...src/main/java/org/apache/gobblin/temporal/cluster/ServiceBasedAppLauncherWithoutMetrics.java

Address comments

40153a4

homatthew force-pushed the mh-terminate-on-finish branch from 0c53cca to 40153a4 Compare February 13, 2024 04:18

phet approved these changes Feb 13, 2024

View reviewed changes

Add clarifying comment

e8f94b7

phet approved these changes Feb 13, 2024

View reviewed changes

Will-Lo requested changes Feb 13, 2024

View reviewed changes

Comment thread ...emporal/src/main/java/org/apache/gobblin/temporal/cluster/GobblinTemporalClusterManager.java

LOGGER to log

6239d42

Will-Lo approved these changes Feb 13, 2024

View reviewed changes

phet approved these changes Feb 13, 2024

View reviewed changes

Will-Lo merged commit 64bdc70 into apache:master Feb 13, 2024


		LOGGER.info("Stopping the Gobblin Cluster Manager");

		if (this.idleProcessThread != null) {

Conversation

homatthew commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA

Description

Tests

Commits

Uh oh!

homatthew Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

phet Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

homatthew Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

homatthew Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

homatthew Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

phet left a comment

Choose a reason for hiding this comment

Uh oh!

phet Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phet Feb 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

homatthew Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phet Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

phet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

homatthew commented Jan 24, 2024 •

edited

Loading

phet Feb 9, 2024 •

edited

Loading

homatthew Feb 13, 2024 •

edited

Loading

codecov-commenter commented Feb 13, 2024 •

edited

Loading