[GOBBLIN-798] Clean up workflows from Helix when the Gobblin applicat… #2665

htran1 · 2019-06-06T23:43:17Z

…ion master starts

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-XXX] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-798

Description

Here are some details about my PR, including screenshots (if applicable):
If the application master aborts a new one may be spawned by YARN. The second application master will resubmit the jobs. This results in duplicate jobs in Helix and multiple instances of the job may run, resulting in duplicate data.

The Gobblin application master should clean up all workflows on startup to avoid executing multiple instances of a job.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
GobblinYarnAppLauncherTest.testJobCleanup()

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

…ion master starts

sv2000 · 2019-06-07T17:29:40Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinClusterConfigurationKeys.java

+
+  // for cleaning up jobs on cluster manager startup
+  public static final String CLEAN_UP_JOBS_ON_MANAGER_START = GOBBLIN_CLUSTER_PREFIX + "cleanUpJobsOnManagerStart";
+  public static final boolean DEFAULT_CLEAN_UP_JOBS_ON_MANAGER_START = false;


Is the default "false" to preserve the current behavior with Gobblin cluster in non-Yarn mode?

Yes, this is to keep behavior the same.

Removed this option since the existing behavior of not cleaning up on startup lead to the issue being fixed in this PR, so it is reasonable to not have an option to preserve the buggy behavior.

sv2000 · 2019-06-07T17:30:41Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinClusterConfigurationKeys.java

@@ -161,4 +161,8 @@

  public static final String CANCEL_RUNNING_JOB_ON_DELETE = GOBBLIN_CLUSTER_PREFIX + "job.cancelRunningJobOnDelete";
  public static final String DEFAULT_CANCEL_RUNNING_JOB_ON_DELETE = "false";
+
+  // for cleaning up jobs on cluster manager startup
+  public static final String CLEAN_UP_JOBS_ON_MANAGER_START = GOBBLIN_CLUSTER_PREFIX + "cleanUpJobsOnManagerStart";


Shouldn't we always clean up on restart?

I have this default to true for YARN mode. For standalone mode we clean up on leader ship change. For all other modes the existing behavior is maintained. I wanted to avoid changing behavior as much as possible. Initial startup already blows away the Helix cluster. This is only for yarn restart of the Gobblin application master without a restart of the application launcher.

sv2000 · 2019-06-07T17:35:15Z

gobblin-yarn/src/main/java/org/apache/gobblin/yarn/GobblinApplicationMaster.java

-        GobblinClusterUtils.addDynamicConfig(config), Optional.<Path>absent());
+        GobblinClusterUtils.addDynamicConfig(config)
+        .withFallback(ConfigFactory.parseMap(
+            ImmutableMap.of(GobblinClusterConfigurationKeys.CLEAN_UP_JOBS_ON_MANAGER_START, "true"))),


Should it be ImmutableMap.of(GobblinClusterConfigurationKeys.CLEAN_UP_JOBS_ON_MANAGER_START, DEFAULT_CLEAN_UP_JOBS_ON_MANAGER_START) ?

No, this is explicitly setting it to "true" for the Gobblin app master. All other gobblin cluster modes use the default of "false".

sv2000 · 2019-06-07T17:36:46Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinHelixMultiManager.java

  private void cleanUpJobs(HelixManager helixManager) {
    // Clean up existing jobs
    TaskDriver taskDriver = new TaskDriver(helixManager);

    Map<String, WorkflowConfig> workflows = taskDriver.getWorkflows();

+    log.debug("cleanUpJobs workflow count {} workflows {}", workflows.size(), workflows);


Maybe just dump workflows.keySet() instead of the entire map?

sv2000 · 2019-06-07T17:45:13Z

gobblin-cluster/src/main/java/org/apache/gobblin/cluster/GobblinClusterManager.java

@@ -264,6 +268,12 @@ public synchronized void start() {
    this.eventBus.register(this);
    this.multiManager.connect();

+    // Standalone mode registers a handler to clean up on leadership change, so don't do the cleanup
+    // now even if the option to clean up on startup is set.
+    if (this.cleanUpJobsOnStartup && !this.isStandaloneMode) {


Is this check needed for correctness or to avoid duplicate clean up calls? If it is the latter, shouldn't the 2nd call be handled as a No-op?

It is to avoid duplicate clean up calls. The code would be correct without this check. A standalone mode manager can transition through multiple leadership change events and there is only one clean up call per transition, so there is no need to dedupe/no-op in that code path.

Workflows will always be deleted on cluster manager startup in non-standalone mode. For standalone mode the cleanup will be in the existing leadership change path.

sv2000

+1. LGTM.

sv2000

+1. LGTM.

[GOBBLIN-798] Clean up workflows from Helix when the Gobblin applicat…

5ea0ab8

…ion master starts

sv2000 requested changes Jun 7, 2019

View reviewed changes

Removed the option to skip workflow cleanup on startup.

13d0eef

Workflows will always be deleted on cluster manager startup in non-standalone mode. For standalone mode the cleanup will be in the existing leadership change path.

sv2000 reviewed Jun 10, 2019

View reviewed changes

sv2000 approved these changes Jun 10, 2019

View reviewed changes

asfgit closed this in af84c57 Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GOBBLIN-798] Clean up workflows from Helix when the Gobblin applicat… #2665

[GOBBLIN-798] Clean up workflows from Helix when the Gobblin applicat… #2665

htran1 commented Jun 6, 2019

sv2000 Jun 7, 2019

htran1 Jun 7, 2019

htran1 Jun 10, 2019

sv2000 Jun 7, 2019

htran1 Jun 7, 2019

sv2000 Jun 7, 2019

htran1 Jun 7, 2019

sv2000 Jun 7, 2019

htran1 Jun 7, 2019

sv2000 Jun 7, 2019

htran1 Jun 7, 2019

sv2000 left a comment

sv2000 left a comment

[GOBBLIN-798] Clean up workflows from Helix when the Gobblin applicat… #2665

[GOBBLIN-798] Clean up workflows from Helix when the Gobblin applicat… #2665

Conversation

htran1 commented Jun 6, 2019

JIRA

Description

Tests

Commits

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sv2000 left a comment

Choose a reason for hiding this comment

sv2000 left a comment

Choose a reason for hiding this comment