Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-798] Clean up workflows from Helix when the Gobblin applicat… #2665

Closed
wants to merge 2 commits into from

Conversation

htran1
Copy link
Contributor

@htran1 htran1 commented Jun 6, 2019

…ion master starts

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    If the application master aborts a new one may be spawned by YARN. The second application master will resubmit the jobs. This results in duplicate jobs in Helix and multiple instances of the job may run, resulting in duplicate data.

The Gobblin application master should clean up all workflows on startup to avoid executing multiple instances of a job.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    GobblinYarnAppLauncherTest.testJobCleanup()

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"


// for cleaning up jobs on cluster manager startup
public static final String CLEAN_UP_JOBS_ON_MANAGER_START = GOBBLIN_CLUSTER_PREFIX + "cleanUpJobsOnManagerStart";
public static final boolean DEFAULT_CLEAN_UP_JOBS_ON_MANAGER_START = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the default "false" to preserve the current behavior with Gobblin cluster in non-Yarn mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is to keep behavior the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this option since the existing behavior of not cleaning up on startup lead to the issue being fixed in this PR, so it is reasonable to not have an option to preserve the buggy behavior.

@@ -161,4 +161,8 @@

public static final String CANCEL_RUNNING_JOB_ON_DELETE = GOBBLIN_CLUSTER_PREFIX + "job.cancelRunningJobOnDelete";
public static final String DEFAULT_CANCEL_RUNNING_JOB_ON_DELETE = "false";

// for cleaning up jobs on cluster manager startup
public static final String CLEAN_UP_JOBS_ON_MANAGER_START = GOBBLIN_CLUSTER_PREFIX + "cleanUpJobsOnManagerStart";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we always clean up on restart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this default to true for YARN mode. For standalone mode we clean up on leader ship change. For all other modes the existing behavior is maintained. I wanted to avoid changing behavior as much as possible. Initial startup already blows away the Helix cluster. This is only for yarn restart of the Gobblin application master without a restart of the application launcher.

GobblinClusterUtils.addDynamicConfig(config), Optional.<Path>absent());
GobblinClusterUtils.addDynamicConfig(config)
.withFallback(ConfigFactory.parseMap(
ImmutableMap.of(GobblinClusterConfigurationKeys.CLEAN_UP_JOBS_ON_MANAGER_START, "true"))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be ImmutableMap.of(GobblinClusterConfigurationKeys.CLEAN_UP_JOBS_ON_MANAGER_START, DEFAULT_CLEAN_UP_JOBS_ON_MANAGER_START) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is explicitly setting it to "true" for the Gobblin app master. All other gobblin cluster modes use the default of "false".

private void cleanUpJobs(HelixManager helixManager) {
// Clean up existing jobs
TaskDriver taskDriver = new TaskDriver(helixManager);

Map<String, WorkflowConfig> workflows = taskDriver.getWorkflows();

log.debug("cleanUpJobs workflow count {} workflows {}", workflows.size(), workflows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just dump workflows.keySet() instead of the entire map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -264,6 +268,12 @@ public synchronized void start() {
this.eventBus.register(this);
this.multiManager.connect();

// Standalone mode registers a handler to clean up on leadership change, so don't do the cleanup
// now even if the option to clean up on startup is set.
if (this.cleanUpJobsOnStartup && !this.isStandaloneMode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check needed for correctness or to avoid duplicate clean up calls? If it is the latter, shouldn't the 2nd call be handled as a No-op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to avoid duplicate clean up calls. The code would be correct without this check. A standalone mode manager can transition through multiple leadership change events and there is only one clean up call per transition, so there is no need to dedupe/no-op in that code path.

Workflows will always be deleted on cluster manager startup in non-standalone mode.
For standalone mode the cleanup will be in the existing leadership change path.
Copy link
Contributor

@sv2000 sv2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. LGTM.

Copy link
Contributor

@sv2000 sv2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. LGTM.

@asfgit asfgit closed this in af84c57 Jun 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants