Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-24068: Add re-execution plugin for handling DAG submission and unmanaged AM failures #1428

Merged
merged 3 commits into from Aug 26, 2020

Conversation

prasanthj
Copy link
Contributor

What changes were proposed in this pull request?

DAG submission failure can also happen in environments where AM container died causing DNS issues. DAG submissions are safe to retry as the DAG hasn't started execution yet. There are retries at getSession and submitDAG level individually but some submitDAG failure has to retry getSession as well as AM could be unreachable, this can be handled in re-execution plugin. This PR adds a new re-execution plugin for intermittent DAG submission failures.

Why are the changes needed?

To make hive resilient to environments with network/DNS issues.

Does this PR introduce any user-facing change?

Yes. Adds the re-exec plugin as default option.

How was this patch tested?

Manually. Tez code was changed to explicitly throw UnknownHostException to simulate DNS/network issue and tested to make sure retry happens.

Copy link
Member

@kgyrtkirk kgyrtkirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there were some other strange things...and it turns out that HIVE-23725 have messed up things around here....this patch at least follows the existing concepts....
+1

Comment on lines 44 to 46
// there could be race condition where getSession could return a healthy AM but by the time DAG is submitted
// the AM could become unhealthy/unreachable (possible DNS or network issues) which can fail tez DAG
// submission. Since the DAG hasn't started execution yet this failure can be safely restarted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this explanation could be moved into the class documentation

@prasanthj prasanthj changed the title HIVE-24068: Add re-execution plugin for handling DAG submission failures HIVE-24068: Add re-execution plugin for handling DAG submission and unmanaged AM failures Aug 25, 2020
@prasanthj
Copy link
Contributor Author

@kgyrtkirk thanks for the review! Addressed the review comment. Also handled the maxExecutions within the plugin which was missing before. Once minor change added is to update AM loss plugin to handle unmanaged AM failure as well.

@prasanthj prasanthj merged commit cd4154e into apache:master Aug 26, 2020
saihemanth-cloudera pushed a commit to saihemanth-cloudera/hive that referenced this pull request Sep 1, 2020
…nmanaged AM failures (apache#1428)

* HIVE-24068: Add re-execution plugin for handling DAG submission failures

* addressed Zoltan's code review comments. Added unmanaged AM failure to lost AM query plugin.

* fix comments

Co-authored-by: Prasanth Jayachandran <pjayachandran@cloudera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants