Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added resubmit handlers on Pulsar connection failures #7053

Merged
merged 6 commits into from Dec 21, 2018

Conversation

@nuwang
Copy link
Member

nuwang commented Nov 28, 2018

Currently, the Pulsar runner does not have support for resubmit handlers. This PR adds support for resubmit handlers to the Pulsar runner if an unknown error occurs during job preparation or finalisation. For example:

        <destination id="initial_pulsar" runner="pulsar_rest">
            <param id="url">https://<some_pulsar_server>:8913/</param>
            <param id="private_token">some_token</param>
            <resubmit condition="unknown_error" destination="local" />
        </destination>

In addition, it also introduces a referrer parameter for dynamic destinations. The referrer parameter is useful for finding out the resubmit conditions of a dynamic destination's referrer, or any of its parameters.
For example:

        <destination id="galaxycloudrunner" runner="dynamic">
            <param id="type">python</param>
            <param id="function">cloudlaunch_pulsar_burst</param>
            <param id="fallback_destination">local</param>
            <resubmit condition="unknown_error and attempt &lt;= 7" destination="galaxycloudrunner" />
        </destination>

With the referrer parameter being accessible within the dynamic function as:

def cloudlaunch_pulsar_burst(app, referrer):
    print(referrer.id) // should print galaxycloudrunner
@galaxybot galaxybot added the triage label Nov 28, 2018
@galaxybot galaxybot added this to the 19.01 milestone Nov 28, 2018
nuwang added 3 commits Nov 28, 2018
@nuwang nuwang force-pushed the nuwang:pulsar_resubmit branch from d1d3142 to 2632c07 Nov 28, 2018
@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Dec 12, 2018

Can reproduce the test error locally, trying to figure it out though...

@nuwang

This comment has been minimized.

Copy link
Member Author

nuwang commented Dec 13, 2018

Thanks for looking into this. I was hoping that you could help, because I can't immediately see a connection between the changes and why the tool output is empty, as the resubmission works as expected when I run against the galaxycloudrunner.

jmchilton added 2 commits Dec 21, 2018
More uniform error hanlding to fix test. Simplify test case of resubmission to test just jobs resubmit properly (no need to test Pulsar-to-Pulsar resubmission in particular here I don't think, maybe as a second test - but not a first?).
@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Dec 21, 2018

I think the problem is finish being called there? I can see you getting a command_line in production but the transport failing so that you'd get a chance to call fail_job without finish. Regardless of the specifics, I feel like the approach I pushed is a bit more uniform and it seems to cause the now simplified test of resubmission to pass. Locally this test takes a minute or two to pass though - I'm a little concerned about including it in the suite of tests that run with each PR for that reason. 🤔

@nuwang

This comment has been minimized.

Copy link
Member Author

nuwang commented Dec 21, 2018

@jmchilton I totally missed that - thanks for tracking it down. Perhaps the test will run faster if we use an invalid dns name instead of a reserved IP perhaps? I can try that next.

@jmchilton jmchilton merged commit 26c950f into galaxyproject:dev Dec 21, 2018
2 of 6 checks passed
2 of 6 checks passed
api test Test started.
Details
framework test Test started.
Details
integration test Test started.
Details
selenium test Test started.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
toolshed test Build finished. 577 tests run, 0 skipped, 0 failed.
Details
@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Dec 21, 2018

I'm just going to assume this works, if it is broken in production it should be close - hopefully it is just a small bug fix PR to fix it. galaxycloudrunner looks awesome, keep me updated on the amazing things you do with it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.