New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with resubmission + dynamic destination? #7118

Open
erasche opened this Issue Dec 14, 2018 · 0 comments

Comments

Projects
None yet
1 participant
@erasche
Copy link
Member

erasche commented Dec 14, 2018

Hi @jmchilton since we debugged last time I'm having an issue with the resubmission and maybe you can shed some light on it. Since I can't really pass around parameters I've settled on having three destinations: 1 1,5 and 2x memory destinations.

I have the following dynamic destination:

from galaxy.jobs import JobDestination

def testway(tool_id, user, memory_scale=1.0, next_dest=None):
    resubmit = []
    if next_dest:
        resubmit = [{
            'condition': 'any_failure and attempt <= 3',
            'destination': next_dest
        }]

    return JobDestination(
        id='blah',
        runner='condor',
        params={'description': 'aaaa', 'request_memory': '%sG' % memory_scale},
        resubmit=resubmit,
    )

def gateway_1x(tool_id, user):
    return testway(tool_id, user, memory_scale=1, next_dest='gateway_1_5x')

def gateway_1_5x(tool_id, user):
    return testway(tool_id, user, memory_scale=1.5, next_dest='gateway_2x')

def gateway_2x(tool_id, user):
    return testway(tool_id, user, memory_scale=2)

And the following job conf:

<job_conf>
    <plugins workers="4">
        <plugin id="condor" type="runner" load="galaxy.jobs.runners.condor:CondorJobRunner" />
    </plugins>
    <destinations default="gateway_1x">

        <destination id="gateway_1x" runner="dynamic">
            <param id="function">gateway</param>
            <param id="tmp_dir">True</param>
            <param id="type">python</param>
        </destination>

        <destination id="gateway_1_5x" runner="dynamic">
            <param id="function">gateway_1_5x</param>
            <param id="tmp_dir">True</param>
            <param id="type">python</param>
        </destination>

        <destination id="gateway_2x" runner="dynamic">
            <param id="function">gateway_2x</param>
            <param id="tmp_dir">True</param>
            <param id="type">python</param>
        </destination>

    </destinations>
</job_conf>

With the expectation that it will submit to first, then second, then third.

In practice I see that it is only resubmitted once:

Galaxy logs
galaxy.tools DEBUG 2018-12-14 10:25:18,113 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Validated and populated state for tool request (12.920 ms)
galaxy.tools.actions INFO 2018-12-14 10:25:18,158 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Handled output named out_file1 for tool test_history_sanitization (26.258 ms)
galaxy.tools.actions INFO 2018-12-14 10:25:18,174 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Added output datasets to history (15.416 ms)
galaxy.tools.actions INFO 2018-12-14 10:25:18,181 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Setup for job Job[unflushed,tool_id=test_history_sanitization] complete, ready to flush (6.363 ms)
galaxy.tools.actions INFO 2018-12-14 10:25:18,216 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Flushed transaction for job Job[id=33,tool_id=test_history_sanitization] (35.057 ms)
galaxy.tools.execute DEBUG 2018-12-14 10:25:18,216 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Tool [test_history_sanitization] created job [33] (91.104 ms)
galaxy.tools.execute DEBUG 2018-12-14 10:25:18,222 [p:60895,w:1,m:0] [uWSGIWorker1Core1] Executed 1 job(s) for tool test_history_sanitization request: (108.683 ms)
galaxy.jobs.mapper DEBUG 2018-12-14 10:25:18,520 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Mapped job to destination id: blah
galaxy.jobs.handler DEBUG 2018-12-14 10:25:18,528 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Dispatching to condor runner
galaxy.jobs DEBUG 2018-12-14 10:25:18,536 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Persisting job destination (destination id: blah)
galaxy.jobs DEBUG 2018-12-14 10:25:18,538 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Working directory for job is: /home/hxr/arbeit/galaxy/database/jobs_directory/000/33
galaxy.jobs.runners DEBUG 2018-12-14 10:25:18,554 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] Job [33] queued (25.845 ms)
galaxy.jobs.handler INFO 2018-12-14 10:25:18,577 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Job dispatched
galaxy.jobs.command_factory INFO 2018-12-14 10:25:18,700 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] Built script [/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/tool_script.sh] for tool command [sleep 0; env | sort> /home/hxr/arbeit/galaxy/database/files/000/dataset_33.dat; exit 1]
galaxy.jobs.runners DEBUG 2018-12-14 10:25:18,749 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] (33) command is: rm -rf working; mkdir -p working; cd working; /home/hxr/arbeit/galaxy/database/jobs_directory/000/
33/tool_script.sh; return_code=$?; cd '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33';
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True
python "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/set_metadata_9qLh3K.py" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/registry.xml" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33
/working/galaxy.json" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_in_HistoryDatasetAssociation_33_ZIRRRv,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_kwds_HistoryDatasetAss
ociation_33_E92cHz,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_out_HistoryDatasetAssociation_33_f1Bd3X,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_results_HistoryDatasetAssociation_33_05ETQK,/home/hxr/arbeit/galaxy/database/files/000/dataset_33.dat,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_override_HistoryDatasetAssociation_33_6CUyJJ" 5242880; sh -c "exit $ret
urn_code"
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:18,780 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] (33) submitting file /home/hxr/arbeit/galaxy/database/pbs/galaxy_33.sh
galaxy.jobs.runners.util.condor DEBUG 2018-12-14 10:25:18,847 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] condor external id: 320
galaxy.jobs.runners.condor INFO 2018-12-14 10:25:18,848 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] (33) queued as 320
galaxy.jobs DEBUG 2018-12-14 10:25:18,849 [p:60895,w:1,m:0] [CondorRunner.work_thread-0] (33) Persisting job destination (destination id: blah)
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:19,043 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/320) job is now running
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:22,115 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/320) job is now running
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:22,175 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/320) job has completed
galaxy.jobs.output_checker INFO 2018-12-14 10:25:22,200 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] Job 33: Fatal error: Exit code 1 ()
galaxy.jobs.output_checker DEBUG 2018-12-14 10:25:22,200 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] Tool exit code indicates an error, failing job.
galaxy.jobs.output_checker DEBUG 2018-12-14 10:25:22,200 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] job failed, standard error is - [Fatal error: Exit code 1 ()
]
galaxy.jobs.runners.state_handlers.resubmit INFO 2018-12-14 10:25:22,213 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] (33/320) Job will be resubmitted to 'gateway_1_5x' because tool detected error at the 'blah'
 destination
galaxy.jobs.mapper DEBUG 2018-12-14 10:25:22,217 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] (33) Mapped job to destination id: blah
galaxy.jobs DEBUG 2018-12-14 10:25:22,218 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] (33) Working directory for job is: /home/hxr/arbeit/galaxy/database/jobs_directory/000/33
galaxy.jobs DEBUG 2018-12-14 10:25:22,218 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] (33) Previous working directory moved to /home/hxr/arbeit/galaxy/database/jobs_directory/_cleared_contents/000/33/20181214-
102522
galaxy.jobs DEBUG 2018-12-14 10:25:22,253 [p:60895,w:1,m:0] [CondorRunner.work_thread-1] (33) Persisting job destination (destination id: blah)
galaxy.jobs.handler DEBUG 2018-12-14 10:25:22,717 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Job was resubmitted and is being dispatched immediately
galaxy.jobs.handler DEBUG 2018-12-14 10:25:22,719 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Recovered destination id (blah) does not exist in job config (but this may be normal in the case of a dynamically generated destination)
galaxy.jobs.handler DEBUG 2018-12-14 10:25:22,719 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Dispatching to condor runner
galaxy.jobs DEBUG 2018-12-14 10:25:22,736 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Persisting job destination (destination id: blah)
galaxy.jobs DEBUG 2018-12-14 10:25:22,743 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] (33) Working directory for job is: /home/hxr/arbeit/galaxy/database/jobs_directory/000/33
galaxy.jobs.runners DEBUG 2018-12-14 10:25:22,762 [p:60895,w:1,m:0] [JobHandlerQueue.monitor_thread] Job [33] queued (42.119 ms)
galaxy.jobs.command_factory INFO 2018-12-14 10:25:22,884 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] Built script [/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/tool_script.sh] for tool command [sleep 0; env | sort> /home/hxr/arbeit/galaxy/database/files/000/dataset_33.dat; exit 1]
galaxy.jobs.runners DEBUG 2018-12-14 10:25:22,976 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] (33) command is: rm -rf working; mkdir -p working; cd working; /home/hxr/arbeit/galaxy/database/jobs_directory/000/33/tool_script.sh; return_code=$?; cd '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33';
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True
python "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/set_metadata_R8Sdq5.py" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/registry.xml" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/working/galaxy.json" "/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_in_HistoryDatasetAssociation_33_8pYEJ6,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_kwds_HistoryDatasetAss
ociation_33_Jwg8le,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_out_HistoryDatasetAssociation_33_iBInaR,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_results_HistoryDatasetAss
ociation_33_Ndsn0g,/home/hxr/arbeit/galaxy/database/files/000/dataset_33.dat,/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_override_HistoryDatasetAssociation_33_hbTGcI" 5242880; sh -c "exit $ret
urn_code"
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:23,010 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] (33) submitting file /home/hxr/arbeit/galaxy/database/pbs/galaxy_33.sh
galaxy.jobs.runners.util.condor DEBUG 2018-12-14 10:25:23,069 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] condor external id: 321
galaxy.jobs.runners.condor INFO 2018-12-14 10:25:23,069 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] (33) queued as 321
galaxy.jobs DEBUG 2018-12-14 10:25:23,070 [p:60895,w:1,m:0] [CondorRunner.work_thread-2] (33) Persisting job destination (destination id: blah)
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:38,388 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/321) job is now running
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:41,447 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/321) job is now running
galaxy.jobs.runners.condor DEBUG 2018-12-14 10:25:41,501 [p:60895,w:1,m:0] [CondorRunner.monitor_thread] (33/321) job has completed
galaxy.jobs.output_checker INFO 2018-12-14 10:25:41,544 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Job 33: Fatal error: Exit code 1 ()
galaxy.jobs.output_checker DEBUG 2018-12-14 10:25:41,544 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Tool exit code indicates an error, failing job.
galaxy.jobs.output_checker DEBUG 2018-12-14 10:25:41,544 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] job failed, standard error is - [Fatal error: Exit code 1 ()
]
galaxy.jobs DEBUG 2018-12-14 10:25:41,561 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] (33) setting dataset 33 state to ERROR
galaxy.jobs INFO 2018-12-14 10:25:41,606 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Collecting metrics for Job 33
galaxy.jobs DEBUG 2018-12-14 10:25:41,617 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] job 33 ended (finish() executed in (70.957 ms))
galaxy.tools.error_reports DEBUG 2018-12-14 10:25:41,626 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Bug report plugin  generate
d response None
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,626 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Cleaning up external metadata files
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,631 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup MetadataTempFile temp files from /home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadat
a_out_HistoryDatasetAssociation_33_f1Bd3X: [Errno 2] No such file or directory: u'/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_out_HistoryDatasetAssociation_33_f1Bd3X'
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,631 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup external metadata file (filename_in) for HistoryDatasetAssociation_33: [Errno 2] No such file o
r directory: '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_in_HistoryDatasetAssociation_33_ZIRRRv'
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,631 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup external metadata file (filename_out) for HistoryDatasetAssociation_33: [Errno 2] No such file
or directory: '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_out_HistoryDatasetAssociation_33_f1Bd3X'
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,631 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup external metadata file (filename_results_code) for HistoryDatasetAssociation_33: [Errno 2] No s
uch file or directory: '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_results_HistoryDatasetAssociation_33_05ETQK'
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,631 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup external metadata file (filename_kwds) for HistoryDatasetAssociation_33: [Errno 2] No such file
 or directory: '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_kwds_HistoryDatasetAssociation_33_E92cHz'
galaxy.model.metadata DEBUG 2018-12-14 10:25:41,632 [p:60895,w:1,m:0] [CondorRunner.work_thread-3] Failed to cleanup external metadata file (filename_override_metadata) for HistoryDatasetAssociation_33: [Errno 2]
 No such file or directory: '/home/hxr/arbeit/galaxy/database/jobs_directory/000/33/metadata_override_HistoryDatasetAssociation_33_6CUyJJ'

@erasche erasche added the area/jobs label Dec 14, 2018

erasche added a commit to usegalaxy-eu/infrastructure-playbook that referenced this issue Dec 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment