Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve interaction between job state and purged datasets. #3619

Merged
merged 3 commits into from Mar 27, 2017

Conversation

@jmchilton
Copy link
Member

commented Feb 15, 2017

Overview

There are two basic behavior changes here and bunch of testing to verify the new behaviors.

  1. Previously purging a dataset would cause a creating job to be cancelled but all outputs would need to be "deleted" in order to cause the creating job to be cancelled. After this change the behavior is the same for both deletion and purging - the job will run until all outputs are deleted or purged.

  2. Previously Galaxy jobs wouldn't attempt to cleanup purged datasets that were created by a running tool - this cleanup now occurs.

These are tackled as a pair because I assume the difference in the behaviors between deletion and purging was a hack around tools populating datasets that had been purged while running. Cancelling the job wasn't a sure fix - but it would reduce the likelihood of such problems. I think the new approach is a bit more robust and explicit.

Additionally there are a couple small, atomic commits that cleanup a couple things related to jobs before tackling this.

Testing

All new tests (and a couple old ones) can be executed using the command:

./run_tests.sh -api test/api/test_jobs.py

# We need to ignore a potential flush=False here and force the flush
# so that job cleanup associated with stop_creating_job will see
# the dataset as flushed.

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Feb 22, 2017

Member

s/flushed/purged/ I think

if not purged and not clean_only:
self.app.object_store.update_from_file(dataset, create=True)
else:
# If the datset is purged and Galaxy is configured to write directly

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Feb 22, 2017

Member

s/datset/dataset/

@@ -1297,17 +1302,19 @@ def finish(
# lets not allow this to occur
# need to update all associated output hdas, i.e. history was shared with job running
for dataset in dataset_assoc.dataset.dataset.history_associations + dataset_assoc.dataset.dataset.library_associations:
purged = dataset.dataset.purged
trynum = 0

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Feb 22, 2017

Member

This line can be moved inside the next if.

@@ -1492,7 +1499,8 @@ def path_rewriter( path ):

# fix permissions
for path in [ dp.real_path for dp in self.get_mutable_output_fnames() ]:
util.umask_fix_perms( path, self.app.config.umask, 0o666, self.app.config.gid )
if os.path.exists(path):
util.umask_fix_perms( path, self.app.config.umask, 0o666, self.app.config.gid )

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Feb 22, 2017

Member

This change prevents the logging of an exception if path does not exist, is it safe to hide these from the admins?

This comment has been minimized.

Copy link
@jmchilton

jmchilton Feb 23, 2017

Author Member

The path might not exist and that might be fine. Is this a common error message you see that indicates a problem of some kind?

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Feb 23, 2017

Member

No clue, just worried. Maybe try/except: log.debug() ?

This comment has been minimized.

Copy link
@jmchilton

jmchilton Feb 23, 2017

Author Member

That method does its own logging though - at log.exception it assumes the path exists and this piece of code shouldn't assume the path exists. I feel like this check is the correct thing to do.

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Feb 23, 2017

Member

this piece of code shouldn't assume the path exists

Then this change should be fine, thanks!

dataset_1 = self._get_history_item_as_admin( history_id, outputs[0]["id"] )
dataset_2 = self._get_history_item_as_admin( history_id, outputs[1]["id"] )
if "file_name" in dataset_1:
output_datset_paths = [ dataset_1[ "file_name" ], dataset_2[ "file_name" ] ]

This comment has been minimized.

Copy link
@nsoranzo

nsoranzo Feb 22, 2017

Member

s/datset/dataset/ here and in the rest of this file (or is there a specific reason for using datset?)

This comment has been minimized.

Copy link
@jmchilton

jmchilton Feb 22, 2017

Author Member

Nope - just a typo. I'll fix this - thanks @nsoranzo!

@jmchilton jmchilton force-pushed the jmchilton:purge_jobs_rework branch 3 times, most recently from 8fbab65 to d8ae777 Feb 23, 2017

jmchilton added some commits Feb 13, 2017

Re-use job.finish abstraction in dataset manager stop_creating_job.
It is used in purge a bit above and it is the better abstraction.
Improve interaction between job state and purged datasets.
There are two basic behavior changes here and bunch of testing to verify the new behaviors.

1) Previously purging a dataset would cause a creating job to be cancelled but all outputs would need to be "deleted" in order to cause the creating job to be cancelled. After this change the behavior is the same for both deletion and purging - the job will run until all outputs are deleted or purged.

2) Previously Galaxy jobs wouldn't attempt to cleanup purged datasets that were created by a running tool - this cleanup now occurs.

These are tackled as a pair because I assume the difference in the behaviors between deletion and purging was a hack around tools populating datasets that had been purged while running. Cancelling the job wasn't a sure fix - but it would reduce the likelihood of such problems. I think the new approach is a bit more robust and explicit.

@jmchilton jmchilton force-pushed the jmchilton:purge_jobs_rework branch from d8ae777 to 009e2cb Mar 20, 2017

@martenson
Copy link
Member

left a comment

the changes here make sense to me, thanks @jmchilton

@martenson martenson merged commit 967d0b6 into galaxyproject:dev Mar 27, 2017

5 checks passed

api test Build finished. 266 tests run, 0 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 140 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 25 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 580 tests run, 0 skipped, 0 failed.
Details
@jmchilton

This comment has been minimized.

Copy link
Member Author

commented Mar 28, 2017

Huge thanks for the merge @martenson.

Ping @natefoo - I know this is all stuff you care about so I wanted to make sure you saw these changes.

@jmchilton jmchilton deleted the jmchilton:purge_jobs_rework branch Mar 28, 2017

jmchilton added a commit to jmchilton/galaxy that referenced this pull request Apr 27, 2017

Attempted fix for transiently failing API test_jobs introduced in gal…
…axyproject#3619.

A couple things need to occur while the jobs are "running" and it seems that 60 seconds wasn't enough time to sleep for all that to happen on Jenkins.
 I checked the timing in the raw logs (they are for some reasons being filtered out in the Jenkins display of the XUnit) - and indeed the jobs are properly sleeping the correct amount of time but the client API calls need some more time it seems. My initial thought was maybe the jobs were being interrupted or something - and that does not appear to be the case.

martenson added a commit that referenced this pull request Apr 30, 2017

Merge pull request #3988 from jmchilton/job_test_timing_improvements
Attempted fix for transiently failing API test_jobs introduced in #3619.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.