Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip job execution if equivalent job exists #4690

Merged
merged 51 commits into from Jan 2, 2018

Conversation

Projects
None yet
6 participants
@mvdbeek
Copy link
Member

mvdbeek commented Sep 25, 2017

This PR implements the possibility to find and re-use the output of jobs with identical combinations of input files and parameters, which, assuming deterministic output, should produce the same result.

Possible use-cases:

  • I have run a workflow. An update to a tool I have been using in the workflow has been published, and I would like to rerun the workflow with the new tool. This will now only rerun the steps of the workflow that may change.
  • I have workflows that overlap, for example QC using read trimming, read mapping, duplicate marking and reporting using multiqc. Another analysis workflow with read trimming, read mapping, duplicate marking, and variant calling uses many of the same steps. I can now run the analysis workflow, and it will only re-run the modules that differed.
  • ?

Limitations:

  • The inputs must match exactly. If the input dataset names differ (for example Cut on data1 and Cut on data2), we will not consider the resulting dataset to be equivalent, because a tool may make use of the name to modify its content, for example to add a header. To avoid such problems one can run jobs in clean histories, and then copy the inputs to a new history, so that the datasets carry the same name. We may in the future introduce an additional switch to ignore the name if an element identifier is present.
  • You must be the owner of the input datasets
  • Either the input dataset update time (i.e the last time it was last renamed / modified) must be lower than the equivalent's jobs' creation time, or the input dataset must have been created with galaxy release 18.01 or newer, which tracks the metadata for a dataset with each change.

How can I use this functionality:

This functionality is off by default. There is a user interface, which can be activated by copying config/user_preferences_extra_conf.yml.sample to config/user_preferences_extra_conf.yml (or copying only the use_cached_job section of the sample config file). Once this is activated by the admin, users can set in the user-preferences if they would like to see this option in the tool form and the workflow submission form in (User -> Preferences -> Manage Information -> Do you want to be able to re-use equivalent jobs ?). If you're using galaxy via the API you can pass use_cached_jobs=True in the tool/workflow submission payload.

Future improvements:

  • We may be able to use this functionality in more situations if we can ignore the dataset name, which often varies based on the order of the history items ($tool.name on data1, data2, data3). We could add an additional tool form / workflow submission options where we ignore the name if an element identifier is available
  • We could allow input and output datasets to be shared with instead of owned by the user that attempts to execute a job
  • We could add a tool element that says: ignore name differences. That would be helpful for tools that do not look at the dataset name. Alternatively it may be possible to parse the command section to see if the dataset name is being accessed.

@mvdbeek mvdbeek changed the title [WIP] Skip job execution is equivalent job exists [WIP] Skip job execution if equivalent job exists Sep 25, 2017

@mvdbeek mvdbeek force-pushed the mvdbeek:skip_job_execution branch from 72b2495 to fbdd8b9 Sep 25, 2017

@mvdbeek mvdbeek added this to the 18.01 milestone Sep 25, 2017

@mvdbeek mvdbeek force-pushed the mvdbeek:skip_job_execution branch from fbdd8b9 to 819280b Sep 28, 2017

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Sep 28, 2017

So this works now for workflows, collections, mapped-over collections and tools with repeats (which was a little more difficult due to the way JobParameters are stored in the database).
This is still pretty hackish in some parts, but I'm confident this could work, with one major caveat:

If a tool uses the name or element_identifier of a dataset to modify the output content (e.g. by adding a column with the sample name), a user may not get the expected output if the element_identifier or name has changed between job runs. I don't think there is a easy way around this, but I still think this can be very useful: I am not planning to make this the default (it is now, just to see where/if it crashes in the tests). Most of my workflows wouldn't be affected by this, and this could be enabled on a per tool / per workflow basis. Tool authors may even declare (in a future galaxy profile version) that a tool is safe to use with this feature.

@dannon

This comment has been minimized.

Copy link
Member

dannon commented Sep 28, 2017

Awesome! Can't wait to test this -- this is something we've had on a blue sky wishlist for a long time!

@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Sep 28, 2017

@mvdbeek I'd just add to that list of possible external factors that the dataset's (HDA's) metadata (like the datatype declared metadata) may change and its datatype may change. Might it be worth tracking the last time a dataset's metadata was updated and using that in the calculation? That would catch changes based on name as well. If alternatively you somehow hashed all of these things together - dataset content + metadata - you could think about doing re-runs across even like copied histories and stuff where a new HDA is generated. That approach may be out of scope for this approach and PR - something to think about though.

It all makes me quite nervous to be honest - but I appreciate the tests, the caveats, and that it is off by default. It is pretty awesome.

@bgruening

This comment has been minimized.

Copy link
Member

bgruening commented Sep 28, 2017

This is super awesome! But I'm also nervous about this. Marius could you implement a dry mode? That could log cases and calculates checksums for both files. I would be willing to test the dry run on our server and report back after a few month.

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Sep 29, 2017

Thanks everyone for the comments!

I'd just add to that list of possible external factors that the dataset's (HDA's) metadata (like the datatype declared metadata) may change and its datatype may change. Might it be worth tracking the last time a dataset's metadata was updated and using that in the calculation?

Thanks, that's a good idea, I will do that.

That would catch changes based on name as well.

Currently the job search works by identifying shared Dataset instances (for HDA input) or collections (either direct match or collection that have the copied_from_collection attribute set). So that has the advantage that it works across histories and workflows, and HDAs don't need to have the same name. I could add a check for the name and make this tighter, but I don't think it would solve the issue I mentioned (I would need to check that the job that produced the dataset that could potentially be used as an output dataset for the current job used inputs of the same name -- wait, that may actually be doable, the identifier is encoded in the JobParameters, and I think the name is as well!).

Marius could you implement a dry mode?

That's a good suggestion, I'll think about it. I think that if I manage to implement John's suggestion there shouldn't be a large margin where things can go wrong. I haven't check out #4659 yet -- maybe we can re-use the hashing.

@mvdbeek mvdbeek force-pushed the mvdbeek:skip_job_execution branch from 053b6d9 to 2d4f746 Sep 29, 2017

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Oct 2, 2017

So I think for jobs that don't use explicit collection input this should be pretty solid by now. We verify name, element_identifier, datatype and metadata. Mapping over collections should be solid as well, but I'll need to check HDCAs more carefully (applying the same checks that are already in place for HDAs).

One thing that we should improve (maybe not for the initial implementation here) is either an audit trail for mutable things, like HistoryDatasetAssociations, which can change name, datatype, metadata and so on, or recording mutable parameters in the JobToInputDatasetAssociation. I kind of prefer having versioned HDAs, because that may be useful outside of the context of this PR. Without a trace of these we can only rely on the HDAs update_time attribute to point to a date that is older than the job creation_time. Hiding and deleting HDAs unfortunately alters the update time, so that seriously limits the ability to verify that a jobs input is equivalent to the requested input.

@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Oct 6, 2017

@mvdbeek Your implementation of the versioning is very clever. I have a request that is unfair and so you may choose to ignore it I think - I'd prefer if the state history wasn't tracked until datasets were in an "okay" state - or maybe until the job is finished. I've tried to eliminate a lot of flushes and optimize the dataset creation process and during the job finished process a great deal - and having to write additional rows and such during each random flush in there might slow it down and wouldn't really be beneficial. The dataset doesn't really have a complete set of metadata until the end of the job finish process so why track that? Does that make sense? It is also possible you've already thought of that and it doesn't do that - I just walked through the code - I didn't run it and verify it is doing what I'm worried about (or indeed that what I'm worried about has an appreciable impact on performance).

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Oct 6, 2017

Thanks @jmchilton, the implementation is actually close to http://docs.sqlalchemy.org/en/latest/orm/examples.html#versioning-objects, except that we only version HDAs (or classes with a __create_version__ attribute). I think your suggestion makes sense, it should be easy enough to do, and if we have only "meaningful" versions we may in the future expose a history of a dataset in the UI!

@mvdbeek mvdbeek force-pushed the mvdbeek:skip_job_execution branch 2 times, most recently from ca084b0 to 9e4fa3c Oct 8, 2017

@mvdbeek mvdbeek force-pushed the mvdbeek:skip_job_execution branch 7 times, most recently from 2c4b28e to f4062bd Oct 17, 2017

@mvdbeek mvdbeek referenced this pull request Dec 8, 2017

Merged

Samtools to pysam #5037

@mvdbeek mvdbeek force-pushed the mvdbeek:skip_job_execution branch from f4062bd to ae646fd Dec 10, 2017

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Dec 11, 2017

This is getting closer -- it's off by default, and passing use_cached_job=True along to tool / wf payload will attempt to use jobs that have already been run. Next up: UI :).

@mvdbeek mvdbeek force-pushed the mvdbeek:skip_job_execution branch from 647fd6e to 075e210 Dec 17, 2017

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Dec 17, 2017

@galaxybot test this

@mvdbeek mvdbeek force-pushed the mvdbeek:skip_job_execution branch from 86b9443 to 655af30 Dec 31, 2017

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Jan 1, 2018

Hmm, not quite sure why the qunit test is failing now (I only rebuilt the client), it works locally ...

@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Jan 1, 2018

Probably just a transient failure - I think we need to increase some sort of timeout in the karma setup. I reran them.

@jmchilton jmchilton merged commit 32624df into galaxyproject:dev Jan 2, 2018

6 checks passed

api test Build finished. 343 tests run, 4 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 166 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 67 tests run, 0 skipped, 0 failed.
Details
selenium test Build finished. 118 tests run, 2 skipped, 0 failed.
Details
toolshed test Build finished. 577 tests run, 0 skipped, 0 failed.
Details
@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Jan 2, 2018

Wonderful!

@bgruening

This comment has been minimized.

Copy link
Member

bgruening commented Jan 2, 2018

This is amazing @mvdbeek! The entire GTN owns you so many beers - training can now be very fast if needed and still use large datasets :)

What a nice start in the new year!!!

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Jan 2, 2018

Thanks everyone, I hope this will be useful!
I have updated the initial description and added some hints about limitations and future improvements.

@martenson

This comment has been minimized.

Copy link
Member

martenson commented Jan 2, 2018

@mvdbeek would it justify a special topic in docs?

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Jan 2, 2018

Maybe a tutorial or a blog entry would be more visible ? I think this is interesting to users as well as admins

@martenson

This comment has been minimized.

Copy link
Member

martenson commented Jan 2, 2018

@mvdbeek Sounds great to me. Btw have you thought about using input dataset checksums for this and making it instance-wide (or wider). How much effort would that be?

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Jan 2, 2018

Yeah, I'd love that. I think if we started hashing all datasets with a fast checksum that may also circumvent the input name / metadata limitations I mentioned above. Computation-wise calculating a cheap hash should be minor compared to let's say aligning a dataset. I think #4659 is already a good step in that direction.
Once we have a a checksum associated with a dataset it should be fairly trivial to implement this.

Going a step further one could even think about looking up public input datasets from public servers and see if someone has already done this job. Completely opt-in of course :).

@martenson

This comment has been minimized.

Copy link
Member

martenson commented Jan 2, 2018

  • Make results of this workflow execution publicly available?
@bernt-matthias

This comment has been minimized.

Copy link
Contributor

bernt-matthias commented Jan 3, 2018

Sounds super great to me. Just one thought: Would this also cover data sources, ie are multiple downloads avoided if users download the exactly the same data (with the same tool parameters)?

@mvdbeek

This comment has been minimized.

Copy link
Member Author

mvdbeek commented Jan 3, 2018

Yes, that should be covered.

@mvdbeek mvdbeek deleted the mvdbeek:skip_job_execution branch Jun 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.