New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend job search for hdcas #4665

Merged
merged 17 commits into from Sep 29, 2017

Conversation

Projects
None yet
4 participants
@mvdbeek
Member

mvdbeek commented Sep 21, 2017

This PR extends the job search:

  • Jobs that used dataset collections can now be found correctly (jobs have to be newer than this PR, see 19818cb)
  • We check that a input dataset is supplied at the correct input parameter (previously jobs with same inputs for different parameters would be returned as well)
  • The core of the functionality is now in galaxy.manager.jobs so that it can be re-used for #4468
  • Some new API tests that exercise this functionality.

mvdbeek added some commits Sep 20, 2017

Check JobParameter match
Matching the JobParameters will prevent returning jobs where a job uses the
same input datasets, but where the inputs are used in a different order (and
hence the result may differ)
Allow job_search to find job even if current HDCA was copied from ano…
…ther HDCA and make sure that jobs with same inputs matched to different job parameter inputs are not considered as a match
Test deleting input datasets and recovering jobs from search
This revealed a wrong query filter (comparing
JobToInputDatasetAssociation.dataset_id == HistoryDatasetAssociation.id,
instead of JobToInputDatasetAssociation.dataset_id ==
HistoryDatasetAssociation.dataset_id).
from galaxy.managers.lddas import LDDAManager
class JobSearch(object):

This comment has been minimized.

@jmchilton

jmchilton Sep 23, 2017

Member

I'd not put this file in galaxy.jobs, I think you are implementing something that belongs in galaxy.managers somewhere - probably galaxy.managers.jobs. galaxy.jobs to me feels more like a low level thing related to dealing with actually running jobs. This is implementing manager-style business logic.

mvdbeek added some commits Sep 24, 2017

Fix loading of model.JobToInputDatasetCollectionAssociation
If we don't load this relationship we do not properly record a job's
input_dataset_collections in `DefaultToolAction._record_inputs`.
Filter jobs on input data using JobToInput*Association.
This has the advantage that for complex parameters we don't need to decompose
parameter names like `queries_0|input2` in order to compare JobParameters.

@mvdbeek mvdbeek changed the title from WIP: Extend job search for hdcas to Extend job search for hdcas Sep 24, 2017

Skip recording JobToInputDatasetCollection relations
if the item to record is not a HistoryDatasetCollectionAssociation. This fixes

```
Error executing tool: Attempting to flush an item of type <class
\'galaxy.model.DatasetCollectionElement\'> as a member of collection
"JobToInputDatasetCollectionAssociation.dataset_collection". Expected an object
of type <class \'galaxy.model.HistoryDatasetCollectionAssociation\'> or a
polymorphic subclass of this type.' ```

It isn't ideal, but before 19818cb we didn't record Collection-like elements at all.
@@ -0,0 +1,161 @@
import json

This comment has been minimized.

@nsoranzo

nsoranzo Sep 25, 2017

Member

Missing blank line here.

from galaxy import model
from galaxy.managers.hdas import HDAManager
from galaxy.managers.collections import DatasetCollectionManager

This comment has been minimized.

@nsoranzo

nsoranzo Sep 25, 2017

Member

Wrong import order (you may want to add this file to .ci/flake8_lint_include_list.txt so it will be checked)

return self.__search(tool_id=tool_id, user=user, input_param=input_param, input_data=input_data, job_state=job_state)
def one_not_deleted(self, items):
return any((True for item in items if not item.deleted))

This comment has been minimized.

@nsoranzo

nsoranzo Sep 25, 2017

Member

You can remove the internal parentheses.

Linting fixes
and drop unnecessary internal parenthesis (thanks @nsoranzo!).
@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 25, 2017

Putting this back in WIP, there is probably a problem parsing the input parameters.

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 26, 2017

(Mostly writing this down for myself ...)
I'm having some problems with repeats, which can contain a mix of one or more data input parameters and non-data parameters. Repeat parameters are dumped like this and can be retrieved as JobParameter values from the database:

'[{"__index__": 0, "f1": {"values": [{"src": "hda", "id": 1}]}}]'

This example doesn't contain additional nested repeats, or non-input data, but the problem is that if I query for exactly this JobParameter, I will not be able to find jobs that ran on equivalent input data (A key motivation in #4468). Ideally I would like to filter jobs that have this JobParameter, or an equivalent JobParameter where the hda id points to the same dataset as the hda with id 1.

My first thought was that instead of querying just for this parameter I could replace the hda id with all equivalent hda ids and instead of using JobParameter.value == v I could do JobParameter.value.in_(v). The problem is that this increases in complexity with each hda and data input parameter.

My next idea was to replace the id with a wildcard and do JobParameter.like(v), but the % wildcard would potentially result in false positives and isn't ideal either.

I am thinking now that I could possibly query for jobs with matching input and output dataset(-collection)s and JobParameters that don't contain a InputDataset. I could then go through the list of possibly equivalent jobs, fill in equivalent dataset ids in the JobParameter dumps and keep only those jobs that match.

@mvdbeek mvdbeek added status/review and removed status/WIP labels Sep 28, 2017

@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Sep 28, 2017

Alright, I've implemented my last idea, and it looks like this works pretty well.

Tighten job search result
We now expand incoming tool parameters, so that we have a dictionary of all
tool parameters (i.e specified/posted parameters and defaults). We start the
search by quering the database for jobs using the same tool id for the same
owner with the same input datasets, taking into account copied
dataset(-collection)s.  We then replace all dataset ids (HDA/LDA or HDCA) in
the parameter dict with the input data used for the jobs queried in the first
step. This allows us to check every parameter for a match with the job.
@@ -68,6 +68,9 @@
FILENAME_VALID_CHARS = '.,^_-()[]0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
defaultdict = collections.defaultdict

This comment has been minimized.

@jmchilton

jmchilton Sep 29, 2017

Member

I get the impulse here - it is a utility so add it to galaxy.util - but I'd generally prefer dependencies on collections over dependencies on galaxy.util in code that consume this. galaxy.util is very light and doesn't have many nested dependencies but it will never be a portable as collections. I'm not -1 or anything - I'm going to merge this - just sharing my two cents.

This comment has been minimized.

@mvdbeek

mvdbeek Oct 16, 2017

Member

@jmchilton I would have imported from collections, but I'm using defaultdict in managers.jobs, but doing from collections import defaultdict doesn't work there because we also have managers.collections, this is just a hack around that.

@jmchilton jmchilton merged commit 4b7bd98 into galaxyproject:dev Sep 29, 2017

6 checks passed

api test Build finished. 297 tests run, 4 skipped, 0 failed.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
framework test Build finished. 161 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 46 tests run, 0 skipped, 0 failed.
Details
lgtm analysis: JavaScript No alert changes
Details
toolshed test Build finished. 579 tests run, 0 skipped, 0 failed.
Details
@mvdbeek

This comment has been minimized.

Member

mvdbeek commented Oct 16, 2017

Thanks for the merge, just saw this now :).

@mvdbeek mvdbeek deleted the mvdbeek:extend_job_search branch Jun 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment