Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a validator that ensures a dataset's extra_files_path #3918

Merged
merged 4 commits into from Apr 28, 2017

Conversation

Projects
None yet
7 participants
@gregvonkuster
Copy link
Contributor

commented Apr 11, 2017

directory exists and is non-empty

@galaxybot galaxybot added the triage label Apr 11, 2017

@galaxybot galaxybot added this to the 17.05 milestone Apr 11, 2017

@bgruening

This comment has been minimized.

Copy link
Member

commented Apr 13, 2017

@gregvonkuster I'm curious what is the use-case for this?

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented Apr 24, 2017

@bgruening Several of my PlantTribes Galaxy tools take as input and produce as output dataset(s) that have associated directories of files via the files_path attribute. Similar to how some existing Galaxy tools produce empty outputs (which are valid), these PlantTribes tools can in some cases produce empty directories of files (again, valid). In these cases, the history item that is associated with these empty directories of files should behave the same as empty input datasets for tools (i.e., not selectable since they are empty).

The use-case for this validator is to ensure that selected inputs do, in fact, have associated files_path directories that are not empty.

An example of one of the tools that uses this validator is here: https://github.com/gregvonkuster/galaxy_tools/blob/master/tools/plant_tribes/gene_family_aligner/gene_family_aligner.xml#L91

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented Apr 26, 2017

Please let me know if there is anything blocking this from being merged and I'll address it. Thanks! ;)


def validate( self, value, trans=None ):
if value:
if not os.path.isdir(value.extra_files_path) or len(os.listdir(value.extra_files_path)) == 0:

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Apr 26, 2017

Member

There are some abstractions in galaxy.model.Dataset, could you use those? In particular you have get_size and get_total_size, which in the case of empty extra files would be the same, I think. That also takes care of object_store interaction and caching the result in the database.

This comment has been minimized.

Copy link
@gregvonkuster

gregvonkuster Apr 26, 2017

Author Contributor

Thanks for the quick response on this. Just to confirm, since these abstractions generally relate to files instead of directories, I still need to check if the path is a valid directory, right?

>>> p1 = '/not/a/path'
>>> os.path.getsize(p1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/biotools/galaxy/galaxy/.venv/lib/python2.7/genericpath.py", line 49, in getsize
    return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: '/not/a/path'

So the only change would be to this line: https://github.com/galaxyproject/galaxy/pull/3918/files#diff-cbfb6ffd518f2a68bf40004cfe889947R263, which could become:

if not os.path.isdir(value.extra_files_path) or value.get_size() == 0:

Is this correct?

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Apr 26, 2017

Member

The code in question is this (set_total_size is being called by get_total_size):

    def set_total_size( self ):
        if self.file_size is None:
            self.set_size()
        self.total_size = self.file_size or 0
        if self.object_store.exists(self, extra_dir=self._extra_files_path or "dataset_%d_files" % self.id, dir_only=True):
            for root, dirs, files in os.walk( self.extra_files_path ):
                self.total_size += sum( [ os.path.getsize( os.path.join( root, file ) ) for file in files if os.path.exists( os.path.join( root, file ) ) ] )

So if value.get_total_size() == value.get_size(): should do what you want, right?

This comment has been minimized.

Copy link
@mvdbeek

mvdbeek Apr 26, 2017

Member

The idea is that total_size is the sum of primary file size (what you get with value.get_size()) and all files in extra_files_path or dataset_%d_files (which is not exposed separately). So the extra files total size is zero if value.get_total_size() == value.get_size(). It does call self.object_store.exists with the extra_files_path, so that should be safe without an additional os.path.isdir(value.extra_files_path).

This comment has been minimized.

Copy link
@gregvonkuster

gregvonkuster Apr 26, 2017

Author Contributor

Yes, that works - thanks for the pointer! ;)

gregvonkuster added some commits Apr 26, 2017

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented Apr 26, 2017

Hmm...not sure how to fix this issue https://travis-ci.org/galaxyproject/galaxy/jobs/226023677#L203 or the integration test failure. Any ideas?

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Apr 26, 2017

@gregvonkuster The py27-lint Travis build is currently broken due to https://gitlab.com/pycqa/flake8-docstrings/issues/19 .

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Apr 26, 2017

@galaxybot test this

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented Apr 27, 2017

Can this be merged?

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Apr 27, 2017

I haven't tested it yet, I will do it in the evening. Do you have a tool or input dataset that I can use to test this?

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented Apr 27, 2017

I have several phylogenomics tools in the TTS that use this validator, but I have to comment out the validator since it is not yet included in the Galaxy core.

Here's an example: https://testtoolshed.g2.bx.psu.edu/view/greg/gene_family_aligner/5a5f80ea6306

Unfortunately validators like this that are not part of the Galaxy core result in metadata not being generated for the tool when uploaded to the TS, resulting in an invalid tool which cannot be installed.
It's been a bit painful for me trying to work around this issue since I have several tools that use the validator. This is why I'm trying to get the PR merged as quickly as possible.

Here is a snapshot of the validator in use if that helps. If you'd like to install the above tool from the TTS, you could then edit it on disk to uncomment the validator. You'd have to dummy up a dataset to have the ptorthocs data format. Let me know how you'd like to proceed on this and I'll help out.

validator

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Apr 27, 2017

You'd have to dummy up a dataset to have the ptorthocs data format. Let me know how you'd like to proceed on this and I'll help out.

it would be great if you could post such a dummy dataset. I guess you can still download them from the UI, right?

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented Apr 27, 2017

Sure, here is one that is empty and another that is non-empty. of course, you'll have to manually create a files_path directory associated with the dataset that is non-empty. The Galaxy datatype of each of these is ptorthocs.

ptorthocs_dataset.txt
ptorthocs_empty_dataset.txt

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Apr 27, 2017

Sorry @gregvonkuster, I'll not have the time today. In any case I guess the toolshed needs to be updated as well right? I think this will make it to the 17.05 release which the toolsheds will run once it is released.

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented Apr 28, 2017

Thanks @mvdbeek for your help on this. I wonder if there is anything more I can do to help out with getting this merged. Hmm...maybe I'll try to stop by and see the core team here on the PSU campus today. Thanks again! ;)

@davebx

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2017

I have tested this locally, and it appears functionally correct. +1 from me.

@davebx

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2017

@galaxybot test this

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Apr 28, 2017

Alright, one last thing is that I think empty_files_path should be empty_extra_files_path and DatasetFilesPathEmptyValidator should be DatasetExtraFilesPathEmptyValidator, after that you got my 👍 as well (which you don't need).

@davebx

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2017

@mvdbeek I'm not entirely sure about that, I'd personally prefer to see the de facto standard moving more towards using files_path over extra_files_path, since the latter is more verbose without adding any particular value.

@mvdbeek

This comment has been minimized.

Copy link
Member

commented Apr 28, 2017

Well, the extra does imply that this is not the primary dataset. Other than that I don't care too strongly about this.

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Apr 28, 2017

According to galaxy-iuc/standards#40 (comment) , since this is an input dataset validator (IIUC), then @mvdbeek is correct in asking for the "extra" to be added.

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented Apr 28, 2017

Ah, ok this makes it very clear:

$input.extra_files_path for inputs and $output.files_path for outputs

I'll rename the validator as requested. Thanks!

@nsoranzo

This comment has been minimized.

Copy link
Member

commented Apr 28, 2017

LGTM

@davebx

This comment has been minimized.

Copy link
Contributor

commented Apr 28, 2017

Looks like we have unanimous approval, merging when tests pass.

@mvdbeek mvdbeek merged commit a32e785 into galaxyproject:dev Apr 28, 2017

4 of 5 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
api test Build finished. 275 tests run, 0 skipped, 0 failed.
Details
framework test Build finished. 148 tests run, 0 skipped, 0 failed.
Details
integration test Build finished. 34 tests run, 0 skipped, 0 failed.
Details
toolshed test Build finished. 579 tests run, 0 skipped, 0 failed.
Details
@mvdbeek

This comment has been minimized.

Copy link
Member

commented Apr 28, 2017

Alright, thanks @gregvonkuster !

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented Apr 28, 2017

Thanks so much everyone for all of the help on this. And thanks for the merge! ;)

@martenson No rush (before Monday), but can you ping me when the TTS is updated with this? I owe beers! ;)

@martenson

This comment has been minimized.

Copy link
Member

commented May 1, 2017

@gregvonkuster TTS has been updated

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented May 1, 2017

Thanks a whole bunch @martenson ')

@bgruening

This comment has been minimized.

Copy link
Member

commented May 1, 2017

@gregvonkuster gregvonkuster deleted the gregvonkuster:fpv branch May 1, 2017

@gregvonkuster

This comment has been minimized.

Copy link
Contributor Author

commented May 1, 2017

Sure. Is this all that is needed? #3994

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.