Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File order in tar-ball from shed_upload #159

Closed
peterjc opened this issue Apr 30, 2015 · 6 comments
Closed

File order in tar-ball from shed_upload #159

peterjc opened this issue Apr 30, 2015 · 6 comments

Comments

@peterjc
Copy link
Contributor

peterjc commented Apr 30, 2015

I know the order does not matter by the time the tar-ball is pushed to the Tool Shed, but for local checks it would be nice if the .shed.yml ordering given in the include directives was preserved.

e.g.

$ cat ~/repositories/pico_galaxy/tools/venn_list/.shed.yml
name: venn_list
owner: peterjc
homepage_url: https://github.com/peterjc/pico_galaxy/tools/venn_list
remote_repository_url: https://github.com/peterjc/pico_galaxy/tools/venn_list
description: Draw Venn Diagram (PDF) from lists, FASTA files, etc
long_description: |
  Draw Venn Diagrams for 1, 2 or 3 sets of identifiers as a PDF file.

  Can parse FASTA, FASTQ, SFF or tabular files (taking column one) for identifiers.
  This can be combined with the extensive tabular file filtering and manipulation
  tools within Galaxy to prepare the input data.

  Uses the R/Bioconductor package limma to draw the PDF file, called from Python with
  rpy.

  Uses Biopython to parse SFF files.
categories:
- Graphics
- Sequence Analysis
- Visualization
type: unrestricted
include:
- README.rst
- venn_list.py
- venn_list.xml
- tool_dependencies.xml

This gave the following:

$ planemo shed_upload --tar_only  ~/repositories/pico_galaxy/tools/venn_list 
cp /tmp/tmpWm9B9U shed_upload.tar.gz
$ tar -xvf /tmp/tmpWm9B9U
venn_list.py
tool_dependencies.xml
README.rst
venn_list.xml
@peterjc
Copy link
Contributor Author

peterjc commented Apr 30, 2015

Reading over planemo/shed.py I think the code constructs a temporary folder containing all the files and folders to be turned into the tar-ball? In this case, the order is at the mercy of the file-system and my request is non-trivial.

Plan B would be sorting by name - sorting the top level is easy but does not alter the order of files in sub-folders, which is down to how tarfile's recursive add works:

$ git diff
diff --git a/planemo/shed.py b/planemo/shed.py
index 6189176..8bd9a2e 100644
--- a/planemo/shed.py
+++ b/planemo/shed.py
@@ -401,7 +401,9 @@ def build_tarball(realized_path, **kwds):
     try:
         tar = tarfile.open(temp_path, "w:gz", dereference=True)
         try:
-            for name in os.listdir(realized_path):
+            # File system order essentially random, so at least sort
+            # top level entries by name:
+            for name in sorted(os.listdir(realized_path)):
                 tar.add(
                     os.path.join(realized_path, name),
                     arcname=name,

@jmchilton
Copy link
Member

Yes - in order to allow all of the shed operations to uniformly reason about things and to allow multiple repositories to be mapped from a single .shed.yml - I really like creating the directory. Happy to add that sort if it helps (or even rework the recursive manually to always sort). Given the include expressions can be globs I feel like sorting by name for repeat-ibility and consistency is actually preferable to sorting by input order. Is that sufficient @peterjc?

@peterjc
Copy link
Contributor Author

peterjc commented Apr 30, 2015

Wildcards (and thus what happens to be on disk, and in what order) introduces another level of unpredictability. So yes, for repeat-ibility and consistency there is a lot to be said for always sorting by name rather than by input order.

@jmchilton
Copy link
Member

Do you want to commit your patch above then? Happy to do it myself if you don't mind loosing commit author.

@peterjc
Copy link
Contributor Author

peterjc commented Apr 30, 2015

OK, will do - but this only does a top level sort so I'll leave this issue open.

peterjc added a commit that referenced this issue Apr 30, 2015
See GitHub issue #159. For reproducibility it would be nice to
have all the files in the tar-ball sorted by name, rather than
by the order on disk in the temporary folder being compressed.
That will require reworking the recursion as well.
@peterjc peterjc changed the title shed_upload include order not preserved in tar ball File order in tar-ball from shed_upload Apr 30, 2015
@jmchilton
Copy link
Member

@peterjc Awesome - thanks.

peterjc added a commit to peterjc/planemo that referenced this issue Apr 30, 2015
See GitHub issue galaxyproject#159.

Note this does not explicitly handle directories, thus any empty
directory would be missing from the tar-ball.
@peterjc peterjc closed this as completed Apr 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants