Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command to compare repositories in different Tool Sheds #27

Closed
peterjc opened this issue Nov 25, 2014 · 5 comments · Fixed by #33
Closed

Command to compare repositories in different Tool Sheds #27

peterjc opened this issue Nov 25, 2014 · 5 comments · Fixed by #33

Comments

@peterjc
Copy link
Contributor

peterjc commented Nov 25, 2014

I'm not sure if this falls under the planemo scope, but posting it here for discussion at least.

As part of my workflow of initially releasing tools on the Test Tool Shed, and then if there are no problems with the functional test, uploading them to the main Tool Shed, I would like a "ToolShed diff" command which could be used as follows:

$ shed_diff https://toolshed.g2.bx.psu.edu/view/peterjc/blast_rbh https://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_rbh
...

I would like this to output something along the following lines (a bit of a hack using command line tools hg and diff to fetch and compare the files from the ToolShed), here showing a harmless diff in the dependencies:

$ hg clone https://peterjc@toolshed.g2.bx.psu.edu/repos/peterjc/blast_rbh blast_rbh_main
$ hg clone https://peterjc@testtoolshed.g2.bx.psu.edu/repos/peterjc/blast_rbh blast_rbh_test
$ rm -rf blast_rbh_main/.hg
$ rm -rf blast_rbh_test/.hg
$ diff -r blast_rbh_main blast_rbh_test
diff -r blast_rbh_main/tools/blast_rbh/tool_dependencies.xml blast_rbh_test/tools/blast_rbh/tool_dependencies.xml
4c4
<         <repository changeset_revision="5477a05cc158" name="package_biopython_1_64" owner="biopython" toolshed="https://toolshed.g2.bx.psu.edu" />

---
>         <repository changeset_revision="268128adb501" name="package_biopython_1_64" owner="biopython" toolshed="https://testtoolshed.g2.bx.psu.edu" />
7c7
<         <repository changeset_revision="0fe5d5c28ea2" name="package_blast_plus_2_2_30" owner="iuc" toolshed="https://toolshed.g2.bx.psu.edu" />

---
>         <repository changeset_revision="f69b90d89b62" name="package_blast_plus_2_2_30" owner="iuc" toolshed="https://testtoolshed.g2.bx.psu.edu" />

In terms of the tool command line API, alternative ways to specify the tool sheds might make sense here too? I'd probably setup an alias like this for the typical case where the same author ID owns both:

$ toolshed_main_test_diff peterjc/blast_rbh
...

This would help greatly in spotting when I have forgotten to push an update from the Test Tool Shed to the main Tool Shed.

However, it would be nice to compare any two tools (e.g. alternative versions of a wrapper from two different authors) which would work with the full URL style.

@peterjc
Copy link
Contributor Author

peterjc commented Nov 25, 2014

Python script doing the above quick-and-dirty hack using hg and diff: https://gist.github.com/peterjc/13653e6907d75c470d01

@peterjc
Copy link
Contributor Author

peterjc commented Nov 25, 2014

A related task is comparing the current directory (possibly under git version control) with a remote Tool Shed, e.g. to see if you need to upload a new tar-ball or not. However, this runs into issues flagged on #26 about how to determine which files to look at.

@peterjc
Copy link
Contributor Author

peterjc commented Nov 25, 2014

I updated my gist so the Python script can now also compare a report repository to local files (ignoring local files not present in the remote repository which is generally what I need with my Galaxy Tool development setup).

Sample output, perhaps too verbose, comparing my development repository to the Test Tool Shed (identical bar trivial differences in tool_dependencies.xml):

$ shed_diff https://testtoolshed.g2.bx.psu.edu/view/peterjc/seq_select_by_id
Fetching https://testtoolshed.g2.bx.psu.edu/repos/peterjc/seq_select_by_id
diff /tmp/tool_shed_diff_zpAPqC/remote/tools/seq_select_by_id/seq_select_by_id.xml /mnt/galaxy/pico_galaxy/tools/seq_select_by_id/seq_select_by_id.xml
diff /tmp/tool_shed_diff_zpAPqC/remote/tools/seq_select_by_id/tool_dependencies.xml /mnt/galaxy/pico_galaxy/tools/seq_select_by_id/tool_dependencies.xml
4c4
<         <repository changeset_revision="ac9cc2992b69" name="package_biopython_1_62" owner="biopython" toolshed="https://testtoolshed.g2.bx.psu.edu" />
---
>         <repository name="package_biopython_1_62" owner="biopython" />
diff /tmp/tool_shed_diff_zpAPqC/remote/tools/seq_select_by_id/README.rst /mnt/galaxy/pico_galaxy/tools/seq_select_by_id/README.rst
diff /tmp/tool_shed_diff_zpAPqC/remote/tools/seq_select_by_id/seq_select_by_id.py /mnt/galaxy/pico_galaxy/tools/seq_select_by_id/seq_select_by_id.py
diff /tmp/tool_shed_diff_zpAPqC/remote/test-data/k12_hypothetical.fasta /mnt/galaxy/pico_galaxy/test-data/k12_hypothetical.fasta
diff /tmp/tool_shed_diff_zpAPqC/remote/test-data/k12_hypothetical.tabular /mnt/galaxy/pico_galaxy/test-data/k12_hypothetical.tabular
diff /tmp/tool_shed_diff_zpAPqC/remote/test-data/k12_ten_proteins.fasta /mnt/galaxy/pico_galaxy/test-data/k12_ten_proteins.fasta

Sample output comparing the Test Tool Shed and main Tool Shed repositories, showing I might want to push the v0.0.9 release to the main Tool Shed which is missing several updates:

$ shed_diff https://testtoolshed.g2.bx.psu.edu/view/peterjc/seq_select_by_id https://toolshed.g2.bx.psu.edu/view/peterjc/seq_select_by_id
Fetching https://testtoolshed.g2.bx.psu.edu/repos/peterjc/seq_select_by_id
Fetching https://toolshed.g2.bx.psu.edu/repos/peterjc/seq_select_by_id
diff -r A/tools/seq_select_by_id/README.rst B/tools/seq_select_by_id/README.rst
4c4
< This tool is copyright 2011-2014 by Peter Cock, The James Hutton Institute
---
> This tool is copyright 2011-2013 by Peter Cock, The James Hutton Institute
39,40c39,40
< * ``seq_select_by_id.py`` (the Python script)
< * ``seq_select_by_id.xml`` (the Galaxy tool definition)
---
> * seq_select_by_id.py (the Python script)
> * seq_select_by_id.xml (the Galaxy tool definition)
42c42
< The suggested location is a dedicated ``tools/seq_select_by_id`` folder.
---
> The suggested location is a dedicated tools/seq_select_by_id folder.
44c44
< You will also need to modify the ``tools_conf.xml`` file to tell Galaxy to offer the
---
> You will also need to modify the tools_conf.xml file to tell Galaxy to offer the
49,50c49,50
< If you wish to run the unit tests, also move/copy the ``test-data/`` files
< under Galaxy's ``test-data/`` folder. Then::
---
> If you wish to run the unit tests, also add this to tools_conf.xml.sample
> and move/copy the test-data files under Galaxy's test-data folder. Then::
52c52
<     $ ./run_tests.sh -id seq_select_by_id
---
>     $ ./run_functional_tests.sh -id seq_select_by_id
76,79c76
< v0.0.8  - Corrected automated dependency definition.
< v0.0.9  - Simplified XML to apply input format to output data.
<         - Tool definition now embeds citation information.
<         - Include input dataset name in output dataset names.
---
> v0.0.8  - Corrected automated dependency definition
diff -r A/tools/seq_select_by_id/seq_select_by_id.xml B/tools/seq_select_by_id/seq_select_by_id.xml
1c1
< <tool id="seq_select_by_id" name="Select sequences by ID" version="0.0.9">
---
> <tool id="seq_select_by_id" name="Select sequences by ID" version="0.0.6">
22c22,32
<         <data name="output_file" format="input" metadata_source="input_file" label="Selected sequences from $input_file.name"/>
---
>         <data name="output_file" format="fasta" label="Selected sequences">
>             <!-- TODO - Replace this with format="input:input_fastq" if/when that works -->
>             <change_format>
>                 <when input_dataset="input_file" attribute="extension" value="sff" format="sff" />
>                 <when input_dataset="input_file" attribute="extension" value="fastq" format="fastq" />
>                 <when input_dataset="input_file" attribute="extension" value="fastqsanger" format="fastqsanger" />
>                 <when input_dataset="input_file" attribute="extension" value="fastqsolexa" format="fastqsolexa" />
>                 <when input_dataset="input_file" attribute="extension" value="fastqillumina" format="fastqillumina" />
>                 <when input_dataset="input_file" attribute="extension" value="fastqcssanger" format="fastqcssanger" />
>             </change_format>
>         </data>
62,65d71
<     <citations>
<         <citation type="doi">10.7717/peerj.167</citation>
<         <citation type="doi">10.1093/bioinformatics/btp163</citation>
<     </citations>
diff -r A/tools/seq_select_by_id/tool_dependencies.xml B/tools/seq_select_by_id/tool_dependencies.xml
4c4
<         <repository changeset_revision="ac9cc2992b69" name="package_biopython_1_62" owner="biopython" toolshed="https://testtoolshed.g2.bx.psu.edu" />
---
>         <repository changeset_revision="3e82cbc44886" name="package_biopython_1_62" owner="biopython" toolshed="http://toolshed.g2.bx.psu.edu" />

@jmchilton
Copy link
Member

Thanks for laying this all out Peter - this is definitely in scope and something I wanted to work on so this is perfect. Things are a bit hectic right now - but I am definitely going to look at this in detail at some point soon. Thanks again!

@peterjc
Copy link
Contributor Author

peterjc commented Dec 4, 2014

Good. If this becomes part of planemo, we can do clever things via the .shed.yml file (see #25) like inferring the associated Test/Main Tool Shed URLs and the base paths (and even file lists, see #26).

jmchilton added a commit that referenced this issue Dec 5, 2014
Inspired by script from @peterjc - https://gist.github.com/peterjc/13653e6907d75c470d01.

By default compares the local changes against the main Tool Shed repository defined by [.][tool][_]shed.yml, but with command line options can be made to do all sorts of comparisons. Some of these are demonstrated below:

Default against main tool shed:

```
% planemo shed_diff
wget -q --recursive -O - 'https://toolshed.g2.bx.psu.edu/repository/download?repository_id=b6b97c236de89252&changeset_revision=default&file_type=gz' | tar -xzf - -C /tmp/tool_shed_diff_CuRq5U/_toolshed_ --strip-components 1
mkdir "/tmp/tool_shed_diff_CuRq5U/_local_"; tar -xzf "/tmp/tmpdVW07c" -C "/tmp/tool_shed_diff_CuRq5U/_local_"; rm -rf /tmp/tmpdVW07c
cd "/tmp/tool_shed_diff_CuRq5U"; diff -r _local_ _toolshed_
diff -r _local_/count_covariates.xml _toolshed_/count_covariates.xml
7d6
<    <version_command>echo "A REALLY OLD OPEN SOURCE VERSION OF GATK"</version_command>
diff -r _local_/tool_dependencies.xml _toolshed_/tool_dependencies.xml
4c4
<       <repository name="package_gatk_1_4" owner="devteam" prior_installation_required="False" />
---
>       <repository changeset_revision="ec95ec570854" name="package_gatk_1_4" owner="devteam" prior_installation_required="False" toolshed="http://toolshed.g2.bx.psu.edu" />
7c7
<       <repository name="package_samtools_0_1_18" owner="devteam" prior_installation_required="False" />
---
>       <repository changeset_revision="171cd8bc208d" name="package_samtools_0_1_18" owner="devteam" prior_installation_required="False" toolshed="http://toolshed.g2.bx.psu.edu" />
```

Check local diff against test tool shed.

```
% planemo shed_diff --shed_target testtoolshed
/home/john/workspace/planemo/.venv/local/lib/python2.7/site-packages/requests-2.4.3-py2.7.egg/requests/packages/urllib3/connectionpool.py:730: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html (This warning will only appear once by default.)
  InsecureRequestWarning)
wget -q --recursive -O - 'https://testtoolshed.g2.bx.psu.edu/repository/download?repository_id=4dd15c58c2ade087&changeset_revision=default&file_type=gz' | tar -xzf - -C /tmp/tool_shed_diff_LWnNZt/_testtoolshed_ --strip-components 1
mkdir "/tmp/tool_shed_diff_LWnNZt/_local_"; tar -xzf "/tmp/tmpNKEpuO" -C "/tmp/tool_shed_diff_LWnNZt/_local_"; rm -rf /tmp/tmpNKEpuO
cd "/tmp/tool_shed_diff_LWnNZt"; diff -r _local_ _testtoolshed_
diff -r _local_/count_covariates.xml _testtoolshed_/count_covariates.xml
7d6
<    <version_command>echo "A REALLY OLD OPEN SOURCE VERSION OF GATK"</version_command>
diff -r _local_/tool_dependencies.xml _testtoolshed_/tool_dependencies.xml
4c4
<       <repository name="package_gatk_1_4" owner="devteam" prior_installation_required="False" />
---
>       <repository changeset_revision="0cc94f66d00e" name="package_gatk_1_4" owner="devteam" prior_installation_required="False" toolshed="http://testtoolshed.g2.bx.psu.edu" />
7c7
<       <repository name="package_samtools_0_1_18" owner="devteam" prior_installation_required="False" />
---
>       <repository changeset_revision="c0f72bdba484" name="package_samtools_0_1_18" owner="devteam" prior_installation_required="False" toolshed="http://testtoolshed.g2.bx.psu.edu" />
```

Check difference between test and main for this repository.

```
% planemo shed_diff --shed_target_source testtoolshed
/home/john/workspace/planemo/.venv/local/lib/python2.7/site-packages/requests-2.4.3-py2.7.egg/requests/packages/urllib3/connectionpool.py:730: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html (This warning will only appear once by default.)
  InsecureRequestWarning)
wget -q --recursive -O - 'https://toolshed.g2.bx.psu.edu/repository/download?repository_id=b6b97c236de89252&changeset_revision=default&file_type=gz' | tar -xzf - -C /tmp/tool_shed_diff_Aa9wj3/_toolshed_ --strip-components 1
wget -q --recursive -O - 'https://testtoolshed.g2.bx.psu.edu/repository/download?repository_id=4dd15c58c2ade087&changeset_revision=default&file_type=gz' | tar -xzf - -C /tmp/tool_shed_diff_Aa9wj3/_testtoolshed_ --strip-components 1
cd "/tmp/tool_shed_diff_Aa9wj3"; diff -r _testtoolshed_ _toolshed_
diff -r _testtoolshed_/tool_dependencies.xml _toolshed_/tool_dependencies.xml
4c4
<       <repository changeset_revision="0cc94f66d00e" name="package_gatk_1_4" owner="devteam" prior_installation_required="False" toolshed="http://testtoolshed.g2.bx.psu.edu" />
---
>       <repository changeset_revision="ec95ec570854" name="package_gatk_1_4" owner="devteam" prior_installation_required="False" toolshed="http://toolshed.g2.bx.psu.edu" />
7c7
<       <repository changeset_revision="c0f72bdba484" name="package_samtools_0_1_18" owner="devteam" prior_installation_required="False" toolshed="http://testtoolshed.g2.bx.psu.edu" />
---
>       <repository changeset_revision="171cd8bc208d" name="package_samtools_0_1_18" owner="devteam" prior_installation_required="False" toolshed="http://toolshed.g2.bx.psu.edu" />
```

Ignore YAML file and just check difference between main and test tool shed for arbitrary repository.

```
% planemo shed_diff --owner peterjc --name blast_rbh --shed_target_source testtoolshed
/home/john/workspace/planemo/.venv/local/lib/python2.7/site-packages/requests-2.4.3-py2.7.egg/requests/packages/urllib3/connectionpool.py:730: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html (This warning will only appear once by default.)
  InsecureRequestWarning)
wget -q --recursive -O - 'https://toolshed.g2.bx.psu.edu/repository/download?repository_id=d5dd1c5d2070513e&changeset_revision=default&file_type=gz' | tar -xzf - -C /tmp/tool_shed_diff_II0eAD/_toolshed_ --strip-components 1
wget -q --recursive -O - 'https://testtoolshed.g2.bx.psu.edu/repository/download?repository_id=c053d26daf6271bf&changeset_revision=default&file_type=gz' | tar -xzf - -C /tmp/tool_shed_diff_II0eAD/_testtoolshed_ --strip-components 1
cd "/tmp/tool_shed_diff_II0eAD"; diff -r _testtoolshed_ _toolshed_
diff -r _testtoolshed_/tools/blast_rbh/blast_rbh.py _toolshed_/tools/blast_rbh/blast_rbh.py
35c35
<     print "BLAST RBH v0.1.6"
---
>     print "BLAST RBH v0.1.5"
110c110
<     if blast_type not in ["blastp", "blastp-fast", "blastp-short"]:
---
>     if blast_type not in ["blastp", "blastp-short"]:
332c332
<     sys.stderr.write("Warning: Sequences with tied best hits found, you may have duplicates/clusters\n")
---
>     sys.stderr.write("Warning: Sequencies with tied best hits found, you may have duplicates/clusters\n")
diff -r _testtoolshed_/tools/blast_rbh/blast_rbh.xml _toolshed_/tools/blast_rbh/blast_rbh.xml
1c1
< <tool id="blast_reciprocal_best_hits" name="BLAST Reciprocal Best Hits (RBH)" version="0.1.6">
---
> <tool id="blast_reciprocal_best_hits" name="BLAST Reciprocal Best Hits (RBH)" version="0.1.5">
48d47
<                     <option value="blastp-fast">blastp-fast - Uses longer words as described by Shiryev et al (2007)</option>
167c166
<             <param name="nucl_type" value="blastp-fast"/>
---
>             <param name="nucl_type" value="blastp"/>
diff -r _testtoolshed_/tools/blast_rbh/README.rst _toolshed_/tools/blast_rbh/README.rst
65d64
< v0.1.6  - Offer the new blastp-fast task added in BLAST+ 2.2.30.
diff -r _testtoolshed_/tools/blast_rbh/tool_dependencies.xml _toolshed_/tools/blast_rbh/tool_dependencies.xml
4c4
<         <repository changeset_revision="268128adb501" name="package_biopython_1_64" owner="biopython" toolshed="https://testtoolshed.g2.bx.psu.edu" />
---
>         <repository changeset_revision="5477a05cc158" name="package_biopython_1_64" owner="biopython" toolshed="https://toolshed.g2.bx.psu.edu" />
7c7
<         <repository changeset_revision="f69b90d89b62" name="package_blast_plus_2_2_30" owner="iuc" toolshed="https://testtoolshed.g2.bx.psu.edu" />
---
>         <repository changeset_revision="0fe5d5c28ea2" name="package_blast_plus_2_2_30" owner="iuc" toolshed="https://toolshed.g2.bx.psu.edu" />
```

Closes #27.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants