New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planemo lint --urls false positive from pmid:12345678 #573

Closed
peterjc opened this Issue Sep 22, 2016 · 5 comments

Comments

Projects
None yet
2 participants
@peterjc
Copy link
Contributor

peterjc commented Sep 22, 2016

Linting this file https://github.com/peterjc/pico_galaxy/blob/801daf8dc7932a087eb83a96d0be1e99ed0447c3/tools/chromosome_diagram/chromosome_diagram.xml is failing,

$ planemo --version
planemo, version 0.33.0.dev0
$ planemo shed_lint --tools --urls tools/chromosome_diagram/ ; echo "Return code $?"
(snip)
Applying linter tool_urls... FAIL
.. ERROR: URL Error <urlopen error unknown url type: pmid> accessing pmid:19304878
.. INFO: URL OK http://dx.doi.org/10.1093/bioinformatics/btp163
Failed linting
Return code 1

Or,

$ planemo lint --urls tools/chromosome_diagram/ ; echo "Return code $?"
...
Applying linter tool_urls... FAIL
.. ERROR: URL Error <urlopen error unknown url type: pmid> accessing pmid:19304878
.. INFO: URL OK http://dx.doi.org/10.1093/bioinformatics/btp163
Failed linting
Return code 1

This is triggered by the RST help text in the tool XML file:

Cock et al 2009. Biopython: freely available Python tools for computational
molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.

https://github.com/peterjc/pico_galaxy/blob/801daf8dc7932a087eb83a96d0be1e99ed0447c3/tools/chromosome_diagram/chromosome_diagram.xml#L75

It appears pmid:19304878 is wrongly being picked up as a URL despite not having a double slash after the colon?

# http://stackoverflow.com/questions/7676255/find-and-replace-urls-in-a-block-of-te

>>> import re
>>> HTTP_REGEX_PATTERN = re.compile(r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>\[\]]+|\(([^\s()<>\[\]]+|(\([^\
... \s()<>\[\]]+\)))*\))+(?:\(([^\s()<>\[\]]+|(\([^\s()<>\[\]]+\)))*\)|[^\s`!(){};:'".,<>?\[\]]))""")
>>> HTTP_REGEX_PATTERN.findall("\nSee pmid:12345678 for details.")
[('pmid:12345678', '', '', '', '')]
>>> HTTP_REGEX_PATTERN.findall("\nSee http://example.org or pmid:12345678.")
[('http://example.org', '', '', '', ''), ('pmid:12345678', '', '', '', '')]
@peterjc

This comment has been minimized.

Copy link
Contributor Author

peterjc commented Sep 22, 2016

Also failing on GO:0005856 in this example RST help text:

Numbers in parentheses, such as "0005856(2)" indicate that descendant "part_of"
cellular components were also included, up to the specified depth (2 in this case).
For example, all of the children and grandchildren of "GO:0005856" were
included as "cysk". 

https://github.com/peterjc/pico_galaxy/blob/11dcc5c30e024871ff33d9c509743515c71b9b81/tools/protein_analysis/wolf_psort.xml#L91

See e.g. https://travis-ci.org/peterjc/galaxy_blast/jobs/161948618

Applying linter tool_urls... FAIL
.. ERROR: URL Error <urlopen error unknown url type: go> accessing GO:0005856
@peterjc

This comment has been minimized.

Copy link
Contributor Author

peterjc commented Sep 23, 2016

Also failing on HWI-ST916:79:D04M5ACXX:1:1101:10000:100326 which is an example Illumina read name used in:

https://github.com/peterjc/pico_galaxy/blob/ea24615d4adfe1625a505d71da82eebfe5f6abfc/tools/fastq_paired_unpaired/fastq_paired_unpaired.xml#L80

peterjc added a commit to peterjc/pico_galaxy that referenced this issue Sep 23, 2016

@erasche

This comment has been minimized.

Copy link
Member

erasche commented Sep 24, 2016

hey @peterjc I was on vacation this past week. Writing a quick patch now that fixes this and additionally restricts urls to http / ftp urls.

@peterjc

This comment has been minimized.

Copy link
Contributor Author

peterjc commented Sep 24, 2016

Thanks - your pull request #579 is along the lines I was thinking of. Editing the REGEX might be more elegant, but it is already so complicated it would be easy to trigger side effects.

@peterjc

This comment has been minimized.

Copy link
Contributor Author

peterjc commented Sep 28, 2016

With #579 merged this should be fixed in the next release of Planemo, so closing - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment