New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planemo lint --urls UserAgent? #578

Closed
peterjc opened this Issue Sep 23, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@peterjc
Copy link
Contributor

peterjc commented Sep 23, 2016

Just noticed via https://travis-ci.org/peterjc/galaxy_blast/jobs/162157454 that however planemo accesses the URL it can be blocked, apprently based on the UserAgent, here linting https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_makeblastdb.xml

Applying linter tool_urls... FAIL
.. ERROR: HTTP Error 403 accessing https://www.ncbi.nlm.nih.gov/books/NBK279690/
.. INFO: URL OK http://dx.doi.org/10.1186/s13742-015-0080-7
.. INFO: URL OK http://dx.doi.org/10.1186/1471-2105-10-421
.. INFO: URL OK http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus

Curiously the URL https://www.ncbi.nlm.nih.gov/books/NBK279690/ is fine in web-browsers or with curl, but wget also gets the Error 403 reply:

$ wget https://www.ncbi.nlm.nih.gov/books/NBK279690/
--2016-09-23 14:54:10--  https://www.ncbi.nlm.nih.gov/books/NBK279690/
Resolving www.ncbi.nlm.nih.gov... 130.14.29.110, 2607:f220:41e:4290::110
Connecting to www.ncbi.nlm.nih.gov|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2016-09-23 14:54:11 ERROR 403: Forbidden.

You might argue that for linting the user-facing documentation (the reStructuredText in the <help> tag) we should fetch the link with a standard web-browser User Agent, while for package download URLs then the default arbitrary User Agent makes sense as the URL would be used for programmatic downloads (e.g. in Python when installing via the ToolShed).

@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Sep 23, 2016

Well that stinks - yeah I really like your idea of splitting user agent based on human or computer downloads.

$ wget  --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0"  https://www.ncbi.nlm.nih.gov/books/NBK279690/
--2016-09-23 10:02:13--  https://www.ncbi.nlm.nih.gov/books/NBK279690/
Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 2607:f220:41e:4290::110, 130.14.29.110
Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|2607:f220:41e:4290::110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

    [ <=>                                              ] 32,155      --.-K/s   in 0.04s   

2016-09-23 10:02:14 (852 KB/s) - ‘index.html’ saved [32155]
@peterjc

This comment has been minimized.

Copy link
Contributor Author

peterjc commented Sep 23, 2016

Thanks John - setting the UserAgent with wget works for me too - I'll email the NCBI BLAST help team to see if they can get the website fixed, I can see no reason to block access to the BLAST manual and other HTML pages like this.

peterjc added a commit to peterjc/galaxy_blast that referenced this issue Sep 23, 2016

TravisCI: Turn off URL checking (false positive)
See galaxyproject/planemo#578
NCBI currently giving Error 403: Forbidden on webpages
depending on the user-agent.
@peterjc

This comment has been minimized.

Copy link
Contributor Author

peterjc commented Sep 28, 2016

The NCBI got back to me, the UserAgent filtering is done deliberately on the PMC servers to protect some restricted copyright content from bulk downloads. That this blocks wget of the BLAST command line manual is unfortunate.

Spoofing a browser for verifying download links seems like the best solution here.

@blankenberg

This comment has been minimized.

Copy link
Member

blankenberg commented Sep 28, 2016

/facepalm

$ wget https://www.ncbi.nlm.nih.gov/books/NBK279690/ --user-agent="user-agent blocking is silly"
--2016-09-28 10:12:58--  https://www.ncbi.nlm.nih.gov/books/NBK279690/
Resolving www.ncbi.nlm.nih.gov... 2607:f220:41e:4290::110, 130.14.29.110
Connecting to www.ncbi.nlm.nih.gov|2607:f220:41e:4290::110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

index.html                              [ <=>                                                              ]  31.40K  --.-KB/s    in 0.04s   

2016-09-28 10:12:59 (896 KB/s) - ‘index.html’ saved [32155]
@jmchilton

This comment has been minimized.

Copy link
Member

jmchilton commented Sep 28, 2016

I really just want to bulk download some restricted copyright content I'd otherwise not care at all about now.

@peterjc

This comment has been minimized.

Copy link
Contributor Author

peterjc commented Sep 28, 2016

@peterjc

This comment has been minimized.

Copy link
Contributor Author

peterjc commented Oct 10, 2016

Great - thanks for reviewing that quickly John.

Once the next planemo release goes live on PyPI we can revert some of the workarounds to skip tool URL linting. Although right now the https://toolshed.g2.bx.psu.edu/ is down so that will be annoying...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment