Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rss] Retrieval of RSS site returns forbidden without user-agent header #128

Closed
albertinisg opened this issue Mar 23, 2017 · 0 comments
Closed
Assignees
Labels

Comments

@albertinisg
Copy link
Contributor

Trying to fetch an RSS feed returns a 403 forbidden error:

$ perceval rss https://blog.couchbase.com/feed/
[2017-03-23 19:58:58,425] - Sir Perceval is on his quest.
[2017-03-23 19:58:58,426] - Looking for rss entries at feed 'https://blog.couchbase.com/feed/'
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/perceval-0.7.0.dev4-py3.5.egg/perceval/backend.py", line 309, in run
    for item in items:
  File "/usr/local/lib/python3.5/dist-packages/perceval-0.7.0.dev4-py3.5.egg/perceval/backend.py", line 359, in decorator
    for data in func(self, *args, **kwargs):
  File "/usr/local/lib/python3.5/dist-packages/perceval-0.7.0.dev4-py3.5.egg/perceval/backends/core/rss.py", line 73, in fetch
    raw_entries = self.client.get_entries()
  File "/usr/local/lib/python3.5/dist-packages/perceval-0.7.0.dev4-py3.5.egg/perceval/backends/core/rss.py", line 170, in get_entries
    req.raise_for_status()
  File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 851, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/perceval", line 4, in <module>
    __import__('pkg_resources').run_script('perceval==0.7.0.dev4', 'perceval')
  File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 746, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 1501, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.5/dist-packages/perceval-0.7.0.dev4-py3.5.egg/EGG-INFO/scripts/perceval", line 191, in <module>
    main()
  File "/usr/local/lib/python3.5/dist-packages/perceval-0.7.0.dev4-py3.5.egg/EGG-INFO/scripts/perceval", line 109, in main
    cmd.run()
  File "/usr/local/lib/python3.5/dist-packages/perceval-0.7.0.dev4-py3.5.egg/perceval/backend.py", line 314, in run
    raise RuntimeError(str(e))
RuntimeError: 403 Client Error: Forbidden

But the site is reachable using curl:

curl -v https://blog.couchbase.com/feed/
*   Trying 104.199.116.147...
* Connected to blog.couchbase.com (104.199.116.147) port 443 (#0)
* found 173 certificates in /etc/ssl/certs/ca-certificates.crt
* found 696 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_256_GCM_SHA384
* 	 server certificate verification OK
* 	 server certificate status verification SKIPPED
* 	 common name: blog.couchbase.com (matched)
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: CN=blog.couchbase.com
* 	 start date: Mon, 13 Feb 2017 23:57:00 GMT
* 	 expire date: Sun, 14 May 2017 23:57:00 GMT
* 	 issuer: C=US,O=Let's Encrypt,CN=Let's Encrypt Authority X3
* 	 compression: NULL
* ALPN, server accepted to use http/1.1
> GET /feed/ HTTP/1.1
> Host: blog.couchbase.com
> User-Agent: curl/7.45.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Server: nginx
< Date: Thu, 23 Mar 2017 19:00:00 GMT
< Content-Type: application/rss+xml; charset=UTF-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Keep-Alive: timeout=20
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Pragma: no-cache
< Last-Modified: Thu, 23 Mar 2017 17:50:51 GMT
< ETag: "3d880cc1444b27b26be2adf1ddee205b-gzip"
< X-Robots-Tag: noindex, follow
< Link: <https://blog.couchbase.com/wp-json/>; rel="https://api.w.org/"
< X-Cacheable: CacheAlways: feed
< Cache-Control: max-age=600, must-revalidate
< X-Cache: HIT: 5
< X-Pass-Why: 
< X-Cache-Group: 
< X-Type: feed

So if we do the request with the python requests library, the error still appears:

In [1]: import requests
In [2]: req = requests.get('https://blog.couchbase.com/feed/')
In [3]: req.text
Out[3]: u'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

But works if we add the user-agent:

In [6]: req = requests.get('https://blog.couchbase.com/feed/', headers={'User-Agent':'curl/7.45.0'})
In [7]: req.text
Out[7]: u'<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"\n\txmlns:content="http://purl.org/rss/1.0/modules/content/"\n\txmlns:wfw="http://wellformedweb.org/CommentAPI/"\n\txmlns:dc="http://purl.org/dc/elements/1.1/"\n\txmlns:atom="http://www.w3.org/2005/Atom"\n\txmlns:sy="http://purl.org/rss/1.0/modules/syndication/"\n\txmlns:slash="http://purl.org/rss/1.0/modules/slash/"\n\t>\n\n<channel>\n\t<title>
...
acs pushed a commit to acs/perceval that referenced this issue Mar 23, 2017
…s return the data.

In wordpress.com if the User-Agent header is not included, a forbidden (403) response is received.
Fix chaoss#128. Solution proposed by albertinisg!
acs pushed a commit to acs/perceval that referenced this issue Mar 23, 2017
…s return the data.

In wordpress.com if the User-Agent header is not included, a forbidden (403) response is received.
Fix chaoss#128. Solution proposed by albertinisg!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants