You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Although we fetch the HTML using the python requests module with a User Agent that masks itself as Firefox so that we can cache a local copy and be polite, it seems that rdflib, even when asked to parse a string, then goes on to fetch URLs anyway.
This is problematic for a number of reasons:
First, we want to analyze the HTML that we fetched from the site; if rdflib is going behind our backs and grabbing newer or additional content from the network, it's not exactly a repeatable experiment. SCIENCE!
Second, adding network traffic greatly increases the amount of time required to parse the URLs.
Third, it can overburden the network and potentially will get my IP banned from some servers.
HTTP Error 403: Access Denied<br/>HTTP request error 17f4e8c8<br/>You can fix the problem
yourself by visiting <a href='http://www.ioerror.us/bb2-support-key?key=17f4e8c8'>this URL</a>
and following its instructions.
The corresponding traceback looks something like:
Traceback (most recent call last):
File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/utils.py", line 83, in __init__
self.data = urlopen(req)
File "/usr/lib64/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib64/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/usr/lib64/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/usr/lib64/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib64/python3.4/urllib/request.py", line 587, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 587, in graph_from_source
if not rdfOutput : raise h
File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 576, in graph_from_source
input = self._get_input(name)
File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 455, in _get_input
raise sys.exc_info()[1]
File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 426, in _get_input
url_request = URIOpener(name)
File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/utils.py", line 133, in __init__
raise HTTPError('%s' % msg[1], e.code)
rdflib.plugins.parsers.pyRdfa.HTTPError
Perhaps we could import rdflib/plugins/parsers/pyRdfa/utils.py and override init so that it won't even try to open a req that doesn't start with file:///; or perhaps even urllib.request (so that it results in the same behaviour for all parsers).
The text was updated successfully, but these errors were encountered:
Although we fetch the HTML using the python requests module with a User Agent that masks itself as Firefox so that we can cache a local copy and be polite, it seems that rdflib, even when asked to parse a string, then goes on to fetch URLs anyway.
This is problematic for a number of reasons:
First, we want to analyze the HTML that we fetched from the site; if rdflib is going behind our backs and grabbing newer or additional content from the network, it's not exactly a repeatable experiment. SCIENCE!
Second, adding network traffic greatly increases the amount of time required to parse the URLs.
Third, it can overburden the network and potentially will get my IP banned from some servers.
Fourth, as it happens within rdflib code, we have no way to set the User Agent--and thus for some sites like http://www.robertscreekcommunity.ca/roberts-creek-library-and-reading-room.html we get the following error that is printed directly to stderror (not even logged, sigh):
The corresponding traceback looks something like:
Perhaps we could import rdflib/plugins/parsers/pyRdfa/utils.py and override init so that it won't even try to open a req that doesn't start with file:///; or perhaps even urllib.request (so that it results in the same behaviour for all parsers).
The text was updated successfully, but these errors were encountered: