Prevent rdflib from refetching URLs #2

dbs · 2016-01-10T18:37:00Z

Although we fetch the HTML using the python requests module with a User Agent that masks itself as Firefox so that we can cache a local copy and be polite, it seems that rdflib, even when asked to parse a string, then goes on to fetch URLs anyway.

This is problematic for a number of reasons:

First, we want to analyze the HTML that we fetched from the site; if rdflib is going behind our backs and grabbing newer or additional content from the network, it's not exactly a repeatable experiment. SCIENCE!

Second, adding network traffic greatly increases the amount of time required to parse the URLs.

Third, it can overburden the network and potentially will get my IP banned from some servers.

Fourth, as it happens within rdflib code, we have no way to set the User Agent--and thus for some sites like http://www.robertscreekcommunity.ca/roberts-creek-library-and-reading-room.html we get the following error that is printed directly to stderror (not even logged, sigh):

HTTP Error 403: Access Denied<br/>HTTP request error 17f4e8c8<br/>You can fix the problem 
yourself by visiting <a href='http://www.ioerror.us/bb2-support-key?key=17f4e8c8'>this URL</a>
and following its instructions.

The corresponding traceback looks something like:

Traceback (most recent call last):
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/utils.py", line 83, in __init__
    self.data       = urlopen(req)
  File "/usr/lib64/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib64/python3.4/urllib/request.py", line 469, in open
    response = meth(req, response)
  File "/usr/lib64/python3.4/urllib/request.py", line 579, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python3.4/urllib/request.py", line 507, in error
    return self._call_chain(*args)
  File "/usr/lib64/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib64/python3.4/urllib/request.py", line 587, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 587, in graph_from_source
    if not rdfOutput : raise h
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 576, in graph_from_source
    input = self._get_input(name)
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 455, in _get_input
    raise sys.exc_info()[1]
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 426, in _get_input
    url_request       = URIOpener(name)
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/utils.py", line 133, in __init__
    raise HTTPError('%s' % msg[1], e.code)
rdflib.plugins.parsers.pyRdfa.HTTPError

Perhaps we could import rdflib/plugins/parsers/pyRdfa/utils.py and override init so that it won't even try to open a req that doesn't start with file:///; or perhaps even urllib.request (so that it results in the same behaviour for all parsers).

The text was updated successfully, but these errors were encountered:

dbs · 2016-01-10T20:32:34Z

Ah, so it seems that there's a significant difference between the following methods of parsing a file with a set location.

To start with, this does not trigger a network request:

g.parse(open('empty.html', 'r'), format='html')

This, however, does trigger a network request:

g.parse(open('empty.html', 'r'), format='html', location='http://laurentian.ca/library/')

And this does not:

g.parse(file=open('empty.html', 'r'), format='html', location='http://laurentian.ca/library/')

So even though http://rdflib.readthedocs.org/en/stable/apidocs/rdflib.html#rdflib.graph.Graph.parse says that source as a positional argument accepts a file-like object, and the named file argument specifies a file-like object, they are not equivalent. This is surprising to me, but perhaps not to others.

dbs closed this as completed in 13e0955 Jan 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent rdflib from refetching URLs #2

Prevent rdflib from refetching URLs #2

dbs commented Jan 10, 2016

dbs commented Jan 10, 2016

Prevent rdflib from refetching URLs #2

Prevent rdflib from refetching URLs #2

Comments

dbs commented Jan 10, 2016

dbs commented Jan 10, 2016