Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent rdflib from refetching URLs #2

Closed
dbs opened this issue Jan 10, 2016 · 1 comment
Closed

Prevent rdflib from refetching URLs #2

dbs opened this issue Jan 10, 2016 · 1 comment

Comments

@dbs
Copy link
Owner

dbs commented Jan 10, 2016

Although we fetch the HTML using the python requests module with a User Agent that masks itself as Firefox so that we can cache a local copy and be polite, it seems that rdflib, even when asked to parse a string, then goes on to fetch URLs anyway.

This is problematic for a number of reasons:

First, we want to analyze the HTML that we fetched from the site; if rdflib is going behind our backs and grabbing newer or additional content from the network, it's not exactly a repeatable experiment. SCIENCE!

Second, adding network traffic greatly increases the amount of time required to parse the URLs.

Third, it can overburden the network and potentially will get my IP banned from some servers.

Fourth, as it happens within rdflib code, we have no way to set the User Agent--and thus for some sites like http://www.robertscreekcommunity.ca/roberts-creek-library-and-reading-room.html we get the following error that is printed directly to stderror (not even logged, sigh):

HTTP Error 403: Access Denied<br/>HTTP request error 17f4e8c8<br/>You can fix the problem 
yourself by visiting <a href='http://www.ioerror.us/bb2-support-key?key=17f4e8c8'>this URL</a>
and following its instructions.

The corresponding traceback looks something like:

Traceback (most recent call last):
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/utils.py", line 83, in __init__
    self.data       = urlopen(req)
  File "/usr/lib64/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib64/python3.4/urllib/request.py", line 469, in open
    response = meth(req, response)
  File "/usr/lib64/python3.4/urllib/request.py", line 579, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python3.4/urllib/request.py", line 507, in error
    return self._call_chain(*args)
  File "/usr/lib64/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/lib64/python3.4/urllib/request.py", line 587, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 587, in graph_from_source
    if not rdfOutput : raise h
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 576, in graph_from_source
    input = self._get_input(name)
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 455, in _get_input
    raise sys.exc_info()[1]
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/__init__.py", line 426, in _get_input
    url_request       = URIOpener(name)
  File "/lib/python3.4/site-packages/rdflib/plugins/parsers/pyRdfa/utils.py", line 133, in __init__
    raise HTTPError('%s' % msg[1], e.code)
rdflib.plugins.parsers.pyRdfa.HTTPError

Perhaps we could import rdflib/plugins/parsers/pyRdfa/utils.py and override init so that it won't even try to open a req that doesn't start with file:///; or perhaps even urllib.request (so that it results in the same behaviour for all parsers).

@dbs
Copy link
Owner Author

dbs commented Jan 10, 2016

Ah, so it seems that there's a significant difference between the following methods of parsing a file with a set location.

To start with, this does not trigger a network request:

g.parse(open('empty.html', 'r'), format='html')

This, however, does trigger a network request:

g.parse(open('empty.html', 'r'), format='html', location='http://laurentian.ca/library/')

And this does not:

g.parse(file=open('empty.html', 'r'), format='html', location='http://laurentian.ca/library/')

So even though http://rdflib.readthedocs.org/en/stable/apidocs/rdflib.html#rdflib.graph.Graph.parse says that source as a positional argument accepts a file-like object, and the named file argument specifies a file-like object, they are not equivalent. This is surprising to me, but perhaps not to others.

@dbs dbs closed this as completed in 13e0955 Jan 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant