Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyld does not inspect Link headers #128

Open
alpha-beta-soup opened this issue Jun 19, 2020 · 4 comments
Open

pyld does not inspect Link headers #128

alpha-beta-soup opened this issue Jun 19, 2020 · 4 comments

Comments

@alpha-beta-soup
Copy link

Let's say I have this extremely minimal bit of JSON-LD to be expanded with pyld:

>>> import pyld
>>> d = {
...    "@context": "https://schema.org",
...    "@type":"Dataset",
...    "@id":"http://localhost:5000/collections/obs",
...    "url":"http://localhost:5000/collections/obs"
... }
>>> pyld.expand(d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'pyld' has no attribute 'expand'
>>> pyld.jsonld.expand(d)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pyld/documentloader/requests.py", line 72, in loader
    'document': response.json()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/pyld/context_resolver.py", line 143, in _fetch_context
    remote_doc = jsonld.load_document(url,
  File "/usr/local/lib/python3.8/dist-packages/pyld/jsonld.py", line 6583, in load_document
    remote_doc = options['documentLoader'](url, options)
  File "/usr/local/lib/python3.8/dist-packages/pyld/documentloader/requests.py", line 100, in loader
    raise JsonLdError(
pyld.jsonld.JsonLdError: ('Could not retrieve a JSON-LD document from the URL.',)
Type: jsonld.LoadDocumentError
Code: loading document failed
Cause: Expecting value: line 1 column 1 (char 0)  File "/usr/local/lib/python3.8/dist-packages/pyld/documentloader/requests.py", line 72, in loader
    'document': response.json()
  File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/pyld/jsonld.py", line 163, in expand
    return JsonLdProcessor().expand(input_, options)
  File "/usr/local/lib/python3.8/dist-packages/pyld/jsonld.py", line 870, in expand
    expanded = self._expand(active_ctx, None, document, options,
  File "/usr/local/lib/python3.8/dist-packages/pyld/jsonld.py", line 2302, in _expand
    active_ctx = self._process_context(
  File "/usr/local/lib/python3.8/dist-packages/pyld/jsonld.py", line 3049, in _process_context
    resolved = options['contextResolver'].resolve(active_ctx, local_ctx, options.get('base', ''))
  File "/usr/local/lib/python3.8/dist-packages/pyld/context_resolver.py", line 58, in resolve
    resolved = self._resolve_remote_context(
  File "/usr/local/lib/python3.8/dist-packages/pyld/context_resolver.py", line 108, in _resolve_remote_context
    context, remote_doc = self._fetch_context(active_ctx, url, cycles)
  File "/usr/local/lib/python3.8/dist-packages/pyld/context_resolver.py", line 148, in _fetch_context
    raise jsonld.JsonLdError(
pyld.jsonld.JsonLdError: ('Dereferencing a URL did not result in a valid JSON-LD object. Possible causes are an inaccessible URL perhaps due to a same-origin policy (ensure the server uses CORS if you are using client-side JavaScript), too many redirects, a non-JSON response, or more than one HTTP Link Header was provided for a remote context.',)
Type: jsonld.InvalidUrl
Code: loading remote context failed
Details: {'url': 'https://schema.org', 'cause': JsonLdError('Could not retrieve a JSON-LD document from the URL.')}

If I susbtitute "https://schema.org" with "https://schema.org/docs/jsonldcontext.jsonld", with the code otherwise unchanged, it will correctly print (as I expected):

>>> [{'@id': 'http://localhost:5000/collections/obs', '@type': ['http://schema.org/Dataset'], 'http://schema.org/url': [{'@id': 'http://localhost:5000/collections/obs'}]}]

However, that then seems to mess up other parsers, including the Google Structured Data Testing Tool.

The root issue seems to be with pyld's remote fetching of contexts, in that "https://schema.org/" does not now have an application/ld+json content-type, instead opting to use Link header with rel=alternate and type=application/ld+json. It seems that pyld needs to be updated to handle that case:

$ curl -I https://schema.org/ 
HTTP/2 200 
access-control-allow-credentials: true
access-control-allow-headers: Accept
access-control-allow-methods: GET
access-control-allow-origin: *
access-control-expose-headers: Link
link: </docs/jsonldcontext.jsonld>; rel="alternate"; type="application/ld+json"
date: Fri, 19 Jun 2020 03:17:19 GMT
expires: Fri, 19 Jun 2020 03:27:19 GMT
etag: "G8zMyg"
x-cloud-trace-context: d2d5c536d73ce1590813f8e1018a2ad6
content-type: text/html
server: Google Frontend
content-length: 5100
age: 73
cache-control: public, max-age=600
alt-svc: h3-28=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-25=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"

If you do curl https://schema.org/ -H "Accept: application/ld+json" you will still get back an HTML response.

Perhaps the cleanest way to implement this would be to check if a non-JSON-LD response is recieved, and if so, to look for an appropriate Link header and then make a request there.

@alpha-beta-soup
Copy link
Author

alpha-beta-soup commented Jun 19, 2020

After reviewing the source, it doesn't seem to be the case that pyld doesn't inspect Link headers, but that it does response.json(), triggering an exception right before the Link header would be inspected, so it never gets that far. This can possibly be avoided by first checking whether the Content-Type is some kind of JSON (since https://schema.org will response with HTML). The error suggests that for whatever reason, at the point of the exception, response is None.

pyld.jsonld.JsonLdError: ('Dereferencing a URL did not result in a valid JSON-LD object. Possible causes are an inaccessible URL perhaps due to a same-origin policy (ensure the server uses CORS if you are using client-side JavaScript), too many redirects, a non-JSON response, or more than one HTTP Link Header was provided for a remote context.',)

Why require JSON repsonses if the Link of type alternate is intented to point to the alternate representation? https://html.spec.whatwg.org/multipage/links.html#rel-alternate

If the alternate keyword is used with the type attribute, it indicates that the referenced document is a reformulation of the current document in the specified format.

@davidlehn
Copy link
Member

The Link handling code is in the document loaders right below where that json() call happens. Quite possible that code hadn't been properly tested before. If someone has time to refactor that code to handle Link header in the proper order, that would be great.

@alpha-beta-soup
Copy link
Author

alpha-beta-soup commented Jun 21, 2020

@davidlehn I'm trying to learn how the tests are put together, to get a clear failing case before trying to fix the issue. If you can help with that, I'm willing to try and fix it.

I have the existing test suite running (although I get five failures). To that I've added a manifest.json in the root, and two test cases at the root as well.

manifest.json:

{
  "@context": ["context.jsonld", {"@base": "manifest"}],
  "@id": "",
  "@type": "mf:Manifest",
  "name": "JSON-LD Test Suite",
  "description": "This manifest loads some tests for resolving https://github.com/digitalbazaar/pyld/issues/128",
  "sequence": [
    "sample.jsonld",
    "sample2.jsonld"
  ]
}

sample.jsonld

{
	"@context": "https://schema.org",
	"@type":"Dataset",
	"@id":"http://localhost:5000/collections/obs",
	"url":"http://localhost:5000/collections/obs"
}

sample2.jsonld

{
	"@context": "https://schema.org/docs/jsonldcontext.jsonld",
	"@type":"Dataset",
	"@id":"http://localhost:5000/collections/obs",
	"url":"http://localhost:5000/collections/obs"
}

I run the tests in a virtual environment as: python tests/runtests.py ./manifest.jsonld, but the test suite skips them:

/usr/lib/python3/dist-packages/requests/__init__.py:80: RequestsDependencyWarning: urllib3 (1.24.1) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
PyLD Tests
Use -h or --help to view options.

JSON-LD Test Suite: http://localhost:5000/collections/obs: None ... skipped "Test type of ['Dataset']"
JSON-LD Test Suite: http://localhost:5000/collections/obs: None ... skipped "Test type of ['Dataset']"

----------------------------------------------------------------------
Ran 2 tests in 0.000s

OK (skipped=2)

How can I test these?

@mathiasrichter
Copy link

The Link handling code is in the document loaders right below where that json() call happens. Quite possible that code hadn't been properly tested before. If someone has time to refactor that code to handle Link header in the proper order, that would be great.

Wouldn't a possible fix be as follows:

After performing the initial request which returns a response with alternate link headers:

            if response.headers['Link']:
                links = response.links
                if links['alternate'] and links['alternate']['type'] == 'application/ld+json':
                    response = requests.get(response.url+links['alternate']['url'], headers=headers, **kwargs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants