Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RDF grammar #1217

Closed
wants to merge 1 commit into from
Closed

Fix RDF grammar #1217

wants to merge 1 commit into from

Conversation

Hendrikto
Copy link

@Hendrikto Hendrikto commented Mar 4, 2022

According to the specification [0], PN_CHARS_U is defined as:

PN_CHARS_U ::= PN_CHARS_BASE | '_' | ':'

The previous implementation was missing the ':' part.

Additionally, according to the code comment, PN_CHARS_U_N is defined as:

PN_CHARS_U_N ::= PN_CHARS_U | [0-9]

The previous implementation did not match the spec by deferring to PN_CHARS_U, and instead repeated its definition. Changing (in this case fixing) PN_CHARS_U should also affect PN_CHARS_U_N, which can be achieved simply by following the specification.

According to the specification [0], PN_CHARS_U is defined as:

    PN_CHARS_U ::= PN_CHARS_BASE | '_' | ':'

The previous implementation was missing the ':' part.

Additionally, according to the code comment, PN_CHARS_U_N is defined as:

    PN_CHARS_U_N ::= PN_CHARS_U | [0-9]

The previous implementation did not match the spec by deferring to
PN_CHARS_U, and instead repeated its definition. Changing (in this case
fixing) PN_CHARS_U should also affect PN_CHARS_U_N, which can be
achieved simply by following the specification.

[0]: https://www.w3.org/TR/n-triples/#grammar-production-PN_CHARS_U
@afs
Copy link
Member

afs commented Mar 4, 2022

Hi @Hendrikto - interesting - I think this is a bug in the N-triples spec.

How did you come across it?

In Turtle,

[164s] 	PN_CHARS_U 	::= 	PN_CHARS_BASE | '_'

No : -- and in Turtle : is used in prefix names. I ran your PR and got a lot of test failures because of that.

This matters because N-triples is supposed to be a subset of Turtle (a Turtle parser should be able to parse N-triples)

The way the numbering of the s rules are broken ("s" = supposed to be the same rule as SPARQL) suggest the N-Triples text was written much earlier. Neither set of s is right, Turtle is closer.

@gkellogg
Copy link

gkellogg commented Mar 4, 2022

The Ruby implementation of N-Triples uses a version of PN_CHARS_U missing ':'; not sure how it ever got in there.

I think it's worth an erratum and queue for eventual update to the spec.

@afs
Copy link
Member

afs commented Mar 5, 2022

We have tracked down the intent (in 2006) -- the NT and NQ specs didn't get updated during grammar development and aligning SPARQL with Turtle.

@Hendrikto - do you know of any system that creates N-Triples relying on : in the blank node label?

@Hendrikto
Copy link
Author

Hendrikto commented Mar 5, 2022

@Hendrikto - do you know of any system that creates N-Triples relying on : in the blank node label?

Yes, I work at Spinque, where we have encountered this discrepancy between spec and implementation while investigating an indefinite hang of our application during ingestion of RDF data. The same data is successfully parsed by rdfproc, but not by Jena.

I was not the one who discovered the bug. I requested more information from the reporter and the one who investigated the root cause. I will let you know once I have more information.

Anyway, there definitely are tools out in the wild that produce data according to the spec including ':', and tools which can parse that format.

@afs
Copy link
Member

afs commented Mar 5, 2022

@Hendrikto - thanks

Which tool generated the N-triples?

@Hendrikto
Copy link
Author

@afs Which tool generated the N-triples?

That is what I am trying to get more information on. I have not yet received an answer, but tomorrow is our weekly company meeting, and I can take the opportunity to ask there. I will report back to you with more information afterwards.

What I already know:

  • The file is called vgw_groninger_museum.nt. Not very helpful, I know… But it sounds like it was generated by a Dutch museum.
  • Example triple cuasign parsing errors: _:visual_item:61044b5c-228d-4e3e-a612-27211061684c-0-6422

@afs
Copy link
Member

afs commented Mar 9, 2022

Hi @Hendrikto --

The string 61044b5c-228d-4e3e-a612-27211061684c is a V4 UUID, then the string "-0-6422". That's a lot of uniqueness.

I did a quick survey of other open source systems. Some accept, some don't.

Oxigraph and RubyRDF have said they don't support colon in blank node labels. I ran Serd and it does not accept.

From code inspection, Eclipse RDF4J and rdfjs don't but I didn't run them so I might have got it wrong.

Redland (run tested) and rdflib (inspection) do accept.

There is a way to fix Jena without breaking Turtle. It'll need a change in the tokenizer and signalling that it should use the NT-definition of blank node labels.

Jena won't write such labels. Generally, other systems allocate a known-globally-unique system id.

Redland and Serd do not change the blank node label on parsing - they leave it to the app to do so if it is combined with other data, the app has to ensure uniqueness across files.

I've filed an errata for the RDF specs. It is definitely a bug because the text says NT is a subset of Turtle and this breaks that design choice. It's also why many systems do not accept - they use some or all of the same code.

https://lists.w3.org/Archives/Public/public-rdf-comments/2022Mar/0000.html
https://www.w3.org/2001/sw/wiki/RDF1.1_Errata

@Hendrikto
Copy link
Author

Hendrikto commented Mar 9, 2022

Hi @afs,

Thanks for looking into this so thoroughly. If I understand correctly, the situation is as follows:

Implementation accepts ':'
Oxigraph
RubyRDF
Serd
Eclipse RDF4J ✗(?)
rdfjs ✗(?)
Redland
rdflib

I have received the following statement from my company:

As far as we know at least internally we produce blank-nodes with ':'. The reason for using ':' is to be able to construct blank-nodes by combining/concatenating multiple simple names only consisting of PN_CHARS_BASE characters.

rdfproc handles according to spec:

$ cat test.nt
_:a:b <https://schema.org/name> "test" .
$ rdfproc test parse test.nt ntriples
rdfproc: Parsing URI file:///tmp/rdfproc/test.nt with ntriples parser
$ rdfproc test serialize ntriples
_:a:b <https://schema.org/name> "test" .

If Jena could comply with the current n-triples spec (or have a flag that can be set to allow the current specs), that would be great.


@afs I've filed an errata for the RDF specs. It is definitely a bug because the text says NT is a subset of Turtle

If you say so, I believe it. I was not aware of that design goal.

@afs There is a way to fix Jena without breaking Turtle.

If it can be done, that would be very nice. Should I close this PR and maybe open an issue instead, or just leave it? What's your preferred workflow?

@afs It'll need a change in the tokenizer and signalling that it should use the NT-definition of blank node labels.

I think working on such changes is out-of-scope for me. There is a good argument for applying this fix, because the specification bug has been around for quite a while, and there are implementations that follow it. Therefore, I hope you will consider implementing it.

Thanks again, and let me know if you need anything else.

@afs
Copy link
Member

afs commented Mar 9, 2022

Out of curiosity, what's going on here?

What do you use Jena for in your system.

The reason for using ':' is to be able to construct blank-nodes by combining/concatenating multiple simple names only consisting of PN_CHARS_BASE characters.

It suggests the label is meaningful, and the example has a UUID in it.

You might as well send the data through sed and turn the : into _ which is also not in PN_CHARS_BASE.

Jena does not preserve labels end-to-end so the simple name can't be extracted anyway.

_:ab <urn:ex:123> "abc"@ab .

which writes out using the encoded internal id which is unique for read parser run (you'll see a different id).

_:B36a51bfa87b2e87b418774d143cf2bbd <urn:ex:123> "abc"@ab .

then next time:

_:Bd70094c0f2e0e08779438a2c5dc498a1 <urn:ex:123> "abc"@ab .

@afs
Copy link
Member

afs commented Mar 9, 2022

Recorded as JENA-2304.

@Hendrikto
Copy link
Author

After reviewing the situation, we have decided to implement the fix you suggested: We replace : with _.

@afs
Copy link
Member

afs commented Mar 12, 2022

Hi @Hendrikto -- thanks for the update.

This has been an interesting issue - it's not often there is a issue with a spec like N-triples!

@afs afs closed this Mar 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants