Fix RDF grammar #1217

Hendrikto · 2022-03-04T12:27:50Z

According to the specification [0], PN_CHARS_U is defined as:

PN_CHARS_U ::= PN_CHARS_BASE | '_' | ':'

The previous implementation was missing the ':' part.

Additionally, according to the code comment, PN_CHARS_U_N is defined as:

PN_CHARS_U_N ::= PN_CHARS_U | [0-9]

The previous implementation did not match the spec by deferring to PN_CHARS_U, and instead repeated its definition. Changing (in this case fixing) PN_CHARS_U should also affect PN_CHARS_U_N, which can be achieved simply by following the specification.

According to the specification [0], PN_CHARS_U is defined as: PN_CHARS_U ::= PN_CHARS_BASE | '_' | ':' The previous implementation was missing the ':' part. Additionally, according to the code comment, PN_CHARS_U_N is defined as: PN_CHARS_U_N ::= PN_CHARS_U | [0-9] The previous implementation did not match the spec by deferring to PN_CHARS_U, and instead repeated its definition. Changing (in this case fixing) PN_CHARS_U should also affect PN_CHARS_U_N, which can be achieved simply by following the specification. [0]: https://www.w3.org/TR/n-triples/#grammar-production-PN_CHARS_U

afs · 2022-03-04T14:38:10Z

Hi @Hendrikto - interesting - I think this is a bug in the N-triples spec.

How did you come across it?

In Turtle,

[164s] 	PN_CHARS_U 	::= 	PN_CHARS_BASE | '_'

No : -- and in Turtle : is used in prefix names. I ran your PR and got a lot of test failures because of that.

This matters because N-triples is supposed to be a subset of Turtle (a Turtle parser should be able to parse N-triples)

The way the numbering of the s rules are broken ("s" = supposed to be the same rule as SPARQL) suggest the N-Triples text was written much earlier. Neither set of s is right, Turtle is closer.

gkellogg · 2022-03-04T19:09:09Z

The Ruby implementation of N-Triples uses a version of PN_CHARS_U missing ':'; not sure how it ever got in there.

I think it's worth an erratum and queue for eventual update to the spec.

afs · 2022-03-05T10:18:18Z

We have tracked down the intent (in 2006) -- the NT and NQ specs didn't get updated during grammar development and aligning SPARQL with Turtle.

@Hendrikto - do you know of any system that creates N-Triples relying on : in the blank node label?

Hendrikto · 2022-03-05T12:51:58Z

@Hendrikto - do you know of any system that creates N-Triples relying on : in the blank node label?

Yes, I work at Spinque, where we have encountered this discrepancy between spec and implementation while investigating an indefinite hang of our application during ingestion of RDF data. The same data is successfully parsed by rdfproc, but not by Jena.

I was not the one who discovered the bug. I requested more information from the reporter and the one who investigated the root cause. I will let you know once I have more information.

Anyway, there definitely are tools out in the wild that produce data according to the spec including ':', and tools which can parse that format.

afs · 2022-03-05T13:42:20Z

@Hendrikto - thanks

Which tool generated the N-triples?

Hendrikto · 2022-03-08T18:50:00Z

@afs Which tool generated the N-triples?

That is what I am trying to get more information on. I have not yet received an answer, but tomorrow is our weekly company meeting, and I can take the opportunity to ask there. I will report back to you with more information afterwards.

What I already know:

The file is called vgw_groninger_museum.nt. Not very helpful, I know… But it sounds like it was generated by a Dutch museum.
Example triple cuasign parsing errors: _:visual_item:61044b5c-228d-4e3e-a612-27211061684c-0-6422

afs · 2022-03-09T09:50:56Z

Hi @Hendrikto --

The string 61044b5c-228d-4e3e-a612-27211061684c is a V4 UUID, then the string "-0-6422". That's a lot of uniqueness.

I did a quick survey of other open source systems. Some accept, some don't.

Oxigraph and RubyRDF have said they don't support colon in blank node labels. I ran Serd and it does not accept.

From code inspection, Eclipse RDF4J and rdfjs don't but I didn't run them so I might have got it wrong.

Redland (run tested) and rdflib (inspection) do accept.

There is a way to fix Jena without breaking Turtle. It'll need a change in the tokenizer and signalling that it should use the NT-definition of blank node labels.

Jena won't write such labels. Generally, other systems allocate a known-globally-unique system id.

Redland and Serd do not change the blank node label on parsing - they leave it to the app to do so if it is combined with other data, the app has to ensure uniqueness across files.

I've filed an errata for the RDF specs. It is definitely a bug because the text says NT is a subset of Turtle and this breaks that design choice. It's also why many systems do not accept - they use some or all of the same code.

https://lists.w3.org/Archives/Public/public-rdf-comments/2022Mar/0000.html
https://www.w3.org/2001/sw/wiki/RDF1.1_Errata

Hendrikto · 2022-03-09T11:38:24Z

Hi @afs,

Thanks for looking into this so thoroughly. If I understand correctly, the situation is as follows:

Implementation	accepts `':'`
Oxigraph	✗
RubyRDF	✗
Serd	✗
Eclipse RDF4J	✗(?)
rdfjs	✗(?)
Redland	✓
rdflib	✓

I have received the following statement from my company:

As far as we know at least internally we produce blank-nodes with ':'. The reason for using ':' is to be able to construct blank-nodes by combining/concatenating multiple simple names only consisting of PN_CHARS_BASE characters.

rdfproc handles according to spec:
$ cat test.nt
_:a:b <https://schema.org/name> "test" .
$ rdfproc test parse test.nt ntriples
rdfproc: Parsing URI file:///tmp/rdfproc/test.nt with ntriples parser
$ rdfproc test serialize ntriples
_:a:b <https://schema.org/name> "test" .
If Jena could comply with the current n-triples spec (or have a flag that can be set to allow the current specs), that would be great.

@afs I've filed an errata for the RDF specs. It is definitely a bug because the text says NT is a subset of Turtle

If you say so, I believe it. I was not aware of that design goal.

@afs There is a way to fix Jena without breaking Turtle.

If it can be done, that would be very nice. Should I close this PR and maybe open an issue instead, or just leave it? What's your preferred workflow?

@afs It'll need a change in the tokenizer and signalling that it should use the NT-definition of blank node labels.

I think working on such changes is out-of-scope for me. There is a good argument for applying this fix, because the specification bug has been around for quite a while, and there are implementations that follow it. Therefore, I hope you will consider implementing it.

Thanks again, and let me know if you need anything else.

afs · 2022-03-09T12:02:26Z

Out of curiosity, what's going on here?

What do you use Jena for in your system.

The reason for using ':' is to be able to construct blank-nodes by combining/concatenating multiple simple names only consisting of PN_CHARS_BASE characters.

It suggests the label is meaningful, and the example has a UUID in it.

You might as well send the data through sed and turn the : into _ which is also not in PN_CHARS_BASE.

Jena does not preserve labels end-to-end so the simple name can't be extracted anyway.

_:ab <urn:ex:123> "abc"@ab .

which writes out using the encoded internal id which is unique for read parser run (you'll see a different id).

_:B36a51bfa87b2e87b418774d143cf2bbd <urn:ex:123> "abc"@ab .

then next time:

_:Bd70094c0f2e0e08779438a2c5dc498a1 <urn:ex:123> "abc"@ab .

afs · 2022-03-09T17:50:57Z

Recorded as JENA-2304.

Hendrikto · 2022-03-11T15:11:26Z

After reviewing the situation, we have decided to implement the fix you suggested: We replace : with _.

afs · 2022-03-12T09:09:26Z

Hi @Hendrikto -- thanks for the update.

This has been an interesting issue - it's not often there is a issue with a spec like N-triples!

afs closed this Mar 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RDF grammar #1217

Fix RDF grammar #1217

Hendrikto commented Mar 4, 2022 •

edited

Loading

afs commented Mar 4, 2022

gkellogg commented Mar 4, 2022

afs commented Mar 5, 2022 •

edited

Loading

Hendrikto commented Mar 5, 2022 •

edited

Loading

afs commented Mar 5, 2022

Hendrikto commented Mar 8, 2022

afs commented Mar 9, 2022

Hendrikto commented Mar 9, 2022 •

edited by afs

Loading

afs commented Mar 9, 2022

afs commented Mar 9, 2022

Hendrikto commented Mar 11, 2022

afs commented Mar 12, 2022

Fix RDF grammar #1217

Fix RDF grammar #1217

Conversation

Hendrikto commented Mar 4, 2022 • edited Loading

afs commented Mar 4, 2022

gkellogg commented Mar 4, 2022

afs commented Mar 5, 2022 • edited Loading

Hendrikto commented Mar 5, 2022 • edited Loading

afs commented Mar 5, 2022

Hendrikto commented Mar 8, 2022

afs commented Mar 9, 2022

Hendrikto commented Mar 9, 2022 • edited by afs Loading

afs commented Mar 9, 2022

afs commented Mar 9, 2022

Hendrikto commented Mar 11, 2022

afs commented Mar 12, 2022

Hendrikto commented Mar 4, 2022 •

edited

Loading

afs commented Mar 5, 2022 •

edited

Loading

Hendrikto commented Mar 5, 2022 •

edited

Loading

Hendrikto commented Mar 9, 2022 •

edited by afs

Loading