-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RDF grammar #1217
Fix RDF grammar #1217
Conversation
According to the specification [0], PN_CHARS_U is defined as: PN_CHARS_U ::= PN_CHARS_BASE | '_' | ':' The previous implementation was missing the ':' part. Additionally, according to the code comment, PN_CHARS_U_N is defined as: PN_CHARS_U_N ::= PN_CHARS_U | [0-9] The previous implementation did not match the spec by deferring to PN_CHARS_U, and instead repeated its definition. Changing (in this case fixing) PN_CHARS_U should also affect PN_CHARS_U_N, which can be achieved simply by following the specification. [0]: https://www.w3.org/TR/n-triples/#grammar-production-PN_CHARS_U
Hi @Hendrikto - interesting - I think this is a bug in the N-triples spec. How did you come across it? In Turtle,
No This matters because N-triples is supposed to be a subset of Turtle (a Turtle parser should be able to parse N-triples) The way the numbering of the |
The Ruby implementation of N-Triples uses a version of PN_CHARS_U missing ':'; not sure how it ever got in there. I think it's worth an erratum and queue for eventual update to the spec. |
We have tracked down the intent (in 2006) -- the NT and NQ specs didn't get updated during grammar development and aligning SPARQL with Turtle. @Hendrikto - do you know of any system that creates N-Triples relying on |
Yes, I work at Spinque, where we have encountered this discrepancy between spec and implementation while investigating an indefinite hang of our application during ingestion of RDF data. The same data is successfully parsed by I was not the one who discovered the bug. I requested more information from the reporter and the one who investigated the root cause. I will let you know once I have more information. Anyway, there definitely are tools out in the wild that produce data according to the spec including |
@Hendrikto - thanks Which tool generated the N-triples? |
That is what I am trying to get more information on. I have not yet received an answer, but tomorrow is our weekly company meeting, and I can take the opportunity to ask there. I will report back to you with more information afterwards. What I already know:
|
Hi @Hendrikto -- The string I did a quick survey of other open source systems. Some accept, some don't. Oxigraph and RubyRDF have said they don't support colon in blank node labels. I ran Serd and it does not accept. From code inspection, Eclipse RDF4J and rdfjs don't but I didn't run them so I might have got it wrong. Redland (run tested) and rdflib (inspection) do accept. There is a way to fix Jena without breaking Turtle. It'll need a change in the tokenizer and signalling that it should use the NT-definition of blank node labels. Jena won't write such labels. Generally, other systems allocate a known-globally-unique system id. Redland and Serd do not change the blank node label on parsing - they leave it to the app to do so if it is combined with other data, the app has to ensure uniqueness across files. I've filed an errata for the RDF specs. It is definitely a bug because the text says NT is a subset of Turtle and this breaks that design choice. It's also why many systems do not accept - they use some or all of the same code. https://lists.w3.org/Archives/Public/public-rdf-comments/2022Mar/0000.html |
Hi @afs, Thanks for looking into this so thoroughly. If I understand correctly, the situation is as follows:
I have received the following statement from my company:
If you say so, I believe it. I was not aware of that design goal.
If it can be done, that would be very nice. Should I close this PR and maybe open an issue instead, or just leave it? What's your preferred workflow?
I think working on such changes is out-of-scope for me. There is a good argument for applying this fix, because the specification bug has been around for quite a while, and there are implementations that follow it. Therefore, I hope you will consider implementing it. Thanks again, and let me know if you need anything else. |
Out of curiosity, what's going on here? What do you use Jena for in your system.
It suggests the label is meaningful, and the example has a UUID in it. You might as well send the data through Jena does not preserve labels end-to-end so the simple name can't be extracted anyway.
which writes out using the encoded internal id which is unique for read parser run (you'll see a different id).
then next time:
|
Recorded as JENA-2304. |
After reviewing the situation, we have decided to implement the fix you suggested: We replace |
Hi @Hendrikto -- thanks for the update. This has been an interesting issue - it's not often there is a issue with a spec like N-triples! |
According to the specification [0],
PN_CHARS_U
is defined as:The previous implementation was missing the
':'
part.Additionally, according to the code comment,
PN_CHARS_U_N
is defined as:The previous implementation did not match the spec by deferring to
PN_CHARS_U
, and instead repeated its definition. Changing (in this case fixing)PN_CHARS_U
should also affectPN_CHARS_U_N
, which can be achieved simply by following the specification.