Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write canonical NTriples 1.1 by default #35

Open
plasticfist opened this issue Dec 31, 2021 · 6 comments
Open

Write canonical NTriples 1.1 by default #35

plasticfist opened this issue Dec 31, 2021 · 6 comments

Comments

@plasticfist
Copy link

plasticfist commented Dec 31, 2021

(Edited) The output does not appear to be UTF-8, is this is a bug? I thought UTF-8 would be the default given there is an option to "Write ASCII output if possible"

Example:

source triple from dbpedia/article-templates_lang=en_nested.ttl
<http://dbpedia.org/resource/André_Éric_Létourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang=en_nested.ttl
article-templates_lang=en_nested.ttl: UTF-8 Unicode text

serdi output:
<http://dbpedia.org/resource/Andr\u00E9_\u00C9ric_L\u00E9tourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang=en_nested-serdi.nt
article-templates_lang=en_nested-serdi.nt: ASCII text, with very long lines

apache jena riot output:
<http://dbpedia.org/resource/André_Éric_Létourneau> <http://dbpedia.org/property/wikiPageUsesTemplate> <http://dbpedia.org/resource/Template:Birth_date_and_age> .

$ file article-templates_lang=en_nested.ttl.bz2-riot.nt
article-templates_lang=en_nested.ttl.bz2-riot.nt: UTF-8 Unicode text

Spec Reference:
https://www.w3.org/TR/n-triples/#canonical-ntriples

Note: At first I thought maybe this was a BOM related rendering/display issue, but file would reveal if there is a BOM, and the same tools were used to find and display the examples above...

@plasticfist plasticfist changed the title UTF-8 characters in input are converted to \u code in the output (ntriples) Output is ASCII? desire UTF-8 output (edited for clarity) Jan 3, 2022
@drobilla
Copy link
Owner

drobilla commented Jan 6, 2022

This is a holdover from back in the day when NTriples was ASCII. serd now supports RDF 1.1 NTriples, which is UTF-8, but the command-line tool behaviour is still the same. The upcoming major version is more precise about this and lets you mix and match all kinds of options to get what you want.

I'm not sure if the default could be changed without breaking things for people in the current version. Maybe? I agree that the option existing (it's meant for Turtle) makes this confusing, but I'm hesitant to change it and potentially break people's existing scripts/workflows/whatever...

@drobilla
Copy link
Owner

drobilla commented Jan 6, 2022

For reference, this is how the new command-line tool interfaces look: https://drobilla.net/files/serd_man_pages/ where serd-pipe is the closest thing to serdi. So the default will be UTF-8 everywhere, but you can -O ascii to ASCIIfy any syntax. This also lets you do nice things like write a "flat Turtle" file, like NTriples but with namespace prefixes, and so on.

@joelduerksen
Copy link

I understand and can empathize with backwards compatibility, but the (current) specs seemed to be clear on this question, or I thought so on first read.

Quote: "The content encoding of N-Triples is always UTF-8."
Reference: 6. Media Type and Content Encoding

That said, I have to say they seem to walk back on the clear directive in section 6.1 (if doc is plain/text it would be ASCII and escaped, etc..) I guess this gets into the nuances of "web document types" as opposed to files, so when working outside that frame work it is left up to individual interpretation. sigh.

@drobilla
Copy link
Owner

drobilla commented Jan 7, 2022

ASCII is a subset of UTF-8. In other words, the output of serdi is UTF-8, and valid N-Triples.

It's not canonical RDF 1.1 N-Triples though, because escaping like this is not allowed there (see link in OP).

@plasticfist
Copy link
Author

Ok, I'll rephrase ticket request, would like command line tool that outputs canonical N-Triples. (no escaped characters)
Whether you make it the default or not is up to you, as long as it is possible. I wouldn't mind adding --canonical to the command line if required. No worries here.

@drobilla
Copy link
Owner

drobilla commented Jan 7, 2022

Sure, I was just responding to the above comment. If you want this right now, I suggest building the serd1 branch from git and using serd-pipe. My top priority is getting the new major version out, there will probably not be any more non-trivial releases of 0.x.x.

I'll make a note to double-check the other canonical rules and make sure that the default output adheres to them, but I think it does.

@drobilla drobilla changed the title Output is ASCII? desire UTF-8 output (edited for clarity) Write canonical NTriples 1.1 by default Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants