Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify encoding of .license files #106

Merged
merged 2 commits into from Jun 21, 2023
Merged

Specify encoding of .license files #106

merged 2 commits into from Jun 21, 2023

Conversation

mxmehl
Copy link
Member

@mxmehl mxmehl commented May 12, 2022

Fixes #73

I took the very simple suggestion by @kirelagin. @silverhook had some concerns regarding UTF-8 but if I understood correctly they were cleared. Please correct me if I'm wrong :)

@mxmehl mxmehl added the spec Specification label May 12, 2022
@silverhook
Copy link
Collaborator

The issue with Unicode in general was that in the embedded sphere to save space they tend to use ASCII encoding.

Personally, I much prefer Unicode, but it might make sense to check with the embedded community if this messes anything up for them. (and if so, we’d need to weigh the two)

@mxmehl
Copy link
Member Author

mxmehl commented May 12, 2022

True, but how about what has been said in the issue?

UTF-8 is an ASCII-compatible encoding (a superset of ASCII where every byte value that is allowed in ASCII means the same thing in UTF-8), so every ASCII text file is automatically a valid UTF-8 text file.

Wouldn't then implementors be able to use ASCII encoding and still comply with the spec?

@silverhook
Copy link
Collaborator

I know enough about character encoding that I know Unicode includes all of ASCII, so it probably should fall back gracefully.

So I did some tests on my machine what happens if I save BSD-4-Clause with my name as copyright holder and encode it differently. But take this with a huge pinch of salt, as I’m nowhere near either an encoding or embedded expert.

The results seem to be that:

  • for UTF8 only the non-ASCII characters consume more space
  • for UTF16 it does not seem to matter, it’s double the size per character however you spin it (according to Wikipedia it is also the only web-encoding that is incompatible with ASCII; this side)

So given that I’d say at least UTF8 should not make things more complicated.

converted to ASCII with iconv:

  • 11 lines
  • 241 characters
  • 1634 bytes

UTF8, no non-ASCII chars (= Matija Suklje):

  • 11 lines
  • 241 characters
  • 1634 bytes

UTF8, with special chars (= Matija Šuklje):

  • 11 lines
  • 242 characters
  • 1643 bytes

UTF16, no non-ASCII chars:

  • 11 lines
  • 241 characters
  • 3270 bytes

UTF16, with special chars:

  • 11 lines
  • 241 characters
  • 3270 bytes

P.S. This webpage also seems interesting on this topic: http://utf8everywhere.org/

@kirelagin
Copy link

Hey everyone, so, I think let’s first fix the terminology in order to avoid any possible confusion.

Unicode is not an encoding, it is just a huge list of characters (codepoints), so basically it’s just a collection of all symbols that computers can work with (like, letters, digits, punctuation, hieroglyphics, emojis, etc.). There is really no alternative to it, so we are always implicitly talking about Unicode – if a character is in Unicode, you can use it in a computer; if it is not – then you can’t.

Now, Unicode is a list of “abstract” characters. The real question is how to represent those characters inside a computer, because computers want bytes (or rather bits). That’s where encodings come into play as they specify how to serialise a Unicode character into a sequence of bytes and deserialise it back. Some encodings cover the entirety of Unicode (e.g. UTF-8), some only cover some subsets (e.g. ASCII only supports digits, latin letters, and some control characters – a total of 128 possible characters, much less than all of Unicode).

Next, file types. When you have some file (a sequence of bytes), in order to be able to interpret it, you need to know what the format of this file is. Think JPEG, which has a specification, which explains how to take the bytes that constitute a JPEG file and interpret it as a picture. An unfortunate truth about text files is that a “text file” is a sequence of Unicode characters, but that is an extremely high-level, abstract definition, since it says nothing about how those characters are actually stored on disk. So, one can’t just say that a REUSE .license file is a text file, that just does not say enough about how to actually read or write such a file. You have to specify the encoding.

Now, that’s true for any text file. Before the Internet, that was not such a big issue, because everyone just had some default encoding configured on their computer and they would not need to think about it, since the files were only read and written by them on their own computer. But with the Internet, we are constantly exchanging text files and suddenly we need to know what each file’s encoding is in order to be able to read it. These days, the de-facto standard encoding for text files is UTF-8 (for many reasons, including historical, but, most importantly, because it is actually the most sensible choice).

You can also read about this here: https://serokell.io/blog/haskell-with-utf8, just to rehash what I wrote above.

Now, embedded systems. There are sort of two questions here.

The first one is software support. If you have, like, vim on your router, it might or might not be able to correctly display and edit UTF-8 encoded text. If it does, then all is fine, if it does not, then you are, ummm, out of luck? Because, like, you do not always control what is in the files that you want to edit. The space saving here can potentially come from the fact that having proper support for UTF-8 required some code (or, more likely, libraries), so some people might want to build their software without UTF-8 support.

The second one is the storage space used by files themselves. When we are talking about UTF-8, that’s not really any concern at all, since UTF-8 is very efficient and, more importantly, it gives you control: if you stay within basic latin characters, the size of your file will end up being as small as reasonably possible.


To sum up.

  1. The specification for .license files must say what encoding is used for the text files, since otherwise it is ambiguous and those files are borderlines useless – to successfully read them, one needs to know the encoding.
  2. The choice is really between ASCII and UTF-8. If we specify ASCII, we are limiting the content of .license files to those 128 Unicode characters that are representable in ASCII. If we specify UTF-8, we allow all of Unicode characters at the cost of those files potentially not being fully readable on systems where software has no support for UTF-8 for whatever reasons.

Lastly, the following is true: for any Unicode character, if it is representable in ASCII, its encoding in ASCII is guaranteed to be the same as its encoding in UTF-8. This provides a degree of backward-compatibility: if your UTF-8-encoded file only contains characters from the subset representable in ASCII, then you can work with it using software that does not know anything about UTF-8 and assumes ASCII encoding.

This whole embedded systems concern does not really make much sense to me, since we are talking about text files here, and, like, embedded system developers do not have complete control over text files that will be present/used in their systems, so one way or another, almost certainly there will be some files containing non-ASCII characters in those systems, that’s just... not an issue, because one can simply avoid editing those files on embedded systems. But if there is control over text files, then, indeed, is you simply make sure that your text files do go beyond the ASCII range, there will be no practical difference between a UTF-8 and ASCII encoded text.

I’m sorry my comment ended being up so long, but I just wanted to clarify the situation for everyone once and for all, since any concerns over the use of UTF-8 are, frankly, very frustrating to me, at least because that is what is used in practice anyway.

@silverhook
Copy link
Collaborator

Well put @kirelagin. I glossed over some details (and you clearly know more than I do as well). I agree that I don’t see a compelling reason to chose ASCII (or UTF-16 or UTF-32) over UTF-8 in practice.

But it is great to be equipped for the eventual comment (again) from the embedded community if it comes to that. I think we now understand enough to both 1) confidently demand UTF-8 and 2) defend that position.

@mxmehl mxmehl changed the title Specify format of .license files Specify encoding of .license files Mar 6, 2023
Copy link
Member

@linozen linozen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@linozen linozen merged commit fe45bb6 into main Jun 21, 2023
@linozen linozen deleted the spec-dotlicense-files branch June 21, 2023 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec Specification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify the format of .license files
4 participants