Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geany encoding determination broken? #2910

Closed
drws opened this issue Sep 28, 2021 · 9 comments
Closed

Geany encoding determination broken? #2910

drws opened this issue Sep 28, 2021 · 9 comments
Labels

Comments

@drws
Copy link

drws commented Sep 28, 2021

Geany version: 1.37.1

I've enabled Files > Encodings > Use fixed encoding when opening non-Unicode files with default non-Unicode encoding set to None.

I'm attaching three examples that demonstrate the issue(s) involved. They contain 8-bit (extended ASCII) values and I am unsuccessfully trying to open these files in Geany.

The first file iso88591.txt opens as ISO-8859-1 (even though the default is None for non-Unicode!). The example file contains:
C0 61 0A

The file breaks.txt, well, breaks (cannot be opened). It might be detected with a 16-bit encoding which then breaks because of odd number of bytes. Its contents are:
C0 61 00

And finally, the file utf16le.txt opens as UTF-16LE. It contains:
C0 61 00 0A

An expected outcome in the first two cases would be that the file opened with no encoding (due to default non-Unicode setting). In the third case, Unicode detection is expected, but there is another issue. When the encoding is set in Geany, the displayed output does not change for me (such as in the case of endline conversion for example). Is Geany really unable to properly open any of the above files?

@elextr
Copy link
Member

elextr commented Sep 28, 2021

You should include error messages so we can see where the failures happen, but some are obvious:

  1. None of the files are ASCII, ASCII is less than 127 (0x80) and all files contain 0xC0
  2. There is no such thing as "no encoding", the selection in Preferences->Files->Encodings->Use fixed encoding might be seen as misleading in this case. The Geany buffer must be UTF-8 so loading without doing any encoding conversion still must check for UTF-8 and fails because in all cases your files are not UTF-8, so to try to be helpful Geany falls back to finding an encoding that works and converts to UTF-8.
  3. The first it happens to find for iso88591.txt happens to be ISO-8859-1.
  4. breaks.txt contains a null character which is documented to not work, as the manual says. "Only text files are supported, i.e. opening files which contain NULL-bytes may fail. Geany will try to open the file anyway but it is likely that the file will be truncated because it can only be read up to the first occurrence of a NULL-byte. All characters after this position are lost and are not written when you save the file." Geany is C code, C strings terminate in NULL, so all operations will stop at that point as noted in the manual. If its possible truncation would happen Geany will not open the file to avoid possible data loss.
  5. utf16le.txt also searches for an encoding that works and happens to find UTF-16LE. I'm not sure what you mean by "the displayed output does not change for me", the code points the file represents in UTF-16LE may not exist in your font, so the code point value may be shown, here I get a CJK character for 0x61C0 and 0A00 for 0x0A00 since my font doesn't have whatever that is.
  6. Detection of encodings for files a few bytes long is entirely unreliable, its just deterministic randomness which encoding will be found first (the order of an internal list that can change in any release).

Is Geany really unable to properly open any of the above files?

The first and third open "properly" by the Geany definition of "properly", and the second is refused for the reason given.

@elextr elextr added the invalid label Sep 28, 2021
@elextr elextr closed this as completed Sep 28, 2021
@drws
Copy link
Author

drws commented Sep 29, 2021

  1. There are no error messages displayed in my workflow. If I may be ignorant a bit, it just doesn't work. But it is a very good idea that user was notified about all these limitations when they are encountered (NULL value for example). Since Geany is not ashamed of nagging the user (Geany does not respect write-protected .geany files #2863) it is only logical for such notifications (confirmation windows) to be included.

  2. I know it's not ASCII, that's why it's in brackets. I would only like to display ASCII symbols and all the other values with special (hex or so) symbols. Geany doesn't let me do that in any way. I also stated it's an 8-bit ASCII, which you conveniently overlooked. As a developer, I'm sure you know what I'm talking about, otherwise just reread my OP and this time overlook the word ASCII.

  3. I know all that, but what you are trying to present as my issue is actually Geany's shitty nomenclature, not mine. I only used it out of my respect for the developers which apparently use it. If I need to be more explicit, Geany has Without encoding (None) option in Document > Set Encoding. Maybe that is another issue?

  4. And that does not seem broken to you, especially if non-Unicode default is set to no encoding?

  5. I agree. But I'd still expect better handling of this from a versatile text editor (see 0th point above).

  6. By "output not changing" I mean that I use the Document > Set Encoding function, select a new encoding and nothing ever changes after that. For example: If a file is opened as Unicode gibberish, it stays like that no matter which encoding is selected. I've tried this with different files/encodings, always the same, as if the function wasn't there. Can this functionality be broken from outside?

  7. I completely agree and that is why I think one should be able to disable this sporadic encoding-detection black magic altogether and just have a... you know... a working text editor...

@elextr
Copy link
Member

elextr commented Sep 29, 2021

  1. When opening break.txt the message The file "breaks.txt" does not look like a text file or the file encoding is not supported. appears in the status bar. That could be changed to a dialog, but compared to Geany does not respect write-protected .geany files #2863 not opening is not a data loss situation, so I expect thats why no dialog has been created to date.

  2. Geany is a text editor, not a binary or hex editor. There are already hex editors available and although its been requested nobody has found enough benefit to do the work to add the facility to Geany (noting its actually likely to be a considerable amount of work). As I said, ASCII is not 8 bits, its 7 bits, there are many 8 bit extensions including ISO 8859 which has 16 different interpretations of the upper 8 bits.

  3. Being rude is not a good way to get volunteers who build software in their own time to do something. I said that possibly the label on the setting might be considered as misleading, suggesting a constructive alternative would be useful, not complaining in bold type.

  4. As I have explained, there is no way to load files other than by conversion to UTF-8, anything else is a complete re-write of the editor and unlikely to happen. So trying to find an encoding that works is a useful process for most users.

  5. Whatever the expectations, its not the case and isn't likely to change anytime soon.

  6. As I explained, the buffer in memory is always UTF-8. So encoding only comes into loading and saving files, nothing in the buffer will change when you select to save it as a different encoding so the display won't change.

  7. As I said above Geany edits text, not binary, and text is always encoded, there has to be some agreement that 0x20 is a space, thats what encoding is, there is no way of "disabling" encoding, thats not how text works. For text the reference is the Unicode standard, so the buffer is Unicode and specifically in UTF-8. What you seem to be wanting is a binary editor, and as I have said, Geany is not that.

I tried Gedit and it complains about the encoding of all three cases too.

@drws
Copy link
Author

drws commented Sep 30, 2021

  1. While I see what you mean, there is still (too?) big of a difference in notifying the user between both examples. In one case, user is bombarded with confirmation windows after every action and in the other, Message Window has to be open for it to be seen. All I'm saying that between these two comparative examples, one or the other is probably exaggerated in its own way.

  2. I added an adjective extended in front of ASCII to the OP so as to stop these redundant ASCII reexplanations. By 8-bit ASCII I meant ASCII-compatible 8-bit extension the first time around.

  3. You are right in both my poor choice of a word and the fact that suggesting a better nomenclature would be more productive. I do have an idea: shouldn't Without encoding be simply named ASCII?

The examples I provided were minimal examples from an actucal raw ASCII dump where only a few characters here and there were corrupted due to transient events on the RS-232 line during a legacy device bootup. While it is debatable whether such data calls for a binary editor, the fact remains that Geany is perfectly able to display similar files and fails only at some due to buggy encoding detection. If some, but not all values are removed (in a kilobyte-grade log, not minimal examples here), the file is sometimes opened properly, meaning 7-bit ASCII characters are displayed and other values are displayed with hex symbols. Other times it's an UTF-16 jungle because of a single 8-bit value occurrence or similar.

So the encoding detection logic works sporadically even with longer files. Why wouldn't user be able to manually set encoding if autodetection is not bulletproof and displayed encoding cannot be changed without a reload? Geany is perfectly able to handle some hybrid ASCII-binary files and its encoding detection logic is the only thing preventing it to handle the remaining ones.

Also, I tried to set something other than Without encoding in Preferences > Files > Encodings > Default encoding (existing non-Unicode files) and iso88591.txt still opens as ISO-8859-1. Does this also not add up to you or am I missing something?

@elextr
Copy link
Member

elextr commented Oct 1, 2021

By 8-bit ASCII I meant ASCII-compatible 8-bit extension the first time around.

As I explained there is no such thing, ASCII only defines what values less than 128 mean, it does not define what values with the most significant bit set mean. So no, its also not a suitable name for "no specified encoding", since the top 128 values are undefined. The ISO-8859 series encodings are examples of what the top 128 values mean, and there are 16 variants. Actually "no specified encoding" might be a good label for the setting since it alludes to falling back on searching for one.

So the encoding detection logic works sporadically even with longer files.

What you are actually seeing is "sporadically a longer file with random errors happens to be a valid file in some encoding". So that encoding is found. This is also why "Geany is perfectly able to handle some hybrid ASCII-binary files", the file happened to be a valid encoding. The encoding detection is not buggy, the files are 😁

So the solution would be to fix the file, possibly with a script that replaced non-ASCII values with a selected ASCII value, or mask off the MSB, or replace the non-ASCII with a valid and very visible UTF-8 character, possibly an Emoji like 👿. Thats something only you as the programmer can decide and do.

the file is sometimes opened properly, meaning 7-bit ASCII characters are displayed and other values are displayed with hex symbols. Other times it's an UTF-16 jungle because of a single 8-bit value occurrence or similar.

Thats again a manifestation of finding an encoding where the file is valid and converting that to UTF-8 in the buffer.

The display as hex values is something the font management does, not Geany, either generating a synthetic glyph when no font has one for that value, or a font in your stack has that glyph. Sometimes missing glyphs are shown as squares not hex. (to be precise its done by Harfbuzz used by Pango which is part of GTK which is used by Scintilla which is used by Geany, so its well buried behaviour and controlled by "many" things).

Why wouldn't user be able to manually set encoding

The current behaviour of searching for a valid encoding has evolved to handle a common use-case where files are in mixed encodings (is/was common on Windows in non-English speaking locales IIUC and many Geany contributors are in such locales). It would be possible to add a "use this encoding only" option (but somebody has to do it), but the result if the file was not valid in that encoding would have to be a refusal to load since Geany would have no idea what UTF-8 the file contents were meant to be converted to if the file was not valid in that encoding.

Also, I tried to set something other than Without encoding in Preferences > Files > Encodings > Default encoding (existing non-Unicode files) and iso88591.txt still opens as ISO-8859-1. Does this also not add up to you or am I missing something?

Again if you selected something other than a valid encoding for that file the behaviour is to fallback to searching for a working encoding. All the encoding settings are "try this first, then search" rather than "just try this or fail". Thats probably where the "None" for the default encoding comes from, meaning "Don't try anything first, just search".

Just to finally re-emphasise, the Geany buffer has to be valid UTF-8 with no embedded NULLs, all the editing and other functions assume it, and depend on it, so loading invalid UTF-8 sequences "without encoding" is not possible, the input must be valid in UTF-8 or an encoding that can be converted to UTF-8.

@drws
Copy link
Author

drws commented Oct 2, 2021

I explained there is no such thing

Of course there is backward compatibility in many 8-bit encodings. An ASCII-encoded file can be opened and treated as ISO-8859-1 for example (well, maybe not so easily in Geany...) since the former is a subset of the latter.

"no specified encoding" might be a good label for the setting

Maybe in the Set Encoding submenu, but as you already mentioned, it is still misleading in the Preferences. Since you described it as Don't try anything first, just search, could it simply be named Auto-detect, Auto-search or something like that?

Also, doesn't the Without encoding (None) setting effectively nullify its parent Use fixed encoding when opening non-Unicode files and if so, is actually redundant?

the file happened to be a valid encoding

Autodetection is particularly sensitive to the first few bytes, but not in every case. I still think there's something wrong with encoding detection in Geany, setting the 8-bit values aside. I hoped OP examples would be convincing enough, but I'll provide more relevant examples when I encounter that again.

The encoding detection is not buggy, the files are

Oh, come on. As if Geany was a sticky notes application, not a versatile developer's tool. I agreed in the very beginning that hex editor was the correct generic answer, but the question remains whether Geany is really going to refuse opening hybrid files that it potentially could. Although it is apparent now that some additional hackery would be needed for that.

the solution would be to fix the file

Of course and I'm doing that while also outputting an ordinary log file in the very same program. But this is a case of a debugging log with raw 8-bit values and 99.9% of content being ASCII and I was interested in the actual values above 128.

the Geany buffer has to be valid UTF-8 with no embedded NULLs

I understand the NULL value limitation and even then I think the user should be actively notified and also given a chance to load file up to the first NULL occurrence, with a red data-loss warning (and possibly a Save As... shortcut) included of course. But not having a chance at all is not really a solution.

the input must be valid in UTF-8

Setting the NULL limitation aside, we are talking about values 1-255, which are valid UTF-8. So theoretically any nonzero uint8_t data can be represented in such a buffer or am I mistaken somewhere?

@elextr
Copy link
Member

elextr commented Oct 3, 2021

All the discussion above has been about the Preferences->Files setting which sets a default.

That made me forget (well its hidden normally and I don't edit anything but UTF-8 so I don't use it ... ever :-) about the setting on the open dialog, which _is_enforced, even None, but it applies to only the selected file(s) not to any file opened from the command line or goto symbol or other method, those continue to use the preference.

If its set there, all your example files open, although breaks.txt and utf16le.txt appear to be truncated as expected (no endline).

But as I said above, beware that its unknown what functionality will work with invalid UTF-8 in the buffer. So if Geany or a plugin doesn't work or crashes you have been warned. (but report Geany crashes, we do try to fix those). The Scintilla editing widget claims to handle illegal bytes and show them as lozenge shapes with the hex in them, so actions executed directly by Scintilla will likely work fine, but you have no real way of telling which actions those are, mostly simple editing.

Oh, come on. As if Geany was a sticky notes application, not a versatile developer's tool.

As I said Geany is a volunteer project, people do what they need or want to do, and nobody needed or wanted it enough to do it. Its not a corporate supported designed tool, and its not in competition with other IDEs or editors for feature completeness. But the bigger it becomes, and the more rarely used features it gathers, the more work there is to support it. So its better that rare use-cases be handled by more appropriate tools.

I understand the NULL value limitation and even then I think the user should be actively notified and also given a chance to load file up to the first NULL occurrence, with a red data-loss warning (and possibly a Save As... shortcut) included of course.

Thats what the setting on the open dialog does, and save as will work.

Setting the NULL limitation aside, we are talking about values 1-255, which are valid UTF-8. So theoretically any nonzero uint8_t data can be represented in such a buffer or am I mistaken somewhere?

Indeed I think you misunderstand UTF-8, yes it uses values between 0 and 255, but not randomly. All ASCII code points are their own value, but any code point 128 or greater is encoded as a sequence of more than one byte with a value >= 128. The number of bytes increases as the code point value increases, up to four bytes, all with values >= 128. See here

So your files with single bytes >= 128 are not UTF-8.

@elextr
Copy link
Member

elextr commented Oct 3, 2021

I still think there's something wrong with encoding detection in Geany, setting the 8-bit values aside. I hoped OP examples would be convincing enough, but I'll provide more relevant examples when I encounter that again.

As I said, encoding "detection" is "search for an encoding that will convert the file to UTF-8", it just iterates through the list of known encodings trying each, and first that works wins. Its not possible to get it "wrong" but it also won't detect that there may be multiple "right"s. The actual conversion is using the system conversion libraries, so if that gets it wrong please report it there.

@drws
Copy link
Author

drws commented Oct 4, 2021

Thank you for the described workaround, I'll try it on real files.

Scintilla editing widget claims to handle illegal bytes and show them as lozenge shapes with the hex in them

And that is exactly what is needed here and sometimes already happens with Geany's encoding detection. I only think there could be a manual way in Geany to specify "display what you can in this encoding, otherwise print illegal symbol".

any code point 128 or greater is encoded as a sequence of more than one byte with a value >= 128

I did not know that more than one byte was required in every case above 127, thank you for outlining it. But an UTF-8 buffer shoud be able to store values 128-255 with some hackery. Which might not be needed at all given the Scintilla-level solution above.

encoding "detection" is "search for an encoding that will convert the file to UTF-8"

Maybe that could be better emphasized in the program somehow. Also Without encoding is still a misleading nomenclature. Wouldn't Auto-search or First found, while still generic, be a more appropriate name?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants