New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Geany encoding determination broken? #2910
Comments
You should include error messages so we can see where the failures happen, but some are obvious:
The first and third open "properly" by the Geany definition of "properly", and the second is refused for the reason given. |
|
I tried Gedit and it complains about the encoding of all three cases too. |
The examples I provided were minimal examples from an actucal raw ASCII dump where only a few characters here and there were corrupted due to transient events on the RS-232 line during a legacy device bootup. While it is debatable whether such data calls for a binary editor, the fact remains that Geany is perfectly able to display similar files and fails only at some due to buggy encoding detection. If some, but not all values are removed (in a kilobyte-grade log, not minimal examples here), the file is sometimes opened properly, meaning 7-bit ASCII characters are displayed and other values are displayed with hex symbols. Other times it's an UTF-16 jungle because of a single 8-bit value occurrence or similar. So the encoding detection logic works sporadically even with longer files. Why wouldn't user be able to manually set encoding if autodetection is not bulletproof and displayed encoding cannot be changed without a reload? Geany is perfectly able to handle some hybrid ASCII-binary files and its encoding detection logic is the only thing preventing it to handle the remaining ones. Also, I tried to set something other than Without encoding in |
As I explained there is no such thing, ASCII only defines what values less than 128 mean, it does not define what values with the most significant bit set mean. So no, its also not a suitable name for "no specified encoding", since the top 128 values are undefined. The ISO-8859 series encodings are examples of what the top 128 values mean, and there are 16 variants. Actually "no specified encoding" might be a good label for the setting since it alludes to falling back on searching for one.
What you are actually seeing is "sporadically a longer file with random errors happens to be a valid file in some encoding". So that encoding is found. This is also why "Geany is perfectly able to handle some hybrid ASCII-binary files", the file happened to be a valid encoding. The encoding detection is not buggy, the files are 😁 So the solution would be to fix the file, possibly with a script that replaced non-ASCII values with a selected ASCII value, or mask off the MSB, or replace the non-ASCII with a valid and very visible UTF-8 character, possibly an Emoji like 👿. Thats something only you as the programmer can decide and do.
Thats again a manifestation of finding an encoding where the file is valid and converting that to UTF-8 in the buffer. The display as hex values is something the font management does, not Geany, either generating a synthetic glyph when no font has one for that value, or a font in your stack has that glyph. Sometimes missing glyphs are shown as squares not hex. (to be precise its done by Harfbuzz used by Pango which is part of GTK which is used by Scintilla which is used by Geany, so its well buried behaviour and controlled by "many" things).
The current behaviour of searching for a valid encoding has evolved to handle a common use-case where files are in mixed encodings (is/was common on Windows in non-English speaking locales IIUC and many Geany contributors are in such locales). It would be possible to add a "use this encoding only" option (but somebody has to do it), but the result if the file was not valid in that encoding would have to be a refusal to load since Geany would have no idea what UTF-8 the file contents were meant to be converted to if the file was not valid in that encoding.
Again if you selected something other than a valid encoding for that file the behaviour is to fallback to searching for a working encoding. All the encoding settings are "try this first, then search" rather than "just try this or fail". Thats probably where the "None" for the default encoding comes from, meaning "Don't try anything first, just search". Just to finally re-emphasise, the Geany buffer has to be valid UTF-8 with no embedded NULLs, all the editing and other functions assume it, and depend on it, so loading invalid UTF-8 sequences "without encoding" is not possible, the input must be valid in UTF-8 or an encoding that can be converted to UTF-8. |
Of course there is backward compatibility in many 8-bit encodings. An ASCII-encoded file can be opened and treated as ISO-8859-1 for example (well, maybe not so easily in Geany...) since the former is a subset of the latter.
Maybe in the Also, doesn't the
Autodetection is particularly sensitive to the first few bytes, but not in every case. I still think there's something wrong with encoding detection in Geany, setting the 8-bit values aside. I hoped OP examples would be convincing enough, but I'll provide more relevant examples when I encounter that again.
Oh, come on. As if Geany was a sticky notes application, not a versatile developer's tool. I agreed in the very beginning that hex editor was the correct generic answer, but the question remains whether Geany is really going to refuse opening hybrid files that it potentially could. Although it is apparent now that some additional hackery would be needed for that.
Of course and I'm doing that while also outputting an ordinary log file in the very same program. But this is a case of a debugging log with raw 8-bit values and 99.9% of content being ASCII and I was interested in the actual values above 128.
I understand the NULL value limitation and even then I think the user should be actively notified and also given a chance to load file up to the first NULL occurrence, with a red data-loss warning (and possibly a
Setting the NULL limitation aside, we are talking about values 1-255, which are valid UTF-8. So theoretically any nonzero uint8_t data can be represented in such a buffer or am I mistaken somewhere? |
All the discussion above has been about the Preferences->Files setting which sets a default. That made me forget (well its hidden normally and I don't edit anything but UTF-8 so I don't use it ... ever :-) about the setting on the open dialog, which _is_enforced, even If its set there, all your example files open, although But as I said above, beware that its unknown what functionality will work with invalid UTF-8 in the buffer. So if Geany or a plugin doesn't work or crashes you have been warned. (but report Geany crashes, we do try to fix those). The Scintilla editing widget claims to handle illegal bytes and show them as lozenge shapes with the hex in them, so actions executed directly by Scintilla will likely work fine, but you have no real way of telling which actions those are, mostly simple editing.
As I said Geany is a volunteer project, people do what they need or want to do, and nobody needed or wanted it enough to do it. Its not a corporate supported designed tool, and its not in competition with other IDEs or editors for feature completeness. But the bigger it becomes, and the more rarely used features it gathers, the more work there is to support it. So its better that rare use-cases be handled by more appropriate tools.
Thats what the setting on the open dialog does, and save as will work.
Indeed I think you misunderstand UTF-8, yes it uses values between 0 and 255, but not randomly. All ASCII code points are their own value, but any code point 128 or greater is encoded as a sequence of more than one byte with a value >= 128. The number of bytes increases as the code point value increases, up to four bytes, all with values >= 128. See here So your files with single bytes >= 128 are not UTF-8. |
As I said, encoding "detection" is "search for an encoding that will convert the file to UTF-8", it just iterates through the list of known encodings trying each, and first that works wins. Its not possible to get it "wrong" but it also won't detect that there may be multiple "right"s. The actual conversion is using the system conversion libraries, so if that gets it wrong please report it there. |
Thank you for the described workaround, I'll try it on real files.
And that is exactly what is needed here and sometimes already happens with Geany's encoding detection. I only think there could be a manual way in Geany to specify "display what you can in this encoding, otherwise print illegal symbol".
I did not know that more than one byte was required in every case above 127, thank you for outlining it. But an UTF-8 buffer shoud be able to store values 128-255 with some hackery. Which might not be needed at all given the Scintilla-level solution above.
Maybe that could be better emphasized in the program somehow. Also Without encoding is still a misleading nomenclature. Wouldn't Auto-search or First found, while still generic, be a more appropriate name? |
Geany version: 1.37.1
I've enabled
Files > Encodings > Use fixed encoding when opening non-Unicode files
with default non-Unicode encoding set toNone
.I'm attaching three examples that demonstrate the issue(s) involved. They contain 8-bit (extended ASCII) values and I am unsuccessfully trying to open these files in Geany.
The first file iso88591.txt opens as ISO-8859-1 (even though the default is
None
for non-Unicode!). The example file contains:C0 61 0A
The file breaks.txt, well, breaks (cannot be opened). It might be detected with a 16-bit encoding which then breaks because of odd number of bytes. Its contents are:
C0 61 00
And finally, the file utf16le.txt opens as UTF-16LE. It contains:
C0 61 00 0A
An expected outcome in the first two cases would be that the file opened with no encoding (due to default non-Unicode setting). In the third case, Unicode detection is expected, but there is another issue. When the encoding is set in Geany, the displayed output does not change for me (such as in the case of endline conversion for example). Is Geany really unable to properly open any of the above files?
The text was updated successfully, but these errors were encountered: