WIP: NUL byte support #1709

b4n · 2017-12-05T08:59:12Z

From my early tests it works fine loading and saving embedded NUL bytes, but I expect many features not to work properly (although e.g. regex search seems happy, yet displaying results not so much). It however should help getting to a point where these could be handled properly, and is handy also for slightly broken files and alike.
Anyway for now such files are loaded read-only and display a warning to the user to avoid most problems.

Related to (and should fix for the most part) #618 and #1708 (and possibly others).

Don't include the extra NUL in the length itself as it's actually not part of the data. Instead simply add it when necessary.

Like sci_test_text() but allows for embedded NUL bytes.

Such documents are still marked read-only because most editing features are still likely not to work as expected with NULs.

This is less annoying than a modal dialog while better showing which document is affected.

elextr · 2017-12-05T12:12:36Z

Havn't reviewed yet, but just on the point of actually doing this at all.

Ok, despite what g_utf8_validate () claims NUL is a UTF-8 code point. So I suppose there is an argument that it should be loaded. But due the number of C NTS functions used in Geany its pretty much going to be a lottery if it edits, so loading it read-only is a good idea (if display and parse and search and any other function that will access an readonly file is NUL safe). But thats only loading files as UTF-8, any other encoding is going to depend on the system iconv implementation underlying the g_convert()

Showing an infobar identifying that its readonly, and why, allows the user to understand why they can't edit it, and where the problem is an encoding error, to locate where that occurred so they might be able to fix it with a hex editor or similar seems useful.

But why saving it? Its readonly, so it won't change. If the problem is encoding, then LOAD it with the correct encoding, don't save a new encoding of an incorrectly decoded file. If the file has wrong encoding (or mixed encodings) then saving it again won't fix it.

As for editing it, I have yet to see a real use-case for editing such files in a text editor/IDE, and given the amount of changes to Geany it would take to safely allow editing I do not see the point of making any changes to support more than loading and displaying files with NULs until a real use-case is provided.

So unless a compelling use-case is presented I would say load it, and lock it read-only, and tell the user why to help them find the problem, but I do not support starting any changes to allow editing, or saving.

codebrainz · 2017-12-05T15:37:29Z

But due the number of C NTS functions used in Geany

Have you actually checked how many functions used in Geany would be affected? I haven't, but I ask because it might be worth actually checking to see, maybe it's not so many more as one might assume.

As for editing it, I have yet to see a real use-case for editing such files in a text editor/IDE

The obvious one that comes to mind is actually fixing the file by removing the NUL, which IIRC Scintilla helpfully displays using a special little icon/glyph.

b4n · 2017-12-05T19:44:50Z

But thats only loading files as UTF-8, any other encoding is going to depend on the system iconv implementation underlying the g_convert()

Indeed, but I feel like those are somewhat likely to support whichever bytes, as it's basically the point in converting encodings -- but indeed I don't know. And on my system it seems like it does work fine for what I tested.

But why saving it? Its readonly, so it won't change.

All I did is fix it so it works, mostly because it was so easy -- and there was a very simple but weird bug in there not only easy to fix but also very confusing when looking at the code. So no, my idea was not to suggest we can work fully with files embedding NUL bytes and that then we need to be able to save them, but that why not fix this very important part for any future possibility in that area if it's so easy anyway.
And encoding issue when saving will be the very same with NULs as any other byte: if the converter is happy with them it'll work, otherwise it won't. The problem was that UTF-8 files were artificially truncated at the first NUL, but actually non-UTF-8 were fine embedding NULs – which actually is likely to be important because I would imagine some encodings might generate NUL bytes in their encoded stream anyway (or at least could anyway).

Anyway: what I fixed was simply that now we properly write the whole thing in any case, with no corner case doing otherwise.

The obvious one that comes to mind is actually fixing the file by removing the NUL, which IIRC Scintilla helpfully displays using a special little icon/glyph.

Yes (like #1708), and yes (it displays it like any other control character, in a little box with the character name). And again, regex search actually works pretty OK with NULs, you can actually search for NUL bytes with them. Non-regex search doesn't, however (could be interesting to fix, but for now it's out of the scope of this).

Anyway, despite the "WIP" label, I don't really plan on working on fixing everything so it accepts NULs. I started it to see how hard it would be to at least simply load such files properly so one could look at them, also thinking that if that part is done it's more likely anyone would work on fixing some other code to handle those.
So @elextr I think the goal of this PR matches your hopes: loading files, but nothing else -- and warning users it's not properly supported and can cause issues. But if one day we end up with full support for those, I don't think it would be a bad thing.

elextr · 2017-12-05T23:09:53Z

@codebrainz

There are plenty of tools out there that can remove NULs if that is actually the correct thing to do to the file (which its probably not if its an incorrect encoding). So its not a good trade-off of effort vs return to add the capability to Geany for an occasional faulty file. Just because its possible to do, doesn't mean it should be done.

Have you actually checked how many functions used in Geany would be affected? I haven't, but I ask because it might be worth actually checking to see, maybe it's not so many more as one might assume.

I don't support adding editing so I won't do that. :)

@b4n

some encodings might generate NUL bytes in their encoded stream anyway (or at least could anyway)

Sure, for example an ASCII file saved in UTF-16 will be full of NUL bytes, and it would be a bug if we did not allow for that, but thats different to NUL in the UTF-8. (and saving a file with "foo" as UTF-16 indeed works fine despite every second byte being NUL :)

Anyway: what I fixed was simply that now we properly write the whole thing in any case, with no corner case doing otherwise.

Ok, a silent truncation on saving of a buffer with a NUL is a "bad thing" ™️ , for example if a buggy plugin made a NUL.

Non-regex search doesn't, however (could be interesting to fix, but for now it's out of the scope of this).

So long as there have been comprehensive tests that nothing that accesses the readonly buffer with NULs can (1) crash (2) hang (3) cause problems for other buffers (and that includes testing all plugins).

Note I didn't say anything had to work right, just not cause problems.

So @elextr I think the goal of this PR matches your hopes: loading files, but nothing else -- and warning users it's not properly supported and can cause issues.

And to be clear I support that if there is a proposal for how to do the comprehensive testing I mentioned above, and will review this when I get a chance.

But if one day we end up with full support for those, I don't think it would be a bad thing.

And as I said above, just because we can doesn't mean we should waste time on it when there are many other things to be done. And don't forget plugins.

codebrainz · 2017-12-06T01:41:47Z

And as I said above, just because we can doesn't mean we should waste time on it

There are quite a few bugs/issues/questions about this, so it wouldn't really be a waste of time, relative to any other changes/fixes/etc.

vkmaheshbhat · 2019-02-06T09:03:20Z

I really need this feature and will be happy to help with testing.
I have come across multiple occasions when Geany won't load a file because of some NUL byte within. I then had to open it in Notepad++ (Win) or Vim (Linux) and clean the file of these bytes.

b4n added 7 commits December 4, 2017 21:03

Avoid weird length value

a1ba91b

Don't include the extra NUL in the length itself as it's actually not part of the data. Instead simply add it when necessary.

Add sci_set_text_with_length()

a70456c

Like sci_test_text() but allows for embedded NUL bytes.

Don't cut the loaded data at the first NUL when passing it to Scintilla

c4606a1

Don't cut UTF-8 documents at the first NUL byte when saving them

168476b

encodings: Accept NULs when validating UTF-8

d2b1536

Properly load files with embedded NULs

d957d38

Such documents are still marked read-only because most editing features are still likely not to work as expected with NULs.

Show an infobar for files with embedded NULs

307335a

This is less annoying than a modal dialog while better showing which document is affected.

b4n added enhancement work in progress labels Dec 5, 2017

b4n requested review from codebrainz and elextr December 5, 2017 08:59

codebrainz mentioned this pull request May 6, 2018

Feature Request: opening binary files #1508

Closed

codebrainz mentioned this pull request Feb 6, 2019

Open files even if the encoding cannot be determined #2081

Closed

codebrainz mentioned this pull request Nov 10, 2020

Geany fails to open a file (silently) with trailing zero chars #2654

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: NUL byte support #1709

WIP: NUL byte support #1709

b4n commented Dec 5, 2017

elextr commented Dec 5, 2017

codebrainz commented Dec 5, 2017

b4n commented Dec 5, 2017

elextr commented Dec 5, 2017

codebrainz commented Dec 6, 2017

vkmaheshbhat commented Feb 6, 2019

WIP: NUL byte support #1709

Are you sure you want to change the base?

WIP: NUL byte support #1709

Conversation

b4n commented Dec 5, 2017

elextr commented Dec 5, 2017

codebrainz commented Dec 5, 2017

b4n commented Dec 5, 2017

elextr commented Dec 5, 2017

codebrainz commented Dec 6, 2017

vkmaheshbhat commented Feb 6, 2019