Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: NUL byte support #1709

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

WIP: NUL byte support #1709

wants to merge 7 commits into from

Conversation

b4n
Copy link
Member

@b4n b4n commented Dec 5, 2017

From my early tests it works fine loading and saving embedded NUL bytes, but I expect many features not to work properly (although e.g. regex search seems happy, yet displaying results not so much). It however should help getting to a point where these could be handled properly, and is handy also for slightly broken files and alike.
Anyway for now such files are loaded read-only and display a warning to the user to avoid most problems.

Related to (and should fix for the most part) #618 and #1708 (and possibly others).

Don't include the extra NUL in the length itself as it's actually not
part of the data.  Instead simply add it when necessary.
Like sci_test_text() but allows for embedded NUL bytes.
Such documents are still marked read-only because most editing features
are still likely not to work as expected with NULs.
This is less annoying than a modal dialog while better showing which
document is affected.
@elextr
Copy link
Member

elextr commented Dec 5, 2017

Havn't reviewed yet, but just on the point of actually doing this at all.

Ok, despite what g_utf8_validate () claims NUL is a UTF-8 code point. So I suppose there is an argument that it should be loaded. But due the number of C NTS functions used in Geany its pretty much going to be a lottery if it edits, so loading it read-only is a good idea (if display and parse and search and any other function that will access an readonly file is NUL safe). But thats only loading files as UTF-8, any other encoding is going to depend on the system iconv implementation underlying the g_convert()

Showing an infobar identifying that its readonly, and why, allows the user to understand why they can't edit it, and where the problem is an encoding error, to locate where that occurred so they might be able to fix it with a hex editor or similar seems useful.

But why saving it? Its readonly, so it won't change. If the problem is encoding, then LOAD it with the correct encoding, don't save a new encoding of an incorrectly decoded file. If the file has wrong encoding (or mixed encodings) then saving it again won't fix it.

As for editing it, I have yet to see a real use-case for editing such files in a text editor/IDE, and given the amount of changes to Geany it would take to safely allow editing I do not see the point of making any changes to support more than loading and displaying files with NULs until a real use-case is provided.

So unless a compelling use-case is presented I would say load it, and lock it read-only, and tell the user why to help them find the problem, but I do not support starting any changes to allow editing, or saving.

@codebrainz
Copy link
Member

But due the number of C NTS functions used in Geany

Have you actually checked how many functions used in Geany would be affected? I haven't, but I ask because it might be worth actually checking to see, maybe it's not so many more as one might assume.

As for editing it, I have yet to see a real use-case for editing such files in a text editor/IDE

The obvious one that comes to mind is actually fixing the file by removing the NUL, which IIRC Scintilla helpfully displays using a special little icon/glyph.

@b4n
Copy link
Member Author

b4n commented Dec 5, 2017

But thats only loading files as UTF-8, any other encoding is going to depend on the system iconv implementation underlying the g_convert()

Indeed, but I feel like those are somewhat likely to support whichever bytes, as it's basically the point in converting encodings -- but indeed I don't know. And on my system it seems like it does work fine for what I tested.

But why saving it? Its readonly, so it won't change.

All I did is fix it so it works, mostly because it was so easy -- and there was a very simple but weird bug in there not only easy to fix but also very confusing when looking at the code. So no, my idea was not to suggest we can work fully with files embedding NUL bytes and that then we need to be able to save them, but that why not fix this very important part for any future possibility in that area if it's so easy anyway.
And encoding issue when saving will be the very same with NULs as any other byte: if the converter is happy with them it'll work, otherwise it won't. The problem was that UTF-8 files were artificially truncated at the first NUL, but actually non-UTF-8 were fine embedding NULs – which actually is likely to be important because I would imagine some encodings might generate NUL bytes in their encoded stream anyway (or at least could anyway).

Anyway: what I fixed was simply that now we properly write the whole thing in any case, with no corner case doing otherwise.

The obvious one that comes to mind is actually fixing the file by removing the NUL, which IIRC Scintilla helpfully displays using a special little icon/glyph.

Yes (like #1708), and yes (it displays it like any other control character, in a little box with the character name). And again, regex search actually works pretty OK with NULs, you can actually search for NUL bytes with them. Non-regex search doesn't, however (could be interesting to fix, but for now it's out of the scope of this).

Anyway, despite the "WIP" label, I don't really plan on working on fixing everything so it accepts NULs. I started it to see how hard it would be to at least simply load such files properly so one could look at them, also thinking that if that part is done it's more likely anyone would work on fixing some other code to handle those.
So @elextr I think the goal of this PR matches your hopes: loading files, but nothing else -- and warning users it's not properly supported and can cause issues. But if one day we end up with full support for those, I don't think it would be a bad thing.

@elextr
Copy link
Member

elextr commented Dec 5, 2017

@codebrainz

There are plenty of tools out there that can remove NULs if that is actually the correct thing to do to the file (which its probably not if its an incorrect encoding). So its not a good trade-off of effort vs return to add the capability to Geany for an occasional faulty file. Just because its possible to do, doesn't mean it should be done.

Have you actually checked how many functions used in Geany would be affected? I haven't, but I ask because it might be worth actually checking to see, maybe it's not so many more as one might assume.

I don't support adding editing so I won't do that. :)

@b4n

some encodings might generate NUL bytes in their encoded stream anyway (or at least could anyway)

Sure, for example an ASCII file saved in UTF-16 will be full of NUL bytes, and it would be a bug if we did not allow for that, but thats different to NUL in the UTF-8. (and saving a file with "foo" as UTF-16 indeed works fine despite every second byte being NUL :)

Anyway: what I fixed was simply that now we properly write the whole thing in any case, with no corner case doing otherwise.

Ok, a silent truncation on saving of a buffer with a NUL is a "bad thing" ™️ , for example if a buggy plugin made a NUL.

Non-regex search doesn't, however (could be interesting to fix, but for now it's out of the scope of this).

So long as there have been comprehensive tests that nothing that accesses the readonly buffer with NULs can (1) crash (2) hang (3) cause problems for other buffers (and that includes testing all plugins).

Note I didn't say anything had to work right, just not cause problems.

So @elextr I think the goal of this PR matches your hopes: loading files, but nothing else -- and warning users it's not properly supported and can cause issues.

And to be clear I support that if there is a proposal for how to do the comprehensive testing I mentioned above, and will review this when I get a chance.

But if one day we end up with full support for those, I don't think it would be a bad thing.

And as I said above, just because we can doesn't mean we should waste time on it when there are many other things to be done. And don't forget plugins.

@codebrainz
Copy link
Member

And as I said above, just because we can doesn't mean we should waste time on it

There are quite a few bugs/issues/questions about this, so it wouldn't really be a waste of time, relative to any other changes/fixes/etc.

@vkmaheshbhat
Copy link

I really need this feature and will be happy to help with testing.
I have come across multiple occasions when Geany won't load a file because of some NUL byte within. I then had to open it in Notepad++ (Win) or Vim (Linux) and clean the file of these bytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants