Add support for arbitrary encoding 🚀 #188

Alphare · 2020-05-13T08:06:22Z

Hi,
I've quickly looked at the project and it looks awesome. Also thanks for supporting Mercurial. :)

The project uses String for handling hunks, which forces the data to be UTF8, or since 9d1aafe, display garbage.

Some projects historically did not start with UTF8, either because they predated the standard or for other reasons, which makes it impossible for them to use VCS or VCS-related software that does not support handling arbitrary bytes without completely losing their (often decades-long) history.

Have you considered it? Is it a wontfix?

Thanks

The text was updated successfully, but these errors were encountered:

dandavison · 2020-05-13T12:46:31Z

Hi @Alphare thanks for getting in touch. Absolutely, let's fix this. I hasdn't thought about this previously but based on your message I'm thinking that we need to support multiple different encodings, in different files/commits within a repo. How does the following sound? cc @fgsch

We add a new command line option --encoding. For example, a user could pass --encoding=utf-8,another-encoding. (I.e. singular name "encoding", but supporting a comma-separated list; I am not sure if the command line app libraries in rust support passing an argument multiple times and assembling a list of values).
Delta will try each encoding in turn, until one succeeds.
If none succeed, then delta will do a lossy decoding, using the first encoding in the supplied list, producing output that will look like Hello �World (or just a lot of �).
The default value of --encoding will be utf-8.

Recently opened issues #187 and #150 led me to put in the quick fix 9d1aafe to avoid crashing in the face of invalid utf-8, but I was about to modify that in any case, so I think this work fits in. I also think we can do this without impacting performance on the common case (actually, it might improve it).

fgsch · 2020-05-13T13:06:35Z

I'd like to add that this is not necessarily historical. Utf-8 (or any encoding really) makes sense for text, but there are cases where you want to have arbitrary bytes.

Adding something like --encoding makes sense to me.
Would the retry be at line level or entire output?

dandavison · 2020-05-13T13:25:08Z

The retry would be at line level (delta only reads a limited number of lines into memory at a time -- basically one diff hunk).

I haven't looked into this properly yet: for entirely non-textual files, I was thinking that git (and mercurial?) would often output text like binary files a/mybinary b/mybinary differ, in which case there will be no issue (and there's a unit test that delta handles that output from git appropriately).

dandavison · 2020-05-13T13:26:43Z

@Alphare how does Mercurial handle this? Does it look at the first few bytes of a file to judge whether it appears to be text or not?

Alphare · 2020-05-13T14:40:43Z

@Alphare how does Mercurial handle this? Does it look at the first few bytes of a file to judge whether it appears to be text or not?

The encoding strategy is documented here and will do a better job than I can: https://www.mercurial-scm.org/wiki/EncodingStrategy

If anywhere in the code Mercurial does need to assess if any hunk is binary, in LFS context there is an override, otherwise it looks if it's not empty and has a null byte in it.

I unfortunately don't have much time to think about how to solve the issue today, but here's my question: do we have to care about the encoding at all? This is a very candid question, as in "is there an encoding-agnostic way of doing so"?

For example: I've been working a lot on the Rust version of hg status and I've defined an HgPath struct to represent any tracked filename/path, and it's just arbitrary bytes until we have to interact with the filesystem, in which case we either get an error - because it's invalid for the FS/OS pair - or we get an OsStr/Path back.

dandavison · 2020-05-13T16:01:23Z

here's my question: do we have to care about the encoding at all?
I've been working a lot on the Rust version of hg status and I've defined an HgPath struct to represent any tracked filename/path, and it's just arbitrary bytes until we have to interact with the filesystem,

I think that we will need to care about the encoding. The operations that delta needs to do on its input lines include parsing, running string alignment algorithms over them to infer within-line edits, inserting ANSI escape sequences. So although bytelines allows us to iterate over lines of stdin yielding each line as a pointer to the relevant raw bytes of stdin, it's going to be impossible to implement the rest of Delta without being able to to construct heap-allocated std::string::Strings. For example, I don't think that ansi_term can be used with undecoded bytes (&[u8]). Rust's String wraps valid utf-8 bytes, therefore we must construct valid utf-8 bytes. Therefore, we must deal with the original encoding, because we need to know what it is in order to translate the incoming bytes to utf-8.

Alphare · 2020-05-13T16:43:13Z

For example, I don't think that ansi_term can be used with undecoded bytes (&[u8]).

It looks like it does support bytes: https://github.com/ogham/rust-ansi-term#byte-strings.

I don't see how the tasks you've listed could not be done on arbitrary bytes, but maybe I don't understand the specific need for String vs Vec<u8> aside from the obviously better ergonomics.

dandavison · 2020-05-13T17:15:44Z

It looks like it does support bytes:

Ah, thanks!

Well, OK but apart from ansi_term :) ... what about syntect? It looks to me like syntect requires the input to be &str: https://github.com/trishume/syntect/blob/master/src/highlighting/highlighter.rs#L65

I'm new to Rust with this project, and not expert. Honestly, I think that I'd been thinking that, given the prominence of std::string::String in Rust documentation/code, it would be nuts to try to write the entire application eschewing it.

dandavison · 2020-05-19T00:13:30Z

@Alphare I've made some of the necessary changes to do this in #191 which provides lossy utf-8 decoding.

Some projects historically did not start with UTF8, either because they predated the standard or for other reasons,

Do you happen to have some examples of repos to hand which contain a mix of non-utf-8 encodings which I can test the fix for this issue against?

dandavison · 2020-05-29T14:48:54Z

@Alphare would you be able to point me to some example repos that have problematic encodings for me to test against?

Alphare · 2020-06-02T09:37:23Z

@dandavison sorry for the delay, I should have answered you when I first got the message.

I do not personally have any repos that are anything other than UTF8, but I'll ask around.
In the meantime, it's very easy to create a repo that would need both encodings in a diff output:

$ hg init test-repo
$ cd test-repo
$ echo -n "Raphaël Gomès" > foo  # assuming UTF8 default
$ hg ci -Am "UTF8"
$ iconv -f UTF-8 -t WINDOWS-1252 foo > foo2
$ mv foo2 foo
$ hg ci -Am "CP1252"

Running HGPLAIN= hg export here will show the correct bytes in each "half" of the diff if your terminal encoding is set to UTF8 or CP1252, no bytes are lost by Mercurial. Even without changing encodings in a commit, simply using another encoding like JIS would be unusable if not for the diff algorithm being encoding-agnostic.

I hope that helps, thanks for pinging me. :)

Alphare · 2020-06-02T09:42:21Z

Early revisions of the nginx repo have non-UTF8 encodings. hg up 2000 shows docs/xml/nginx/changes.xml using ISO-8859.

dandavison · 2020-06-02T13:01:38Z

@Alphare that's very useful, thanks a lot.

htharoldht · 2024-10-11T06:43:04Z

Hello, is there any new progress?
I have a repository and cannot modify the file encoding. Some files are in gb2312 and some are in utf8. When I use git-delta to diff, the garbled characters are displayed. Although it works, it is really not beautiful.
I tried diffFilter = delta --color-only | iconv -f GB2312 -t UTF-8//TRANSLIT or diffFilter = iconv -f GB2312 -t UTF-8//TRANSLIT | delta --color-only, but it didn't work

dandavison mentioned this issue Jun 4, 2020

🐛 error: "stream did not contain valid UTF-8" #218

Closed

dandavison mentioned this issue Mar 3, 2022

avoid the display issue with git add -p without losing CR #1000

Open

dandavison mentioned this issue Jan 29, 2024

🐛 delta causes fancy unicode symbols such as Nerd Font Icons to be displayed as escaped in angle brackets #1616

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for arbitrary encoding 🚀 #188

Add support for arbitrary encoding 🚀 #188

Alphare commented May 13, 2020 •

edited

Loading

dandavison commented May 13, 2020

fgsch commented May 13, 2020

dandavison commented May 13, 2020 •

edited

Loading

dandavison commented May 13, 2020

Alphare commented May 13, 2020 •

edited

Loading

dandavison commented May 13, 2020

Alphare commented May 13, 2020

dandavison commented May 13, 2020 •

edited

Loading

dandavison commented May 19, 2020 •

edited

Loading

dandavison commented May 29, 2020

Alphare commented Jun 2, 2020

Alphare commented Jun 2, 2020

dandavison commented Jun 2, 2020

htharoldht commented Oct 11, 2024 •

edited

Loading

Add support for arbitrary encoding 🚀 #188

Add support for arbitrary encoding 🚀 #188

Comments

Alphare commented May 13, 2020 • edited Loading

dandavison commented May 13, 2020

fgsch commented May 13, 2020

dandavison commented May 13, 2020 • edited Loading

dandavison commented May 13, 2020

Alphare commented May 13, 2020 • edited Loading

dandavison commented May 13, 2020

Alphare commented May 13, 2020

dandavison commented May 13, 2020 • edited Loading

dandavison commented May 19, 2020 • edited Loading

dandavison commented May 29, 2020

Alphare commented Jun 2, 2020

Alphare commented Jun 2, 2020

dandavison commented Jun 2, 2020

htharoldht commented Oct 11, 2024 • edited Loading

Alphare commented May 13, 2020 •

edited

Loading

dandavison commented May 13, 2020 •

edited

Loading

Alphare commented May 13, 2020 •

edited

Loading

dandavison commented May 13, 2020 •

edited

Loading

dandavison commented May 19, 2020 •

edited

Loading

htharoldht commented Oct 11, 2024 •

edited

Loading