Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for arbitrary encoding 🚀 #188

Open
Alphare opened this issue May 13, 2020 · 14 comments
Open

Add support for arbitrary encoding 🚀 #188

Alphare opened this issue May 13, 2020 · 14 comments

Comments

@Alphare
Copy link

Alphare commented May 13, 2020

Hi,
I've quickly looked at the project and it looks awesome. Also thanks for supporting Mercurial. :)

The project uses String for handling hunks, which forces the data to be UTF8, or since 9d1aafe, display garbage.

Some projects historically did not start with UTF8, either because they predated the standard or for other reasons, which makes it impossible for them to use VCS or VCS-related software that does not support handling arbitrary bytes without completely losing their (often decades-long) history.

Have you considered it? Is it a wontfix?

Thanks

@dandavison
Copy link
Owner

Hi @Alphare thanks for getting in touch. Absolutely, let's fix this. I hasdn't thought about this previously but based on your message I'm thinking that we need to support multiple different encodings, in different files/commits within a repo. How does the following sound? cc @fgsch

  • We add a new command line option --encoding. For example, a user could pass --encoding=utf-8,another-encoding. (I.e. singular name "encoding", but supporting a comma-separated list; I am not sure if the command line app libraries in rust support passing an argument multiple times and assembling a list of values).
  • Delta will try each encoding in turn, until one succeeds.
  • If none succeed, then delta will do a lossy decoding, using the first encoding in the supplied list, producing output that will look like Hello �World (or just a lot of ).
  • The default value of --encoding will be utf-8.

Recently opened issues #187 and #150 led me to put in the quick fix 9d1aafe to avoid crashing in the face of invalid utf-8, but I was about to modify that in any case, so I think this work fits in. I also think we can do this without impacting performance on the common case (actually, it might improve it).

@fgsch
Copy link

fgsch commented May 13, 2020

I'd like to add that this is not necessarily historical. Utf-8 (or any encoding really) makes sense for text, but there are cases where you want to have arbitrary bytes.

Adding something like --encoding makes sense to me.
Would the retry be at line level or entire output?

@dandavison
Copy link
Owner

dandavison commented May 13, 2020

The retry would be at line level (delta only reads a limited number of lines into memory at a time -- basically one diff hunk).

I haven't looked into this properly yet: for entirely non-textual files, I was thinking that git (and mercurial?) would often output text like binary files a/mybinary b/mybinary differ, in which case there will be no issue (and there's a unit test that delta handles that output from git appropriately).

@dandavison
Copy link
Owner

@Alphare how does Mercurial handle this? Does it look at the first few bytes of a file to judge whether it appears to be text or not?

@Alphare
Copy link
Author

Alphare commented May 13, 2020

@Alphare how does Mercurial handle this? Does it look at the first few bytes of a file to judge whether it appears to be text or not?

The encoding strategy is documented here and will do a better job than I can: https://www.mercurial-scm.org/wiki/EncodingStrategy

If anywhere in the code Mercurial does need to assess if any hunk is binary, in LFS context there is an override, otherwise it looks if it's not empty and has a null byte in it.

I unfortunately don't have much time to think about how to solve the issue today, but here's my question: do we have to care about the encoding at all? This is a very candid question, as in "is there an encoding-agnostic way of doing so"?

For example: I've been working a lot on the Rust version of hg status and I've defined an HgPath struct to represent any tracked filename/path, and it's just arbitrary bytes until we have to interact with the filesystem, in which case we either get an error - because it's invalid for the FS/OS pair - or we get an OsStr/Path back.

@dandavison
Copy link
Owner

here's my question: do we have to care about the encoding at all?
I've been working a lot on the Rust version of hg status and I've defined an HgPath struct to represent any tracked filename/path, and it's just arbitrary bytes until we have to interact with the filesystem,

I think that we will need to care about the encoding. The operations that delta needs to do on its input lines include parsing, running string alignment algorithms over them to infer within-line edits, inserting ANSI escape sequences. So although bytelines allows us to iterate over lines of stdin yielding each line as a pointer to the relevant raw bytes of stdin, it's going to be impossible to implement the rest of Delta without being able to to construct heap-allocated std::string::Strings. For example, I don't think that ansi_term can be used with undecoded bytes (&[u8]). Rust's String wraps valid utf-8 bytes, therefore we must construct valid utf-8 bytes. Therefore, we must deal with the original encoding, because we need to know what it is in order to translate the incoming bytes to utf-8.

@Alphare
Copy link
Author

Alphare commented May 13, 2020

For example, I don't think that ansi_term can be used with undecoded bytes (&[u8]).

It looks like it does support bytes: https://github.com/ogham/rust-ansi-term#byte-strings.

I don't see how the tasks you've listed could not be done on arbitrary bytes, but maybe I don't understand the specific need for String vs Vec<u8> aside from the obviously better ergonomics.

@dandavison
Copy link
Owner

dandavison commented May 13, 2020

It looks like it does support bytes:

Ah, thanks!

Well, OK but apart from ansi_term :) ... what about syntect? It looks to me like syntect requires the input to be &str: https://github.com/trishume/syntect/blob/master/src/highlighting/highlighter.rs#L65

I'm new to Rust with this project, and not expert. Honestly, I think that I'd been thinking that, given the prominence of std::string::String in Rust documentation/code, it would be nuts to try to write the entire application eschewing it.

@dandavison
Copy link
Owner

dandavison commented May 19, 2020

@Alphare I've made some of the necessary changes to do this in #191 which provides lossy utf-8 decoding.

Some projects historically did not start with UTF8, either because they predated the standard or for other reasons,

Do you happen to have some examples of repos to hand which contain a mix of non-utf-8 encodings which I can test the fix for this issue against?

@dandavison
Copy link
Owner

@Alphare would you be able to point me to some example repos that have problematic encodings for me to test against?

@Alphare
Copy link
Author

Alphare commented Jun 2, 2020

@dandavison sorry for the delay, I should have answered you when I first got the message.

I do not personally have any repos that are anything other than UTF8, but I'll ask around.
In the meantime, it's very easy to create a repo that would need both encodings in a diff output:

$ hg init test-repo
$ cd test-repo
$ echo -n "Raphaël Gomès" > foo  # assuming UTF8 default
$ hg ci -Am "UTF8"
$ iconv -f UTF-8 -t WINDOWS-1252 foo > foo2
$ mv foo2 foo
$ hg ci -Am "CP1252"

Running HGPLAIN= hg export here will show the correct bytes in each "half" of the diff if your terminal encoding is set to UTF8 or CP1252, no bytes are lost by Mercurial. Even without changing encodings in a commit, simply using another encoding like JIS would be unusable if not for the diff algorithm being encoding-agnostic.

I hope that helps, thanks for pinging me. :)

@Alphare
Copy link
Author

Alphare commented Jun 2, 2020

Early revisions of the nginx repo have non-UTF8 encodings. hg up 2000 shows docs/xml/nginx/changes.xml using ISO-8859.

@dandavison
Copy link
Owner

@Alphare that's very useful, thanks a lot.

@htharoldht
Copy link

htharoldht commented Oct 11, 2024

Hello, is there any new progress?
I have a repository and cannot modify the file encoding. Some files are in gb2312 and some are in utf8. When I use git-delta to diff, the garbled characters are displayed. Although it works, it is really not beautiful.
I tried diffFilter = delta --color-only | iconv -f GB2312 -t UTF-8//TRANSLIT or diffFilter = iconv -f GB2312 -t UTF-8//TRANSLIT | delta --color-only, but it didn't work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants