Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Pre-commit hook misbehaves for UTF-8 files #196

Closed
jakub-g opened this Issue · 6 comments

3 participants

@jakub-g
Collaborator

When the file to be committed is a "real" UTF-8 file (with or without BOM), i.e. it contains e.g. Russian cyryllic characters, then the CRLF replacement in the hook doubles the number of lines, by introducing empty lines.

This seems to be a bug in dos2unix bundled with Git.

dos2unix -D "$FILE" is meant to convert UNIX -> DOS. It works fine also if the file in already DOS (CRLF) but only for ASCII file. However, if the file is already DOS && the file is non-ASCII, then it converts CRLF to CRCRLF thus the output looks like having empty lines introduced every second line.

Using dos2unix -A (A stands for auto) doesn't introduce this problem. However dos2unix -A converts CRLF to LF or LF to CRLF, which is not what we want to achieve. We want to end up with CRLF always, whatever the input is.

The alternative would be to simply use sed:
sed -e 's/$/\r/' input.txt > output.txt
The above converts LF to CRLF, CRLF to CRLF and works also fine if the input is mixed (LF or CRLF -> CRLF).

We should however make sure that binary files (i.e. images) will not be corrupted.

From what I've seen by inspecting dos2unix -tv output, being binary or not is decided by having non-ASCII chars, which is bad if we want to have truly Unicode files (because they'll be classified as binary). Otherwise we'll have to stick to \uXXXX in our files.

TL;DR:

  1. The conversion method in the hook should convert LF -> CRLF, CRLF->CRLF, mixed -> CRLF (this is one of the cons of the agreement to have core.autocrlf=true)
  2. The conversion method should work for UTF-8 files.
  3. The hook should leave binary files untouched. It should be stated which files are binary and which not. Probably a whitelist of non-binary files will be created (txt, tpl, css, js...) on which to perform conversion.

Anyway, we should make sure that binary files (PNGs, GIFs etc.) are unaffected.

Probably the best option will be to use .gitattributes file to solve all this problems - to be investigated.

@jakub-g
Collaborator

If sed is to be used, it should be used with -i to do in-place change (instead of creating a new file).

@jakub-g
Collaborator

Yep I've seen that and that's why I've suggested using .gitattributes (but didn't have time back then to dig through it).

The article linked inside is a pretty good explanation of under-the-hood stuff:
http://timclem.wordpress.com/2012/03/01/mind-the-end-of-your-line/

Having read that, I think it's best to remove dos2unix at all, and add the following .gitattributes file in our repo, and it should be totally equivalent (and I hope will solve UTF8 problems with dos2unix):

# 'text' means: convert to CRLF when checking out, convert to LF when committing
# (equivalent to core.autocrlf=true)

*.js text
*.css text
*.txt text
*.tpl text
*.tml text
*.cml text
*.json text
*.htm text
*.html text

*.md text
*.yml text
*.yaml text
*.gitignore text

*.png binary
*.gif binary
*.swf binary
*.jpg binary

To be tested.

@jakub-g
Collaborator

Well, after some thought, in fact, it probably won't be equivalent entirely, but hopefully it will not matter.

AFAIU, let's assume that you have the settings as specified above (and removed dos2unix from pre-commit hook), and you've modified some file in a way that is has both LF and CRLF (mixed). Then you commit, it will store LF in the Git database, but mixed line endings will remain in your local file.

Then if someone else changes that file, and you update it, I'm not sure if the file after update will remain mixed, or will be CRLF-only. If the latter holds, then it's brilliant. If the former, I don't know honestly. Probably it shouldn't matter either. To be tested.

@jakub-g jakub-g was assigned
@jakub-g
Collaborator

I've played with the things related to .gitattributes and pre-commit hook.

sed -i 's/$/\r/' instead of dos2unix seems to do the trick.

I'll test things locally for a couple of days and if everything works fine, I'll push to master.

@jakub-g jakub-g referenced this issue from a commit in jakub-g/ariatemplates
@jakub-g jakub-g Added .gitattributes, improve pre-commit hook. Fixes #196. 274a777
@jakub-g jakub-g referenced this issue from a commit in jakub-g/ariatemplates
@jakub-g jakub-g fix #196 pre-commit hook misbehaving for UTF-8 files 5b6dc53
@jakub-g
Collaborator

The problem with .gitattributes is that it can make Git think that some changes happened in the repo, just after introducing this file and not changing other files. In other words, it will work seamlessly when you clone a repo from scratch, but can result in weird problem when working on existing repo.

Tried the hints provided here: https://help.github.com/articles/dealing-with-line-endings but with no luck -- the edge case is switching between your local branches that were touched before introducing this file.

So in the commit 5b6dc53 I only replace dos2unix with sed. We can introduce .gitattributes but then the best thing to do to avoid weird conflicts will be to clone the repo from scratch...

@piuccio piuccio closed this in ed322f3
@jakub-g jakub-g referenced this issue from a commit in jakub-g/ariatemplates
@jakub-g jakub-g fix #196 pre-commit hook improvements for UTF-8 / binary files
The previous commit ed322f3 fixed a problem for UTF-8 files but introduced
new issues -- for the files having no trailing newline and for binary
files. This commit addresses all those problems.

Close #249.
8e8ecbb
@piuccio piuccio referenced this issue from a commit
@jakub-g jakub-g fix #196 pre-commit hook improvements for UTF-8 / binary files
The previous commit ed322f3 fixed a problem for UTF-8 files but introduced
new issues -- for the files having no trailing newline and for binary
files. This commit addresses all those problems.

Close #249.
34c3bd8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.