Skip to content

Conversation

@jayrm
Copy link
Member

@jayrm jayrm commented Feb 10, 2019

This change is intended to normalize line endings for fbc source files.

Some source files were stored to the repository with LF line endings, apparently permanently, likely due to the patch, or perhaps the way the patch was applied at the time, last summer. For these specific files, they will not auto convert to CRLF line endings on windows even when autocrlf=true. This doesn't affect source code; if the IDE/editor correctly reads files with LF endings then no issue is noticed.

@countingpine
Copy link
Collaborator

Good catch.
I've checked out this patch on Linux though, and now these files just use CRLFs, while the other files use LFs. Not entirely sure how to fix this on Windows, but I can make a commit from Linux if need be.

@rversteegen
Copy link
Member

Yes, there are only a few lines in each file which are CRLF, but instead of fixing them, this patch changes every other line, which is going to be break plain git blame (you would have use git blame -w instead).

There are more files with mixed line endings or pure CRLF line endings:

file -N $(find . -name '*.bi' -o -name "*.bas")|grep CRLF
./src/compiler/ast-node-misc.bas: C source, ASCII text, with CRLF, LF line terminators
./src/compiler/ast.bi: C source, ASCII text, with CRLF, LF line terminators
./src/compiler/edbg_stab.bas: C source, ASCII text, with CRLF, LF line terminators
./src/compiler/emit.bas: C source, ASCII text, with CRLF, LF line terminators
./src/compiler/emit.bi: C source, ASCII text, with CRLF, LF line terminators
./src/compiler/emit_x86.bas: C source, ASCII text, with CRLF, LF line terminators
./src/compiler/emitdbg.bi: ASCII text, with CRLF, LF line terminators
./src/compiler/ir-llvm.bas: C source, ASCII text, with CRLF, LF line terminators
./src/compiler/ir-tac.bas: C source, ASCII text, with CRLF, LF line terminators
./src/compiler/ir.bi: C source, ASCII text, with CRLF, LF line terminators
./src/compiler/parser-inlineasm.bas: C source, ASCII text, with CRLF, LF line terminators
./src/compiler/parser-toplevel.bas: C source, ASCII text, with CRLF, LF line terminators
./tests/wstring/inctest.bas: C source, Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
./tests/wstring/utf16be.bas: Big-endian UTF-16 Unicode text, with CRLF line terminators
./tests/wstring/utf16le.bas: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
./examples/unicode/hello_UTF16BE.bas: C source, Big-endian UTF-16 Unicode text, with CRLF line terminators
./examples/unicode/hello_UTF16LE.bas: C source, Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
./examples/unicode/hello_chinese.bas: C source, Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
./examples/unicode/hello_greek.bas: C source, Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
./examples/unicode/hello_japanese.bas: C source, Little-endian UTF-16 Unicode text, with CRLF line terminators
./examples/unicode/hello_korean.bas: C source, Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
./examples/unicode/hello_russian.bas: C source, Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
./tests.clean/wstring/inctest.bas: C source, Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
./tests.clean/wstring/utf16be.bas: Big-endian UTF-16 Unicode text, with CRLF line terminators
./tests.clean/wstring/utf16le.bas: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators

None of the C or asm files have mixed line endings.

…& LF line endings (as seen on Linux), likely due to the patch that was applied at the time. And will not auto convert to CRLF even when autocrlf=true
@jayrm
Copy link
Member Author

jayrm commented Feb 15, 2019

@countingpine, thanks. I've made the changes and commit from Linux, and the pull request makes more sense. Only a few lines are changed now. (I force pushed to this same pull request, as there was no real need to keep the initial version. Interesting, that github keeps track of that too now apparently.)

It looks like there was mixed CRLF and LF in the source files. I don't know, I guess that confuses autocrlf on windows? Here's how I normalized the line endings on Linux:

$ cd src/compiler
$ dos2unix *.bas *.bi
$ git add -A
$ git commit -m "fbc: ..."

Then, when checking out on windows, I tested, and seems everything gets converted to CRLF as expected.

@rversteegen, the UTF-16 & UTF-32 files appear to be stored as binary. It does not appear that git recognizes these files as text, likely due the null bytes, so no EOL conversions are performed. I would suspect, depending on which system the file was last modified will determine if it the line endings are found as LF or CRLF. I'm not sure it matters for these files on windows or linux, since the end user needs a capable editor for sure when working with these files.

$ git diff `git hash-object -t tree /dev/null` --numstat origin/master -- *.bas *.bi | grep '^\-'
-	-	examples/unicode/hello_UTF16BE.bas
-	-	examples/unicode/hello_UTF16LE.bas
-	-	examples/unicode/hello_UTF32BE.bas
-	-	examples/unicode/hello_UTF32LE.bas
-	-	examples/unicode/hello_chinese.bas
-	-	examples/unicode/hello_greek.bas
-	-	examples/unicode/hello_japanese.bas
-	-	examples/unicode/hello_korean.bas
-	-	examples/unicode/hello_russian.bas
-	-	tests/wstring/inctest.bas
-	-	tests/wstring/utf16be.bas
-	-	tests/wstring/utf16le.bas
-	-	tests/wstring/utf32be.bas
-	-	tests/wstring/utf32le.bas

I pieced this together from examples I found. git hash-object -t tree /dev/null just generates the empty tree hash 4b825dc642cb6eb9a060e54bf8d69288fbee4904 which every file in origin/master should be different from. And --numstat tells git diff to print the number of lines added/deleted, except if it is binary, then it just outputs - - instead of the line counts. This commend greps for a leading dash to determine if the file is saved as binary in the repo.

@rversteegen
Copy link
Member

rversteegen commented Feb 15, 2019

All files in git are "stored in binary". The line ending conversion is something that happens when going between blob storage and files on disk. I couldn't tell you whether git diff --stat thinks a file is binary iif the line ending conversion would occur. (That's an interesting one-liner though!) If git doesn't convert line endings in UTF16 and UTF32 files that's behaviour (a bug?) which might get fixed in future versions, but I'm happy to ignore that since it's not a problem now.

Speaking of bugs, the file output I posted earlier shows that tests/wstring/utf16le.bas and tests/wstring/inctest.bas have mixed line endings, but I inspected the files with hexdump and they don't! I doubt any of the other UTF16 files have a problem either.

@jayrm
Copy link
Member Author

jayrm commented Feb 16, 2019

@rversteegen

All files in git are "stored in binary"

Heh, that's not what I mean, and you know it. :) I think we could make that claim for any storage system. Thanks though, I agree, that is the correct way to think about this problem.

Looking around in the git source code, starting with convert.c & diff.c, I see heuristics for determining if a file gets EOL conversion or is treated as binary is different logic in each source.

The one-liner I posted, only determines if the files are seen as binary by git. And if the file is seen as binary, then no EOL conversion will occur. However, there are some cases where EOL conversion still does not occur, even though the file is not seen as binary (like mixed line endings, in "text" files). The "binary" files listed by the one-liner is only a subset of the files that do not get EOL conversions.

Looking in git source apply.c, I see there is some logic for keeping CRLF when applying a patch. So I suspect, when the patch was applied, there was a mix of line endings between the patch and what the patch was applied to, and no normalization occured, introducing the mixed line endings in some files, which are not seen as binary, but also do not get EOL converted due the mixed line endings. Something to watch out for in future when applying patches.

Maybe can use git add --renormalize in future to help keep line endings for "text" files normalized. Though really, users should be encouraged to use editors that are more liberal on what input they accept.

Anyway, that's enough of that!

@jayrm jayrm merged commit fabb56e into freebasic:master Feb 16, 2019
@jayrm jayrm deleted the whitespace branch February 16, 2019 14:45
@rversteegen
Copy link
Member

I was being a bit curt, though what I meant is that git doesn't store any metadata on a file indicating whether it's binary or not, unlike svn, which also has similar line-end handling options. Or at least, that's what I thought, but I've just discovered I'm completely wrong: git has text and eol attributes (gitattributes) which are exactly for this purpose. Though technically blobs don't have attributes, only paths have them, it's now practically equivalent to svn attibutes.

But thanks for that deep dive into the implementation details.
All these numerous conversion options and quirks are so complex that I hope to never in my life use them - svn was bad enough.

And I filed that bug against file, so at least something came of my complaints!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants