-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve binary file check (fixes #205) #206
Improve binary file check (fixes #205) #206
Conversation
fa3fb52
to
80365ae
Compare
47c245e
to
27d2587
Compare
27d2587
to
c0d0a17
Compare
Also changed download script to Makefile
chardet's DetectBest() returns multiple charsets with the same Confidence level. Unfortunately, it sometimes returns a different charset. This commit fixes this by sorting the results returned by DetectAll() and returning the first alphabetically. This could be certainly be approved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting some errors when running make test
- I'm not sure it is related to my environment or not.
if len(results) == 0 { | ||
return "", fmt.Errorf("Failed to determine charset") | ||
} | ||
confidence := -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand this correctly: it loops over every possible encoding and has a confidence how likely each of them matches this file?
If so, how does this affect performance in a large repository? I would assume that it will slow the execution time, could you provide some measurements (just barely, doesn't need to be to much scientific)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can do some benchmarking. Unfortunately, except for html files (which often embed their charset in themselves) there is no way to know what encoding a file is without this sort of empirical logic. And go expects UTF-8 for strings.
We might be able to speed things up by first scanning files to see if they contain binary characters, and then run the detect loop on only binary files to find UTF16s and UTF32s to convert them to UTF-8. But the line length check would probably miscount the length of lines on other non-UTF-8 files that are multi-byte encodings.
We could also simply skip processing any files that aren't valid UTF-8 but that's an ugly "solution."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that the current code reads each file twice (here and here).
This PR only reads each file once:
editorconfig-checker/pkg/validation/validation.go
Lines 39 to 60 in c1839f9
rawFileContent, err := os.ReadFile(filePath) | |
if err != nil { | |
panic(err) | |
} | |
fileContent := string(rawFileContent) | |
mime, err := files.GetContentTypeBytes(rawFileContent, config) | |
for _, regex := range textRegexes { | |
match, _ := regexp.MatchString(regex, mime) | |
if match { | |
var charset string | |
fileContent, charset, err = encoding.DecodeBytes(rawFileContent) | |
if err != nil { | |
if charset == "" { | |
charset = "unknown" | |
} | |
config.Logger.Error("Could not decode the %s encoded file: %s", charset, filePath) | |
config.Logger.Error(err.Error()) | |
} | |
break | |
} | |
} | |
lines := files.ReadLines(fileContent) |
I ran a quick test against https://github.com/torvalds/linux between main, and this PR: $ time ec --no-color >ec-main.out
real 5m23.737s
user 0m0.000s
sys 0m0.000s
$ time ec --no-color >ec-pr206.out
real 7m14.026s
user 0m0.000s
sys 0m0.000s And here's a diff of the results: ross@indra:/c/github/torvalds$ diff -u -w --color=never ec-main.out ec-pr206.out
--- ec-main.out 2022-07-02 17:23:37.824216200 -0700
+++ ec-pr206.out 2022-07-02 17:30:51.863097200 -0700
@@ -28486,23 +28486,6 @@
6: Trailing whitespace
tools/perf/tests/attr/test-record-spe-physical-address:
Wrong line endings or new final newline
-tools/perf/tests/pe-file.exe:
- Wrong line endings or new final newline
- Not all lines have the correct end of line character
- 30: Trailing whitespace
- 31: Trailing whitespace
- 32: Trailing whitespace
-tools/perf/tests/pe-file.exe.debug:
- Wrong line endings or new final newline
- Not all lines have the correct end of line character
- 149: Trailing whitespace
- 170: Trailing whitespace
- 171: Trailing whitespace
- 248: Trailing whitespace
- 269: Trailing whitespace
- 305: Trailing whitespace
- 306: Trailing whitespace
- 307: Trailing whitespace
tools/perf/trace/beauty/beauty.h:
69: Trailing whitespace
tools/perf/trace/beauty/mmap_prot.sh:
@@ -28926,4 +28909,4 @@
virt/kvm/kvm_main.c:
2554: Trailing whitespace
-26105 errors found
+26090 errors found per $ file tools/perf/tests/pe-file.exe*
tools/perf/tests/pe-file.exe: PE32+ executable (console) x86-64, for MS Windows
tools/perf/tests/pe-file.exe.debug: PE32+ executable (console) x86-64, for MS Windows |
Specifically: application/ecmascript application/json application/xml +xml
The file command's source has comments worth reading on this subject here. |
git's binary determination logic is at |
Thank you, I will try to get this in this weekend! |
The CI is failing due to the files not being found. All the tests pass when run locally, so I'm not sure what's up. Do they run fine locally for you too? |
No I was assuming it is related to my machine/setup but it seems it is the same for CI |
@mstruebing OK, lemme see if I can determine why they are failing, and fix it. On it now. |
@mstruebing OK, I fixed the issues. The first is I hadn't committed The fixes are in #210 . Thanks for merging, and sorry for the issues. |
Thanks for taking care of this feature and especially the issues after merging, I really appreciate that 🙏 |
And for completeness, the dos2unix's binary check logic is here. That's the logic I patterned our code after, as it's the simplest, and produces the last amount of false positives (imo). |
First draft. All tests pass. Let me know what you think.