Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve binary file check (fixes #205) #206

Merged
merged 9 commits into from
Jul 16, 2022

Conversation

rasa
Copy link
Contributor

@rasa rasa commented Jun 26, 2022

First draft. All tests pass. Let me know what you think.

@rasa rasa force-pushed the rasa-fix-205-binary-check branch from fa3fb52 to 80365ae Compare June 26, 2022 23:26
@rasa rasa marked this pull request as ready for review June 26, 2022 23:26
@rasa rasa force-pushed the rasa-fix-205-binary-check branch 4 times, most recently from 47c245e to 27d2587 Compare June 27, 2022 00:07
@rasa rasa force-pushed the rasa-fix-205-binary-check branch from 27d2587 to c0d0a17 Compare June 27, 2022 05:27
rasa added 4 commits June 27, 2022 08:49
Also changed download script to Makefile
chardet's DetectBest() returns multiple charsets with the same Confidence level.
Unfortunately, it sometimes returns a different charset. This commit
fixes this by sorting the results returned by DetectAll() and returning
the first alphabetically. This could be certainly be approved.
Copy link
Member

@mstruebing mstruebing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting some errors when running make test - I'm not sure it is related to my environment or not.

.gitattributes Show resolved Hide resolved
pkg/encoding/encoding.go Outdated Show resolved Hide resolved
if len(results) == 0 {
return "", fmt.Errorf("Failed to determine charset")
}
confidence := -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand this correctly: it loops over every possible encoding and has a confidence how likely each of them matches this file?

If so, how does this affect performance in a large repository? I would assume that it will slow the execution time, could you provide some measurements (just barely, doesn't need to be to much scientific)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can do some benchmarking. Unfortunately, except for html files (which often embed their charset in themselves) there is no way to know what encoding a file is without this sort of empirical logic. And go expects UTF-8 for strings.

We might be able to speed things up by first scanning files to see if they contain binary characters, and then run the detect loop on only binary files to find UTF16s and UTF32s to convert them to UTF-8. But the line length check would probably miscount the length of lines on other non-UTF-8 files that are multi-byte encodings.

We could also simply skip processing any files that aren't valid UTF-8 but that's an ugly "solution."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that the current code reads each file twice (here and here).

This PR only reads each file once:

rawFileContent, err := os.ReadFile(filePath)
if err != nil {
panic(err)
}
fileContent := string(rawFileContent)
mime, err := files.GetContentTypeBytes(rawFileContent, config)
for _, regex := range textRegexes {
match, _ := regexp.MatchString(regex, mime)
if match {
var charset string
fileContent, charset, err = encoding.DecodeBytes(rawFileContent)
if err != nil {
if charset == "" {
charset = "unknown"
}
config.Logger.Error("Could not decode the %s encoded file: %s", charset, filePath)
config.Logger.Error(err.Error())
}
break
}
}
lines := files.ReadLines(fileContent)

@rasa
Copy link
Contributor Author

rasa commented Jul 3, 2022

I ran a quick test against https://github.com/torvalds/linux between main, and this PR:

$ time ec --no-color >ec-main.out

real    5m23.737s
user    0m0.000s
sys     0m0.000s

$ time ec --no-color >ec-pr206.out

real    7m14.026s
user    0m0.000s
sys     0m0.000s

And here's a diff of the results:

ross@indra:/c/github/torvalds$ diff -u -w --color=never ec-main.out ec-pr206.out
--- ec-main.out 2022-07-02 17:23:37.824216200 -0700
+++ ec-pr206.out 2022-07-02 17:30:51.863097200 -0700
@@ -28486,23 +28486,6 @@
        6: Trailing whitespace
 tools/perf/tests/attr/test-record-spe-physical-address:
        Wrong line endings or new final newline
-tools/perf/tests/pe-file.exe:
-       Wrong line endings or new final newline
-       Not all lines have the correct end of line character
-       30: Trailing whitespace
-       31: Trailing whitespace
-       32: Trailing whitespace
-tools/perf/tests/pe-file.exe.debug:
-       Wrong line endings or new final newline
-       Not all lines have the correct end of line character
-       149: Trailing whitespace
-       170: Trailing whitespace
-       171: Trailing whitespace
-       248: Trailing whitespace
-       269: Trailing whitespace
-       305: Trailing whitespace
-       306: Trailing whitespace
-       307: Trailing whitespace
 tools/perf/trace/beauty/beauty.h:
        69: Trailing whitespace
 tools/perf/trace/beauty/mmap_prot.sh:
@@ -28926,4 +28909,4 @@
 virt/kvm/kvm_main.c:
        2554: Trailing whitespace

-26105 errors found
+26090 errors found

per

$ file tools/perf/tests/pe-file.exe*
tools/perf/tests/pe-file.exe:       PE32+ executable (console) x86-64, for MS Windows
tools/perf/tests/pe-file.exe.debug: PE32+ executable (console) x86-64, for MS Windows

@rasa
Copy link
Contributor Author

rasa commented Jul 4, 2022

The file command's source has comments worth reading on this subject here.

@rasa
Copy link
Contributor Author

rasa commented Jul 15, 2022

@mstruebing
Copy link
Member

Thank you, I will try to get this in this weekend!

@mstruebing mstruebing merged commit 9d4703a into editorconfig-checker:main Jul 16, 2022
@rasa
Copy link
Contributor Author

rasa commented Jul 17, 2022

The CI is failing due to the files not being found. All the tests pass when run locally, so I'm not sure what's up. Do they run fine locally for you too?

@mstruebing
Copy link
Member

No I was assuming it is related to my machine/setup but it seems it is the same for CI

@rasa
Copy link
Contributor Author

rasa commented Jul 17, 2022

@mstruebing OK, lemme see if I can determine why they are failing, and fix it. On it now.

@rasa rasa mentioned this pull request Jul 17, 2022
@rasa
Copy link
Contributor Author

rasa commented Jul 17, 2022

@mstruebing OK, I fixed the issues. The first is I hadn't committed testdata/empty.txt as it was in .gitignore. The second issue is the TestGetRelativePath deleted the current working directory, so all disk-based tests after that would fail.

The fixes are in #210 . Thanks for merging, and sorry for the issues.

@mstruebing
Copy link
Member

Thanks for taking care of this feature and especially the issues after merging, I really appreciate that 🙏

@rasa
Copy link
Contributor Author

rasa commented Jul 18, 2022

And for completeness, the dos2unix's binary check logic is here. That's the logic I patterned our code after, as it's the simplest, and produces the last amount of false positives (imo).

@rasa rasa deleted the rasa-fix-205-binary-check branch September 5, 2022 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants