Improve binary file check (fixes #205) #206

rasa · 2022-06-26T20:51:44Z

First draft. All tests pass. Let me know what you think.

Also changed download script to Makefile

chardet's DetectBest() returns multiple charsets with the same Confidence level. Unfortunately, it sometimes returns a different charset. This commit fixes this by sorting the results returned by DetectAll() and returning the first alphabetically. This could be certainly be approved.

mstruebing

I'm getting some errors when running make test - I'm not sure it is related to my environment or not.

.gitattributes

pkg/encoding/encoding.go

mstruebing · 2022-07-02T09:50:54Z

pkg/encoding/encoding.go

+	if len(results) == 0 {
+		return "", fmt.Errorf("Failed to determine charset")
+	}
+	confidence := -1


Do I understand this correctly: it loops over every possible encoding and has a confidence how likely each of them matches this file?

If so, how does this affect performance in a large repository? I would assume that it will slow the execution time, could you provide some measurements (just barely, doesn't need to be to much scientific)

Yes, I can do some benchmarking. Unfortunately, except for html files (which often embed their charset in themselves) there is no way to know what encoding a file is without this sort of empirical logic. And go expects UTF-8 for strings.

We might be able to speed things up by first scanning files to see if they contain binary characters, and then run the detect loop on only binary files to find UTF16s and UTF32s to convert them to UTF-8. But the line length check would probably miscount the length of lines on other non-UTF-8 files that are multi-byte encodings.

We could also simply skip processing any files that aren't valid UTF-8 but that's an ugly "solution."

Also note that the current code reads each file twice (here and here).

This PR only reads each file once:

editorconfig-checker/pkg/validation/validation.go

Lines 39 to 60 in c1839f9

rawFileContent, err := os.ReadFile(filePath)

if err != nil {

panic(err)

}

fileContent := string(rawFileContent)

mime, err := files.GetContentTypeBytes(rawFileContent, config)

for _, regex := range textRegexes {

match, _ := regexp.MatchString(regex, mime)

if match {

var charset string

fileContent, charset, err = encoding.DecodeBytes(rawFileContent)

if err != nil {

if charset == "" {

charset = "unknown"

}

config.Logger.Error("Could not decode the %s encoded file: %s", charset, filePath)

config.Logger.Error(err.Error())

}

break

}

}

lines := files.ReadLines(fileContent)

rasa · 2022-07-03T01:08:15Z

I ran a quick test against https://github.com/torvalds/linux between main, and this PR:

$ time ec --no-color >ec-main.out

real    5m23.737s
user    0m0.000s
sys     0m0.000s

$ time ec --no-color >ec-pr206.out

real    7m14.026s
user    0m0.000s
sys     0m0.000s

And here's a diff of the results:

ross@indra:/c/github/torvalds$ diff -u -w --color=never ec-main.out ec-pr206.out
--- ec-main.out 2022-07-02 17:23:37.824216200 -0700
+++ ec-pr206.out 2022-07-02 17:30:51.863097200 -0700
@@ -28486,23 +28486,6 @@
        6: Trailing whitespace
 tools/perf/tests/attr/test-record-spe-physical-address:
        Wrong line endings or new final newline
-tools/perf/tests/pe-file.exe:
-       Wrong line endings or new final newline
-       Not all lines have the correct end of line character
-       30: Trailing whitespace
-       31: Trailing whitespace
-       32: Trailing whitespace
-tools/perf/tests/pe-file.exe.debug:
-       Wrong line endings or new final newline
-       Not all lines have the correct end of line character
-       149: Trailing whitespace
-       170: Trailing whitespace
-       171: Trailing whitespace
-       248: Trailing whitespace
-       269: Trailing whitespace
-       305: Trailing whitespace
-       306: Trailing whitespace
-       307: Trailing whitespace
 tools/perf/trace/beauty/beauty.h:
        69: Trailing whitespace
 tools/perf/trace/beauty/mmap_prot.sh:
@@ -28926,4 +28909,4 @@
 virt/kvm/kvm_main.c:
        2554: Trailing whitespace

-26105 errors found
+26090 errors found

per

$ file tools/perf/tests/pe-file.exe*
tools/perf/tests/pe-file.exe:       PE32+ executable (console) x86-64, for MS Windows
tools/perf/tests/pe-file.exe.debug: PE32+ executable (console) x86-64, for MS Windows

Specifically: application/ecmascript application/json application/xml +xml

rasa · 2022-07-04T09:39:21Z

The file command's source has comments worth reading on this subject here.

rasa · 2022-07-15T05:44:38Z

git's binary determination logic is at
https://github.com/git/git/blob/f9b95943b68b6b8ca5a6072f50a08411c6449b55/convert.c#L35

mstruebing · 2022-07-15T06:49:16Z

Thank you, I will try to get this in this weekend!

rasa · 2022-07-17T06:31:37Z

The CI is failing due to the files not being found. All the tests pass when run locally, so I'm not sure what's up. Do they run fine locally for you too?

mstruebing · 2022-07-17T10:20:27Z

No I was assuming it is related to my machine/setup but it seems it is the same for CI

rasa · 2022-07-17T12:28:46Z

@mstruebing OK, lemme see if I can determine why they are failing, and fix it. On it now.

rasa · 2022-07-17T15:48:51Z

@mstruebing OK, I fixed the issues. The first is I hadn't committed testdata/empty.txt as it was in .gitignore. The second issue is the TestGetRelativePath deleted the current working directory, so all disk-based tests after that would fail.

The fixes are in #210 . Thanks for merging, and sorry for the issues.

mstruebing · 2022-07-18T06:39:49Z

Thanks for taking care of this feature and especially the issues after merging, I really appreciate that 🙏

rasa · 2022-07-18T17:03:17Z

And for completeness, the dos2unix's binary check logic is here. That's the logic I patterned our code after, as it's the simplest, and produces the last amount of false positives (imo).

rasa force-pushed the rasa-fix-205-binary-check branch from fa3fb52 to 80365ae Compare June 26, 2022 23:26

rasa marked this pull request as ready for review June 26, 2022 23:26

rasa force-pushed the rasa-fix-205-binary-check branch 4 times, most recently from 47c245e to 27d2587 Compare June 27, 2022 00:07

fix(files): Improve binary file check (fixes editorconfig-checker#205)

c0d0a17

rasa force-pushed the rasa-fix-205-binary-check branch from 27d2587 to c0d0a17 Compare June 27, 2022 05:27

rasa added 4 commits June 27, 2022 08:49

Fix URL in download-test-files.sh

73fe942

Fix TestDecodeBytesText test

6b98edf

Also changed download script to Makefile

Remove dead code

146017c

mstruebing reviewed Jul 2, 2022

View reviewed changes

Remove commented out code

48d5755

rasa added 3 commits July 2, 2022 18:10

Add additional types to defaultAllowedContentTypes

f3b0a8a

Specifically: application/ecmascript application/json application/xml +xml

Add application/x-ndjson & +json, update readme

ebe0ab9

Update readme with improved grammar

9e01142

mstruebing merged commit 9d4703a into editorconfig-checker:main Jul 16, 2022

rasa mentioned this pull request Jul 17, 2022

Fix CI test failures #210

Merged

rasa deleted the rasa-fix-205-binary-check branch September 5, 2022 23:47

polarathene mentioned this pull request Oct 3, 2023

Could not decode the IBM420_rtl encoded file #252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve binary file check (fixes #205) #206

Improve binary file check (fixes #205) #206

rasa commented Jun 26, 2022

mstruebing left a comment

mstruebing Jul 2, 2022

rasa Jul 2, 2022

rasa Jul 2, 2022

rasa commented Jul 3, 2022

rasa commented Jul 4, 2022

rasa commented Jul 15, 2022

mstruebing commented Jul 15, 2022

rasa commented Jul 17, 2022 •

edited

mstruebing commented Jul 17, 2022

rasa commented Jul 17, 2022

rasa commented Jul 17, 2022

mstruebing commented Jul 18, 2022

rasa commented Jul 18, 2022 •

edited

	rawFileContent, err := os.ReadFile(filePath)
	if err != nil {
	panic(err)
	}
	fileContent := string(rawFileContent)
	mime, err := files.GetContentTypeBytes(rawFileContent, config)
	for _, regex := range textRegexes {
	match, _ := regexp.MatchString(regex, mime)
	if match {
	var charset string
	fileContent, charset, err = encoding.DecodeBytes(rawFileContent)
	if err != nil {
	if charset == "" {
	charset = "unknown"
	}
	config.Logger.Error("Could not decode the %s encoded file: %s", charset, filePath)
	config.Logger.Error(err.Error())
	}
	break
	}
	}
	lines := files.ReadLines(fileContent)

Improve binary file check (fixes #205) #206

Improve binary file check (fixes #205) #206

Conversation

rasa commented Jun 26, 2022

mstruebing left a comment

Choose a reason for hiding this comment

mstruebing Jul 2, 2022

Choose a reason for hiding this comment

rasa Jul 2, 2022

Choose a reason for hiding this comment

rasa Jul 2, 2022

Choose a reason for hiding this comment

rasa commented Jul 3, 2022

rasa commented Jul 4, 2022

rasa commented Jul 15, 2022

mstruebing commented Jul 15, 2022

rasa commented Jul 17, 2022 • edited

mstruebing commented Jul 17, 2022

rasa commented Jul 17, 2022

rasa commented Jul 17, 2022

mstruebing commented Jul 18, 2022

rasa commented Jul 18, 2022 • edited

rasa commented Jul 17, 2022 •

edited

rasa commented Jul 18, 2022 •

edited