Add test case and temp fix for issue72 #74

boyter · 2019-05-02T22:51:26Z

Don't actually want to merge this yet, as I belive the solution should be more generic but this does work to fix the issue in #72

I am thinking that perhaps the full table of BOM https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding should be included in this to ensure that it works for all encoding types. No idea what impact that would have though. Is it even worth doing it for all of them?

Looking for ideas on this before it gets merged in for good to ensure its a solution that does not need to be revisited if possible.

dbaggerman · 2019-05-02T23:10:42Z

Checking for the BOM seems reasonable to me. Maybe look at https://golang.org/pkg/bytes/#HasPrefix for checking it though, rather than checking individual bytes.

Non UTF-8 encoding probably aren't ASCII-ish enough for scc to read them properly anyway. Maybe we could just use their presence to show a warning/explanation to the user about encoding support.

dbaggerman · 2019-05-02T23:35:16Z

Although, I suppose if we wanted to we could use https://golang.org/pkg/unicode/utf16 and https://golang.org/pkg/encoding/binary to populate a Trie with utf-16 encoded tokens and get UTF-16 support that way.

Not sure it's worthwhile, but it's interesting nonetheless.

boyter · 2019-05-03T04:01:46Z

I do like that it works on bytes at the moment, which was a deliberate choice to avoid ever worrying about the encoding (just didn't expect BOM for the most part) which is just a painful thing to work with without something like Python's unicode dammit.

I have the table mostly done now,

var ByteOrderMarks = [][]byte{
	{239, 187, 191},       // UTF-8 make this first as it is likely to be the most common case
	{254, 255},            // UTF-16 BE
	{255, 254},            // UTF-16 LE
	{0, 0, 254, 255},      // UTF-32 BE
	{255, 254, 0, 0},      // UTF-32 LE
	{43, 47, 118, 56},     // UTF-7
	{43, 47, 118, 57},     // UTF-7
	{43, 47, 118, 43},     // UTF-7
	{43, 47, 118, 47},     // UTF-7
	{43, 47, 118, 56, 45}, // UTF-7
	{247, 100, 76},        // UTF-1
	{221, 115, 102, 115},  // UTF-EBCDIC
	{14, 254, 255},        // SCSU
}

As you suggested have flipped over to https://golang.org/pkg/bytes/#HasPrefix I had a feeling something was there but wanted to see if I could resolve it first before looking at the API. Cheers for finding it.

boyter · 2019-05-03T04:28:21Z

Seems to happen more than you would expect

boyter · 2019-05-03T06:36:32Z

I think this is much closer to a proper solution. Going to add more tests for it to cover all the cases first.

boyter · 2019-05-03T07:19:47Z

I think this is much closer to a proper solution. Going to add more tests for it to cover all the cases first. Will then check for any performance issues, which I doubt will be a problem.

boyter · 2019-05-03T07:46:53Z

github.com/boyter/scc/processor.checkBomSkip (0.18%, 0.02s)

when tested over a very large repository

Language                 Files     Lines     Code  Comments   Blanks Complexity
───────────────────────────────────────────────────────────────────────────────
Total                    40167   9506667  5699791   2839927   966949     751032

So next to no impact. Took me several runs to get the profiler to even pick it up based on the default sampling size actually.

boyter · 2019-05-03T07:55:32Z

Pretty happy with this now.

@dbaggerman feel free to review at any point.

dbaggerman · 2019-05-07T04:57:18Z

Since we know that most of these encodings won't actually scan correctly, we should show a warning/error when we detect them in the BOM, not just skip the BOM and carry on.

A token like for { will encode to the same sequence of bytes in ASCII or utf8, so skipping the BOM and carrying on is correct. But for most/all other encodings it will encode to a different byte sequence so the complexity analysis would never be correct. Even single byte/character tokens like quotes or newlines might not be the same in other encodings so even the line count could be incorrect.

dbaggerman · 2019-05-07T06:53:33Z

Actually, it just occurred to me that utf16 would almost certainly contain bytes containing zero and fail the isBinary check.

boyter · 2019-05-07T22:06:29Z

Yeah the isBinary does break on those. I agree, should change this to emit a warning if the BOM is found for anything non UTF-8 and then proceed as normal. Ill refactor.

boyter · 2019-05-07T22:29:40Z

Modified as per above. Only skips if UTF-8 BOM is found. Otherwise if -v verbose option is enabled it will print a warning indicating the file might not be counted correctly.

Nice side effect is that this should reduce that run-time overhead as the second check only needs to happen if verbose is enabled in which case printing will probably be slower than the check itself.

dbaggerman · 2019-05-08T23:17:53Z

Ok, looks good to me.

Add test case and temp fix for issue72

a1d7726

boyter requested a review from dbaggerman May 2, 2019 22:51

boyter mentioned this pull request May 2, 2019

First comment line in C# is counted as Code #72

Closed

Update to check all BOM

73c44b9

boyter added 2 commits May 3, 2019 17:27

Refactor to try all BOM

409e5dd

Pull BOM check into own method

f8568c5

boyter added 2 commits May 3, 2019 17:50

Reset regression test to exact case and add unit for BOM

bff72cf

Fix typo

2b64ffe

Increase coverge slightly

779412a

boyter mentioned this pull request May 6, 2019

Change JSON structure and code to support custom string types #76

Merged

Merge branch 'master' into issue72

9872382

Modify to only skip UTF-8 BOM and warn if others found

bf9f873

Merge branch 'master' into issue72

da25e9e

boyter merged commit a4fed34 into master May 9, 2019

boyter deleted the issue72 branch May 9, 2019 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test case and temp fix for issue72 #74

Add test case and temp fix for issue72 #74

boyter commented May 2, 2019 •

edited

dbaggerman commented May 2, 2019

dbaggerman commented May 2, 2019

boyter commented May 3, 2019

boyter commented May 3, 2019

boyter commented May 3, 2019

boyter commented May 3, 2019

boyter commented May 3, 2019

boyter commented May 3, 2019

dbaggerman commented May 7, 2019

dbaggerman commented May 7, 2019

boyter commented May 7, 2019

boyter commented May 7, 2019

dbaggerman commented May 8, 2019

Add test case and temp fix for issue72 #74

Add test case and temp fix for issue72 #74

Conversation

boyter commented May 2, 2019 • edited

dbaggerman commented May 2, 2019

dbaggerman commented May 2, 2019

boyter commented May 3, 2019

boyter commented May 3, 2019

boyter commented May 3, 2019

boyter commented May 3, 2019

boyter commented May 3, 2019

boyter commented May 3, 2019

dbaggerman commented May 7, 2019

dbaggerman commented May 7, 2019

boyter commented May 7, 2019

boyter commented May 7, 2019

dbaggerman commented May 8, 2019

boyter commented May 2, 2019 •

edited