Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test case and temp fix for issue72 #74

Merged
merged 10 commits into from May 9, 2019
Merged

Add test case and temp fix for issue72 #74

merged 10 commits into from May 9, 2019

Conversation

boyter
Copy link
Owner

@boyter boyter commented May 2, 2019

Don't actually want to merge this yet, as I belive the solution should be more generic but this does work to fix the issue in #72

I am thinking that perhaps the full table of BOM https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding should be included in this to ensure that it works for all encoding types. No idea what impact that would have though. Is it even worth doing it for all of them?

Looking for ideas on this before it gets merged in for good to ensure its a solution that does not need to be revisited if possible.

@dbaggerman
Copy link
Collaborator

Checking for the BOM seems reasonable to me. Maybe look at https://golang.org/pkg/bytes/#HasPrefix for checking it though, rather than checking individual bytes.

Non UTF-8 encoding probably aren't ASCII-ish enough for scc to read them properly anyway. Maybe we could just use their presence to show a warning/explanation to the user about encoding support.

@dbaggerman
Copy link
Collaborator

Although, I suppose if we wanted to we could use https://golang.org/pkg/unicode/utf16 and https://golang.org/pkg/encoding/binary to populate a Trie with utf-16 encoded tokens and get UTF-16 support that way.

Not sure it's worthwhile, but it's interesting nonetheless.

@boyter
Copy link
Owner Author

boyter commented May 3, 2019

I do like that it works on bytes at the moment, which was a deliberate choice to avoid ever worrying about the encoding (just didn't expect BOM for the most part) which is just a painful thing to work with without something like Python's unicode dammit.

I have the table mostly done now,

var ByteOrderMarks = [][]byte{
	{239, 187, 191},       // UTF-8 make this first as it is likely to be the most common case
	{254, 255},            // UTF-16 BE
	{255, 254},            // UTF-16 LE
	{0, 0, 254, 255},      // UTF-32 BE
	{255, 254, 0, 0},      // UTF-32 LE
	{43, 47, 118, 56},     // UTF-7
	{43, 47, 118, 57},     // UTF-7
	{43, 47, 118, 43},     // UTF-7
	{43, 47, 118, 47},     // UTF-7
	{43, 47, 118, 56, 45}, // UTF-7
	{247, 100, 76},        // UTF-1
	{221, 115, 102, 115},  // UTF-EBCDIC
	{14, 254, 255},        // SCSU
}

As you suggested have flipped over to https://golang.org/pkg/bytes/#HasPrefix I had a feeling something was there but wanted to see if I could resolve it first before looking at the API. Cheers for finding it.

@boyter
Copy link
Owner Author

boyter commented May 3, 2019

Seems to happen more than you would expect

image

@boyter
Copy link
Owner Author

boyter commented May 3, 2019

I think this is much closer to a proper solution. Going to add more tests for it to cover all the cases first.

@boyter
Copy link
Owner Author

boyter commented May 3, 2019

I think this is much closer to a proper solution. Going to add more tests for it to cover all the cases first. Will then check for any performance issues, which I doubt will be a problem.

@boyter
Copy link
Owner Author

boyter commented May 3, 2019

github.com/boyter/scc/processor.checkBomSkip (0.18%, 0.02s)

when tested over a very large repository

Language                 Files     Lines     Code  Comments   Blanks Complexity
───────────────────────────────────────────────────────────────────────────────
Total                    40167   9506667  5699791   2839927   966949     751032

So next to no impact. Took me several runs to get the profiler to even pick it up based on the default sampling size actually.

@boyter
Copy link
Owner Author

boyter commented May 3, 2019

Pretty happy with this now.

@dbaggerman feel free to review at any point.

@dbaggerman
Copy link
Collaborator

Since we know that most of these encodings won't actually scan correctly, we should show a warning/error when we detect them in the BOM, not just skip the BOM and carry on.

A token like for { will encode to the same sequence of bytes in ASCII or utf8, so skipping the BOM and carrying on is correct. But for most/all other encodings it will encode to a different byte sequence so the complexity analysis would never be correct. Even single byte/character tokens like quotes or newlines might not be the same in other encodings so even the line count could be incorrect.

@dbaggerman
Copy link
Collaborator

Actually, it just occurred to me that utf16 would almost certainly contain bytes containing zero and fail the isBinary check.

@boyter
Copy link
Owner Author

boyter commented May 7, 2019

Yeah the isBinary does break on those. I agree, should change this to emit a warning if the BOM is found for anything non UTF-8 and then proceed as normal. Ill refactor.

@boyter
Copy link
Owner Author

boyter commented May 7, 2019

Modified as per above. Only skips if UTF-8 BOM is found. Otherwise if -v verbose option is enabled it will print a warning indicating the file might not be counted correctly.

Nice side effect is that this should reduce that run-time overhead as the second check only needs to happen if verbose is enabled in which case printing will probably be slower than the check itself.

@dbaggerman
Copy link
Collaborator

Ok, looks good to me.

@boyter boyter merged commit a4fed34 into master May 9, 2019
@boyter boyter deleted the issue72 branch May 9, 2019 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants