New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test case and temp fix for issue72 #74
Conversation
Checking for the BOM seems reasonable to me. Maybe look at https://golang.org/pkg/bytes/#HasPrefix for checking it though, rather than checking individual bytes. Non UTF-8 encoding probably aren't ASCII-ish enough for |
Although, I suppose if we wanted to we could use Not sure it's worthwhile, but it's interesting nonetheless. |
I do like that it works on bytes at the moment, which was a deliberate choice to avoid ever worrying about the encoding (just didn't expect BOM for the most part) which is just a painful thing to work with without something like Python's unicode dammit. I have the table mostly done now,
As you suggested have flipped over to https://golang.org/pkg/bytes/#HasPrefix I had a feeling something was there but wanted to see if I could resolve it first before looking at the API. Cheers for finding it. |
I think this is much closer to a proper solution. Going to add more tests for it to cover all the cases first. |
I think this is much closer to a proper solution. Going to add more tests for it to cover all the cases first. Will then check for any performance issues, which I doubt will be a problem. |
when tested over a very large repository
So next to no impact. Took me several runs to get the profiler to even pick it up based on the default sampling size actually. |
Pretty happy with this now. @dbaggerman feel free to review at any point. |
Since we know that most of these encodings won't actually scan correctly, we should show a warning/error when we detect them in the BOM, not just skip the BOM and carry on. A token like |
Actually, it just occurred to me that utf16 would almost certainly contain bytes containing zero and fail the |
Yeah the isBinary does break on those. I agree, should change this to emit a warning if the BOM is found for anything non UTF-8 and then proceed as normal. Ill refactor. |
Modified as per above. Only skips if UTF-8 BOM is found. Otherwise if Nice side effect is that this should reduce that run-time overhead as the second check only needs to happen if verbose is enabled in which case printing will probably be slower than the check itself. |
Ok, looks good to me. |
Don't actually want to merge this yet, as I belive the solution should be more generic but this does work to fix the issue in #72
I am thinking that perhaps the full table of BOM https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding should be included in this to ensure that it works for all encoding types. No idea what impact that would have though. Is it even worth doing it for all of them?
Looking for ideas on this before it gets merged in for good to ensure its a solution that does not need to be revisited if possible.