Test for degenerate character encoding case.#829
Closed
geoff-nixon wants to merge 1 commit into
Closed
Conversation
This was referenced Dec 12, 2013
This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Basically: - explicitly convert text to UTF-8, replacing invalid characters prior to spitting into lines and parsing. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb). - Adds the test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py). Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.
geoff-nixon
pushed a commit
to pullreq/linguist
that referenced
this pull request
Dec 16, 2013
This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Basically: - explicitly convert text to UTF-8, replacing invalid characters prior to spitting into lines and/or parsing. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb) and [sample.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/sample.rb). - Adds a test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py). Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.
geoff-nixon
pushed a commit
to pullreq/linguist
that referenced
this pull request
Dec 23, 2013
…errors). So I've gone ahead and rebased this onto 2.10.7... But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess? Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code? This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Addresses a number of encoding errors, mostly by: - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing. - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.) - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc. - Includes the following new test cases for the above, all taken from real repositories here on Github: - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper) - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer) - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252) - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)
geoff-nixon
pushed a commit
to pullreq/linguist
that referenced
this pull request
Dec 29, 2013
…errors). So I've gone ahead and rebased this onto 2.10.8... But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess? Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code? This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Addresses a number of encoding errors, mostly by: - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing. - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.) - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc. - Includes the following new test cases for the above, all taken from real repositories here on Github: - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper) - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer) - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252) - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Doesn't look like it likes this very much at all.
Courtesy the llvm 'lit' test cases.