Skip to content

Test for degenerate character encoding case.#829

Closed
geoff-nixon wants to merge 1 commit into
github-linguist:masterfrom
pullreq:master
Closed

Test for degenerate character encoding case.#829
geoff-nixon wants to merge 1 commit into
github-linguist:masterfrom
pullreq:master

Conversation

@geoff-nixon

Copy link
Copy Markdown
Contributor

Doesn't look like it likes this very much at all.
Courtesy the llvm 'lit' test cases.

This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829.

Basically:
 - explicitly convert text to UTF-8, replacing invalid characters
   prior to spitting into lines and parsing. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb).
 - Adds the test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py).

Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.
geoff-nixon pushed a commit to pullreq/linguist that referenced this pull request Dec 16, 2013
This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829.

Basically:
 - explicitly convert text to UTF-8, replacing invalid characters prior to spitting into lines and/or parsing.
   See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb) and [sample.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/sample.rb).
 - Adds a test case (from LLVM's [lit](http://llvm.org/docs/CommandGuide/lit.html) testsuite) as [samples/Python/invalid-encoding.py](https://github.com/pullreq/linguist/blob/samples/Python/invalid-encoding.py).

Tested with Ruby 1.8.7p358 and 2.0.0p353 on Darwin.
geoff-nixon pushed a commit to pullreq/linguist that referenced this pull request Dec 23, 2013
…errors).

So I've gone ahead and rebased this onto 2.10.7...

But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess?

Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code?

This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829.

Addresses a number of encoding errors, mostly by:
 - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing.
 - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.)
 - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'.
   See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc.
 - Includes the following new test cases for the above, all taken from real repositories here on Github:
    - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper)
    - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer)
    - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252)
    - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)
geoff-nixon pushed a commit to pullreq/linguist that referenced this pull request Dec 29, 2013
…errors).

So I've gone ahead and rebased this onto 2.10.8...

But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess?

Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code?

This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829.

Addresses a number of encoding errors, mostly by:
 - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing.
 - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.)
 - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'.
   See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc.
 - Includes the following new test cases for the above, all taken from real repositories here on Github:
    - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper)
    - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer)
    - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252)
    - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant