Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force UTF-8 for filenames in breakdown analysis #4465

Merged
merged 11 commits into from
Apr 3, 2019

Conversation

lildude
Copy link
Member

@lildude lildude commented Mar 14, 2019

As reported in #4028 the JSON output for repos which contain filenames that contain unicode chars fails with and encoding error like this:

$ git init foo
Initialized empty Git repository in /Users/lildude/Downloads/trash/foo/.git/
$ cd foo
$ echo "foo" > foo/fileã.rb
$ git add fileã.rb
$ git commit -m 'Foo'
[master (root-commit) c1e9aa2] Foo
 1 file changed, 1 insertion(+)
 create mode 100644 "file\303\243.rb"
$ github-linguist --json
/Users/lildude/.rbenv/versions/2.4.3/lib/ruby/gems/2.4.0/gems/json-2.1.0/lib/json/common.rb:224:in `encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
	from /Users/lildude/.rbenv/versions/2.4.3/lib/ruby/gems/2.4.0/gems/json-2.1.0/lib/json/common.rb:224:in `generate'
	from /Users/lildude/.rbenv/versions/2.4.3/lib/ruby/gems/2.4.0/gems/json-2.1.0/lib/json/common.rb:224:in `generate'
	from /Users/lildude/.rbenv/versions/2.4.3/lib/ruby/gems/2.4.0/gems/json-2.1.0/lib/json/common.rb:394:in `dump'
	from /Users/lildude/.rbenv/versions/2.4.3/lib/ruby/gems/2.4.0/gems/github-linguist-7.0.0/bin/github-linguist:47:in `<top (required)>'
	from /Users/lildude/.rbenv/versions/2.4.3/bin/github-linguist:23:in `load'
	from /Users/lildude/.rbenv/versions/2.4.3/bin/github-linguist:23:in `<main>'
$

Performing a breakdown analysis behind the scenes on GitHub.com will result in a 500 error there too.

As pointed out by @tenderlove:

It looks like we're getting this error because we're trying to dump out filesystem information (specifically file paths) as JSON. Unfortunately:

  1. JSON can only represent UTF-8 strings
  2. Filesystems can use any encoding for file names
  3. Git must support storing these paths
  4. Git doesn't store encoding information about the paths, they are just a list of bytes

... and he proposed two possible solutions. I've gone for the latter in this PR and forcing UTF-8 on all filenames when performing a breakdown.

@tenderlove do you see any issues with this approach? Seems too easy 😉.

Now analysis of my example produces:

$ bundle exec bin/github-linguist ~/tmp/trash/foo --json | jq
{
  "Ruby": [
    "fileã.rb"
  ]
}
$

I've added a new fixture file to the test/attributes branch in #4464 (will need to change the SHA in this PR when that PR has been merged) in order to test this encoding enforcement.

I've gone with this as we don't care about the content of the file and can't add it to the samples as the content is deliberately descriptive about the purpose of the file. As test/fixtures is vendored by default the normal testing wouldn't pick up the file, so I've piggy-backed onto the repo tests when we un-vendor test/fixtures.

If the enforcement doesn't happen, the test fails as such:

  1) Failure:
TestRepository#test_repo_git_attributes [/Users/lildude/github/linguist/test/test_repository.rb:69]:
Expected false to be truthy.

Fixes #4028

@lildude lildude changed the title Force UTF-8 for filenames in JSON breakdown Force UTF-8 for filenames in breakdown analysis Mar 14, 2019
@Alhadis
Copy link
Collaborator

Alhadis commented Apr 2, 2019

I'm not sure if this issue is connected with the encoding errors I get on OpenBSD. Here's what bundle exec rake looks like when I haven't overridden its environment with LANG:

$ type brake
brake is aliased to `LANG=en_AU.UTF-8 bundle exec rake'

This seems to fix the encoding errors. I'm still not sure what's going on, or why this issue only happens on OpenBSD...

@lildude lildude requested a review from tenderlove April 2, 2019 14:08
@lildude
Copy link
Member Author

lildude commented Apr 2, 2019

@tenderlove If you've got a free moment, I'd like your thoughts on this fix based on your comment at #4028 (comment) hence I've requested a review from you.

Copy link
Contributor

@tenderlove tenderlove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. As I said in a comment, we probably want to scrub the string after force encoding, but this will work.

@tenderlove
Copy link
Contributor

@Alhadis the LANG environment variable impacts the encoding of strings read in by the program. It may have been previously set to C which would cause strings read in (including the names of files) to be tagged as US-ASCII even though they contained unicode characters. Changing LANG to include UTF-8 would cause the strings read to be tagged as UTF-8 so it makes sense that your environment variable fixed this issue.

I hope that helps!

@Alhadis
Copy link
Collaborator

Alhadis commented Apr 2, 2019

Ah, that would make sense. Still, isn't there any way we could detect problematic locale settings up-front, so as to not confuse the next user who's faced with a waterfall of unhelpful error messages...

@lildude lildude changed the title Force UTF-8 for filenames in breakdown analysis Force UTF-8 for filenames in JSON breakdown analysis Apr 3, 2019
@lildude
Copy link
Member Author

lildude commented Apr 3, 2019

Still, isn't there any way we could detect problematic locale settings up-front, so as to not confuse the next user who's faced with a waterfall of unhelpful error messages...

🤔 this is more of a user env issue than a Linguist issue. We could put some sort of locale check at the beginning of github-linguist like:

abort "A UTF-8 compatible locale is required to run github-linguist" unless Encoding.default_external.to_s =~ /UTF-8/

But we may need to think carefully about this. Def a topic for another issue, though you're the only person I know who's hit this issue 😉.

@lildude lildude changed the title Force UTF-8 for filenames in JSON breakdown analysis Force UTF-8 for filenames in breakdown analysis Apr 3, 2019
@lildude lildude merged commit bf95666 into master Apr 3, 2019
@lildude lildude deleted the lildude/force-utf-8-filenames branch April 3, 2019 10:59
@Alhadis
Copy link
Collaborator

Alhadis commented Apr 3, 2019

this is more of a user env issue than a Linguist issue.

Is it? Every time I see that warning emitters by charlock_holmes about the default encoding variable not being set, I can't help but wonder if that's got something to do with it.

In any case, I've grown so used to running my brake alias that I no longer think anything of it. 😝 Sometimes I use Docker to remind me that I've forgotten to override the LANG= variable again (where "docker" is really a shell-script in my ~/.files/bin directory on OpenBSD whose sole purpose is to cheat the grammar registration script.

... but don't blame me though, there's no Docker port for OpenBSD, and I barely know a single damn thing about Docker. 😢

@lildude
Copy link
Member Author

lildude commented Apr 3, 2019

Is it? Every time I see that warning emitters by charlock_holmes about the default encoding variable not being set, I can't help but wonder if that's got something to do with it.

Oh, I get those too and everything I do is in UTF-8 😀

@Alhadis
Copy link
Collaborator

Alhadis commented Apr 3, 2019

I know, I see the warnings emitted on TravisCI as well. When I did some digging, I remember reading something about it deferring to the first locale it could find in the site's iconv headers or... something along those lines. I came away with the impression that the locale handling was nondeterministic, and subject to variation between platforms.

ANYWAYYYYY, don't mind me. :D I'm happy with my Perl and JavaScript, where locale is more sane than, erm, Ruby's. 😀

(I also realised I've been side-tracked the whole day and haven't finished tending to #2988 yet. Time to get another oldie off our lingering issues list. 👍

@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Linguist JSON output fails with ASCII-8BIT to UTF-8 encoding issue
3 participants