Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Files with non-english characters are trait as binary files #98

Closed
vsushkov opened this Issue Oct 17, 2012 · 18 comments

Comments

Projects
None yet
4 participants

Files with non-english characters are trait as binary files. Test data set:

test
тест

Put this two lines in a file and search for "test".

Contributor

gjtorikian commented Oct 17, 2012

It's about percentage. If something like 10% of the bytes encountered are "suspicious," then it's flagged as binary.

I believe Perl's -B operates the exact same way, but with around 30% instead.

I'd like to search in Russian-only files too. Isn't it a good idea for ag?

Contributor

gjtorikian commented Oct 17, 2012

I'm not saying it's not a good idea--I'm not even a maintainer for the project!

Perhaps I am mistaken, though. ack is able to find the text, using test and тест. I always thought it stopped at a percentage threshold.

In any event, ag's problem is still about a percentage threshold. The following file, for example, does find matches:

test
т
test
test
testtest
test
testtest
test
testtest
test
testtest
test
testtest
test
testtest
test
testtest
test
testtest
test
testtest
test
testtest
test
testtest
test
testtest
test
test

(In other words, one Russian character amongst other "Latin" ones.)

I suggest to check whether a file is binary using filename extension check. As I understand, it should work even faster than the current approach.

Contributor

gjtorikian commented Oct 17, 2012

Eh, that's not really efficient. What about filenames without extensions, like ag, find, Makefile. Which of these are binary?

Yep, checking filename extension is not a good idea. What about file linux program? It makes a good job in determining file types. I think ag can use it or take some logic from this program.

Owner

ggreer commented Oct 19, 2012

I tweaked is_binary() to be a little more forgiving. Try it out now. If my change doesn't do the trick, there are other ideas we can try. For example, if the file is UTF-8 encoded, there shouldn't be very many bytes above 0b10111111.

Version compiled from git:master treats my LaTeX files (in Russian) as binary :(

 % ag usepackage Руководство_оператора.tex -a
Binary file Руководство_оператора.tex matches.
Owner

ggreer commented Nov 6, 2012

@krigstask can you give me some files that fail?

Here you are: https://gist.github.com/4029832

It's sample rST file, I can't public those LaTeX files, sorry.

Owner

ggreer commented Nov 7, 2012

A sample is fine. Thanks.

Owner

ggreer commented Nov 7, 2012

Try master now.

Great, seems to work now!

Owner

ggreer commented Nov 8, 2012


Victory!

@ggreer ggreer closed this Nov 8, 2012

vsushkov commented Nov 8, 2012

@ggreer Could you please update homebrew formulae?

Owner

ggreer commented Nov 8, 2012

Oh yeah I should tag a release at some point in the near future.

Owner

ggreer commented Nov 8, 2012

New release tagged. The homebrew pull request is here: mxcl/homebrew#15911

vsushkov commented Nov 8, 2012

Megacool. Thanks. 🍰

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment