Erlang escript bundle is treated as JavaScript #236

Closed
ztmr opened this Issue Aug 20, 2012 · 5 comments

Projects

None yet

4 participants

@ztmr
Contributor
ztmr commented Aug 20, 2012

Escript bundle is a compressed Erlang script. Linguist detect it incorrectly as a JavaScript:

$ file ./rebar
./rebar: a escript script text executable
$ linguist ./rebar
./rebar: 0 lines (0 sloc)
  type:      Binary
  mime type: text/plain
  language:  JavaScript
$

...so many Erlang projects that are shipped with rebar build tool script may be detected as JavaScript projects alghough they are pure-Erlang!

@asabil
asabil commented Oct 18, 2012

I am experiencing the same issue, but I also noticed recently that it will mis-detect erlang as Perl...

@stuartpb
Contributor

https://github.com/basho/luke for an example of this behavior.

@ztmr
Contributor
ztmr commented Mar 16, 2013

Hm, it's strange that the problem still exists on some repositories because few weeks ago, it was fixed at least on the repository (http://github.com/ztmr/egtm) where I have discovered the issue for the first time. That's why I thought somebody silently fixed it in meantime...

@stuartpb stuartpb added a commit to stuartpb/linguist that referenced this issue Mar 16, 2013
@stuartpb stuartpb Add Erlang rebar escript bundles to vendor.yml
Fixes #236
ec786b7
@stuartpb
Contributor

Well, if the file doesn't have an extension, Linguist will classify it based on the result of a Bayesian analysis based on the tokens in lib/linguist/samples.json, so what language it (mistakenly) decides the blob is going to be depends on the frequency of the tokens in that particular file.

There are a few things that can be done to fix this bug and others of its type:

  1. Add rebar to lib/linguist/vendor.yml - Proposed as #443.
  2. Add a "SHEBANG#!escript" token to lib/linguist/samples.json. This should fix any future instances of other Escript bundles being recognized as something else. I'm not familiar with Erlang or Linguist, so I don't know how much you want escript files with extensionless names other than rebar recognized in a project, or if they should always be treated as vendor, and if so how that would be done in Linguist.
  3. Don't count any file whose type is determined with less than, say, 90% certainty in the language breakdown. This would fix any other misclassified files Linguist doesn't recognize.
  4. Maybe use file(1) or one of the at least 4 gems that bind to magic(4) before resorting to Bayesian classification(?!).
@tnm tnm closed this in #443 Jul 8, 2013
@tnm
Contributor
tnm commented Jul 8, 2013

This is fixed with #443, and the fix will be out on the website soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment