Disambiguate between OCaml and Standard ML #2227

samoht · 2015-03-13T15:07:37Z

samoht · 2015-03-16T18:26:34Z

Any feebdack on that PR? Thanks!

pchaigno · 2015-03-16T19:44:01Z

We should first try with simpler regular expressions. Have you tested the one @raphael-proust proposed in #2208?

disambiguate "SML", "OCaml" do |data|
  if /=> /.match(data)
    Language["SML"]
  elsif /module|let rec /.match(data)
    Language["OCaml"]
  end
end

samoht · 2015-03-17T12:16:19Z

I've simplified the regular expression. It's hard to make it simpler, as Standard ML and OCaml share most of their core syntax. Would be maybe easier to revert your change :p or to deprecate .ML extensions for Standard ML.

larsbrinkhoff · 2015-03-17T12:45:22Z

You might request a set of 1000 .ml samples from @arfton, e.g. for this search:
http://github.com/search?q=extension%3Aml+NOT+slartibartfaxt&type=Code

Then you would run Linguist against this set, and note the results before and after your modifications. This might provide some insights on the performance of your suggested heuristics.

samoht · 2015-03-17T15:04:01Z

I've gathered a few thousands .ml files which were on my hard-drive into a Git repository, and run linguist on it with both master and master+my-patch. Here are the results:

$ ls *.ml  | wc -l
4206

$ [master] time bundle exec linguist
72.94%  Standard ML
27.06%  OCaml

real    0m29.786s
user    0m29.059s
sys 0m0.483s

$ [master+patch] time bundle exec linguist
time bundle exec linguist
98.60%  OCaml
1.40%   Standard ML

real    0m11.227s
user    0m10.847s
sys 0m0.342s

amirmc · 2015-03-17T16:46:05Z

I read that as over 60% faster, with substantially better classification (from 27% correct before to over 98% correct after).

@arfon, is it ok now to merge this patch? As I mentioned in #2208, this is affecting the discoverability of OCaml projects, and not just on GitHub (cf. Libraries.io, which is also using linguist).

larsbrinkhoff · 2015-03-17T18:24:10Z

@samoht Great! Now you might ideally examine those files to verify that they are indeed correctly identified.

Note that I'm not part of GitHub staff, just a contributor. But I've done this exercise many times to provide evidence that my requests work correctly.

samoht · 2015-03-17T18:36:57Z

I don't program in Standard ML, all the files are OCaml code which are either part of the projects I develop or contribute. So there is still 1.4% error rate, which is much better than 72.9% but I can (almost) live with it.

samoht · 2015-03-17T18:51:04Z

Also note than 15 days ago, linguist would have said (with reason) that these files were 100% OCaml files as they have an .ml extension.

#2087 made .ML extensions equivalent to .ml which completely break the support for OCaml in linguist. But before April 2014, Standard ML was not being associated with .ML file at all (and I'm not sure anyone really use that extension in practice) so it was fine as well.

amirmc · 2015-03-17T18:54:39Z

So there is still 1.4% error rate, which is much better than 72.9% but I can (almost) live with it.

I'd like to reiterate that this is a major improvement on the current situation, in which OCaml-based repos are being misreported as Standard ML. This bug was amusing when it first hit but is detrimental the longer this takes to fix.

Please do understand that I'm seeing this very much from the perspective of a bugfix, not a feature request. A regression in Lingust has led to these issues being created (things were ok before). This patch doesn't have to be the last word, but it clearly alleviates the problem with better performance compared to the current master.

pchaigno · 2015-03-17T21:06:12Z

The detection of Standard ML .ml files should also be tested. We don't want to have a new regression ;)

Standard ML was not being associated with .ML file at all (and I'm not sure anyone really use that extension in practice) so it was fine as well.

@akissinger added the .ML extension in #1035. Maybe he could find us some .ML files for test :-)

amirmc · 2015-03-17T23:50:42Z

The detection of Standard ML .ml files should also be tested. We don't want to have a new regression ;)

I feel this is completely missing the bigger picture. Right now a majority of active OCaml projects are utterly misclassified which is affecting two entire language communities on GitHub. No-one can find repos in either language the way they used to (one has all but disappeared and the other is swamped with erroneous projects).

Surely, it's preferable to deal with the larger problem now and deal with a potentially much smaller problem afterwards.

If you see the new comments on the other thread and also the PR you link to, it's quite clear that the SML community predominantly uses .sml and .sig and to a lesser extent .ML (capitalised). It seems ridiculous to me that the likely smaller fraction of SML users who might be using .ml files are somehow being prioritised over the rest of the two communities who continue to suffer from this error. 11 days ago this situation was highly amusing but it's not anymore.

arfon · 2015-03-17T23:58:45Z

Given this comment let's get this merged in as is.

@samoht would you object to testing your changes on a corpus of files as @larsbrinkhoff suggests before we cut a new gem?

pchaigno · 2015-03-18T08:40:34Z

I tested this pull request on seL4/isabelle which contains 678 Standard ML .ML files. 179 are recognized as OCaml (35% of the total number of .ML lines).

If you remove class from the OCaml heuristic, this number drops to 55 (6% of the total number of .ML lines).

Also, a test on HOL-Theorem-Prover/HOL (which contains a few Standard ML .ml files) seems to indicate that the heuristic for Standard ML is slightly better than the OCaml one. So we might want to put the Standard ML heuristic first:

if /=> |case\s\S+\sof/.match(data)
  Language["Standard ML"]
elsif /module|let rec |begin\s(\w+\s)+end|match\s\S+\swith/.match(data)
  Language["OCaml"]
end

@samoht Could you test again against you samples without the class keyword? (or I can do it if your files are available somewhere).

Fix github-linguist#2208

samoht · 2015-03-18T09:59:10Z

@pchaigno putting the Standard ML heuristic first breaks samples/OCaml/uutf.ml and gives 5% error rate on my sample. Removing/adding the class keyword doesn't seem to change much to the result (which certainly says something about the use of the O in OCaml nowadays).

So I've tweaked the regexp to have acceptable results for both my corpus and isabelle:

$ time bundle exec linguist my-ocaml/
98.56%  OCaml
1.44%   Standard ML

real    0m12.616s
user    0m11.732s
sys 0m0.499s

$ time bundle exec linguist isabelle/
64.75%  Isabelle
26.66%  Standard ML
3.66%   TeX
2.14%   Scala
1.91%   OCaml
0.41%   Shell
0.20%   Java
0.05%   Perl
0.05%   HTML
0.04%   Ada
0.04%   Haskell
0.04%   Python
0.02%   Diff
0.02%   Makefile
0.00%   C
0.00%   CSS
0.00%   ApacheConf

real    0m11.810s
user    0m11.130s
sys 0m0.303s

samoht · 2015-03-18T10:01:28Z

@arfon let me know if I should do something more.

pchaigno · 2015-03-18T11:41:03Z

putting the Standard ML heuristic first breaks samples/OCaml/uutf.ml and gives 5% error rate on my sample.

Okay. It's probably best to keep OCaml first then :-)
Just a question, was the 5% of error due to => or to case\s+(\S+\s)+of?

samoht · 2015-03-18T13:46:03Z

It was due to =>

arfon · 2015-03-18T14:05:10Z

👍 thanks everyone for your work on this. I'll get a new gem cut today and let's iterate on this in future Pull Requests if necessary.

Disambiguate between OCaml and Standard ML

samoht · 2015-03-18T18:41:20Z

@arfon thanks for merging! Is is possible to run linguist on the 112 misclasified Standard ML repositories containing "OCaml" to reclassify them correctly?

https://github.com/search?l=Standard+ML&q=ocaml&type=Repositories&utf8=%E2%9C%93

larsbrinkhoff · 2015-03-18T18:45:28Z

Whoa, that would be cool. I could think of some searches to feed into that feature.

arfon · 2015-03-18T18:46:54Z

@arfon thanks for merging! Is is possible to run linguist on the 112 misclasified Standard ML repositories containing "OCaml" to reclassify them correctly?

Yes, that should be possible. Any chance you could extract this list as CSV and email it to me (arfon@github.com)? I've got a script to do this (using CSV input) and so could run this today. Otherwise I probably won't be able to get to this until Friday.

larsbrinkhoff · 2015-03-18T19:20:31Z

@arfon What's the upper limit on number of repositories for that? Per week? ;-)

arfon · 2015-03-18T19:21:35Z

@arfon What's the upper limit on number of repositories for that? Per week? ;-)

I once did it for ~3000 Prolog repos. If you can send me the list then I can run the job... within reason.

samoht force-pushed the OCaml branch 3 times, most recently from c516504 to 3decc81 Compare March 13, 2015 16:56

rootAvish mentioned this pull request Mar 13, 2015

More info/specifics on tokenization with grammar bundles in Linguist github/mentorships#18

Closed

samoht force-pushed the OCaml branch from 3decc81 to 8edef05 Compare March 13, 2015 19:24

pchaigno mentioned this pull request Mar 16, 2015

OCaml code is reported as standard ML #2208

Closed

samoht force-pushed the OCaml branch from 8edef05 to db2e728 Compare March 17, 2015 12:15

samoht force-pushed the OCaml branch from 4853119 to 8fc0c19 Compare March 18, 2015 09:56

Disambiguate between OCaml and Standard ML

e796073

Fix github-linguist#2208

samoht force-pushed the OCaml branch from 8fc0c19 to e796073 Compare March 18, 2015 09:58

pchaigno mentioned this pull request Mar 18, 2015

Handling misconecption of Ocaml and Standard ML in issue #2208 #2239

Closed

arfon added a commit that referenced this pull request Mar 18, 2015

Merge pull request #2227 from samoht/OCaml

3db6c4a

Disambiguate between OCaml and Standard ML

arfon merged commit 3db6c4a into github-linguist:master Mar 18, 2015

arfon mentioned this pull request Mar 18, 2015

v4.5.2 #2241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disambiguate between OCaml and Standard ML #2227

Disambiguate between OCaml and Standard ML #2227

samoht commented Mar 13, 2015

samoht commented Mar 16, 2015

pchaigno commented Mar 16, 2015

samoht commented Mar 17, 2015

larsbrinkhoff commented Mar 17, 2015

samoht commented Mar 17, 2015

amirmc commented Mar 17, 2015

larsbrinkhoff commented Mar 17, 2015

samoht commented Mar 17, 2015

samoht commented Mar 17, 2015

amirmc commented Mar 17, 2015

pchaigno commented Mar 17, 2015

amirmc commented Mar 17, 2015

arfon commented Mar 17, 2015

pchaigno commented Mar 18, 2015

samoht commented Mar 18, 2015

samoht commented Mar 18, 2015

pchaigno commented Mar 18, 2015

samoht commented Mar 18, 2015

arfon commented Mar 18, 2015

samoht commented Mar 18, 2015

larsbrinkhoff commented Mar 18, 2015

arfon commented Mar 18, 2015

larsbrinkhoff commented Mar 18, 2015

arfon commented Mar 18, 2015

Disambiguate between OCaml and Standard ML #2227

Disambiguate between OCaml and Standard ML #2227

Conversation

samoht commented Mar 13, 2015

samoht commented Mar 16, 2015

pchaigno commented Mar 16, 2015

samoht commented Mar 17, 2015

larsbrinkhoff commented Mar 17, 2015

samoht commented Mar 17, 2015

amirmc commented Mar 17, 2015

larsbrinkhoff commented Mar 17, 2015

samoht commented Mar 17, 2015

samoht commented Mar 17, 2015

amirmc commented Mar 17, 2015

pchaigno commented Mar 17, 2015

amirmc commented Mar 17, 2015

arfon commented Mar 17, 2015

pchaigno commented Mar 18, 2015

samoht commented Mar 18, 2015

samoht commented Mar 18, 2015

pchaigno commented Mar 18, 2015

samoht commented Mar 18, 2015

arfon commented Mar 18, 2015

samoht commented Mar 18, 2015

larsbrinkhoff commented Mar 18, 2015

arfon commented Mar 18, 2015

larsbrinkhoff commented Mar 18, 2015

arfon commented Mar 18, 2015