Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disambiguate between OCaml and Standard ML #2227

Merged
merged 1 commit into from Mar 18, 2015

Conversation

samoht
Copy link
Contributor

@samoht samoht commented Mar 13, 2015

Fix #2208

@samoht
Copy link
Contributor Author

samoht commented Mar 16, 2015

Any feebdack on that PR? Thanks!

@pchaigno
Copy link
Contributor

We should first try with simpler regular expressions. Have you tested the one @raphael-proust proposed in #2208?

disambiguate "SML", "OCaml" do |data|
  if /=> /.match(data)
    Language["SML"]
  elsif /module|let rec /.match(data)
    Language["OCaml"]
  end
end

@samoht
Copy link
Contributor Author

samoht commented Mar 17, 2015

I've simplified the regular expression. It's hard to make it simpler, as Standard ML and OCaml share most of their core syntax. Would be maybe easier to revert your change :p or to deprecate .ML extensions for Standard ML.

@larsbrinkhoff
Copy link
Contributor

You might request a set of 1000 .ml samples from @arfton, e.g. for this search:
http://github.com/search?q=extension%3Aml+NOT+slartibartfaxt&type=Code

Then you would run Linguist against this set, and note the results before and after your modifications. This might provide some insights on the performance of your suggested heuristics.

@samoht
Copy link
Contributor Author

samoht commented Mar 17, 2015

I've gathered a few thousands .ml files which were on my hard-drive into a Git repository, and run linguist on it with both master and master+my-patch. Here are the results:

$ ls *.ml  | wc -l
4206

$ [master] time bundle exec linguist
72.94%  Standard ML
27.06%  OCaml

real    0m29.786s
user    0m29.059s
sys 0m0.483s

$ [master+patch] time bundle exec linguist
time bundle exec linguist
98.60%  OCaml
1.40%   Standard ML

real    0m11.227s
user    0m10.847s
sys 0m0.342s

@amirmc
Copy link

amirmc commented Mar 17, 2015

I read that as over 60% faster, with substantially better classification (from 27% correct before to over 98% correct after).

@arfon, is it ok now to merge this patch? As I mentioned in #2208, this is affecting the discoverability of OCaml projects, and not just on GitHub (cf. Libraries.io, which is also using linguist).

@larsbrinkhoff
Copy link
Contributor

@samoht Great! Now you might ideally examine those files to verify that they are indeed correctly identified.

Note that I'm not part of GitHub staff, just a contributor. But I've done this exercise many times to provide evidence that my requests work correctly.

@samoht
Copy link
Contributor Author

samoht commented Mar 17, 2015

I don't program in Standard ML, all the files are OCaml code which are either part of the projects I develop or contribute. So there is still 1.4% error rate, which is much better than 72.9% but I can (almost) live with it.

@samoht
Copy link
Contributor Author

samoht commented Mar 17, 2015

Also note than 15 days ago, linguist would have said (with reason) that these files were 100% OCaml files as they have an .ml extension.

#2087 made .ML extensions equivalent to .ml which completely break the support for OCaml in linguist. But before April 2014, Standard ML was not being associated with .ML file at all (and I'm not sure anyone really use that extension in practice) so it was fine as well.

@amirmc
Copy link

amirmc commented Mar 17, 2015

So there is still 1.4% error rate, which is much better than 72.9% but I can (almost) live with it.

I'd like to reiterate that this is a major improvement on the current situation, in which OCaml-based repos are being misreported as Standard ML. This bug was amusing when it first hit but is detrimental the longer this takes to fix.

Please do understand that I'm seeing this very much from the perspective of a bugfix, not a feature request. A regression in Lingust has led to these issues being created (things were ok before). This patch doesn't have to be the last word, but it clearly alleviates the problem with better performance compared to the current master.

@pchaigno
Copy link
Contributor

The detection of Standard ML .ml files should also be tested. We don't want to have a new regression ;)

Standard ML was not being associated with .ML file at all (and I'm not sure anyone really use that extension in practice) so it was fine as well.

@akissinger added the .ML extension in #1035. Maybe he could find us some .ML files for test :-)

@amirmc
Copy link

amirmc commented Mar 17, 2015

The detection of Standard ML .ml files should also be tested. We don't want to have a new regression ;)

I feel this is completely missing the bigger picture. Right now a majority of active OCaml projects are utterly misclassified which is affecting two entire language communities on GitHub. No-one can find repos in either language the way they used to (one has all but disappeared and the other is swamped with erroneous projects).

Surely, it's preferable to deal with the larger problem now and deal with a potentially much smaller problem afterwards.

If you see the new comments on the other thread and also the PR you link to, it's quite clear that the SML community predominantly uses .sml and .sig and to a lesser extent .ML (capitalised). It seems ridiculous to me that the likely smaller fraction of SML users who might be using .ml files are somehow being prioritised over the rest of the two communities who continue to suffer from this error. 11 days ago this situation was highly amusing but it's not anymore.

@arfon
Copy link
Contributor

arfon commented Mar 17, 2015

Given this comment let's get this merged in as is.

@samoht would you object to testing your changes on a corpus of files as @larsbrinkhoff suggests before we cut a new gem?

@pchaigno
Copy link
Contributor

I tested this pull request on seL4/isabelle which contains 678 Standard ML .ML files. 179 are recognized as OCaml (35% of the total number of .ML lines).

If you remove class from the OCaml heuristic, this number drops to 55 (6% of the total number of .ML lines).

Also, a test on HOL-Theorem-Prover/HOL (which contains a few Standard ML .ml files) seems to indicate that the heuristic for Standard ML is slightly better than the OCaml one. So we might want to put the Standard ML heuristic first:

if /=> |case\s\S+\sof/.match(data)
  Language["Standard ML"]
elsif /module|let rec |begin\s(\w+\s)+end|match\s\S+\swith/.match(data)
  Language["OCaml"]
end

@samoht Could you test again against you samples without the class keyword? (or I can do it if your files are available somewhere).

@samoht
Copy link
Contributor Author

samoht commented Mar 18, 2015

@pchaigno putting the Standard ML heuristic first breaks samples/OCaml/uutf.ml and gives 5% error rate on my sample. Removing/adding the class keyword doesn't seem to change much to the result (which certainly says something about the use of the O in OCaml nowadays).

So I've tweaked the regexp to have acceptable results for both my corpus and isabelle:

$ time bundle exec linguist my-ocaml/
98.56%  OCaml
1.44%   Standard ML

real    0m12.616s
user    0m11.732s
sys 0m0.499s

$ time bundle exec linguist isabelle/
64.75%  Isabelle
26.66%  Standard ML
3.66%   TeX
2.14%   Scala
1.91%   OCaml
0.41%   Shell
0.20%   Java
0.05%   Perl
0.05%   HTML
0.04%   Ada
0.04%   Haskell
0.04%   Python
0.02%   Diff
0.02%   Makefile
0.00%   C
0.00%   CSS
0.00%   ApacheConf

real    0m11.810s
user    0m11.130s
sys 0m0.303s

@samoht
Copy link
Contributor Author

samoht commented Mar 18, 2015

@arfon let me know if I should do something more.

@pchaigno
Copy link
Contributor

putting the Standard ML heuristic first breaks samples/OCaml/uutf.ml and gives 5% error rate on my sample.

Okay. It's probably best to keep OCaml first then :-)
Just a question, was the 5% of error due to => or to case\s+(\S+\s)+of?

@samoht
Copy link
Contributor Author

samoht commented Mar 18, 2015

It was due to =>

@arfon
Copy link
Contributor

arfon commented Mar 18, 2015

👍 thanks everyone for your work on this. I'll get a new gem cut today and let's iterate on this in future Pull Requests if necessary.

arfon added a commit that referenced this pull request Mar 18, 2015
Disambiguate between OCaml and Standard ML
@arfon arfon merged commit 3db6c4a into github-linguist:master Mar 18, 2015
@arfon arfon mentioned this pull request Mar 18, 2015
@samoht
Copy link
Contributor Author

samoht commented Mar 18, 2015

@arfon thanks for merging! Is is possible to run linguist on the 112 misclasified Standard ML repositories containing "OCaml" to reclassify them correctly?

https://github.com/search?l=Standard+ML&q=ocaml&type=Repositories&utf8=%E2%9C%93

@larsbrinkhoff
Copy link
Contributor

Whoa, that would be cool. I could think of some searches to feed into that feature.

@arfon
Copy link
Contributor

arfon commented Mar 18, 2015

@arfon thanks for merging! Is is possible to run linguist on the 112 misclasified Standard ML repositories containing "OCaml" to reclassify them correctly?

Yes, that should be possible. Any chance you could extract this list as CSV and email it to me (arfon@github.com)? I've got a script to do this (using CSV input) and so could run this today. Otherwise I probably won't be able to get to this until Friday.

@larsbrinkhoff
Copy link
Contributor

@arfon What's the upper limit on number of repositories for that? Per week? ;-)

@arfon
Copy link
Contributor

arfon commented Mar 18, 2015

@arfon What's the upper limit on number of repositories for that? Per week? ;-)

I once did it for ~3000 Prolog repos. If you can send me the list then I can run the job... within reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OCaml code is reported as standard ML
5 participants