-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disambiguate between OCaml and Standard ML #2227
Conversation
c516504
to
3decc81
Compare
Any feebdack on that PR? Thanks! |
We should first try with simpler regular expressions. Have you tested the one @raphael-proust proposed in #2208? disambiguate "SML", "OCaml" do |data|
if /=> /.match(data)
Language["SML"]
elsif /module|let rec /.match(data)
Language["OCaml"]
end
end |
I've simplified the regular expression. It's hard to make it simpler, as Standard ML and OCaml share most of their core syntax. Would be maybe easier to revert your change :p or to deprecate |
You might request a set of 1000 Then you would run Linguist against this set, and note the results before and after your modifications. This might provide some insights on the performance of your suggested heuristics. |
I've gathered a few thousands
|
I read that as over 60% faster, with substantially better classification (from 27% correct before to over 98% correct after). @arfon, is it ok now to merge this patch? As I mentioned in #2208, this is affecting the discoverability of OCaml projects, and not just on GitHub (cf. Libraries.io, which is also using linguist). |
@samoht Great! Now you might ideally examine those files to verify that they are indeed correctly identified. Note that I'm not part of GitHub staff, just a contributor. But I've done this exercise many times to provide evidence that my requests work correctly. |
I don't program in Standard ML, all the files are OCaml code which are either part of the projects I develop or contribute. So there is still 1.4% error rate, which is much better than 72.9% but I can (almost) live with it. |
Also note than 15 days ago, linguist would have said (with reason) that these files were 100% OCaml files as they have an #2087 made |
I'd like to reiterate that this is a major improvement on the current situation, in which OCaml-based repos are being misreported as Standard ML. This bug was amusing when it first hit but is detrimental the longer this takes to fix. Please do understand that I'm seeing this very much from the perspective of a bugfix, not a feature request. A regression in Lingust has led to these issues being created (things were ok before). This patch doesn't have to be the last word, but it clearly alleviates the problem with better performance compared to the current master. |
The detection of Standard ML
@akissinger added the |
I feel this is completely missing the bigger picture. Right now a majority of active OCaml projects are utterly misclassified which is affecting two entire language communities on GitHub. No-one can find repos in either language the way they used to (one has all but disappeared and the other is swamped with erroneous projects). Surely, it's preferable to deal with the larger problem now and deal with a potentially much smaller problem afterwards. If you see the new comments on the other thread and also the PR you link to, it's quite clear that the SML community predominantly uses |
Given this comment let's get this merged in as is. @samoht would you object to testing your changes on a corpus of files as @larsbrinkhoff suggests before we cut a new gem? |
I tested this pull request on seL4/isabelle which contains 678 Standard ML If you remove Also, a test on HOL-Theorem-Prover/HOL (which contains a few Standard ML if /=> |case\s\S+\sof/.match(data)
Language["Standard ML"]
elsif /module|let rec |begin\s(\w+\s)+end|match\s\S+\swith/.match(data)
Language["OCaml"]
end @samoht Could you test again against you samples without the |
@pchaigno putting the Standard ML heuristic first breaks So I've tweaked the regexp to have acceptable results for both my corpus and isabelle:
|
@arfon let me know if I should do something more. |
Okay. It's probably best to keep OCaml first then :-) |
It was due to |
👍 thanks everyone for your work on this. I'll get a new gem cut today and let's iterate on this in future Pull Requests if necessary. |
Disambiguate between OCaml and Standard ML
@arfon thanks for merging! Is is possible to run https://github.com/search?l=Standard+ML&q=ocaml&type=Repositories&utf8=%E2%9C%93 |
Whoa, that would be cool. I could think of some searches to feed into that feature. |
Yes, that should be possible. Any chance you could extract this list as CSV and email it to me (arfon@github.com)? I've got a script to do this (using CSV input) and so could run this today. Otherwise I probably won't be able to get to this until Friday. |
@arfon What's the upper limit on number of repositories for that? Per week? ;-) |
I once did it for ~3000 Prolog repos. If you can send me the list then I can run the job... within reason. |
Fix #2208