-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better heuristic distinction of .d files #3145
Conversation
03d0c9e to
4e336c0
Compare
| * Copyright: Copyright Martin Nowak 2016 -. | ||
| * License: $(LINK2 http://www.boost.org/LICENSE_1_0.txt, Boost License 1.0) | ||
| * Authors: Martin Nowak | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file being detected as Makefile brought me here.
|
Pinging this as I think the false-positive rate is a lot higher! e.g. https://github.com/libmir/mir/pull/281/files (most example scripts in and a lot more |
|
Thank you, @MartinNowak, this looks good! Could you please help by explaining the reasoning behind the regexp used to match function definitions? |
From above:
|
|
@wilzbach I'm asking for an explanation of how |
|
On August 23, 2016 8:38:48 AM GMT+02:00, Lars Brinkhoff notifications@github.com wrote:
Because that's the pattern for a function in D and I highly doubt DTrace has this? D allows templates, hence the optional look ahead: Or with a real function: auto myFun(T)(T paramA) {
return paramA + 1;
} For more information, you may also have a look at the spec: http://dlang.org/spec/function.html Edit: seems like using code examples from an email reply doesn't work, sorry about that. |
Didn't @MartinNowak already add multiple files? In any case a good repository for idiomatic D is of course it's standard library Phobos (about 200 kLoC iirc) and there are many more here on GitHub (the popularity and presence of D is rising slowly, but steadily) |
Oh I didn't read the guide as I am just following this PR. I could also submit more files as a follow-up to this PR? |
Sure thing @larsbrinkhoff. Do we have a search query I can use to grab these files? I don't see one in this thread. |
I am pretty sure you are more familiar with your search query, but I would start by fetching some of the most starred repos with language D that have been updated within this year and filter for all files with the suffix ".d" |
|
Thanks @wilzbach. I ended up going with this search string to return file results (we need to use the code search rather than the repository search here). @larsbrinkhoff - I just sent you a corpus - would you be able to take a look? |
Oh, one thing that you might be careful with is that even though D1 has been officially deprecated in 2010, there's still some code using the old language grammar and D2 constantly keeps improving, so I don't know how sensitive your classifier is, but you might want to use a minimum date. Secondly I noticed that your recognition of |
Thanks. It may take a while before I find a time slot for this. If @MartinNowak or @wilzbach (or anyone else) feel this is urgent, I can provide instructions for how to evalutate the corpous. |
Yes I consider this as urgent - how can I help? :) |
Sure, that's sounds easy (currently on my commute & can do this tonight) - did you see my comment about the search query finding also non D language files? |
|
@wilzbach Great, can I send the file to you gmail address? Right, |
Yes, please do :) |
|
You may get some files that are not D, but that's not a bad thing. |
|
Could you please send me corpora for D, DTrace, and Makefile @arfon @larsbrinkhoff? I generated some myself (get corpora for different languages), but using repositories isn't the best source and the code search itself isn't fully available via API (scoped on user/repo). BTW, there seems to be a fourth contender for the .d extension, apparently it's used by a few people as a game scripting language (currently classified as DTrace), but I don't plan to mess w/ that. |
|
Sure thing @MartinNowak: |
|
Thanks @arfon, unfortunately the zip files are empty. |
Oh wow. Sorry about that @MartinNowak. Totally should have checked... |
|
Did you have time to refetch them @arfon. It seems the rules in this PR could use a bit more tweaking. |
I did not I'm afraid. It's slightly more complex than I first though to fix this. We've been re-working how we run all of our asynchronous jobs in our platform and currently the admin tool I have to run these jobs is broken 😞. I'll report back here once this is fixed and we can pull the sample files again. |
|
Any other idea? I can try to make my script work a bit better, but using whole repos suffers badly from misclassification. |
|
Any workarounds @arfon? |
Not really sorry and I'm afraid since this Pull Request was opened I've left GitHub so we'll need to get another Linguist maintainer who's on GitHub staff to assist here. @MikeMcQuaid @lildude @brandonblack - there's an area of staff tools |
Ping @MikeMcQuaid @lildude @brandonblack - this PR is pending for quite a while now. Would you be so kind to find a bit of time to look at the search query? AFAICT we only need to testing this PR against your GitHub corpus. |
Friendly re-ping ;-) |
|
Polite reminder that we all have lives, and Mr McQuaid is responsible for maintaining the biggest (and also the best) package manager in the solar system. I might look into this later, but pinging my attention will yield the opposite effect. |
|
Sorry about the delay here. I started looking into the failing job off your last ping @wilzbach but then my own |
- require json for Hash.to_json
- properly recongnize dtrace probes - recongnize \ in Makefile paths - recongnize single line `file.ext : dep.ext` make targets - recognize D module, import, function, and unittest declarations - add more representative D samples D changed from 31.2% to 28.1% DTrace changed from 33.5% to 32.5% Makefile changed from 35.3% to 39.4% See https://gist.github.com/MartinNowak/fda24fdef64f2dbb05c5a5ceabf22bd3 for the scraper used to get a test corpus.
4e336c0 to
fe069d8
Compare
|
Build a scraper myself and improved all the existing heuristics for the |
|
Can we get a fresh review on this @Alhadis, @brandonblack? |
| # target : \ | ||
| # : dependency | ||
| # path/file.ext1 : some/path/../file.ext2 | ||
| elsif /([\/\\].*:\s+.*\s\\$|: \\$|^ : |^[\w\s\/\\.]+\w+\.\w+\s*:\s+[\w\s\/\\.]+\w+\.\w+)/.match(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last regex part for path/file.ext1 : some/path/../file.ext2 is specific to match only files with extensions b/c Symbol : Base is commonly used in the other languages as well.
|
Already rotting since 2 month. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty solid. I don't know D, but I can't see this being anything short of an improvement.
I meant to get back to this earlier, but depression + busy life, etc... sorry.
|
I'm still in favour of this change. Good work. Patience and the occasional ping may be required. |
No worries, fairly busy myself :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this PR looks good, just a little question about what appears to be an unnecessary addition.
| @@ -4,6 +4,7 @@ require 'rake/testtask' | |||
| require 'yaml' | |||
| require 'yajl' | |||
| require 'open-uri' | |||
| require 'json' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see any reason for this addition. Is there a reason for this? If not, can you please remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rake benchmark task fails w/o it, b/c it uses Hash.to_json. Apparently I was the first to use that task in a while. Also see the separate commit message 7980c3d.
I could for sure separate that out into a separate PR if wanted, but it seems like a tiny fix on-the-go.
@lildude
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could for sure separate that out into a separate PR if wanted, but it seems like a tiny fix on-the-go.
Nah, that's good enough explanation for me to do it here 😄.
* github/master: Fix heuristic for Unix Assembly with .ms extension (github-linguist#3550) better heuristic distinction of .d files (github-linguist#3145)
file.ext : dep.extmake targetsrake benchmark:compareD changed from 31.2% to 28.1%
DTrace changed from 33.5% to 32.5%
Makefile changed from 35.3% to 39.4%
diff of
rake benchmark:generatehttps://gist.github.com/MartinNowak/6fba5aed9401dba4d4bfd7969a168c61 (URLs relative to github.com/)
see https://gist.github.com/MartinNowak/fda24fdef64f2dbb05c5a5ceabf22bd3 for the scraper used to get test corpora for D, DTrace, and Makefile with
extension:d.