New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing C/C++/Objective-C classifications #1626
Comments
@arfon The keywords have been carefully selected as being unique to the specific language they are detecting for - for example, you cannot use @ syntax outside of comments in C++ as this is a syntax violation, so we can use More testing would be a good way to verify that nobody is doing silly things in comments though, as I believe there is scope to include extra matching based on not being in comments to select out people using Also, looking specifically for std:: is a good benchmark for C++ as this syntax is unique to the STL. |
Thanks @DX-MON - I've made a new PR with all of your commits in here: #1627. I made the Objective-C matchers a little more strict in this commit as
Yeah, excluding lines that begin with |
The other blocks of text to exclude if going after comments are lines enclosed in matching /* and */ pairs (hence the "and such"). I didn't know about \b so thank you for that :) |
@noxabellus yours is not so bad ;LOL; https://github.com/moe123/macadam/search?l=c ; c++ I could understand the dilemma; obj-c very not; linguist is just a front-end patch nothing else. The core problem is the github indexation model which is polluted. |
IMO |
Until an advanced fix is complete and well tested (which might not even be mathematically possible), why not just skip headers for projects with non-header files? The ratio between languages will be mostly the same from having mostly one header per implementation file. |
I found one of the C header files in my project was being misidentified as Objective-C, so thought I'd investigate. I first started deleting lines until I had a minimal reproduction case. Here is what I came up with (stored as test.h, in case filename is relevant):
As written, this identifies as Objective-C, but if I change the name of I understand this is an impossible decision to make automatically, as Objective-C is a superset of C, but imo there should be a higher threshold for marking a file as Objective-C than the naming of one struct member. Edit: Not to imply that such detection is intended behaviour; I don't see the word capacity mentioned in any of the heuristics linked to from this thread. |
capacity is a term that ObjC developers are perhaps more likely to use culturally, and you see this sort of naming in the example files in this repo. Of course there’s no reason any language can’t use the word “capacity” to name something. The longer I think about this example, the more convinced I am that this problem requires a text classification (e.g. neural network) approach. Regex classification works when you have a keyword table in a language reference that is radically different, but between C/++/ObjC they have substantially similar keywords, and a lot of the facts you “learn” with simple systems is wrong facts like “capacity is an ObjC keyword” In reality it’s context-sensitive (does it occur within an objc selector?). That question itself involves a bunch of other language rules, which you can try to regex with lookarounds but I suspect it’s a losing battle. This is I think why behavior for this language combination is poor, you just can’t get a good result from tools that distinguish very different languages, for languages that are this similar. Understanding the context requires an embedding of the language grammar, that is whether we are naming a field right now (as in @tombsar ’s example) or an ObjC method, a field in an ObjC object, property, or something else. An NN would learn such an embedding as a matter of course, and would understand that while we may be using ObjC-friendly names it remembers that the context is a struct. Another benefit is you could maintain more easily by people submitting their mis classified code and just retraining the model, no code changes required. Unfortunately I know nothing about good ruby NN implementations, how portable they are etc. I just suspect that without one, we won’t be able to tell examples like this apart very thoroughly. |
It should be based on keywords.
|
I guessed it would be something like that. Note that "capacity" is a standard container member in the C++ STL, so a larger range of sample code might pick up that commonality.
I think what I'm seeing is that some of my files have none of those keywords, so a heuristic is being used to classify instead, leading to the surprising results. My preference would be that such files (ones that don't match any of the language-specific keywords or constructs) are classified as C, since that is the lowest common denominator between the three languages, but I understand why you might not want that as a solution. A properly-trained RNN classifier is probably the way to go if you want best-possible results, but I likewise don't know anything about the Ruby ecosystem to know what to suggest there. |
Shouldn't linguist just look if the majority of the code is C? If the majority of the code is C, the 'Objective-C' file should be marked as C, and vice-versa. For projects that use both C and C++, see what is used in the directory and assume that. |
Totally agree. This is what I suggested in this thread some time ago. I think linguist should define a threshold - let's say 80% - and when it detects that 80% or more of the files that are inside the project/directory are identified as C, for example, it should just assume the others are C as well. This would obviously do not solve the issue, since somebody could, theoretically, have half of the files C++ and the other files C inside his project/folder. However, I think this would solve 99% of the cases, so in my opinion it is totally worth doing |
Interestingly, this minimal reproduction is the same number of SLOC as my production sample for this case: https://github.com/Starwort/NEA/blob/master/solver_c/memory.h - I don't want to repeatedly push changes to my repository and I don't want to install the numerous dependencies for running linguist myself, so I haven't been able to minimise it, but I don't see why it would identify as C++
|
What if we added a dedicated strategy for C-like header files? This is such a commonly reported problem, it must comprise at least 80% of complaints about misclassified files. I envision the strategy to behave like this:
If one of these steps is unsuccessful, fall through to the usual behaviour of heuristics, followed by Bayesian classification. Would this be a feasible approach, or am I being naïve here? |
Hello folks, Currently, there are 47 C headers, 11 C source files and 1 C++ file in my repository (https://github.com/r-lyeh/FWK). There are at least 7 C files tagged as C++ in my repository, which is wrong. Also, ratio is actually 58:1 in favor of C, or 49:8 according to linguist. However, this info is wrongly displayed as well. |
Also, my 2cents at identifying sources.
|
It probably isn't. Your comments suggest you're looking at the number of files, however as stated in How Linguist Works:
Your, granted incorrectly classified, bytes of code works out to: C++ 12.2 MB ... with 10.7MB of those 12.2MB attributable to a single file: https://github.com/r-lyeh/FWK/blob/master/3rd/3rd_glfw3.h It doesn't show up in the search results because it's too big so subject the search restrictions mentioned in the troubleshooting doc and thus not indexed.
This sounds like a reasonable approach to me. If you're confident about this and have the time, we'd appreciate a PR that enhances the current heuristics here with the regexes broken down here and here. |
Doh! It's all about file sizes then! I would have never guessed that. Yes, glfw3 is reported as C++ incorrectly atm. It could be marked as ObjC (there are a small % of lines in ObjC there), but 98% of it is pure C though. |
C is gaining [[attributes]] in c2x
… On Apr 13, 2021, at 5:06 AM, r-lyeh ***@***.***> wrote:
Also, my 2cents at identifying sources.
Any H file is C,
unless C++ is found:
references&, [lambdas]{}, <extensionless_includes>, templates<>, std::, namespace, using, new/delete, operator, class, public, private, override keywords.
unless ObjC is found
#import, [[attributes]], dangling +/- characters.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Either I misunderstand this or this doesn't seem to work. I'd like to throw in an idea: Can we let users mark their code themselves? I.e.:
Or perhaps a special form of comments? |
@HiImJulien that statement refers to the breakdown of the repo as a whole after the language of each file has been identified, not the breakdown of the individual files. Each file is only associated with one language.
🤔 interesting idea though I'm not sure introducing a new approach to identify the language of a file is needed when users can already specify the language using an override in the |
I've started #5357 to implement what feels to be the general consensus of the comments in this issue: assume all I think this is a happy compromise that leads to more predictable behaviour considering Linguist detects the languages of files without considering any other files in the repo. |
#5357 has now been merged. The changes won't take effect until I make the next release. I've scheduled time for the week of 24 May to make and deploy the next release. |
The changes from #5357 have now been deployed to GitHub.com which means this section of the troubleshooting docs now applies:
Repositories will only be re-analysed when you next push to the repo so if you're seeing behaviour inconsistent with above, push another change to an affected file (only modifying If you still experience behaviour that is not consistent with above, please open a new discussion or open a new PR if you feel the heuristics that ensure this behaviour can be improved. |
A fair fraction of the outstanding issues with Linguist mis-classifications are to do with C, C++, Objective-C and any other languages that use
.m
or.h
as extension.While it's possible we can see further improvements by increasing our samples for the Bayesian classifier I'm pretty convinced we need to craft a few reliable heuristics to shortcut the classifier.
We currently have the following in
heuristics.rb
but it's not being called.@DX-MON had a good go at this in #1036 but at the time we didn't have a good way to benchmark the effects of these changes. With the new benchmarks we are now in a good place to tackle this problem.
The current heuristic is deliberately limited in scope, that's because any heuristics we define should be accurate, simple to understand and fast.
Basically I'm asking for some help with this. I'm more than happy to grab a bunch (1000s) of files to test any potential changes on. Questions for you all:
C
in thatdisambiguate_c
method too if possible.The text was updated successfully, but these errors were encountered: