Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing C/C++/Objective-C classifications #1626

Closed
arfon opened this issue Oct 23, 2014 · 148 comments · Fixed by #5357
Closed

Fixing C/C++/Objective-C classifications #1626

arfon opened this issue Oct 23, 2014 · 148 comments · Fixed by #5357

Comments

@arfon
Copy link
Contributor

arfon commented Oct 23, 2014

A fair fraction of the outstanding issues with Linguist mis-classifications are to do with C, C++, Objective-C and any other languages that use .m or .h as extension.

While it's possible we can see further improvements by increasing our samples for the Bayesian classifier I'm pretty convinced we need to craft a few reliable heuristics to shortcut the classifier.

We currently have the following in heuristics.rb but it's not being called.

def self.disambiguate_c(data, languages)
  matches = []
  matches << Language["Objective-C"] if data.include?("@interface")
  matches << Language["C++"] if data.include?("#include <cstdint>")
  matches
end

@DX-MON had a good go at this in #1036 but at the time we didn't have a good way to benchmark the effects of these changes. With the new benchmarks we are now in a good place to tackle this problem.

The current heuristic is deliberately limited in scope, that's because any heuristics we define should be accurate, simple to understand and fast.

Basically I'm asking for some help with this. I'm more than happy to grab a bunch (1000s) of files to test any potential changes on. Questions for you all:

  1. How do we feel about Fixed the worst of the bad that was the detection heuristic for Obj-C vs C++ vs C #1036 - I'd love to see a case for C in that disambiguate_c method too if possible.
  2. Fixed the worst of the bad that was the detection heuristic for Obj-C vs C++ vs C #1036 is matching on a bunch of keywords - are all of these necessary? Are we potentially sacrificing accuracy for completeness?
  3. Are there any good heuristics for pure C code?
@dragonmux
Copy link
Contributor

@arfon The keywords have been carefully selected as being unique to the specific language they are detecting for - for example, you cannot use @ syntax outside of comments in C++ as this is a syntax violation, so we can use @protocol, for example, to help detect Obj-C as protocol is an invalid C and C++ keyword and @ syntax is invalid for both too

More testing would be a good way to verify that nobody is doing silly things in comments though, as I believe there is scope to include extra matching based on not being in comments to select out people using //! @class and such for documentation.

Also, looking specifically for std:: is a good benchmark for C++ as this syntax is unique to the STL.

@arfon
Copy link
Contributor Author

arfon commented Oct 24, 2014

Thanks @DX-MON - I've made a new PR with all of your commits in here: #1627. I made the Objective-C matchers a little more strict in this commit as @end was matching lines like these in a valid C file (and therefore classifying the file as Objective-C).

More testing would be a good way to verify that nobody is doing silly things in comments though, as I believe there is scope to include extra matching based on not being in comments to select out people using //! @class and such for documentation.

Yeah, excluding lines that begin with // might be a good idea.

@dragonmux
Copy link
Contributor

The other blocks of text to exclude if going after comments are lines enclosed in matching /* and */ pairs (hence the "and such").

I didn't know about \b so thank you for that :)

@mu578
Copy link

mu578 commented Jun 29, 2020

@noxabellus yours is not so bad ;LOL; https://github.com/moe123/macadam/search?l=c ; c++ I could understand the dilemma; obj-c very not; linguist is just a front-end patch nothing else. The core problem is the github indexation model which is polluted.

@drewcrawford
Copy link
Contributor

IMO #import ought to be removed as a pattern for Objective-C. Although it began as an ObjC feature, it is implemented for C/C++ by clang and GCC, and (differently) for C++ by Microsoft. Detecting #import as objc is a significant source of misclassification for my headers.

@Dawoodoz
Copy link

Until an advanced fix is complete and well tested (which might not even be mathematically possible), why not just skip headers for projects with non-header files? The ratio between languages will be mostly the same from having mostly one header per implementation file.

@tombsar
Copy link

tombsar commented Dec 1, 2020

I found one of the C header files in my project was being misidentified as Objective-C, so thought I'd investigate. I first started deleting lines until I had a minimal reproduction case. Here is what I came up with (stored as test.h, in case filename is relevant):

struct V {
    int capacity;
};

As written, this identifies as Objective-C, but if I change the name of capacity to anything else, it identifies as C. I went back to my original file, and saw the same behaviour: a struct with a member named "capacity" causes the file to be identified as Objective-C, and renaming it to anything else gives C.

I understand this is an impossible decision to make automatically, as Objective-C is a superset of C, but imo there should be a higher threshold for marking a file as Objective-C than the naming of one struct member.

Edit: Not to imply that such detection is intended behaviour; I don't see the word capacity mentioned in any of the heuristics linked to from this thread.

@drewcrawford
Copy link
Contributor

capacity is a term that ObjC developers are perhaps more likely to use culturally, and you see this sort of naming in the example files in this repo. Of course there’s no reason any language can’t use the word “capacity” to name something.

The longer I think about this example, the more convinced I am that this problem requires a text classification (e.g. neural network) approach. Regex classification works when you have a keyword table in a language reference that is radically different, but between C/++/ObjC they have substantially similar keywords, and a lot of the facts you “learn” with simple systems is wrong facts like “capacity is an ObjC keyword”

In reality it’s context-sensitive (does it occur within an objc selector?). That question itself involves a bunch of other language rules, which you can try to regex with lookarounds but I suspect it’s a losing battle. This is I think why behavior for this language combination is poor, you just can’t get a good result from tools that distinguish very different languages, for languages that are this similar.

Understanding the context requires an embedding of the language grammar, that is whether we are naming a field right now (as in @tombsar ’s example) or an ObjC method, a field in an ObjC object, property, or something else. An NN would learn such an embedding as a matter of course, and would understand that while we may be using ObjC-friendly names it remembers that the context is a struct. Another benefit is you could maintain more easily by people submitting their mis classified code and just retraining the model, no code changes required.

Unfortunately I know nothing about good ruby NN implementations, how portable they are etc. I just suspect that without one, we won’t be able to tell examples like this apart very thoroughly.

@MarcusJohnson91
Copy link

MarcusJohnson91 commented Dec 3, 2020

As written, this identifies as Objective-C, but if I change the name of capacity to anything else, it identifies as C. I went back to my original file, and saw the same behaviour: a struct with a member named "capacity" causes the file to be identified as Objective-C, and renaming it to anything else gives C.

It should be based on keywords.

#import or @ Objective-C

template or class C++

_Generic, _Bool, _Complex C.

@tombsar
Copy link

tombsar commented Dec 3, 2020

capacity is a term that ObjC developers are perhaps more likely to use culturally, and you see this sort of naming in the example files in this repo.

I guessed it would be something like that. Note that "capacity" is a standard container member in the C++ STL, so a larger range of sample code might pick up that commonality.

It should be based on keywords.

#import or @ Objective-C

template or class C++

_Generic, _Bool, _Complex C.

I think what I'm seeing is that some of my files have none of those keywords, so a heuristic is being used to classify instead, leading to the surprising results.

My preference would be that such files (ones that don't match any of the language-specific keywords or constructs) are classified as C, since that is the lowest common denominator between the three languages, but I understand why you might not want that as a solution.

A properly-trained RNN classifier is probably the way to go if you want best-possible results, but I likewise don't know anything about the Ruby ecosystem to know what to suggest there.

@ItzSwirlz
Copy link

Shouldn't linguist just look if the majority of the code is C? If the majority of the code is C, the 'Objective-C' file should be marked as C, and vice-versa.

For projects that use both C and C++, see what is used in the directory and assume that.

@felipeek
Copy link

felipeek commented Dec 12, 2020

Shouldn't linguist just look if the majority of the code is C? If the majority of the code is C, the 'Objective-C' file should be marked as C, and vice-versa.

For projects that use both C and C++, see what is used in the directory and assume that.

Totally agree. This is what I suggested in this thread some time ago. I think linguist should define a threshold - let's say 80% - and when it detects that 80% or more of the files that are inside the project/directory are identified as C, for example, it should just assume the others are C as well.

This would obviously do not solve the issue, since somebody could, theoretically, have half of the files C++ and the other files C inside his project/folder. However, I think this would solve 99% of the cases, so in my opinion it is totally worth doing

@Starwort
Copy link

Starwort commented Jan 6, 2021

struct V {
    int capacity;
};

Interestingly, this minimal reproduction is the same number of SLOC as my production sample for this case: https://github.com/Starwort/NEA/blob/master/solver_c/memory.h - I don't want to repeatedly push changes to my repository and I don't want to install the numerous dependencies for running linguist myself, so I haven't been able to minimise it, but I don't see why it would identify as C++

memory.h
#include "types.h"

Board* allocate_board();
void free_board(Board* board);

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 6, 2021

What if we added a dedicated strategy for C-like header files? This is such a commonly reported problem, it must comprise at least 80% of complaints about misclassified files.

I envision the strategy to behave like this:

  1. For every .h file, check for a file with a matching path (sans file extension), ending in .c, .cpp, .cc (etc)
  2. Assert that only one such definition file exists (I assume authors don't have foo.h in a directory with foo.c and foo.cpp)
  3. Defer analysis of the .h file until its counterpart has been evaluated
  4. Assert that the counterpart's language is one of C, C++, or Objective-C.
  5. Use the language of the corresponding definition file as the language of the .h file

If one of these steps is unsuccessful, fall through to the usual behaviour of heuristics, followed by Bayesian classification.

Would this be a feasible approach, or am I being naïve here?

@r-lyeh
Copy link

r-lyeh commented Apr 13, 2021

Hello folks,

Currently, there are 47 C headers, 11 C source files and 1 C++ file in my repository (https://github.com/r-lyeh/FWK).

There are at least 7 C files tagged as C++ in my repository, which is wrong.
image

Also, ratio is actually 58:1 in favor of C, or 49:8 according to linguist. However, this info is wrongly displayed as well.
image

@r-lyeh
Copy link

r-lyeh commented Apr 13, 2021

Also, my 2cents at identifying sources.

  1. Any H file is C,
  2. unless C++ is found:
    references&, [lambdas]{}, <extensionless_includes>, templates<>, std::, namespace, using, new/delete, operator, class, public, private, protected, override keywords.
  3. unless ObjC is found
    #import, @interface, @end, [[attributes]], dangling +/- characters.

@lildude
Copy link
Member

lildude commented Apr 13, 2021

@r-lyeh

Also, ratio is actually 58:1 in favor of C, or 49:8 according to linguist. However, this info is wrongly displayed as well.

It probably isn't. Your comments suggest you're looking at the number of files, however as stated in How Linguist Works:

The percentages are calculated based on the bytes of code for each language as reported by the List Languages API.

Your, granted incorrectly classified, bytes of code works out to:

C++ 12.2 MB
C 7.14 MB
Lua 279 KB
Objective-C 133 KB
GLSL 39.5 KB
Shell 6.14 KB

... with 10.7MB of those 12.2MB attributable to a single file: https://github.com/r-lyeh/FWK/blob/master/3rd/3rd_glfw3.h

It doesn't show up in the search results because it's too big so subject the search restrictions mentioned in the troubleshooting doc and thus not indexed.

Also, my 2cents at identifying sources.

This sounds like a reasonable approach to me.

If you're confident about this and have the time, we'd appreciate a PR that enhances the current heuristics here with the regexes broken down here and here.

@r-lyeh
Copy link

r-lyeh commented Apr 13, 2021

Doh! It's all about file sizes then! I would have never guessed that. Yes, glfw3 is reported as C++ incorrectly atm. It could be marked as ObjC (there are a small % of lines in ObjC there), but 98% of it is pure C though.
Ok, I'll try to take a look on those links during next weekend.

@MarcusJohnson91
Copy link

MarcusJohnson91 commented Apr 15, 2021 via email

@HiImJulien
Copy link

HiImJulien commented May 4, 2021

The percentages are calculated based on the bytes of code for each language as reported by the List Languages API.

Either I misunderstand this or this doesn't seem to work.
I have a header, which is roughly 3k lines of C code, 4 lines of C++ and 20 lines of Objective-C. Yet my file is marked as Objective-C.
The code style mandates a column width of 80 characters.


I'd like to throw in an idea: Can we let users mark their code themselves? I.e.:

#pragma linguist(cxx)
// Everything below is considered cxx
#pragma linguist(c)
// Everything below is considered c 
#pragma linguist(objc)
// Everything below is considered objc

Or perhaps a special form of comments?
This would be a (seemingly easy) fix before a "proper" heuristic has been developed.

@lildude
Copy link
Member

lildude commented May 4, 2021

@HiImJulien that statement refers to the breakdown of the repo as a whole after the language of each file has been identified, not the breakdown of the individual files. Each file is only associated with one language.

Can we let users mark their code themselves?

🤔 interesting idea though I'm not sure introducing a new approach to identify the language of a file is needed when users can already specify the language using an override in the .gitattributes file. I think your suggestion may also be based on the misunderstanding around the percentages.

@lildude lildude pinned this issue May 6, 2021
@lildude
Copy link
Member

lildude commented May 6, 2021

I've started #5357 to implement what feels to be the general consensus of the comments in this issue: assume all .h header files are C unless they specifically match C++ or Objective-C specific syntax.

I think this is a happy compromise that leads to more predictable behaviour considering Linguist detects the languages of files without considering any other files in the repo.

@lildude
Copy link
Member

lildude commented May 14, 2021

#5357 has now been merged. The changes won't take effect until I make the next release. I've scheduled time for the week of 24 May to make and deploy the next release.

@lildude
Copy link
Member

lildude commented Jun 1, 2021

The changes from #5357 have now been deployed to GitHub.com which means this section of the troubleshooting docs now applies:

My C/C++/Objective-C .h header file is detected as the wrong language

Correctly detecting the language for the C-family .h header files is tough because Linguist detects the languages of files in isolation when analysing repositories and these header files, especially the smaller ones, can be used across all three languages without using any language-specific content.
To try and reduce the number of false positives and keep some degree of predicatability for users, Linguist will assume all .h header files are C by default and will only identify a file as C++ or Objective-C if the content matches the specific heuristic for that language.
This will mean you will need to implement an override for some of your header files if you wish for them to be classified as C++ or Objective-C if they do no contain language-specific content.

Repositories will only be re-analysed when you next push to the repo so if you're seeing behaviour inconsistent with above, push another change to an affected file (only modifying .gitattributes to reclassify may not be sufficient to trigger re-analysis) and wait for the re-analysis to complete (this can take a while during busier parts of the day).

If you still experience behaviour that is not consistent with above, please open a new discussion or open a new PR if you feel the heuristics that ensure this behaviour can be improved.

@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 1, 2021
@lildude lildude unpinned this issue Jun 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.