Fixing C/C++/Objective-C classifications #1626

arfon · 2014-10-23T10:22:21Z

A fair fraction of the outstanding issues with Linguist mis-classifications are to do with C, C++, Objective-C and any other languages that use .m or .h as extension.

While it's possible we can see further improvements by increasing our samples for the Bayesian classifier I'm pretty convinced we need to craft a few reliable heuristics to shortcut the classifier.

We currently have the following in heuristics.rb but it's not being called.

def self.disambiguate_c(data, languages)
  matches = []
  matches << Language["Objective-C"] if data.include?("@interface")
  matches << Language["C++"] if data.include?("#include <cstdint>")
  matches
end

@DX-MON had a good go at this in #1036 but at the time we didn't have a good way to benchmark the effects of these changes. With the new benchmarks we are now in a good place to tackle this problem.

The current heuristic is deliberately limited in scope, that's because any heuristics we define should be accurate, simple to understand and fast.

Basically I'm asking for some help with this. I'm more than happy to grab a bunch (1000s) of files to test any potential changes on. Questions for you all:

How do we feel about Fixed the worst of the bad that was the detection heuristic for Obj-C vs C++ vs C #1036 - I'd love to see a case for C in that disambiguate_c method too if possible.
Fixed the worst of the bad that was the detection heuristic for Obj-C vs C++ vs C #1036 is matching on a bunch of keywords - are all of these necessary? Are we potentially sacrificing accuracy for completeness?
Are there any good heuristics for pure C code?

The text was updated successfully, but these errors were encountered:

dragonmux · 2014-10-24T00:52:47Z

@arfon The keywords have been carefully selected as being unique to the specific language they are detecting for - for example, you cannot use @ syntax outside of comments in C++ as this is a syntax violation, so we can use @protocol, for example, to help detect Obj-C as protocol is an invalid C and C++ keyword and @ syntax is invalid for both too

More testing would be a good way to verify that nobody is doing silly things in comments though, as I believe there is scope to include extra matching based on not being in comments to select out people using //! @class and such for documentation.

Also, looking specifically for std:: is a good benchmark for C++ as this syntax is unique to the STL.

arfon · 2014-10-24T08:26:58Z

Thanks @DX-MON - I've made a new PR with all of your commits in here: #1627. I made the Objective-C matchers a little more strict in this commit as @end was matching lines like these in a valid C file (and therefore classifying the file as Objective-C).

More testing would be a good way to verify that nobody is doing silly things in comments though, as I believe there is scope to include extra matching based on not being in comments to select out people using //! @class and such for documentation.

Yeah, excluding lines that begin with // might be a good idea.

dragonmux · 2014-10-24T08:37:23Z

The other blocks of text to exclude if going after comments are lines enclosed in matching /* and */ pairs (hence the "and such").

I didn't know about \b so thank you for that :)

mu578 · 2020-06-29T05:10:23Z

@noxabellus yours is not so bad ;LOL; https://github.com/moe123/macadam/search?l=c ; c++ I could understand the dilemma; obj-c very not; linguist is just a front-end patch nothing else. The core problem is the github indexation model which is polluted.

drewcrawford · 2020-07-26T01:15:47Z

IMO #import ought to be removed as a pattern for Objective-C. Although it began as an ObjC feature, it is implemented for C/C++ by clang and GCC, and (differently) for C++ by Microsoft. Detecting #import as objc is a significant source of misclassification for my headers.

Dawoodoz · 2020-11-22T16:04:36Z

Until an advanced fix is complete and well tested (which might not even be mathematically possible), why not just skip headers for projects with non-header files? The ratio between languages will be mostly the same from having mostly one header per implementation file.

tombsar · 2020-12-01T23:25:25Z

I found one of the C header files in my project was being misidentified as Objective-C, so thought I'd investigate. I first started deleting lines until I had a minimal reproduction case. Here is what I came up with (stored as test.h, in case filename is relevant):

struct V {
    int capacity;
};

As written, this identifies as Objective-C, but if I change the name of capacity to anything else, it identifies as C. I went back to my original file, and saw the same behaviour: a struct with a member named "capacity" causes the file to be identified as Objective-C, and renaming it to anything else gives C.

I understand this is an impossible decision to make automatically, as Objective-C is a superset of C, but imo there should be a higher threshold for marking a file as Objective-C than the naming of one struct member.

Edit: Not to imply that such detection is intended behaviour; I don't see the word capacity mentioned in any of the heuristics linked to from this thread.

drewcrawford · 2020-12-02T23:13:23Z

capacity is a term that ObjC developers are perhaps more likely to use culturally, and you see this sort of naming in the example files in this repo. Of course there’s no reason any language can’t use the word “capacity” to name something.

The longer I think about this example, the more convinced I am that this problem requires a text classification (e.g. neural network) approach. Regex classification works when you have a keyword table in a language reference that is radically different, but between C/++/ObjC they have substantially similar keywords, and a lot of the facts you “learn” with simple systems is wrong facts like “capacity is an ObjC keyword”

In reality it’s context-sensitive (does it occur within an objc selector?). That question itself involves a bunch of other language rules, which you can try to regex with lookarounds but I suspect it’s a losing battle. This is I think why behavior for this language combination is poor, you just can’t get a good result from tools that distinguish very different languages, for languages that are this similar.

Understanding the context requires an embedding of the language grammar, that is whether we are naming a field right now (as in @tombsar ’s example) or an ObjC method, a field in an ObjC object, property, or something else. An NN would learn such an embedding as a matter of course, and would understand that while we may be using ObjC-friendly names it remembers that the context is a struct. Another benefit is you could maintain more easily by people submitting their mis classified code and just retraining the model, no code changes required.

Unfortunately I know nothing about good ruby NN implementations, how portable they are etc. I just suspect that without one, we won’t be able to tell examples like this apart very thoroughly.

MarcusJohnson91 · 2020-12-03T00:00:34Z

As written, this identifies as Objective-C, but if I change the name of capacity to anything else, it identifies as C. I went back to my original file, and saw the same behaviour: a struct with a member named "capacity" causes the file to be identified as Objective-C, and renaming it to anything else gives C.

It should be based on keywords.

#import or @ Objective-C

template or class C++

_Generic, _Bool, _Complex C.

tombsar · 2020-12-03T00:38:25Z

capacity is a term that ObjC developers are perhaps more likely to use culturally, and you see this sort of naming in the example files in this repo.

I guessed it would be something like that. Note that "capacity" is a standard container member in the C++ STL, so a larger range of sample code might pick up that commonality.

It should be based on keywords.

#import or @ Objective-C

template or class C++

_Generic, _Bool, _Complex C.

I think what I'm seeing is that some of my files have none of those keywords, so a heuristic is being used to classify instead, leading to the surprising results.

My preference would be that such files (ones that don't match any of the language-specific keywords or constructs) are classified as C, since that is the lowest common denominator between the three languages, but I understand why you might not want that as a solution.

A properly-trained RNN classifier is probably the way to go if you want best-possible results, but I likewise don't know anything about the Ruby ecosystem to know what to suggest there.

ItzSwirlz · 2020-12-11T17:45:03Z

Shouldn't linguist just look if the majority of the code is C? If the majority of the code is C, the 'Objective-C' file should be marked as C, and vice-versa.

For projects that use both C and C++, see what is used in the directory and assume that.

felipeek · 2020-12-12T22:52:37Z

Shouldn't linguist just look if the majority of the code is C? If the majority of the code is C, the 'Objective-C' file should be marked as C, and vice-versa.

For projects that use both C and C++, see what is used in the directory and assume that.

Totally agree. This is what I suggested in this thread some time ago. I think linguist should define a threshold - let's say 80% - and when it detects that 80% or more of the files that are inside the project/directory are identified as C, for example, it should just assume the others are C as well.

This would obviously do not solve the issue, since somebody could, theoretically, have half of the files C++ and the other files C inside his project/folder. However, I think this would solve 99% of the cases, so in my opinion it is totally worth doing

Starwort · 2021-01-06T11:05:03Z

struct V {
    int capacity;
};

Interestingly, this minimal reproduction is the same number of SLOC as my production sample for this case: https://github.com/Starwort/NEA/blob/master/solver_c/memory.h - I don't want to repeatedly push changes to my repository and I don't want to install the numerous dependencies for running linguist myself, so I haven't been able to minimise it, but I don't see why it would identify as C++

memory.h

#include "types.h"

Board* allocate_board();
void free_board(Board* board);

Alhadis · 2021-01-06T11:21:55Z

What if we added a dedicated strategy for C-like header files? This is such a commonly reported problem, it must comprise at least 80% of complaints about misclassified files.

I envision the strategy to behave like this:

For every .h file, check for a file with a matching path (sans file extension), ending in .c, .cpp, .cc (etc)
Assert that only one such definition file exists (I assume authors don't have foo.h in a directory with foo.c and foo.cpp)
Defer analysis of the .h file until its counterpart has been evaluated
Assert that the counterpart's language is one of C, C++, or Objective-C.
Use the language of the corresponding definition file as the language of the .h file

If one of these steps is unsuccessful, fall through to the usual behaviour of heuristics, followed by Bayesian classification.

Would this be a feasible approach, or am I being naïve here?

r-lyeh · 2021-04-13T12:00:28Z

Hello folks,

Currently, there are 47 C headers, 11 C source files and 1 C++ file in my repository (https://github.com/r-lyeh/FWK).

There are at least 7 C files tagged as C++ in my repository, which is wrong.

Also, ratio is actually 58:1 in favor of C, or 49:8 according to linguist. However, this info is wrongly displayed as well.

r-lyeh · 2021-04-13T12:06:21Z

Also, my 2cents at identifying sources.

Any H file is C,
unless C++ is found:
references&, [lambdas]{}, <extensionless_includes>, templates<>, std::, namespace, using, new/delete, operator, class, public, private, protected, override keywords.
unless ObjC is found
#import, @interface, @end, [[attributes]], dangling +/- characters.

lildude · 2021-04-13T13:30:40Z

@r-lyeh

Also, ratio is actually 58:1 in favor of C, or 49:8 according to linguist. However, this info is wrongly displayed as well.

It probably isn't. Your comments suggest you're looking at the number of files, however as stated in How Linguist Works:

The percentages are calculated based on the bytes of code for each language as reported by the List Languages API.

Your, granted incorrectly classified, bytes of code works out to:

C++ 12.2 MB
C 7.14 MB
Lua 279 KB
Objective-C 133 KB
GLSL 39.5 KB
Shell 6.14 KB

... with 10.7MB of those 12.2MB attributable to a single file: https://github.com/r-lyeh/FWK/blob/master/3rd/3rd_glfw3.h

It doesn't show up in the search results because it's too big so subject the search restrictions mentioned in the troubleshooting doc and thus not indexed.

Also, my 2cents at identifying sources.

This sounds like a reasonable approach to me.

If you're confident about this and have the time, we'd appreciate a PR that enhances the current heuristics here with the regexes broken down here and here.

r-lyeh · 2021-04-13T14:11:46Z

Doh! It's all about file sizes then! I would have never guessed that. Yes, glfw3 is reported as C++ incorrectly atm. It could be marked as ObjC (there are a small % of lines in ObjC there), but 98% of it is pure C though.
Ok, I'll try to take a look on those links during next weekend.

MarcusJohnson91 · 2021-04-15T04:44:15Z

C is gaining [[attributes]] in c2x

…

On Apr 13, 2021, at 5:06 AM, r-lyeh ***@***.***> wrote: Also, my 2cents at identifying sources. Any H file is C, unless C++ is found: references&, [lambdas]{}, <extensionless_includes>, templates<>, std::, namespace, using, new/delete, operator, class, public, private, override keywords. unless ObjC is found #import, [[attributes]], dangling +/- characters. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

HiImJulien · 2021-05-04T01:06:19Z

The percentages are calculated based on the bytes of code for each language as reported by the List Languages API.

Either I misunderstand this or this doesn't seem to work.
I have a header, which is roughly 3k lines of C code, 4 lines of C++ and 20 lines of Objective-C. Yet my file is marked as Objective-C.
The code style mandates a column width of 80 characters.

I'd like to throw in an idea: Can we let users mark their code themselves? I.e.:

#pragma linguist(cxx)
// Everything below is considered cxx
#pragma linguist(c)
// Everything below is considered c 
#pragma linguist(objc)
// Everything below is considered objc

Or perhaps a special form of comments?
This would be a (seemingly easy) fix before a "proper" heuristic has been developed.

lildude · 2021-05-04T05:47:32Z

@HiImJulien that statement refers to the breakdown of the repo as a whole after the language of each file has been identified, not the breakdown of the individual files. Each file is only associated with one language.

Can we let users mark their code themselves?

🤔 interesting idea though I'm not sure introducing a new approach to identify the language of a file is needed when users can already specify the language using an override in the .gitattributes file. I think your suggestion may also be based on the misunderstanding around the percentages.

lildude · 2021-05-06T12:26:02Z

I've started #5357 to implement what feels to be the general consensus of the comments in this issue: assume all .h header files are C unless they specifically match C++ or Objective-C specific syntax.

I think this is a happy compromise that leads to more predictable behaviour considering Linguist detects the languages of files without considering any other files in the repo.

lildude · 2021-05-14T08:51:44Z

#5357 has now been merged. The changes won't take effect until I make the next release. I've scheduled time for the week of 24 May to make and deploy the next release.

lildude · 2021-06-01T10:51:25Z

The changes from #5357 have now been deployed to GitHub.com which means this section of the troubleshooting docs now applies:

My C/C++/Objective-C .h header file is detected as the wrong language

Correctly detecting the language for the C-family .h header files is tough because Linguist detects the languages of files in isolation when analysing repositories and these header files, especially the smaller ones, can be used across all three languages without using any language-specific content.
To try and reduce the number of false positives and keep some degree of predicatability for users, Linguist will assume all .h header files are C by default and will only identify a file as C++ or Objective-C if the content matches the specific heuristic for that language.
This will mean you will need to implement an override for some of your header files if you wish for them to be classified as C++ or Objective-C if they do no contain language-specific content.

Repositories will only be re-analysed when you next push to the repo so if you're seeing behaviour inconsistent with above, push another change to an affected file (only modifying .gitattributes to reclassify may not be sufficient to trigger re-analysis) and wait for the re-analysis to complete (this can take a while during busier parts of the day).

If you still experience behaviour that is not consistent with above, please open a new discussion or open a new PR if you feel the heuristics that ensure this behaviour can be improved.

arfon added C/C++/Obj-C Help Wanted labels Oct 23, 2014

XrXr mentioned this issue Oct 31, 2014

Disambiguate C, C++ and Objective-C header files #1651

Closed

smola mentioned this issue Sep 19, 2020

Detected language depends on file length or amount of data? #5012

Closed

lildude mentioned this issue Sep 21, 2020

Add a C sample defining a large initialized array #5015

Closed

3 tasks

smola mentioned this issue Oct 1, 2020

Misclassified C as C++ in data.table project #5028

Closed

foggynight mentioned this issue Jan 2, 2021

C header file detected as C# #5128

Closed

4 tasks

lildude pinned this issue May 6, 2021

lildude mentioned this issue May 6, 2021

Classify .h files as C by default #5357

Merged

3 tasks

lildude closed this as completed in #5357 May 14, 2021

github-linguist locked as resolved and limited conversation to collaborators Jun 1, 2021

lildude unpinned this issue Jun 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing C/C++/Objective-C classifications #1626

Fixing C/C++/Objective-C classifications #1626

arfon commented Oct 23, 2014

dragonmux commented Oct 24, 2014

arfon commented Oct 24, 2014

dragonmux commented Oct 24, 2014

mu578 commented Jun 29, 2020

drewcrawford commented Jul 26, 2020

Dawoodoz commented Nov 22, 2020

tombsar commented Dec 1, 2020 •

edited

drewcrawford commented Dec 2, 2020

MarcusJohnson91 commented Dec 3, 2020 •

edited

tombsar commented Dec 3, 2020 •

edited

ItzSwirlz commented Dec 11, 2020

felipeek commented Dec 12, 2020 •

edited

Starwort commented Jan 6, 2021

Alhadis commented Jan 6, 2021

r-lyeh commented Apr 13, 2021 •

edited

r-lyeh commented Apr 13, 2021 •

edited

lildude commented Apr 13, 2021 •

edited

r-lyeh commented Apr 13, 2021 •

edited

MarcusJohnson91 commented Apr 15, 2021 via email

HiImJulien commented May 4, 2021 •

edited

lildude commented May 4, 2021

lildude commented May 6, 2021

lildude commented May 14, 2021

lildude commented Jun 1, 2021

My C/C++/Objective-C `.h` header file is detected as the wrong language

Fixing C/C++/Objective-C classifications #1626

Fixing C/C++/Objective-C classifications #1626

Comments

arfon commented Oct 23, 2014

dragonmux commented Oct 24, 2014

arfon commented Oct 24, 2014

dragonmux commented Oct 24, 2014

mu578 commented Jun 29, 2020

drewcrawford commented Jul 26, 2020

Dawoodoz commented Nov 22, 2020

tombsar commented Dec 1, 2020 • edited

drewcrawford commented Dec 2, 2020

MarcusJohnson91 commented Dec 3, 2020 • edited

tombsar commented Dec 3, 2020 • edited

ItzSwirlz commented Dec 11, 2020

felipeek commented Dec 12, 2020 • edited

Starwort commented Jan 6, 2021

Alhadis commented Jan 6, 2021

r-lyeh commented Apr 13, 2021 • edited

r-lyeh commented Apr 13, 2021 • edited

lildude commented Apr 13, 2021 • edited

r-lyeh commented Apr 13, 2021 • edited

MarcusJohnson91 commented Apr 15, 2021 via email

HiImJulien commented May 4, 2021 • edited

lildude commented May 4, 2021

lildude commented May 6, 2021

lildude commented May 14, 2021

lildude commented Jun 1, 2021

My C/C++/Objective-C .h header file is detected as the wrong language

tombsar commented Dec 1, 2020 •

edited

MarcusJohnson91 commented Dec 3, 2020 •

edited

tombsar commented Dec 3, 2020 •

edited

felipeek commented Dec 12, 2020 •

edited

r-lyeh commented Apr 13, 2021 •

edited

r-lyeh commented Apr 13, 2021 •

edited

lildude commented Apr 13, 2021 •

edited

r-lyeh commented Apr 13, 2021 •

edited

HiImJulien commented May 4, 2021 •

edited

My C/C++/Objective-C `.h` header file is detected as the wrong language