Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some .pm categorised as Perl6, some as Perl #2149

Closed
mintsoft opened this issue Feb 21, 2015 · 28 comments · Fixed by #2441
Closed

Some .pm categorised as Perl6, some as Perl #2149

mintsoft opened this issue Feb 21, 2015 · 28 comments · Fixed by #2441

Comments

@mintsoft
Copy link
Author

I've also just noticed it here:

Which is also Perl 5

@pchaigno
Copy link
Contributor

I'm afraid there is not much that can be done with Linguist here.
As you can guess, distinguishing Perl 5 and Perl 6 is tricky. The way Linguist does it is by first using some heuristic; if that doesn't work the Bayesian classifier tries to guess the language. But since the two languages are really close, it will often fail.

A few solutions to fix your case:

  • explicitly say you're using Perl 5 with use v5; (use strict; will also work in your case)
  • add *.pm linguist-language=perl in a .gitattributes file to override Linguist results (this will only affect the statistics)
  • use Vim modelines

@mintsoft
Copy link
Author

@pchaigno I can certainly see how difficult this is. Just to clarify, if we use the .gitattributes then it wouldn't fix the highlighting in the cases when it's incorrectly classified as 6?

@pchaigno
Copy link
Contributor

Just to clarify, if we use the .gitattributes then it wouldn't fix the highlighting in the cases when it's incorrectly classified as 6?

That's correct. Unfortunately, the highlighting and the search results are currently not affected by the overrides :/

@mintsoft
Copy link
Author

Dang, that's a real shame. Do you know if there's any plans to link the overrides in at some point?

@pchaigno
Copy link
Contributor

I think it is considered. @arfon will know for sure.

@arfon
Copy link
Contributor

arfon commented Feb 23, 2015

Do you know if there's any plans to link the overrides in at some point?

Yeah, there's an issue for this here. We'd definitely like to add this functionality - it's just a bunch of work on our end (for GitHub/infrastructure reasons).

@arfon
Copy link
Contributor

arfon commented Feb 23, 2015

... for now, the EMacs and Vim overrides are probably your best bet: https://github.com/github/linguist#using-emacs-and-vim-modelines

@mintsoft
Copy link
Author

Cool thanks @arfon We've actually opted for the use strict; for now.

@arfon
Copy link
Contributor

arfon commented Feb 23, 2015

👍 ok thanks @mintsoft

@kraih
Copy link

kraih commented Jun 4, 2015

If Linguist can't distinguish between Perl 5 and Perl 6, why does it default to Perl 6 here? The odds of it being Perl 5 are much much much higher. I think that's the real bug.

@arfon
Copy link
Contributor

arfon commented Jun 4, 2015

If Linguist can't distinguish between Perl 5 and Perl 6, why does it default to Perl 6 here? The odds of it being Perl 5 are much much much higher. I think that's the real bug.

Because the 'likelihood' is determined by the bayesian classifier and not any other custom rules about languages which are more likely purely based on popularity.

As @pchaigno discussed, getting Linguist to do the right thing here is hard so I'd encourage you to take a look at the overrides to address this issue.

@jberger
Copy link

jberger commented Jun 4, 2015

Based on the linked code, there seems to only be certain heuristics, but no way to provide a default 'else'?

  ...
  else
    Language["Perl"]
  end

@arfon
Copy link
Contributor

arfon commented Jun 4, 2015

It is possible to provide a default 'else' but we try to avoid writing heuristics that way. Especially as we don't actually know the file extension we're working with when we're in the heuristic.

@zoffixznet
Copy link

What about adding popularity of a language, however vaguely it is defined, as part of the heuristics?

Regardless of Linguist's implementation, Perl5 has 150,717 modules (not all of them on GitHub though), while Perl6 has only 330. It's incredibly more likely that an ambiguous file is Perl5 code, not Perl6.

@arfon
Copy link
Contributor

arfon commented Jun 4, 2015

What about adding popularity of a language, however vaguely it is defined, as part of the heuristics?

It's possible we could do something with popularity somewhere in the language classification but the heuristics isn't the right place.

@jberger
Copy link

jberger commented Jun 4, 2015

I could even suggest that, in the interim, the lack of a use v6 ought to be enough to disambiguate. I'm fairly sure that that is a Perl 6 best practice and will (or should) be for some time.

@smls
Copy link

smls commented Jun 4, 2015

Perl 6's design docs have some thought in the matter: see point 8 in this bullet point list.

In practice, this means that it should be safe to assume a .pm file is Perl by default, and only treat it as Perl 6 if the first line which is not empty and not a comment starts with one of:

use v6
class
module
unit

PS: unit keyword is not mentioned in the above link, because it was only added recently. But unit class Foo; will likely become the "default" way to start a Perl 6 module file, so don't miss it.

@mintsoft
Copy link
Author

mintsoft commented Jun 4, 2015

If Linguist can't distinguish between Perl 5 and Perl 6, why does it default to Perl 6 here? The odds of it being Perl 5 are much much much higher. I think that's the real bug.

I agree with this, it does seem odd that the weighting appears to be in favour of Perl6

I could even suggest that, in the interim, the lack of a use v6 ought to be enough to disambiguate. I'm fairly sure that that is a Perl 6 best practice and will (or should) be for some time.

This seems sensible to me

@zoffixznet
Copy link

I could even suggest that, in the interim, the lack of a use v6 ought to be enough to disambiguate. I'm fairly sure that that is a Perl 6 best practice and will (or should) be for some time.

Just asked about this on #perl6 IRC channel (irc://irc.freenode.net/#perl6) and FROGGS said that it likely won't work in this case, because we're also trying to decide whether the code is Prolog.. Here is the full discussion: http://pastebin.com/5LMub5X0

Also, to add to the discussion, nwc10 had this to suggest:

alternative question - is it viable to help train the basien filter - ie would github be happy with a button for "you're wrong, it's this other language?"

@mintsoft
Copy link
Author

mintsoft commented Jun 4, 2015

because we're also trying to decide whether the code is Prolog.

Surely that's relatively straight forward, if there are $ in it, it's not Prolog?

@zoffixznet
Copy link

Would there be any possibility of re-opening this issue so someone with more intimate knowledge of Ruby/Linguist could perhaps see if any of the suggestions above are viable and could be implemented, to make Linguist's guesses more accurate?

@mintsoft mintsoft reopened this Jun 4, 2015
@mintsoft
Copy link
Author

mintsoft commented Jun 4, 2015

@zoffixznet I've reopened it for now, I'll leave it open unless the github staff close it

@jberger
Copy link

jberger commented Aug 4, 2015

excellent, thanks!

@Grinnz
Copy link

Grinnz commented Aug 16, 2015

This is excellent, thank you, but just wondering, does this also help with .t files? I don't see it referenced in the commits, and these seem to be the most often marked as Perl6 mistakenly.

@arfon
Copy link
Contributor

arfon commented Aug 19, 2015

This is excellent, thank you, but just wondering, does this also help with .t files? I don't see it referenced in the commits, and these seem to be the most often marked as Perl6 mistakenly.

Not currently. We could write a heuristic that is only for .t extensions which uses the same matchers for .pm here. Note that .t isn't unique to Perl/Perl 6 - Turing also uses it so we'd need to make sure to include that also.

What do you think @pchaigno?

@zoffixznet
Copy link

@arfon in my experience, it's mostly the Perl 5's .t files that Linguist mistakenly thinks are Perl 6, so it would be very helpful if those had the heuristic evaluation as well.

mintsoft added a commit to duckduckgo/zeroclickinfo-goodies that referenced this issue Apr 10, 2016
dbsrgits-sync pushed a commit to Perl5/DBIx-Class that referenced this issue Apr 28, 2016
Port of mojolicious/mojo@19cdf772
Before this clarification the project listed as 7% Perl6 code >.<

The explicit listing is needed, as there apparently won't be a fix
within Github itself any time soon:
github-linguist/linguist#2149 (comment)
github-linguist/linguist#2781 (comment)
github-linguist/linguist#2074 (comment)

Language names sourced from:
https://github.com/github/linguist/blob/master/lib/linguist/languages.yml
@polamjag
Copy link

polamjag commented Jan 24, 2024

FYI: I feel like that problems described here is pretty much resolved by #6264 (ref. #6263)

@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jan 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants