Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override languages being included by language statistics #3807

Merged
merged 10 commits into from
Jan 23, 2018

Conversation

seppestas
Copy link
Contributor

@seppestas seppestas commented Aug 31, 2017

Adds a way to override languages and blobs being included in the language statistics.

This improves support for non-code languages (#3805) by making it possible to flag languages as being detectable in the languages.yml file. It is also possible to override the detectable flag for specific blobs in the repo using the .gitattributes file.

Note: the tests require #3806 to be merged in order to work properly.

@seppestas seppestas changed the title Feature detectable languages Override languages being included by language statistics Aug 31, 2017
@seppestas
Copy link
Contributor Author

/cc @Alhadis @lildude

@seppestas
Copy link
Contributor Author

Note: Travis Build 8308 will probably fail because of the weird way this library tests .gitattributes.

@valerionew
Copy link

This would also fix #3228 and probably many more past issues. For what i could read, this feature has been asked for a long time.
Let's hope that Github's staff and the contributors will agree with these changes!

@seppestas
Copy link
Contributor Author

So... any thoughts?

While I think this is the most flexible and forward looking approach, I understand it's also has the most impact on the rest on the Linguist system compared to the other approaches I described in #3805.

As stated by @5N44P this approach allows users that want XML, SVG, doc, ... files to represented in the language statistics of a project, like the cases stated in #3159 and #3228, to do so.

Note that the override in the .gitattributes works on a blob level. While flexible and simple, this might not be the most user-friendly approach. Having the option to override what languages are "detectable" in the .gitattributes file might be a more consistent solution, however this is harder to combine with the "sensible defaults" in the languages.yml file.

@lildude
Copy link
Member

lildude commented Sep 7, 2017

I'm not 100% sure this is the right approach to solve this problem as this paves the way for a million more PRs asking for similar functionality from other non-programming languages which don't show up in the colourful language bar. I'm also not sure of the impact allowing such an override will have on the performance of Linguist on GitHub.com - it's already a pretty hungry beast and I imagine opening the door to overriding things like XML & SQL will lead to heeeeawg problems.

Accordingly, I've deferred the judgment on how best to approach this to the team within GitHub that "owns" the Linguist product (I'm not part of that team) as they have a much clearer idea of the direction they'd like to take such things, in conjunction with future GitHub.com development plans, in the long run.

@seppestas
Copy link
Contributor Author

seppestas commented Sep 7, 2017

I'm not 100% sure this is the right approach to solve this problem as this paves the way for a million more PRs asking for similar functionality from other non-programming languages which don't show up in the colourful language bar.

Why is this not a good thing? Having people indicate what they expect their projects to be recognized as by Github is an awesome way to handle this IMO. Anyhow, I tried to find some other languages where it would make sense to have them "detectable" by default, but it wasn't so easy to find those since most languages in GH are either already programming languages or just plain data. The only languages I could think of (like music sheet files and music mastering settings) are not currently popular / available on Github. I think the amount of requests for other languages to be detectable by default will not be that high. The only problem might be people wanting to recognise generic files (like XML or SVG), for this reason I think having the option to override the detectable flag makes sense.

I'm also not sure of the impact allowing such an override will have on the performance of Linguist on GitHub.com

I don't think the detectable field for languages I added will have a big impact on performance, but the way the override works might since it uses file globbing. Having an override on a language level (or no override at all) would probably make it more performant. However most overrides in the .gitattributes file currently work on a file / blob level, so I felt the current approach was the most consistent.

Should I separate the addition of the "detectable" language flag (0f7c0df5c4683a134c8e0794d672f1aef4479aee) and the override feature into their own PR?

@lildude
Copy link
Member

lildude commented Sep 7, 2017

Should I separate the addition of the "detectable" language flag (0f7c0df) and the override feature into their own PR?

Nah, you can leave things as-is and we'll let my colleagues decide how best to approach this and advise further.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've discussed this PR with team within GitHub and we think implementing the linguist-detectable .gitattributes option is a good compromise to allow you to bring back the KiCad language stats.

However, we think setting a global detectable attribute within languages.yml is something that could pose a problem in the future so we'd rather hold off on that part for the moment. If we need to, we can always come back to the languages.yml side of things in future.

Can you please update your PR to remove the changes to languages.yml and remove the functionality that detects the detectable attribute within that file.

@@ -480,6 +482,15 @@ def searchable?
@searchable
end

# Public: Is it detectable?
#
# Detectable languages will by include in the language stats.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/by include/be included/

@Alhadis
Copy link
Collaborator

Alhadis commented Sep 15, 2017

Personally, I believe giving users the option of setting their project's language classification a.l.a BitBucket is the ideal solution. It needn't be mandatory; by default, the classification would be based on the codebase's language usage (as it is now), until manually set by the owner. This needn't affect the stats in the language breakdown, of course... but it would offer an intuitive way to affect how their projects are being advertised to other developers.

Just my 2 cents. 👍

@seppestas
Copy link
Contributor Author

I would really like to have languages like KiCad and Eagle design files and other languages that can be considered primary to a repository detectable by default.

Most users of these systems are not that intimately familiar with the workings of git and Github, let alone the role of Linguist in language detection. These users expect their repositories to be recognised correctly, and when this is not the case this causes confusion, and in a lot of cases frustration.

Best case this will result in a bunch of people creating issues on Linguist asking why their repo is being mis-identified, with a whole lot of "you should add a linguist-detectable line in your .gitattributes file" answers. Worst case people never find out about Linguist and/or leave things as-is, negatively impacting discoverability of their projects on Github.

I'll update this PR to only add the .gitattributes feature, but I would like to keep #3805 open until a solution is added that allows non-code languages to be included in the language statistics by default. Could #3806 already be merged so the tests pass when I update this PR?

@seppestas
Copy link
Contributor Author

Personally, I believe giving users the option of setting their project's language classification a.l.a BitBucket is the ideal solution.

While I agree this would be more intuitive for people that are not familiar with the .gitattributes file, I'm not a big fan of this solution because:

  • In the case of hardware projects repositories often contain a combination of hardware design files and firmware source files.
  • It's not automatic, meaning people can overlook / ignore this.
  • The language stats bar just looks cool and adds an easy way to navigate to different types of source files in the repo.

@Alhadis
Copy link
Collaborator

Alhadis commented Sep 15, 2017

It's not automatic, meaning people can overlook / ignore this.

... erh, it literally says in the second sentence that the proposed default would be automatic:

It needn't be mandatory; by default, the classification would be based on the codebase's language usage (as it is now), until manually set by the owner.

As for this:

negatively impacting discoverability of their projects on Github.

There is a topics feature, as I said. Furthermore, I've found that to be much more reliable than searching repositories by language. Searching for repositories that've been identified as PostScript has, more often than not, given me a bunch of misclassified projects. Same for Turing and various lesser-known assembler languages. Searching for something that's been explicitly tagged by humans has proven more useful than the results of a Bayesian classifier.

@Alhadis
Copy link
Collaborator

Alhadis commented Sep 15, 2017

The language stats bar just looks cool

I agree, but don't get too attached to what you see in GitHub's UI. Design elements are subject to change without notice, and believe me, it often happens when you least expect it. 😜 If GitHub decide to do away with the language bar someday, for reasons only known to internally, well...

@seppestas
Copy link
Contributor Author

by default, the classification would be based on the codebase's language usage (as it is now), until manually set by the owner

This would still require proper support for non-code languages in order to be recognised by Linguist though.

Having the option to manually change the language a project is classified as in a way that is more user-friendly than .gitattribute files would already be a big improvement. It might be interesting to tie this in to Github's topic feature. Anyhow, these are things that GH themselves need to implement, since it can't be done by Linguist...

@lildude
Copy link
Member

lildude commented Sep 18, 2017

Having the option to manually change the language a project is classified as in a way that is more user-friendly than .gitattribute files would already be a big improvement.

Please don't confuse the language bar showing the breakdown of languages found within a repository with a specific repository language. GitHub has no concept of a repository language and never has as I pointed out in #3805 (comment). I'm repeating this as I keep getting the feeling that you're thinking the language stats bar is an indicator of a repository's language rather than a breakdown of the languages within it.

The .gitattributes file is not for setting a repository language and nor should it considered a method of doing so either. The Linguist-specific options are meant as a tool to help Linguist identify the languages within a repo when it doesn't quite get it right; it's not perfect after all. If adjusting the content just so happens to result in a shift in the proportion of languages such that your favourite language is now the predominant language, then great, but there is still no concept of a repository language and using the .gitattributes file to try and set one is really missing the point.

If you really want the ability to set a repository language and you feel topics are't the right thing, you'll need to contact GitHub support and log a feature request.

@seppestas
Copy link
Contributor Author

seppestas commented Sep 18, 2017

I'm repeating this as I keep getting the feeling that you're thinking the language stats bar is an indicator of a repository's language rather than a breakdown of the languages within it.

I'm aware of this. The problem I'm trying to address is it not showing the languages I'm / (Ki)CAD users / maintainers of none code repos in general are interested in, and only the remaining, typically miss-identified files being exposed in the language stats causing confusion.

Anyhow, this PR is an implementation of one of my proposals on how this can be fixed. I think we should keep the higher level discussion on how Linguist could / should handle non-code languages in #3805.

@rafer
Copy link
Contributor

rafer commented Sep 18, 2017

Hi All, just wanted to provide some further context for the decisions that @lildude outlined. Basically, we're trying to get some of the functionality you want (the ability for your projects to have KiCad files show up in the language bar), without making significant changes to linguist (adding detectable to the languages.yml), because we'd need to further investigate the impact of this on internal github systems, and also it could open up a common vector for disputes (i.e. some folks want XML files to be detectable by default and some don't). Long story short we're trying to be cautious and make sure that we've fully thought out the repercussions before acting. That's why I think the project-level overrides solution is a good interim step.

  • It alleviates at least some of the pain y'all are currently experiencing.
  • We can 👍 this change without a lengthy investigation on the GitHub side.
  • We can continue to collect data. If we do indeed get a ton of "you should add a linguist-detectable line in your .gitattributes file" type issues as @seppestas suggests, that will be good to know and can help inform future decisions.

Anyway, looking forward to merging this in once @lildude's review comments are addressed.

@seppestas
Copy link
Contributor Author

Ok, I have a commit ready where I reverted the changes of 0f7c0df5c4683a134c8e0794d672f1aef4479aee and made some required adjustments to make this work with the .gitattributes related code, as well as fixing a problem with other blob types.

Having #3806 merged before pushing this commit would make the tests pass (if everything goes well). It's a bit of a chicken and egg story though. Should I go ahead and push my commit to update this PR and have you rerun the test once #3806 is merged, or do you want to merge #3806 first so the tests pass as soon as I update this PR?

@seppestas
Copy link
Contributor Author

Huh, looks like the required commit on the test branch (7af5303) is still not found... Am I missing something?

@seppestas
Copy link
Contributor Author

seppestas commented Sep 19, 2017

Ah, the commit hash got changed when the PR was merged -.-

@seppestas
Copy link
Contributor Author

Should be fixed now. Maybe it's a good idea to tag the commits in test/attributes and using the tag in the tests. This way the person merging PRs could tag the correct commit when a PR is merged. IMO this would also make the test appear a bit clearer, especially if the tags get namespaced to the branch, e.g test/attributes/override_detectable.

@Alhadis
Copy link
Collaborator

Alhadis commented Sep 19, 2017

I believe that's what "lightweight" Git tags are for (as opposed to the annotated ones we use daily). ;-)

See git-tag(1).

@seppestas
Copy link
Contributor Author

Yeey, all green now ^^

I don't think the type of tag matters (though using a lightweight tag does make sense for this). It might even make more sense to use branches, this would make it possible to update the commit required for the tests without having to update the tests.

Either way, someone with merge access would have to create a tag or (update a) branch after a PR is merged into test/attributes for this to work. Also, I'm not sure if it's possible to do a git fetch using a wildcard in the branch name. Apparently this requires configuration of the refspec.

@seppestas
Copy link
Contributor Author

Hmm, I should probably add a test to see if detectable actually causes include_in_language_stats to be changed...

@Alhadis
Copy link
Collaborator

Alhadis commented Sep 19, 2017

Erh no, that's precisely what lightweight tags are for... keeping branches in-sync isn't just a pain, it also takes an ungodly amount of time to switch between for a project of Linguist's size.

Then, of course, there's the racket caused by submodules:

λ GitHub-Linguist (master): time git checkout origin/puppet-fix
warning: unable to rmdir vendor/grammars/Sublime-Pep8: Directory not empty
warning: unable to rmdir vendor/grammars/dartlang: Directory not empty
warning: unable to rmdir vendor/grammars/data-weave-tmLanguage: Directory not empty
warning: unable to rmdir vendor/grammars/language-closure-templates: Directory not empty
warning: unable to rmdir vendor/grammars/language-pan: Directory not empty
warning: unable to rmdir vendor/grammars/language-pcb: Directory not empty
warning: unable to rmdir vendor/grammars/language-webassembly: Directory not empty

M	vendor/CodeMirror
M	vendor/grammars/Elm
M	vendor/grammars/Handlebars
M	vendor/grammars/LiveScript.tmbundle
M	vendor/grammars/MagicPython
M	vendor/grammars/Modelica
M	vendor/grammars/NimLime
M	vendor/grammars/Sublime-SQF-Language
M	vendor/grammars/SublimeGDB
M	vendor/grammars/SublimePuppet
M	vendor/grammars/TLA
M	vendor/grammars/Terraform.tmLanguage
M	vendor/grammars/antlr.tmbundle
M	vendor/grammars/api-blueprint-sublime-plugin
M	vendor/grammars/atom-language-clean
M	vendor/grammars/atom-language-perl6
M	vendor/grammars/atom-language-rust
M	vendor/grammars/c.tmbundle
M	vendor/grammars/capnproto.tmbundle
M	vendor/grammars/chapel-tmbundle
M	vendor/grammars/cucumber-tmbundle
M	vendor/grammars/cython
M	vendor/grammars/d.tmbundle
Note: checking out 'origin/puppet-fix'.

real	0m2.866s
user	0m0.451s
sys	0m1.758s

Yeah, no thanks. A lightweight tag is designed for exactly this scenario.

@seppestas
Copy link
Contributor Author

Hmm, I agree keeping all those branches in sync is a pain, using tags is probably easier. Why specifically lightweight tags though? Just because there's not much point in using annotated tags or is there a reason lightweight tags must be used?

@seppestas
Copy link
Contributor Author

Finally gotten around to rebasing my branch onto master. Took me a longer than expected since I got confused because the tests were failing. Seems like running bundle exec rake samples fixed everything.

@lildude
Copy link
Member

lildude commented Jan 12, 2018

👍 Can you please add a little section for the new attribute like we've got for all the others, for example like the documentation attribute, to explain the usage and purpose.

README.md Outdated
@@ -164,6 +164,21 @@ $ cat .gitattributes
Api.elm linguist-generated=true
```

#### Detectable

Currently there are 4 different language types: data, programming, markup, prose or nill. When a language is not considered a programming, markup or prose language it is typically classified as data. Recently there have been a lot of changes to languages that have been incorrectly classified as programming or markup languages. E.g KiCad files used to be classified as programming language (PR #3743), Eagle files used to be classified as markup (PR #3751), ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this paragraph; it adds too much context for the introduction of the change rather than concentrating on the purpose of the attribute. The next paragraph, with a few tweaks, is probably sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I added this paragraph by accident... I just copy-pasted it to see what I wrote in my original issue, forgot to remove it and committed it (this is what you get when you start working on something one evening and only finish up quickly the next day) 😳

README.md Outdated

Currently there are 4 different language types: data, programming, markup, prose or nill. When a language is not considered a programming, markup or prose language it is typically classified as data. Recently there have been a lot of changes to languages that have been incorrectly classified as programming or markup languages. E.g KiCad files used to be classified as programming language (PR #3743), Eagle files used to be classified as markup (PR #3751), ...

By default only programming languages are "detected", i.e counted towards a repository's language statistics. In some repositories it might be preferable to include some other files in the language statistics. In other cases source files that are not vendored, documentation or generated code might want to be be excluded from the language statistics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of this paragraph is good though it feels a little "off" to me, though I can't put my finger on why just yet. I'll need to mull it a little bit.

@Alhadis you're the better wordsmith, you may have some ideas on this too.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wooohooo!!! We're finally there!! 🎉

This PR is looking very nice and I think many people will come to appreciate this change. Thank-you so much for your patience and dealing with all the back-and-forth. We really appreciate it.

I'm approving this now but it can't be merged until @pchaigno or @Alhadis have given it their 👍 too.

Once done, I'll start work on the next release of Linguist and hope to have this live on GitHub.com by the end of the month.

@seppestas
Copy link
Contributor Author

Great, thanks! It's been quite a journey, but it was also educative ^^.

@lildude lildude merged commit 8da6ddf into github-linguist:master Jan 23, 2018
@seppestas seppestas deleted the feature-detectable-languages branch January 23, 2018 13:11
@seppestas
Copy link
Contributor Author

🎉 Thanks!

@lildude lildude mentioned this pull request Jan 26, 2018
@lildude
Copy link
Member

lildude commented Jan 26, 2018

Linguist v6.0.0 is now live on GitHub.com. 🎉

@valerionew
Copy link

Thank you guys, you did an amazing job ❤

@whymarrh
Copy link

Hey, I just wanted to chime in and say thanks for this change! 🎉 ❤️ 🌟

(I had opened #3940 back in December and this allows me to have Protocol Buffer repositories again!)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants