-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Override languages being included by language statistics #3807
Override languages being included by language statistics #3807
Conversation
Note: Travis Build 8308 will probably fail because of the weird way this library tests .gitattributes. |
This would also fix #3228 and probably many more past issues. For what i could read, this feature has been asked for a long time. |
So... any thoughts? While I think this is the most flexible and forward looking approach, I understand it's also has the most impact on the rest on the Linguist system compared to the other approaches I described in #3805. As stated by @5N44P this approach allows users that want XML, SVG, doc, ... files to represented in the language statistics of a project, like the cases stated in #3159 and #3228, to do so. Note that the override in the .gitattributes works on a blob level. While flexible and simple, this might not be the most user-friendly approach. Having the option to override what languages are "detectable" in the .gitattributes file might be a more consistent solution, however this is harder to combine with the "sensible defaults" in the languages.yml file. |
I'm not 100% sure this is the right approach to solve this problem as this paves the way for a million more PRs asking for similar functionality from other non-programming languages which don't show up in the colourful language bar. I'm also not sure of the impact allowing such an override will have on the performance of Linguist on GitHub.com - it's already a pretty hungry beast and I imagine opening the door to overriding things like XML & SQL will lead to heeeeawg problems. Accordingly, I've deferred the judgment on how best to approach this to the team within GitHub that "owns" the Linguist product (I'm not part of that team) as they have a much clearer idea of the direction they'd like to take such things, in conjunction with future GitHub.com development plans, in the long run. |
Why is this not a good thing? Having people indicate what they expect their projects to be recognized as by Github is an awesome way to handle this IMO. Anyhow, I tried to find some other languages where it would make sense to have them "detectable" by default, but it wasn't so easy to find those since most languages in GH are either already programming languages or just plain data. The only languages I could think of (like music sheet files and music mastering settings) are not currently popular / available on Github. I think the amount of requests for other languages to be detectable by default will not be that high. The only problem might be people wanting to recognise generic files (like XML or SVG), for this reason I think having the option to override the detectable flag makes sense.
I don't think the detectable field for languages I added will have a big impact on performance, but the way the override works might since it uses file globbing. Having an override on a language level (or no override at all) would probably make it more performant. However most overrides in the .gitattributes file currently work on a file / blob level, so I felt the current approach was the most consistent. Should I separate the addition of the "detectable" language flag (0f7c0df5c4683a134c8e0794d672f1aef4479aee) and the override feature into their own PR? |
Nah, you can leave things as-is and we'll let my colleagues decide how best to approach this and advise further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've discussed this PR with team within GitHub and we think implementing the linguist-detectable
.gitattributes
option is a good compromise to allow you to bring back the KiCad language stats.
However, we think setting a global detectable
attribute within languages.yml
is something that could pose a problem in the future so we'd rather hold off on that part for the moment. If we need to, we can always come back to the languages.yml
side of things in future.
Can you please update your PR to remove the changes to languages.yml
and remove the functionality that detects the detectable
attribute within that file.
lib/linguist/language.rb
Outdated
@@ -480,6 +482,15 @@ def searchable? | |||
@searchable | |||
end | |||
|
|||
# Public: Is it detectable? | |||
# | |||
# Detectable languages will by include in the language stats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/by include/be included/
Personally, I believe giving users the option of setting their project's language classification a.l.a BitBucket is the ideal solution. It needn't be mandatory; by default, the classification would be based on the codebase's language usage (as it is now), until manually set by the owner. This needn't affect the stats in the language breakdown, of course... but it would offer an intuitive way to affect how their projects are being advertised to other developers. Just my 2 cents. 👍 |
I would really like to have languages like KiCad and Eagle design files and other languages that can be considered primary to a repository detectable by default. Most users of these systems are not that intimately familiar with the workings of git and Github, let alone the role of Linguist in language detection. These users expect their repositories to be recognised correctly, and when this is not the case this causes confusion, and in a lot of cases frustration. Best case this will result in a bunch of people creating issues on Linguist asking why their repo is being mis-identified, with a whole lot of "you should add a I'll update this PR to only add the .gitattributes feature, but I would like to keep #3805 open until a solution is added that allows non-code languages to be included in the language statistics by default. Could #3806 already be merged so the tests pass when I update this PR? |
While I agree this would be more intuitive for people that are not familiar with the .gitattributes file, I'm not a big fan of this solution because:
|
... erh, it literally says in the second sentence that the proposed default would be automatic:
As for this:
There is a topics feature, as I said. Furthermore, I've found that to be much more reliable than searching repositories by language. Searching for repositories that've been identified as PostScript has, more often than not, given me a bunch of misclassified projects. Same for Turing and various lesser-known assembler languages. Searching for something that's been explicitly tagged by humans has proven more useful than the results of a Bayesian classifier. |
I agree, but don't get too attached to what you see in GitHub's UI. Design elements are subject to change without notice, and believe me, it often happens when you least expect it. 😜 If GitHub decide to do away with the language bar someday, for reasons only known to internally, well... |
This would still require proper support for non-code languages in order to be recognised by Linguist though. Having the option to manually change the language a project is classified as in a way that is more user-friendly than .gitattribute files would already be a big improvement. It might be interesting to tie this in to Github's topic feature. Anyhow, these are things that GH themselves need to implement, since it can't be done by Linguist... |
Please don't confuse the language bar showing the breakdown of languages found within a repository with a specific repository language. GitHub has no concept of a repository language and never has as I pointed out in #3805 (comment). I'm repeating this as I keep getting the feeling that you're thinking the language stats bar is an indicator of a repository's language rather than a breakdown of the languages within it. The If you really want the ability to set a repository language and you feel topics are't the right thing, you'll need to contact GitHub support and log a feature request. |
I'm aware of this. The problem I'm trying to address is it not showing the languages I'm / (Ki)CAD users / maintainers of none code repos in general are interested in, and only the remaining, typically miss-identified files being exposed in the language stats causing confusion. Anyhow, this PR is an implementation of one of my proposals on how this can be fixed. I think we should keep the higher level discussion on how Linguist could / should handle non-code languages in #3805. |
Hi All, just wanted to provide some further context for the decisions that @lildude outlined. Basically, we're trying to get some of the functionality you want (the ability for your projects to have KiCad files show up in the language bar), without making significant changes to linguist (adding detectable to the
Anyway, looking forward to merging this in once @lildude's review comments are addressed. |
Ok, I have a commit ready where I reverted the changes of 0f7c0df5c4683a134c8e0794d672f1aef4479aee and made some required adjustments to make this work with the .gitattributes related code, as well as fixing a problem with other blob types. Having #3806 merged before pushing this commit would make the tests pass (if everything goes well). It's a bit of a chicken and egg story though. Should I go ahead and push my commit to update this PR and have you rerun the test once #3806 is merged, or do you want to merge #3806 first so the tests pass as soon as I update this PR? |
Huh, looks like the required commit on the test branch (7af5303) is still not found... Am I missing something? |
Ah, the commit hash got changed when the PR was merged -.- |
Should be fixed now. Maybe it's a good idea to tag the commits in test/attributes and using the tag in the tests. This way the person merging PRs could tag the correct commit when a PR is merged. IMO this would also make the test appear a bit clearer, especially if the tags get namespaced to the branch, e.g |
I believe that's what "lightweight" Git tags are for (as opposed to the annotated ones we use daily). ;-) See git-tag(1). |
Yeey, all green now ^^ I don't think the type of tag matters (though using a lightweight tag does make sense for this). It might even make more sense to use branches, this would make it possible to update the commit required for the tests without having to update the tests. Either way, someone with merge access would have to create a tag or (update a) branch after a PR is merged into test/attributes for this to work. Also, I'm not sure if it's possible to do a git fetch using a wildcard in the branch name. Apparently this requires configuration of the refspec. |
Hmm, I should probably add a test to see if |
Erh no, that's precisely what lightweight tags are for... keeping branches in-sync isn't just a pain, it also takes an ungodly amount of time to switch between for a project of Linguist's size. Then, of course, there's the racket caused by submodules: λ GitHub-Linguist (master): time git checkout origin/puppet-fix
warning: unable to rmdir vendor/grammars/Sublime-Pep8: Directory not empty
warning: unable to rmdir vendor/grammars/dartlang: Directory not empty
warning: unable to rmdir vendor/grammars/data-weave-tmLanguage: Directory not empty
warning: unable to rmdir vendor/grammars/language-closure-templates: Directory not empty
warning: unable to rmdir vendor/grammars/language-pan: Directory not empty
warning: unable to rmdir vendor/grammars/language-pcb: Directory not empty
warning: unable to rmdir vendor/grammars/language-webassembly: Directory not empty
…
M vendor/CodeMirror
M vendor/grammars/Elm
M vendor/grammars/Handlebars
M vendor/grammars/LiveScript.tmbundle
M vendor/grammars/MagicPython
M vendor/grammars/Modelica
M vendor/grammars/NimLime
M vendor/grammars/Sublime-SQF-Language
M vendor/grammars/SublimeGDB
M vendor/grammars/SublimePuppet
M vendor/grammars/TLA
M vendor/grammars/Terraform.tmLanguage
M vendor/grammars/antlr.tmbundle
M vendor/grammars/api-blueprint-sublime-plugin
M vendor/grammars/atom-language-clean
M vendor/grammars/atom-language-perl6
M vendor/grammars/atom-language-rust
M vendor/grammars/c.tmbundle
M vendor/grammars/capnproto.tmbundle
M vendor/grammars/chapel-tmbundle
M vendor/grammars/cucumber-tmbundle
M vendor/grammars/cython
M vendor/grammars/d.tmbundle
Note: checking out 'origin/puppet-fix'.
real 0m2.866s
user 0m0.451s
sys 0m1.758s Yeah, no thanks. A lightweight tag is designed for exactly this scenario. |
Hmm, I agree keeping all those branches in sync is a pain, using tags is probably easier. Why specifically lightweight tags though? Just because there's not much point in using annotated tags or is there a reason lightweight tags must be used? |
This key allows to override the language being included in the language stats of a repository.
Reverts changes in 0f7c0df5.
PR github-linguist#3806 changed the commit hash. The original commit was not actually merged into the test/attributes branch.
2a8c35d
to
794af60
Compare
Finally gotten around to rebasing my branch onto master. Took me a longer than expected since I got confused because the tests were failing. Seems like running |
👍 Can you please add a little section for the new attribute like we've got for all the others, for example like the documentation attribute, to explain the usage and purpose. |
README.md
Outdated
@@ -164,6 +164,21 @@ $ cat .gitattributes | |||
Api.elm linguist-generated=true | |||
``` | |||
|
|||
#### Detectable | |||
|
|||
Currently there are 4 different language types: data, programming, markup, prose or nill. When a language is not considered a programming, markup or prose language it is typically classified as data. Recently there have been a lot of changes to languages that have been incorrectly classified as programming or markup languages. E.g KiCad files used to be classified as programming language (PR #3743), Eagle files used to be classified as markup (PR #3751), ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this paragraph; it adds too much context for the introduction of the change rather than concentrating on the purpose of the attribute. The next paragraph, with a few tweaks, is probably sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I added this paragraph by accident... I just copy-pasted it to see what I wrote in my original issue, forgot to remove it and committed it (this is what you get when you start working on something one evening and only finish up quickly the next day) 😳
README.md
Outdated
|
||
Currently there are 4 different language types: data, programming, markup, prose or nill. When a language is not considered a programming, markup or prose language it is typically classified as data. Recently there have been a lot of changes to languages that have been incorrectly classified as programming or markup languages. E.g KiCad files used to be classified as programming language (PR #3743), Eagle files used to be classified as markup (PR #3751), ... | ||
|
||
By default only programming languages are "detected", i.e counted towards a repository's language statistics. In some repositories it might be preferable to include some other files in the language statistics. In other cases source files that are not vendored, documentation or generated code might want to be be excluded from the language statistics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content of this paragraph is good though it feels a little "off" to me, though I can't put my finger on why just yet. I'll need to mull it a little bit.
@Alhadis you're the better wordsmith, you may have some ideas on this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wooohooo!!! We're finally there!! 🎉
This PR is looking very nice and I think many people will come to appreciate this change. Thank-you so much for your patience and dealing with all the back-and-forth. We really appreciate it.
I'm approving this now but it can't be merged until @pchaigno or @Alhadis have given it their 👍 too.
Once done, I'll start work on the next release of Linguist and hope to have this live on GitHub.com by the end of the month.
Great, thanks! It's been quite a journey, but it was also educative ^^. |
🎉 Thanks! |
Linguist v6.0.0 is now live on GitHub.com. 🎉 |
Thank you guys, you did an amazing job ❤ |
Hey, I just wanted to chime in and say thanks for this change! 🎉 ❤️ 🌟 (I had opened #3940 back in December and this allows me to have Protocol Buffer repositories again!) |
Adds a way to override
languages andblobs being included in the language statistics.This improves support for non-code languages (#3805) by making it possible
to flag languages as beingto override the detectable flag for specific blobs in the repo using the .gitattributes file.detectable
in the languages.yml file. It is also possibleNote: the tests require #3806 to be merged in order to work properly.