Override languages being included by language statistics #3807

seppestas · 2017-08-31T16:04:54Z

Adds a way to override ~~languages and~~ blobs being included in the language statistics.

This improves support for non-code languages (#3805) by making it possible ~~to flag languages as being detectable in the languages.yml file. It is also possible~~ to override the detectable flag for specific blobs in the repo using the .gitattributes file.

Note: the tests require #3806 to be merged in order to work properly.

seppestas · 2017-08-31T16:07:24Z

/cc @Alhadis @lildude

seppestas · 2017-08-31T16:09:40Z

Note: Travis Build 8308 will probably fail because of the weird way this library tests .gitattributes.

valerionew · 2017-08-31T16:33:15Z

This would also fix #3228 and probably many more past issues. For what i could read, this feature has been asked for a long time.
Let's hope that Github's staff and the contributors will agree with these changes!

seppestas · 2017-09-06T10:15:42Z

So... any thoughts?

While I think this is the most flexible and forward looking approach, I understand it's also has the most impact on the rest on the Linguist system compared to the other approaches I described in #3805.

As stated by @5N44P this approach allows users that want XML, SVG, doc, ... files to represented in the language statistics of a project, like the cases stated in #3159 and #3228, to do so.

Note that the override in the .gitattributes works on a blob level. While flexible and simple, this might not be the most user-friendly approach. Having the option to override what languages are "detectable" in the .gitattributes file might be a more consistent solution, however this is harder to combine with the "sensible defaults" in the languages.yml file.

lildude · 2017-09-07T08:45:59Z

I'm not 100% sure this is the right approach to solve this problem as this paves the way for a million more PRs asking for similar functionality from other non-programming languages which don't show up in the colourful language bar. I'm also not sure of the impact allowing such an override will have on the performance of Linguist on GitHub.com - it's already a pretty hungry beast and I imagine opening the door to overriding things like XML & SQL will lead to heeeeawg problems.

Accordingly, I've deferred the judgment on how best to approach this to the team within GitHub that "owns" the Linguist product (I'm not part of that team) as they have a much clearer idea of the direction they'd like to take such things, in conjunction with future GitHub.com development plans, in the long run.

seppestas · 2017-09-07T09:42:15Z

I'm not 100% sure this is the right approach to solve this problem as this paves the way for a million more PRs asking for similar functionality from other non-programming languages which don't show up in the colourful language bar.

Why is this not a good thing? Having people indicate what they expect their projects to be recognized as by Github is an awesome way to handle this IMO. Anyhow, I tried to find some other languages where it would make sense to have them "detectable" by default, but it wasn't so easy to find those since most languages in GH are either already programming languages or just plain data. The only languages I could think of (like music sheet files and music mastering settings) are not currently popular / available on Github. I think the amount of requests for other languages to be detectable by default will not be that high. The only problem might be people wanting to recognise generic files (like XML or SVG), for this reason I think having the option to override the detectable flag makes sense.

I'm also not sure of the impact allowing such an override will have on the performance of Linguist on GitHub.com

I don't think the detectable field for languages I added will have a big impact on performance, but the way the override works might since it uses file globbing. Having an override on a language level (or no override at all) would probably make it more performant. However most overrides in the .gitattributes file currently work on a file / blob level, so I felt the current approach was the most consistent.

Should I separate the addition of the "detectable" language flag (0f7c0df5c4683a134c8e0794d672f1aef4479aee) and the override feature into their own PR?

lildude · 2017-09-07T09:46:02Z

Should I separate the addition of the "detectable" language flag (0f7c0df) and the override feature into their own PR?

Nah, you can leave things as-is and we'll let my colleagues decide how best to approach this and advise further.

lildude

I've discussed this PR with team within GitHub and we think implementing the linguist-detectable .gitattributes option is a good compromise to allow you to bring back the KiCad language stats.

However, we think setting a global detectable attribute within languages.yml is something that could pose a problem in the future so we'd rather hold off on that part for the moment. If we need to, we can always come back to the languages.yml side of things in future.

Can you please update your PR to remove the changes to languages.yml and remove the functionality that detects the detectable attribute within that file.

lildude · 2017-09-09T02:17:15Z

lib/linguist/language.rb

@@ -480,6 +482,15 @@ def searchable?
      @searchable
    end

+    # Public: Is it detectable?
+    #
+    # Detectable languages will by include in the language stats.


s/by include/be included/

Alhadis · 2017-09-15T08:18:53Z

Personally, I believe giving users the option of setting their project's language classification a.l.a BitBucket is the ideal solution. It needn't be mandatory; by default, the classification would be based on the codebase's language usage (as it is now), until manually set by the owner. This needn't affect the stats in the language breakdown, of course... but it would offer an intuitive way to affect how their projects are being advertised to other developers.

Just my 2 cents. 👍

seppestas · 2017-09-15T09:36:43Z

I would really like to have languages like KiCad and Eagle design files and other languages that can be considered primary to a repository detectable by default.

Most users of these systems are not that intimately familiar with the workings of git and Github, let alone the role of Linguist in language detection. These users expect their repositories to be recognised correctly, and when this is not the case this causes confusion, and in a lot of cases frustration.

Best case this will result in a bunch of people creating issues on Linguist asking why their repo is being mis-identified, with a whole lot of "you should add a linguist-detectable line in your .gitattributes file" answers. Worst case people never find out about Linguist and/or leave things as-is, negatively impacting discoverability of their projects on Github.

I'll update this PR to only add the .gitattributes feature, but I would like to keep #3805 open until a solution is added that allows non-code languages to be included in the language statistics by default. Could #3806 already be merged so the tests pass when I update this PR?

seppestas · 2017-09-15T09:45:41Z

Personally, I believe giving users the option of setting their project's language classification a.l.a BitBucket is the ideal solution.

While I agree this would be more intuitive for people that are not familiar with the .gitattributes file, I'm not a big fan of this solution because:

In the case of hardware projects repositories often contain a combination of hardware design files and firmware source files.
It's not automatic, meaning people can overlook / ignore this.
The language stats bar just looks cool and adds an easy way to navigate to different types of source files in the repo.

Alhadis · 2017-09-15T10:03:12Z

It's not automatic, meaning people can overlook / ignore this.

... erh, it literally says in the second sentence that the proposed default would be automatic:

It needn't be mandatory; by default, the classification would be based on the codebase's language usage (as it is now), until manually set by the owner.

As for this:

negatively impacting discoverability of their projects on Github.

There is a topics feature, as I said. Furthermore, I've found that to be much more reliable than searching repositories by language. Searching for repositories that've been identified as PostScript has, more often than not, given me a bunch of misclassified projects. Same for Turing and various lesser-known assembler languages. Searching for something that's been explicitly tagged by humans has proven more useful than the results of a Bayesian classifier.

Alhadis · 2017-09-15T10:07:12Z

The language stats bar just looks cool

I agree, but don't get too attached to what you see in GitHub's UI. Design elements are subject to change without notice, and believe me, it often happens when you least expect it. 😜 If GitHub decide to do away with the language bar someday, for reasons only known to internally, well...

seppestas · 2017-09-18T09:53:32Z

by default, the classification would be based on the codebase's language usage (as it is now), until manually set by the owner

This would still require proper support for non-code languages in order to be recognised by Linguist though.

Having the option to manually change the language a project is classified as in a way that is more user-friendly than .gitattribute files would already be a big improvement. It might be interesting to tie this in to Github's topic feature. Anyhow, these are things that GH themselves need to implement, since it can't be done by Linguist...

lildude · 2017-09-18T10:17:13Z

Having the option to manually change the language a project is classified as in a way that is more user-friendly than .gitattribute files would already be a big improvement.

Please don't confuse the language bar showing the breakdown of languages found within a repository with a specific repository language. GitHub has no concept of a repository language and never has as I pointed out in #3805 (comment). I'm repeating this as I keep getting the feeling that you're thinking the language stats bar is an indicator of a repository's language rather than a breakdown of the languages within it.

The .gitattributes file is not for setting a repository language and nor should it considered a method of doing so either. The Linguist-specific options are meant as a tool to help Linguist identify the languages within a repo when it doesn't quite get it right; it's not perfect after all. If adjusting the content just so happens to result in a shift in the proportion of languages such that your favourite language is now the predominant language, then great, but there is still no concept of a repository language and using the .gitattributes file to try and set one is really missing the point.

If you really want the ability to set a repository language and you feel topics are't the right thing, you'll need to contact GitHub support and log a feature request.

seppestas · 2017-09-18T14:25:55Z

I'm repeating this as I keep getting the feeling that you're thinking the language stats bar is an indicator of a repository's language rather than a breakdown of the languages within it.

I'm aware of this. The problem I'm trying to address is it not showing the languages I'm / (Ki)CAD users / maintainers of none code repos in general are interested in, and only the remaining, typically miss-identified files being exposed in the language stats causing confusion.

Anyhow, this PR is an implementation of one of my proposals on how this can be fixed. I think we should keep the higher level discussion on how Linguist could / should handle non-code languages in #3805.

rafer · 2017-09-18T23:36:30Z

Hi All, just wanted to provide some further context for the decisions that @lildude outlined. Basically, we're trying to get some of the functionality you want (the ability for your projects to have KiCad files show up in the language bar), without making significant changes to linguist (adding detectable to the languages.yml), because we'd need to further investigate the impact of this on internal github systems, and also it could open up a common vector for disputes (i.e. some folks want XML files to be detectable by default and some don't). Long story short we're trying to be cautious and make sure that we've fully thought out the repercussions before acting. That's why I think the project-level overrides solution is a good interim step.

It alleviates at least some of the pain y'all are currently experiencing.
We can 👍 this change without a lengthy investigation on the GitHub side.
We can continue to collect data. If we do indeed get a ton of "you should add a linguist-detectable line in your .gitattributes file" type issues as @seppestas suggests, that will be good to know and can help inform future decisions.

Anyway, looking forward to merging this in once @lildude's review comments are addressed.

seppestas · 2017-09-19T08:24:49Z

Ok, I have a commit ready where I reverted the changes of 0f7c0df5c4683a134c8e0794d672f1aef4479aee and made some required adjustments to make this work with the .gitattributes related code, as well as fixing a problem with other blob types.

Having #3806 merged before pushing this commit would make the tests pass (if everything goes well). It's a bit of a chicken and egg story though. Should I go ahead and push my commit to update this PR and have you rerun the test once #3806 is merged, or do you want to merge #3806 first so the tests pass as soon as I update this PR?

seppestas · 2017-09-19T19:03:06Z

Huh, looks like the required commit on the test branch (7af5303) is still not found... Am I missing something?

seppestas · 2017-09-19T19:08:16Z

Ah, the commit hash got changed when the PR was merged -.-

seppestas · 2017-09-19T19:20:00Z

Should be fixed now. Maybe it's a good idea to tag the commits in test/attributes and using the tag in the tests. This way the person merging PRs could tag the correct commit when a PR is merged. IMO this would also make the test appear a bit clearer, especially if the tags get namespaced to the branch, e.g test/attributes/override_detectable.

Alhadis · 2017-09-19T19:23:47Z

I believe that's what "lightweight" Git tags are for (as opposed to the annotated ones we use daily). ;-)

See git-tag(1).

seppestas · 2017-09-19T19:36:58Z

Yeey, all green now ^^

I don't think the type of tag matters (though using a lightweight tag does make sense for this). It might even make more sense to use branches, this would make it possible to update the commit required for the tests without having to update the tests.

Either way, someone with merge access would have to create a tag or (update a) branch after a PR is merged into test/attributes for this to work. Also, I'm not sure if it's possible to do a git fetch using a wildcard in the branch name. Apparently this requires configuration of the refspec.

seppestas · 2017-09-19T19:40:11Z

Hmm, I should probably add a test to see if detectable actually causes include_in_language_stats to be changed...

Alhadis · 2017-09-19T19:46:29Z

Erh no, that's precisely what lightweight tags are for... keeping branches in-sync isn't just a pain, it also takes an ungodly amount of time to switch between for a project of Linguist's size.

Then, of course, there's the racket caused by submodules:

λ GitHub-Linguist (master): time git checkout origin/puppet-fix
warning: unable to rmdir vendor/grammars/Sublime-Pep8: Directory not empty
warning: unable to rmdir vendor/grammars/dartlang: Directory not empty
warning: unable to rmdir vendor/grammars/data-weave-tmLanguage: Directory not empty
warning: unable to rmdir vendor/grammars/language-closure-templates: Directory not empty
warning: unable to rmdir vendor/grammars/language-pan: Directory not empty
warning: unable to rmdir vendor/grammars/language-pcb: Directory not empty
warning: unable to rmdir vendor/grammars/language-webassembly: Directory not empty
…
M	vendor/CodeMirror
M	vendor/grammars/Elm
M	vendor/grammars/Handlebars
M	vendor/grammars/LiveScript.tmbundle
M	vendor/grammars/MagicPython
M	vendor/grammars/Modelica
M	vendor/grammars/NimLime
M	vendor/grammars/Sublime-SQF-Language
M	vendor/grammars/SublimeGDB
M	vendor/grammars/SublimePuppet
M	vendor/grammars/TLA
M	vendor/grammars/Terraform.tmLanguage
M	vendor/grammars/antlr.tmbundle
M	vendor/grammars/api-blueprint-sublime-plugin
M	vendor/grammars/atom-language-clean
M	vendor/grammars/atom-language-perl6
M	vendor/grammars/atom-language-rust
M	vendor/grammars/c.tmbundle
M	vendor/grammars/capnproto.tmbundle
M	vendor/grammars/chapel-tmbundle
M	vendor/grammars/cucumber-tmbundle
M	vendor/grammars/cython
M	vendor/grammars/d.tmbundle
Note: checking out 'origin/puppet-fix'.

real	0m2.866s
user	0m0.451s
sys	0m1.758s

Yeah, no thanks. A lightweight tag is designed for exactly this scenario.

seppestas · 2017-09-19T21:28:55Z

Hmm, I agree keeping all those branches in sync is a pain, using tags is probably easier. Why specifically lightweight tags though? Just because there's not much point in using annotated tags or is there a reason lightweight tags must be used?

This key allows to override the language being included in the language stats of a repository.

Reverts changes in 0f7c0df5.

PR github-linguist#3806 changed the commit hash. The original commit was not actually merged into the test/attributes branch.

seppestas · 2018-01-12T15:13:38Z

Finally gotten around to rebasing my branch onto master. Took me a longer than expected since I got confused because the tests were failing. Seems like running bundle exec rake samples fixed everything.

lildude · 2018-01-12T17:32:33Z

👍 Can you please add a little section for the new attribute like we've got for all the others, for example like the documentation attribute, to explain the usage and purpose.

lildude · 2018-01-16T16:41:35Z

README.md

@@ -164,6 +164,21 @@ $ cat .gitattributes
 Api.elm linguist-generated=true
 ```

+#### Detectable
+
+Currently there are 4 different language types: data, programming, markup, prose or nill. When a language is not considered a programming, markup or prose language it is typically classified as data. Recently there have been a lot of changes to languages that have been incorrectly classified as programming or markup languages. E.g KiCad files used to be classified as programming language (PR #3743), Eagle files used to be classified as markup (PR #3751), ...


I don't think we need this paragraph; it adds too much context for the introduction of the change rather than concentrating on the purpose of the attribute. The next paragraph, with a few tweaks, is probably sufficient.

Oops, I added this paragraph by accident... I just copy-pasted it to see what I wrote in my original issue, forgot to remove it and committed it (this is what you get when you start working on something one evening and only finish up quickly the next day) 😳

lildude · 2018-01-16T16:49:08Z

README.md

+
+Currently there are 4 different language types: data, programming, markup, prose or nill. When a language is not considered a programming, markup or prose language it is typically classified as data. Recently there have been a lot of changes to languages that have been incorrectly classified as programming or markup languages. E.g KiCad files used to be classified as programming language (PR #3743), Eagle files used to be classified as markup (PR #3751), ...
+
+By default only programming languages are "detected", i.e counted towards a repository's language statistics. In some repositories it might be preferable to include some other files in the language statistics. In other cases source files that are not vendored, documentation or generated code might want to be be excluded from the language statistics.


The content of this paragraph is good though it feels a little "off" to me, though I can't put my finger on why just yet. I'll need to mull it a little bit.

@Alhadis you're the better wordsmith, you may have some ideas on this too.

lildude

Wooohooo!!! We're finally there!! 🎉

This PR is looking very nice and I think many people will come to appreciate this change. Thank-you so much for your patience and dealing with all the back-and-forth. We really appreciate it.

I'm approving this now but it can't be merged until @pchaigno or @Alhadis have given it their 👍 too.

Once done, I'll start work on the next release of Linguist and hope to have this live on GitHub.com by the end of the month.

seppestas · 2018-01-18T09:59:36Z

Great, thanks! It's been quite a journey, but it was also educative ^^.

seppestas · 2018-01-23T13:12:10Z

🎉 Thanks!

lildude · 2018-01-26T13:28:27Z

Linguist v6.0.0 is now live on GitHub.com. 🎉

valerionew · 2018-01-26T13:37:27Z

Thank you guys, you did an amazing job ❤

whymarrh · 2018-05-13T18:20:03Z

Hey, I just wanted to chime in and say thanks for this change! 🎉 ❤️ 🌟

(I had opened #3940 back in December and this allows me to have Protocol Buffer repositories again!)

seppestas changed the title ~~Feature detectable languages~~ Override languages being included by language statistics Aug 31, 2017

seppestas mentioned this pull request Aug 31, 2017

Add some linguist-detectable attributes #3806

Merged

valerionew mentioned this pull request Sep 7, 2017

KiCAD project now being labelled as SourcePawn. #3812

Closed

lildude requested changes Sep 15, 2017

View reviewed changes

seppestas mentioned this pull request Sep 19, 2017

Improve support for non-code languages #3805

Closed

lildude mentioned this pull request Jan 10, 2018

Can't set the language - Web Ontology Language #3980

Closed

Seppe Stas added 8 commits January 12, 2018 15:57

Add detectable key to languages

80cbaad

This key allows to override the language being included in the language stats of a repository.

Make detectable override-able using .gitattributes

f9a8f02

Mention linguist-detectable in README

2c62f86

Remove detectable key from languages

a4e3c11

Reverts changes in 0f7c0df5.

Update commit hash to the one that was merged

f44b663

PR github-linguist#3806 changed the commit hash. The original commit was not actually merged into the test/attributes branch.

Fix check to ensure detectable is defined

bd5bd00

Add include in language stats tests when detectable set

28f2c7f

Ignore detectable when vendored, documentation or overridden

794af60

seppestas force-pushed the feature-detectable-languages branch from 2a8c35d to 794af60 Compare January 12, 2018 15:09

Add documentation on detectable override in README

136e914

lildude reviewed Jan 16, 2018

View reviewed changes

Improve documentation on detectable override in README

5be62f6

lildude approved these changes Jan 18, 2018

View reviewed changes

lildude requested review from pchaigno and Alhadis January 19, 2018 10:36

Alhadis approved these changes Jan 19, 2018

View reviewed changes

lildude merged commit 8da6ddf into github-linguist:master Jan 23, 2018

seppestas deleted the feature-detectable-languages branch January 23, 2018 13:11

lildude mentioned this pull request Jan 26, 2018

Release v6.0.0 #4002

Merged

This was referenced Feb 26, 2023

folder ".override/" in gitattributes gitattributes/gitattributes#176

Closed

add file TextFiles.gitattributes and add folder ".override/" gitattributes/gitattributes#177

Closed

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024


		Currently there are 4 different language types: data, programming, markup, prose or nill. When a language is not considered a programming, markup or prose language it is typically classified as data. Recently there have been a lot of changes to languages that have been incorrectly classified as programming or markup languages. E.g KiCad files used to be classified as programming language (PR #3743), Eagle files used to be classified as markup (PR #3751), ...

		By default only programming languages are "detected", i.e counted towards a repository's language statistics. In some repositories it might be preferable to include some other files in the language statistics. In other cases source files that are not vendored, documentation or generated code might want to be be excluded from the language statistics.

Override languages being included by language statistics #3807

Override languages being included by language statistics #3807

Conversation

seppestas commented Aug 31, 2017 • edited Loading

seppestas commented Aug 31, 2017

seppestas commented Aug 31, 2017

valerionew commented Aug 31, 2017

seppestas commented Sep 6, 2017

lildude commented Sep 7, 2017

seppestas commented Sep 7, 2017 • edited Loading

lildude commented Sep 7, 2017

lildude left a comment

Choose a reason for hiding this comment

lildude Sep 9, 2017

Choose a reason for hiding this comment

Alhadis commented Sep 15, 2017 • edited Loading

seppestas commented Sep 15, 2017

seppestas commented Sep 15, 2017

Alhadis commented Sep 15, 2017

Alhadis commented Sep 15, 2017

seppestas commented Sep 18, 2017

lildude commented Sep 18, 2017

seppestas commented Sep 18, 2017 • edited Loading

rafer commented Sep 18, 2017

seppestas commented Sep 19, 2017

seppestas commented Sep 19, 2017

seppestas commented Sep 19, 2017 • edited Loading

seppestas commented Sep 19, 2017

Alhadis commented Sep 19, 2017

seppestas commented Sep 19, 2017

seppestas commented Sep 19, 2017

Alhadis commented Sep 19, 2017

seppestas commented Sep 19, 2017

seppestas commented Jan 12, 2018

lildude commented Jan 12, 2018

lildude Jan 16, 2018

Choose a reason for hiding this comment

seppestas Jan 16, 2018

Choose a reason for hiding this comment

lildude Jan 16, 2018

Choose a reason for hiding this comment

lildude left a comment

Choose a reason for hiding this comment

seppestas commented Jan 18, 2018

seppestas commented Jan 23, 2018

lildude commented Jan 26, 2018

valerionew commented Jan 26, 2018

whymarrh commented May 13, 2018

seppestas commented Aug 31, 2017 •

edited

Loading

seppestas commented Sep 7, 2017 •

edited

Loading

Alhadis commented Sep 15, 2017 •

edited

Loading

seppestas commented Sep 18, 2017 •

edited

Loading

seppestas commented Sep 19, 2017 •

edited

Loading