Recommend using set/unset gitattributes (vs. true/false strings) #4624

cespare · 2019-08-25T01:55:27Z

Hi! I was going to file an issue, but the change I want is small (doc changes in README.md) so I just sent a PR for discussion.

For context, see the git documentation for gitattributes. There are four states for an attribute; "set", "unset", "set to a value", and "unspecified". All of linguist's special attributes except for linguist-language are boolean attributes. For these, linguist considers the option to be "on" if the attribute is set or set to the string "true". It considers the option to be "off" if the attribute is unset or set to the string "false".

That is, these are equivalent to linguist:

x.txt linguist-generated=true
x.txt linguist-generated

as are these:

x.txt linguist-generated=false
x.txt -linguist-generated

I found this confusing: linguist planted the notion that the strings "true" and "false" are special, so when I read

special value "true"

on the gitattributes man page I did not understand that those strings are not special to gitattributes, only linguist. (The gitattributes man page could probably be more clear too.)

It's obviously too late to change linguist's behavior, but I propose that we change the documentation to promote set/unset vs. "true"/"false". Set/unset is the functionality for boolean options built right into gitattributes; "true"/"false" is special to linguist. (I initially stumbled onto this wondering while looking at the gitattributes file at my company and wondering why binary is specified like that but linguist-generated=true uses "true".)

Furthermore, note that the examples for linguist-vendored and linguist-documentation in the README currently mix the set/unset and "true"/"false" styles:

special-vendored-path/* linguist-vendored
jquery.js linguist-vendored=false

This seems extra confusing.

Finally, I think the example on the help page Customizing how changed files appear on GitHub should be changed from

search/index.json linguist-generated=true

to

search/index.json linguist-generated

(This might not be the right place to request that change.)

lildude · 2019-08-27T14:22:39Z

This makes sense to me. I'm not sure what the historical reason for using the string "false" was though as it was made over 5 years ago. That said, we can't start suggesting this as it won't work for the negation thanks to:

https://github.com/github/linguist/blob/a5df9a00ab7c9828bd7038bb9f9bd5e56d325dc9/lib/linguist/lazy_blob.rb#L103-L106

That would need to be changed to account for the old behaviour and the correctly interpreted behaviour you're proposing. Additional tests would also need to be added too. Feel free to update this PR with those changes if you want.

(This might not be the right place to request that change.)

Technically, it isn't, but I can make the change once we've got Linguist working

cespare · 2019-08-27T14:40:23Z

That said, we can't start suggesting this as it won't work for the negation [...]

Can you clarify what doesn't work?

When I opened the issue, I wasn't able to test linguist behavior locally because I couldn't get it running after battling Ruby stuff for about 30 minutes. But I did test it on GitHub and setting/unsetting attributes does seem to work as expected (i.e., GitHub correctly interprets them the same way that git does).

Here's a demo: https://github.com/cespare/misc/pull/4/files

There, using -linguist-generated unsets the attribute no matter whether it was previously set with linguist-generated or linguist-generated=true and in all cases where that attribute is either false or unset, GitHub is showing the diff.

$ git check-attr --all -- *
a.md: linguist-generated: unset
a.txt: linguist-generated: unset
b.md: linguist-generated: false
b.txt: linguist-generated: false
c.md: linguist-generated: true
c.txt: linguist-generated: set

When I was trying to understand the code, it seemed that this work is done by the if attr = part here:

https://github.com/github/linguist/blob/a5df9a00ab7c9828bd7038bb9f9bd5e56d325dc9/lib/linguist/lazy_blob.rb#L49-L53

but then I couldn't figure out what the super call did (been a while since I used Ruby) so I switched to testing it empirically on GitHub.

lildude · 2019-08-27T15:15:27Z

Can you clarify what doesn't work?

Sure. The negation, eg -linguist-generated and -linguist-vendored has no effect on Linguist's behaviour when it comes to the actual language analysis. https://github.com/lildude/literate-octo-chainsaw is a repo I tested this on. There should be Go and Python in the list of languages.

The behaviour is confirmed locally too:

$ bundle exec bin/github-linguist ~/tmp/trash/literate-octo-chainsaw
100.00% Ruby
$

Here's a demo: https://github.com/cespare/misc/pull/4/files

There, using -linguist-generated unsets the attribute no matter whether it was previously set with linguist-generated or linguist-generated=true and in all cases where that attribute is either false or unset, GitHub is showing the diff.

You're not testing the Linguist functionality yet 😄. .gitattributes for language detection as used by Linguist only takes effect once the changes are committed to the default branch. What you're PR is showing is how the diffing functionality works which reads the .gitattributes itself in much the same way your CLI cmd does.

When I was trying to understand the code, it seemed that this work is done by the if attr = part here:

https://github.com/github/linguist/blob/a5df9a00ab7c9828bd7038bb9f9bd5e56d325dc9/lib/linguist/lazy_blob.rb#L49-L53

but then I couldn't figure out what the super call did (been a while since I used Ruby) so I switched to testing it empirically on GitHub.

If the linguist-generated attribute can't be found, regardless of the value, the assignment to attr will fail so we call to the "parent" generated? method, in this case: https://github.com/github/linguist/blob/a5df9a00ab7c9828bd7038bb9f9bd5e56d325dc9/lib/linguist/generated.rb#L11-L13

... to then determine if Linguist should treat the file as generated or not when performing the language analysis.

lildude · 2019-08-27T15:29:01Z

🤔 I'm starting to question myself now I actually play with this directly 😁

cespare · 2019-08-27T15:38:03Z

What you're PR is showing is how the diffing functionality works which reads the .gitattributes itself in much the same way your CLI cmd does.

Are you saying that GitHub runs code which is not linguist but which uses the linguist-generated gitattribute?

lildude · 2019-08-27T15:45:49Z

Oh, I see my mistake now.

If the linguist-generated attribute can't be found, regardless of the value, the assignment to attr will fail so we call to the "parent" generated? method

It's not regardless of the value... it needs to equate to true so a string assignment would result in boolean_attribute(attr) being called, but something that equates to false, which is the case for the negation attributes.

Are you saying that GitHub runs code which is not linguist but which uses the linguist-generated gitattribute?

Yes. There are many places we use Linguist methods directly without running a full language analysis.

lildude · 2019-08-27T16:02:15Z

Yes. There are many places we use Linguist methods directly without running a full language analysis.

Oh yes, and I don't claim to have any knowledge of where all of these places are and why 😁

cespare · 2019-08-27T16:03:41Z

Are you saying that GitHub runs code which is not linguist but which uses the linguist-generated gitattribute?

Yes. There are many places we use Linguist methods directly without running a full language analysis.

I'm not sure I follow. I was asking if GitHub is reading the linguist-generated attributes in my example demo code using non-linguist code, which would invalidate my experiment in https://github.com/cespare/misc/pull/4/files. Calling linguist methods is part of what I would consider linguist code.

You mentioned things like "language detection" and "full language analysis" a few times, but I'm not sure why they necessarily need to be involved. I used linguist-generated as my test attribute because it seemed easy to gauge the effect: if GitHub thinks that attribute is set, it should hide the diff. This seems separate from language detection to me.

It's not regardless of the value... it needs to equate to true so a string assignment would result in boolean_attribute(attr) being called, but something that equates to false, which is the case for the negation attributes.

Yeah, I was assuming/hoping that rugged returns nil for unset attributes.

lildude · 2019-08-27T17:27:29Z

I'm not sure I follow. I was asking if GitHub is reading the linguist-generated attributes in my example demo code using non-linguist code, which would invalidate my experiment in https://github.com/cespare/misc/pull/4/files. Calling linguist methods is part of what I would consider linguist code.

Sorry about the confusion. Yes, the diff code is taking advantage of some of the Linguist methods but it is reading the linguist-* gitattributes itself using a different method from that defined in lazy_blob.rb. It also provides its own overriding methods for generated?, vendored? and documentation?.

You mentioned things like "language detection" and "full language analysis" a few times, but I'm not sure why they necessarily need to be involved. I used linguist-generated as my test attribute because it seemed easy to gauge the effect: if GitHub thinks that attribute is set, it should hide the diff. This seems separate from language detection to me.

Yes it is, and it's not the problem I've highlighted. That's why I said your example is only showing how the diffing code behaves and not the language detection, which is really Linguist's primary function. I see my initial explanation didn't really make this clear. Sorry.

Yeah, I was assuming/hoping that rugged returns nil for unset attributes.

It does. I think the problem with the language analysis comes about because the - negation returns false and not "false", which nil effectively does too, so causes Linguist to go straight into determining if the file should be ignored or not completely ignoring the user's choice from the .gitattributes.

lildude · 2019-08-27T17:33:43Z

This quick change seems to do the trick for negation:

diff --git a/lib/linguist/lazy_blob.rb b/lib/linguist/lazy_blob.rb
index 389cc243..cf5a9294 100644
--- a/lib/linguist/lazy_blob.rb
+++ b/lib/linguist/lazy_blob.rb
@@ -38,7 +38,7 @@ module Linguist
     end
 
     def documentation?
-      if attr = git_attributes['linguist-documentation']
+      if attr = git_attributes['linguist-documentation'].to_s
         boolean_attribute(attr)
       else
         super
@@ -46,7 +46,7 @@ module Linguist
     end
 
     def generated?
-      if attr = git_attributes['linguist-generated']
+      if attr = git_attributes['linguist-generated'].to_s
         boolean_attribute(attr)
       else
         super
@@ -54,7 +54,7 @@ module Linguist
     end
 
     def vendored?
-      if attr = git_attributes['linguist-vendored']
+      if attr = git_attributes['linguist-vendored'].to_s
         return boolean_attribute(attr)
       else
         super
@@ -102,7 +102,7 @@ module Linguist
 
     # Returns true if the attribute is present and not the string "false".
     def boolean_attribute(attribute)
-      attribute != "false"
+      attribute.to_s.eql?('true') ? true : false
     end
 
     def load_blob!

... though I've not thoroughly tested this or thought this through too closely (it's the end of my day).

cespare · 2019-08-27T18:02:20Z

Yeah, I was assuming/hoping that rugged returns nil for unset attributes.

It does. I think the problem with the language analysis comes about because the - negation returns false and not "false", which nil effectively does too, so causes Linguist to go straight into determining if the file should be ignored or not completely ignoring the user's choice from the .gitattributes.

I was using "unset" in the specific sense of the gitattributes man page; that is, explicitly marked as unset via -foo. It sounds like you're saying that in this case, rugged returns the ruby value false. (But either nil or false will cause us to take the else super branch, as you're saying.)

I don't think your proposed changes really make sense to me. Strings, even empty strings, are always truthy in Ruby, so

if attr = git_attributes['linguist-generated'].to_s
  boolean_attribute(attr)
else
  super
end

will always take the first branch no matter what, correct?

I need to think about this more too. I'll try to get this ruby code working on my machine again so I can try stuff out locally...

stale · 2019-09-26T18:56:37Z

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

cespare · 2019-09-26T19:05:37Z

@lildude any thoughts about my last message?

lildude · 2019-10-03T14:02:00Z

I don't think your proposed changes really make sense to me. Strings, even empty strings, are always truthy in Ruby, so
if attr = git_attributes['linguist-generated'].to_s
  boolean_attribute(attr)
else
  super
end
will always take the first branch no matter what, correct?

Ah yes. Good point. As I said, I didn't put much thought or testing into my quick suggestion 😄 We're still going to need to come up with a solution for Linguist.

stale · 2019-11-02T14:54:59Z

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

stale · 2019-12-02T15:06:41Z

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

stale · 2020-01-01T16:02:50Z

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

lildude · 2020-01-23T09:20:56Z

Sorry about the delay in coming back to this. I've worked out the necessary code changes to that we can honour the -linguist-whatever method as detailed in the .gitattributes docs and you're proposing in this PR whilst still keeping our old linguist-whatever=false method.

PR coming 🔜 .

lildude · 2020-01-23T14:08:49Z

I'm going to pull your doc changes into my PR and make it a co-authored commit when I merge so you don't lose the credit.

lildude · 2020-01-23T14:18:11Z

Closing in favour of #4780.

cespare · 2020-01-23T16:39:50Z

@lildude thanks!

Recommend using set/unset gitattributes (vs. true/false strings)

d7846e6

cespare added a commit to cespare/misc that referenced this pull request Aug 27, 2019

Add a demo for github-linguist/linguist/pull/4624

9f757f0

stale bot added the Stale label Sep 26, 2019

stale bot removed the Stale label Sep 26, 2019

stale bot added the Stale label Nov 2, 2019

lildude removed the Stale label Nov 2, 2019

stale bot added the Stale label Dec 2, 2019

lildude removed the Stale label Dec 2, 2019

stale bot added the Stale label Jan 1, 2020

lildude removed the Stale label Jan 1, 2020

lildude self-assigned this Jan 6, 2020

lildude mentioned this pull request Jan 23, 2020

Accept and encourage "minus-prefix" for unset gitattributes as per gitattributes documentation #4780

Merged

2 tasks

lildude closed this Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommend using set/unset gitattributes (vs. true/false strings) #4624

Recommend using set/unset gitattributes (vs. true/false strings) #4624

cespare commented Aug 25, 2019

lildude commented Aug 27, 2019

cespare commented Aug 27, 2019

lildude commented Aug 27, 2019

lildude commented Aug 27, 2019

cespare commented Aug 27, 2019

lildude commented Aug 27, 2019

lildude commented Aug 27, 2019

cespare commented Aug 27, 2019

lildude commented Aug 27, 2019

lildude commented Aug 27, 2019

cespare commented Aug 27, 2019

stale bot commented Sep 26, 2019

cespare commented Sep 26, 2019

lildude commented Oct 3, 2019

stale bot commented Nov 2, 2019

stale bot commented Dec 2, 2019

stale bot commented Jan 1, 2020

lildude commented Jan 23, 2020

lildude commented Jan 23, 2020

lildude commented Jan 23, 2020

cespare commented Jan 23, 2020

Recommend using set/unset gitattributes (vs. true/false strings) #4624

Recommend using set/unset gitattributes (vs. true/false strings) #4624

Conversation

cespare commented Aug 25, 2019

lildude commented Aug 27, 2019

cespare commented Aug 27, 2019

lildude commented Aug 27, 2019

lildude commented Aug 27, 2019

cespare commented Aug 27, 2019

lildude commented Aug 27, 2019

lildude commented Aug 27, 2019

cespare commented Aug 27, 2019

lildude commented Aug 27, 2019

lildude commented Aug 27, 2019

cespare commented Aug 27, 2019

stale bot commented Sep 26, 2019

cespare commented Sep 26, 2019

lildude commented Oct 3, 2019

stale bot commented Nov 2, 2019

stale bot commented Dec 2, 2019

stale bot commented Jan 1, 2020

lildude commented Jan 23, 2020

lildude commented Jan 23, 2020

lildude commented Jan 23, 2020

cespare commented Jan 23, 2020