Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend using set/unset gitattributes (vs. true/false strings) #4624

Closed

Conversation

cespare
Copy link
Contributor

@cespare cespare commented Aug 25, 2019

Hi! I was going to file an issue, but the change I want is small (doc changes in README.md) so I just sent a PR for discussion.

For context, see the git documentation for gitattributes. There are four states for an attribute; "set", "unset", "set to a value", and "unspecified". All of linguist's special attributes except for linguist-language are boolean attributes. For these, linguist considers the option to be "on" if the attribute is set or set to the string "true". It considers the option to be "off" if the attribute is unset or set to the string "false".

That is, these are equivalent to linguist:

x.txt linguist-generated=true
x.txt linguist-generated

as are these:

x.txt linguist-generated=false
x.txt -linguist-generated

I found this confusing: linguist planted the notion that the strings "true" and "false" are special, so when I read

special value "true"

on the gitattributes man page I did not understand that those strings are not special to gitattributes, only linguist. (The gitattributes man page could probably be more clear too.)

It's obviously too late to change linguist's behavior, but I propose that we change the documentation to promote set/unset vs. "true"/"false". Set/unset is the functionality for boolean options built right into gitattributes; "true"/"false" is special to linguist. (I initially stumbled onto this wondering while looking at the gitattributes file at my company and wondering why binary is specified like that but linguist-generated=true uses "true".)

Furthermore, note that the examples for linguist-vendored and linguist-documentation in the README currently mix the set/unset and "true"/"false" styles:

special-vendored-path/* linguist-vendored
jquery.js linguist-vendored=false

This seems extra confusing.

Finally, I think the example on the help page Customizing how changed files appear on GitHub should be changed from

search/index.json linguist-generated=true

to

search/index.json linguist-generated

(This might not be the right place to request that change.)

@lildude
Copy link
Member

lildude commented Aug 27, 2019

This makes sense to me. I'm not sure what the historical reason for using the string "false" was though as it was made over 5 years ago. That said, we can't start suggesting this as it won't work for the negation thanks to:

https://github.com/github/linguist/blob/a5df9a00ab7c9828bd7038bb9f9bd5e56d325dc9/lib/linguist/lazy_blob.rb#L103-L106

That would need to be changed to account for the old behaviour and the correctly interpreted behaviour you're proposing. Additional tests would also need to be added too. Feel free to update this PR with those changes if you want.

(This might not be the right place to request that change.)

Technically, it isn't, but I can make the change once we've got Linguist working

cespare added a commit to cespare/misc that referenced this pull request Aug 27, 2019
@cespare
Copy link
Contributor Author

cespare commented Aug 27, 2019

That said, we can't start suggesting this as it won't work for the negation [...]

Can you clarify what doesn't work?

When I opened the issue, I wasn't able to test linguist behavior locally because I couldn't get it running after battling Ruby stuff for about 30 minutes. But I did test it on GitHub and setting/unsetting attributes does seem to work as expected (i.e., GitHub correctly interprets them the same way that git does).

Here's a demo: https://github.com/cespare/misc/pull/4/files

There, using -linguist-generated unsets the attribute no matter whether it was previously set with linguist-generated or linguist-generated=true and in all cases where that attribute is either false or unset, GitHub is showing the diff.

$ git check-attr --all -- *
a.md: linguist-generated: unset
a.txt: linguist-generated: unset
b.md: linguist-generated: false
b.txt: linguist-generated: false
c.md: linguist-generated: true
c.txt: linguist-generated: set

When I was trying to understand the code, it seemed that this work is done by the if attr = part here:

https://github.com/github/linguist/blob/a5df9a00ab7c9828bd7038bb9f9bd5e56d325dc9/lib/linguist/lazy_blob.rb#L49-L53

but then I couldn't figure out what the super call did (been a while since I used Ruby) so I switched to testing it empirically on GitHub.

@lildude
Copy link
Member

lildude commented Aug 27, 2019

Can you clarify what doesn't work?

Sure. The negation, eg -linguist-generated and -linguist-vendored has no effect on Linguist's behaviour when it comes to the actual language analysis. https://github.com/lildude/literate-octo-chainsaw is a repo I tested this on. There should be Go and Python in the list of languages.

The behaviour is confirmed locally too:

$ bundle exec bin/github-linguist ~/tmp/trash/literate-octo-chainsaw
100.00% Ruby
$

Here's a demo: https://github.com/cespare/misc/pull/4/files

There, using -linguist-generated unsets the attribute no matter whether it was previously set with linguist-generated or linguist-generated=true and in all cases where that attribute is either false or unset, GitHub is showing the diff.

You're not testing the Linguist functionality yet 😄. .gitattributes for language detection as used by Linguist only takes effect once the changes are committed to the default branch. What you're PR is showing is how the diffing functionality works which reads the .gitattributes itself in much the same way your CLI cmd does.

When I was trying to understand the code, it seemed that this work is done by the if attr = part here:

https://github.com/github/linguist/blob/a5df9a00ab7c9828bd7038bb9f9bd5e56d325dc9/lib/linguist/lazy_blob.rb#L49-L53

but then I couldn't figure out what the super call did (been a while since I used Ruby) so I switched to testing it empirically on GitHub.

If the linguist-generated attribute can't be found, regardless of the value, the assignment to attr will fail so we call to the "parent" generated? method, in this case: https://github.com/github/linguist/blob/a5df9a00ab7c9828bd7038bb9f9bd5e56d325dc9/lib/linguist/generated.rb#L11-L13

... to then determine if Linguist should treat the file as generated or not when performing the language analysis.

@lildude
Copy link
Member

lildude commented Aug 27, 2019

🤔 I'm starting to question myself now I actually play with this directly 😁

@cespare
Copy link
Contributor Author

cespare commented Aug 27, 2019

What you're PR is showing is how the diffing functionality works which reads the .gitattributes itself in much the same way your CLI cmd does.

Are you saying that GitHub runs code which is not linguist but which uses the linguist-generated gitattribute?

@lildude
Copy link
Member

lildude commented Aug 27, 2019

Oh, I see my mistake now.

If the linguist-generated attribute can't be found, regardless of the value, the assignment to attr will fail so we call to the "parent" generated? method

It's not regardless of the value... it needs to equate to true so a string assignment would result in boolean_attribute(attr) being called, but something that equates to false, which is the case for the negation attributes.

Are you saying that GitHub runs code which is not linguist but which uses the linguist-generated gitattribute?

Yes. There are many places we use Linguist methods directly without running a full language analysis.

@lildude
Copy link
Member

lildude commented Aug 27, 2019

Yes. There are many places we use Linguist methods directly without running a full language analysis.

Oh yes, and I don't claim to have any knowledge of where all of these places are and why 😁

@cespare
Copy link
Contributor Author

cespare commented Aug 27, 2019

Are you saying that GitHub runs code which is not linguist but which uses the linguist-generated gitattribute?

Yes. There are many places we use Linguist methods directly without running a full language analysis.

I'm not sure I follow. I was asking if GitHub is reading the linguist-generated attributes in my example demo code using non-linguist code, which would invalidate my experiment in https://github.com/cespare/misc/pull/4/files. Calling linguist methods is part of what I would consider linguist code.

You mentioned things like "language detection" and "full language analysis" a few times, but I'm not sure why they necessarily need to be involved. I used linguist-generated as my test attribute because it seemed easy to gauge the effect: if GitHub thinks that attribute is set, it should hide the diff. This seems separate from language detection to me.

It's not regardless of the value... it needs to equate to true so a string assignment would result in boolean_attribute(attr) being called, but something that equates to false, which is the case for the negation attributes.

Yeah, I was assuming/hoping that rugged returns nil for unset attributes.

@lildude
Copy link
Member

lildude commented Aug 27, 2019

I'm not sure I follow. I was asking if GitHub is reading the linguist-generated attributes in my example demo code using non-linguist code, which would invalidate my experiment in https://github.com/cespare/misc/pull/4/files. Calling linguist methods is part of what I would consider linguist code.

Sorry about the confusion. Yes, the diff code is taking advantage of some of the Linguist methods but it is reading the linguist-* gitattributes itself using a different method from that defined in lazy_blob.rb. It also provides its own overriding methods for generated?, vendored? and documentation?.

You mentioned things like "language detection" and "full language analysis" a few times, but I'm not sure why they necessarily need to be involved. I used linguist-generated as my test attribute because it seemed easy to gauge the effect: if GitHub thinks that attribute is set, it should hide the diff. This seems separate from language detection to me.

Yes it is, and it's not the problem I've highlighted. That's why I said your example is only showing how the diffing code behaves and not the language detection, which is really Linguist's primary function. I see my initial explanation didn't really make this clear. Sorry.

Yeah, I was assuming/hoping that rugged returns nil for unset attributes.

It does. I think the problem with the language analysis comes about because the - negation returns false and not "false", which nil effectively does too, so causes Linguist to go straight into determining if the file should be ignored or not completely ignoring the user's choice from the .gitattributes.

@lildude
Copy link
Member

lildude commented Aug 27, 2019

This quick change seems to do the trick for negation:

diff --git a/lib/linguist/lazy_blob.rb b/lib/linguist/lazy_blob.rb
index 389cc243..cf5a9294 100644
--- a/lib/linguist/lazy_blob.rb
+++ b/lib/linguist/lazy_blob.rb
@@ -38,7 +38,7 @@ module Linguist
     end
 
     def documentation?
-      if attr = git_attributes['linguist-documentation']
+      if attr = git_attributes['linguist-documentation'].to_s
         boolean_attribute(attr)
       else
         super
@@ -46,7 +46,7 @@ module Linguist
     end
 
     def generated?
-      if attr = git_attributes['linguist-generated']
+      if attr = git_attributes['linguist-generated'].to_s
         boolean_attribute(attr)
       else
         super
@@ -54,7 +54,7 @@ module Linguist
     end
 
     def vendored?
-      if attr = git_attributes['linguist-vendored']
+      if attr = git_attributes['linguist-vendored'].to_s
         return boolean_attribute(attr)
       else
         super
@@ -102,7 +102,7 @@ module Linguist
 
     # Returns true if the attribute is present and not the string "false".
     def boolean_attribute(attribute)
-      attribute != "false"
+      attribute.to_s.eql?('true') ? true : false
     end
 
     def load_blob!

... though I've not thoroughly tested this or thought this through too closely (it's the end of my day).

@cespare
Copy link
Contributor Author

cespare commented Aug 27, 2019

Yeah, I was assuming/hoping that rugged returns nil for unset attributes.

It does. I think the problem with the language analysis comes about because the - negation returns false and not "false", which nil effectively does too, so causes Linguist to go straight into determining if the file should be ignored or not completely ignoring the user's choice from the .gitattributes.

I was using "unset" in the specific sense of the gitattributes man page; that is, explicitly marked as unset via -foo. It sounds like you're saying that in this case, rugged returns the ruby value false. (But either nil or false will cause us to take the else super branch, as you're saying.)

I don't think your proposed changes really make sense to me. Strings, even empty strings, are always truthy in Ruby, so

if attr = git_attributes['linguist-generated'].to_s
  boolean_attribute(attr)
else
  super
end

will always take the first branch no matter what, correct?

I need to think about this more too. I'll try to get this ruby code working on my machine again so I can try stuff out locally...

@stale
Copy link

stale bot commented Sep 26, 2019

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Sep 26, 2019
@cespare
Copy link
Contributor Author

cespare commented Sep 26, 2019

@lildude any thoughts about my last message?

@stale stale bot removed the Stale label Sep 26, 2019
@lildude
Copy link
Member

lildude commented Oct 3, 2019

I don't think your proposed changes really make sense to me. Strings, even empty strings, are always truthy in Ruby, so

if attr = git_attributes['linguist-generated'].to_s
  boolean_attribute(attr)
else
  super
end

will always take the first branch no matter what, correct?

Ah yes. Good point. As I said, I didn't put much thought or testing into my quick suggestion 😄 We're still going to need to come up with a solution for Linguist.

@stale
Copy link

stale bot commented Nov 2, 2019

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Nov 2, 2019
@lildude lildude removed the Stale label Nov 2, 2019
@stale
Copy link

stale bot commented Dec 2, 2019

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Dec 2, 2019
@lildude lildude removed the Stale label Dec 2, 2019
@stale
Copy link

stale bot commented Jan 1, 2020

This pull request has been automatically marked as stale because it has not had recent activity, and will be closed if no further activity occurs. If this pull request was overlooked, forgotten, or should remain open for any other reason, please reply here to call attention to it and remove the stale status. Thank you for your contributions.

@stale stale bot added the Stale label Jan 1, 2020
@lildude lildude removed the Stale label Jan 1, 2020
@lildude lildude self-assigned this Jan 6, 2020
@lildude
Copy link
Member

lildude commented Jan 23, 2020

Sorry about the delay in coming back to this. I've worked out the necessary code changes to that we can honour the -linguist-whatever method as detailed in the .gitattributes docs and you're proposing in this PR whilst still keeping our old linguist-whatever=false method.

PR coming 🔜 .

@lildude
Copy link
Member

lildude commented Jan 23, 2020

I'm going to pull your doc changes into my PR and make it a co-authored commit when I merge so you don't lose the credit.

@lildude
Copy link
Member

lildude commented Jan 23, 2020

Closing in favour of #4780.

@lildude lildude closed this Jan 23, 2020
@cespare
Copy link
Contributor Author

cespare commented Jan 23, 2020

@lildude thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants