Skip to content

Conversation

andrewshadura
Copy link

@andrewshadura andrewshadura commented Feb 19, 2021

It is possible for a user to disable attribute-based filtering when committing by doing one of the following:

  • Create .git/info/attributes unapplying all possible transforming attributes.
  • Use git hash-object and git update-index to stage files manually.

Doing the former requires keeping an up-to-date list of all attributes which can transform files when committing or checking out.
Doing the latter is difficult, error-prone and slow when done from scripts.

Instead, similarly to git hash-object, --no-filter can be added to git add to enable temporarily disabling filtering in an easy to use way.

The use case here is mostly non-interactive use in scripts creating Git trees that need to exactly correspond to a working directory regardless of whether or not they have any .gitattributes files.

For example, for git-buildpackage or dgit, which facilitate Git workflows with Debian packages, want to ensure the contents of the
packages can be exactly reproduced, which is difficult if the upstream’s tarball has .gitattributes. It is possible to "defuse" the attributes as demonstrated above, but this will break if the user modifies the .git/i/a file or if Git adds more attribute-based conversions. This is what dgit currently does, and this is what git-buildpackage will soon do too.

Of course, this patch set only addresses staging files. While working on a patch to git-buildpackage to reproducibly import the contents of tarballs, I realised that the only realistic way seem to do that is to use hash-object + update-index manually, which is likely to come with a performance drop compared to git add (which is what gbp currently uses).
A workaround might be to use Dulwich (which would allow to do hash-object without fork/exec) or perhaps GitPython (which I haven’t really looked into), or maybe to use git fast-import, but all of these alternatives are quite complex and don’t guarantee the same performance.

Adding a new option to git add allows to keep the performance without having to ensure attributes are set to the right values. The attributes will likely have to be adjusted anyway for user’s convenience, but at least if they modify them afterwards, the tools won’t break.

These patches:

  • Add new flag ADD_CACHE_RAW to add_to_index()
  • Add new flag HASH_RAW to index_fd()
  • Make git hash-object use the new HASH_RAW flag for consistency
  • Add tests for the new git-add option.
  • Update the git-add manpage describing the new option and pointing out it’s tricky to use.
  • Expand the relevant FAQ entry adding --no-filters as yet one more reason for perpetually modified files to appear.

Changes since v1:

  • Removed an extra space left in the option definition. Jessica Clarke on GitHub pointed out the inconsistent formatting, but contrary to the suggestion I updated the style to the more widespread variation: "(0," occurs at least 689 times across the codebase, while "( 0" only at most 23.
  • Expanded the option description to warn users about potentially confusing situations (pointed out by brian m. carlson)
  • Expanded the FAQ entry to mention the new option as one of the cause of perpetually modified files.

cc: "brian m. carlson" sandals@crustytoothpaste.net

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

Welcome to GitGitGadget

Hi @andrewshadura, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that your Pull Request has a good description, as it will be used as cover letter.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "commit:", and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the FreeNode IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Freenode. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@adlternative
Copy link

/allow

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

User andrewshadura is now allowed to use GitGitGadget.

WARNING: andrewshadura has no public email address set on GitHub

@andrewshadura andrewshadura force-pushed the git-add-no-filters branch 2 times, most recently from bb31b25 to 0e113e9 Compare February 19, 2021 17:32
@andrewshadura
Copy link
Author

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

Preview email sent as pull.880.git.1613758137.gitgitgadget@gmail.com

@andrewshadura
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

Submitted as pull.880.git.1613758333.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-880/andrewshadura/git-add-no-filters-v1

To fetch this version to local tag pr-880/andrewshadura/git-add-no-filters-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-880/andrewshadura/git-add-no-filters-v1

OPT_COUNTUP( 0 , "stdin", &hashstdin, N_("read the object from stdin")),
OPT_BOOL( 0 , "stdin-paths", &stdin_paths, N_("read file names from stdin")),
OPT_BOOL( 0 , "no-filters", &no_filters, N_("store file as is without filters")),
OPT_BIT(0 , "no-filters", &flags, N_("store file as is without filters"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style is:

Suggested change
OPT_BIT(0 , "no-filters", &flags, N_("store file as is without filters"),
OPT_BIT( 0 , "no-filters", &flags, N_("store file as is without filters"),

to pad where the 's go for short options, at least in this file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grepping buildins/*.c shows that (0 is used at least 689 times, while ( 0 at most 23.

The coding guidelines don’t mention any rule mandating this space, but say: Use whitespace around operators and keywords, but not inside parentheses and not around functions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I forgot to delete the other space.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

On the Git mailing list, "brian m. carlson" wrote (reply to this):


--mlB1r3e00IglsYfs
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2021-02-19 at 18:12:11, Andrej Shadura via GitGitGadget wrote:
> It is possible for a user to disable attribute-based filtering when
> committing by doing one of the following:
>=20
>  * Create .git/info/attributes unapplying all possible transforming
>    attributes.
>  * Use git hash-object and git update-index to stage files manually.
>=20
> Doing the former requires keeping an up-to-date list of all attributes wh=
ich
> can transform files when committing or checking out. Doing the latter is
> difficult, error-prone and slow when done from scripts.
>=20
> Instead, similarly to git hash-object, --no-filter can be added to git add
> to enable temporarily disabling filtering in an easy to use way.
>=20
> These patches:
>=20
>  * Add new flag ADD_CACHE_RAW to add_to_index()
>  * Add new flag HASH_RAW to index_fd()
>  * Make git hash-object use the new HASH_RAW flag for consistency
>  * Add tests for the new git-add option.

I'm interested in your use cases here.  While I agree that this is an
interesting feature, it also means that practically, a user who checks
out a file that's added this way may find that git status marks it as
perpetually modified until a properly cleaned version is committed.
Moreover, even "git reset --hard" won't fix this situation.

We see this problem extremely frequently with Git LFS where people
change the .gitattributes file but don't run "git add --renormalize ."
and then end up with this problem.  However, it's not limited to Git LFS
in particular; anything that uses filters, working tree encodings, or
end of line attributes can be affected.

So I think that while this might be a useful escape hatch for users, I
definitely want to see a compelling rationale for it and a big warning
in the documentation and an update to the relevant entry in the Git FAQ
before we accept such a patch.
--=20
brian m. carlson (he/him or they/them)
Houston, Texas, US

--mlB1r3e00IglsYfs
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.2.27 (GNU/Linux)

iHUEABYKAB0WIQQILOaKnbxl+4PRw5F8DEliiIeigQUCYDA9WQAKCRB8DEliiIei
gQWfAQCEXan4woKj6lQaIFHmMUzJEPPH1xzZ9VNO9bVUOk6WDgD/bJ7cLSmmbWsZ
VhZ+eDzSWGBdSG+SONPydienLaar/AI=
=D9QW
-----END PGP SIGNATURE-----

--mlB1r3e00IglsYfs--

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

User "brian m. carlson" <sandals@crustytoothpaste.net> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 20, 2021

On the Git mailing list, Andrej Shadura wrote (reply to this):

On 19/02/2021 23:36, brian m. carlson wrote:
> On 2021-02-19 at 18:12:11, Andrej Shadura via GitGitGadget wrote:
>> It is possible for a user to disable attribute-based filtering when
>> committing by doing one of the following:
>>
>>  * Create .git/info/attributes unapplying all possible transforming
>>    attributes.
>>  * Use git hash-object and git update-index to stage files manually.
>>
>> Doing the former requires keeping an up-to-date list of all attributes which
>> can transform files when committing or checking out. Doing the latter is
>> difficult, error-prone and slow when done from scripts.
>>
>> Instead, similarly to git hash-object, --no-filter can be added to git add
>> to enable temporarily disabling filtering in an easy to use way.

> I'm interested in your use cases here.  While I agree that this is an
> interesting feature, it also means that practically, a user who checks
> out a file that's added this way may find that git status marks it as
> perpetually modified until a properly cleaned version is committed.
> Moreover, even "git reset --hard" won't fix this situation.
> 
> We see this problem extremely frequently with Git LFS where people
> change the .gitattributes file but don't run "git add --renormalize ."
> and then end up with this problem.  However, it's not limited to Git LFS
> in particular; anything that uses filters, working tree encodings, or
> end of line attributes can be affected.
> 
> So I think that while this might be a useful escape hatch for users, I
> definitely want to see a compelling rationale for it and a big warning
> in the documentation and an update to the relevant entry in the Git FAQ
> before we accept such a patch.

My use case here is mostly non-interactive use in scripts creating Git
trees that need to exactly correspond to a working directory regardless
of whether or not they have any .gitattributes files.

For example, for git-buildpackage or dgit, which facilitate Git
workflows with Debian packages, want to ensure the contents of the
packages can be exactly reproduced, which is difficult if the upstream’s
tarball has .gitattributes. It is possible to "defuse" the attributes as
demonstrated above, but this will break if the user modifies the
.git/i/a file *or* if Git adds more attribute-based conversions. This is
what dgit currently does, and this is what git-buildpackage will soon do
too.

Of course, this patch set only addresses staging files. While working on
a patch to git-buildpackage to reproducibly import the contents of
tarballs, I realised that the only realistic way seem to do that is to
use hash-object + update-index manually, which is likely to come with a
performance drop compared to git add (which is what gbp currently uses).
A workaround might be to use dulwich (which would allow to do
hash-object without fork/exec) or perhaps GitPython (which I haven’t
really looked into), or maybe to use git fast-import, but all of these
alternatives are quite complex and don’t guarantee the same performance.

Adding a new option to git add allows to keep the performance without
having to ensure attributes are set to the right values. The attributes
will likely have to be adjusted anyway for user’s convenience, but at
least if they modify them afterwards, the tools won’t break.

-- 
Cheers,
  Andrej

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 20, 2021

On the Git mailing list, Andrej Shadura wrote (reply to this):

On 20/02/2021 09:06, Andrej Shadura wrote:
> On 19/02/2021 23:36, brian m. carlson wrote:
>> So I think that while this might be a useful escape hatch for users, I
>> definitely want to see a compelling rationale for it and a big warning
>> in the documentation and an update to the relevant entry in the Git FAQ
>> before we accept such a patch.

Here’s my proposal for the updated manpage description of the option:

--no-filters::

Add the contents as is, ignoring any input filter that would have been
chosen by the attributes mechanism, including the end-of-line
conversion. Note that this option is not intended for interactive use,
since files added this way will always show up as modified if Git were
to apply transformations to them, making the situation potentially very
confusing.

And here the FAQ entry extended:

It is also possible for perpetually modified files to occur on any
platform if a smudge or clean filter is in use on your system but a file
was previously committed without running the smudge or clean filter.  To
fix this, run the following on an otherwise clean working tree:
+
----
$ git add --renormalize .
----
+
Another situation where perpetually modified may appear on any platform
is when a file has been committed without running any filters (including
the end-of-line conversion), but the `.gitattributes` file states that
this file requires a conversion.  In this case, you can either
renormalize the files if this happened by mistake, or modify
`.gitattributes` or `$GIT_DIR/info/attributes` as described above to
exempt the file from the conversion if this was intentional.

(I will send an updated patch set when we agree on the wording.)

-- 
Cheers,
  Andrej

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 20, 2021

On the Git mailing list, "brian m. carlson" wrote (reply to this):


--xDkcWKnjSnipxv0a
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2021-02-20 at 09:30:58, Andrej Shadura wrote:
> On 20/02/2021 09:06, Andrej Shadura wrote:
> > On 19/02/2021 23:36, brian m. carlson wrote:
> >> So I think that while this might be a useful escape hatch for users, I
> >> definitely want to see a compelling rationale for it and a big warning
> >> in the documentation and an update to the relevant entry in the Git FAQ
> >> before we accept such a patch.
>=20
> Here=E2=80=99s my proposal for the updated manpage description of the opt=
ion:
>=20
> --no-filters::
>=20
> Add the contents as is, ignoring any input filter that would have been
> chosen by the attributes mechanism, including the end-of-line
> conversion. Note that this option is not intended for interactive use,
> since files added this way will always show up as modified if Git were
> to apply transformations to them, making the situation potentially very
> confusing.
>=20
> And here the FAQ entry extended:
>=20
> It is also possible for perpetually modified files to occur on any
> platform if a smudge or clean filter is in use on your system but a file
> was previously committed without running the smudge or clean filter.  To
> fix this, run the following on an otherwise clean working tree:
> +
> ----
> $ git add --renormalize .
> ----
> +
> Another situation where perpetually modified may appear on any platform
> is when a file has been committed without running any filters (including
> the end-of-line conversion), but the `.gitattributes` file states that
> this file requires a conversion.  In this case, you can either
> renormalize the files if this happened by mistake, or modify
> `.gitattributes` or `$GIT_DIR/info/attributes` as described above to
> exempt the file from the conversion if this was intentional.
>=20
> (I will send an updated patch set when we agree on the wording.)

This seems fine.  Thanks for being open to addressing my concerns, and I
agree that your use case seems like a good one and that this is a
valuable feature.
--=20
brian m. carlson (he/him or they/them)
Houston, Texas, US

--xDkcWKnjSnipxv0a
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.2.27 (GNU/Linux)

iHUEABYKAB0WIQQILOaKnbxl+4PRw5F8DEliiIeigQUCYDEYaQAKCRB8DEliiIei
gQ53AQDgt1oqqpObMS1fXEV0/STiGNwq4BgjrypSotm6JNxfdgD/fgOOriBP594p
3qlW2JGSxRydS8QcwuhwGExP+xK3CQQ=
=szv1
-----END PGP SIGNATURE-----

--xDkcWKnjSnipxv0a--

@andrewshadura
Copy link
Author

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 20, 2021

Preview email sent as pull.880.v2.git.1613840479.gitgitgadget@gmail.com

@andrewshadura
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 20, 2021

Submitted as pull.880.v2.git.1613840865.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-880/andrewshadura/git-add-no-filters-v2

To fetch this version to local tag pr-880/andrewshadura/git-add-no-filters-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-880/andrewshadura/git-add-no-filters-v2

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 20, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Andrej Shadura via GitGitGadget" <gitgitgadget@gmail.com> writes:

> It is possible for a user to disable attribute-based filtering when
> committing by doing one of the following:
>
>  * Create .git/info/attributes unapplying all possible transforming
>    attributes.
>  * Use git hash-object and git update-index to stage files manually.
>
> Doing the former requires keeping an up-to-date list of all attributes which
> can transform files when committing or checking out. Doing the latter is
> difficult, error-prone and slow when done from scripts.
>
> Instead, similarly to git hash-object, --no-filter can be added to git add
> to enable temporarily disabling filtering in an easy to use way.

I think brian's review covered if such a feature is desirable to
sufficient level, and I do not have anything to add in that area,
so I'll limit my comment to the general design and implementation.

In general, think three times before introducing --no-something
option.  It often is much cleaner and futureproof if you instead
introduced --something option whose value defaults to true instead,
so that the end-user can say --no-something from the command line.


>       -		OPT_BOOL( 0 , "no-filters", &no_filters, N_("store file as is without filters")),
>      -+		OPT_BIT(0 , "no-filters", &flags, N_("store file as is without filters"),
>      ++		OPT_BIT(0, "no-filters", &flags, N_("store file as is without filters"),
>       +			HASH_RAW),

In other words, these should give "filters" option, and the code
should initialize the flags word with USE_CLEAN_FILTER bit on by
default (the use of "clean" here comes from "clean vs smudge", one
of the pair of filters end-user can customize the path the data
takes going into Git from the outside world; and the "clean" and
"smudge" datapaths also trigger non-custom standard ones like crlf
munging).

That way when a configuration variable support is introduced to
allow the users to say "I by default refuse to use the clean filters
when running 'git add'" by setting say "[add] cleanfilter = false",
the user can override that with "--filters" from the command line
"for just this time".  The same goes for an alias that hardcodes
"--no-filters" on the command line, where allowing "--filters" lets
the users override it.

Thanks.

It is possible for a user to disable attribute-based filtering when
committing by doing one of the following:

* Create .git/info/attributes unapplying all possible transforming
  attributes.
* Use git hash-object and git update-index to stage files manually.

Doing the former requires keeping an up-to-date list of all attributes
which can transform files when committing or checking out.
Doing the latter is difficult, error-prone and slow when done from
scripts.

Instead, similarly to git hash-object, --no-filter can be added to
git add to enable temporarily disabling filtering in an easy to use
way:

* Add new flag ADD_CACHE_RAW to add_to_index()
* Add new flag HASH_RAW to index_fd()

Signed-off-by: Andrej Shadura <andrew.shadura@collabora.co.uk>
While setting path to NULL works and flips the condition in the right
branch inside index_mem(), doing so isn’t obvious for the reader of
the code. Since index_mem() now has an additional flag to disable
filtering, use that instead.

Signed-off-by: Andrej Shadura <andrew.shadura@collabora.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants