Skip to content

Conversation

edigaryev
Copy link
Contributor

@edigaryev edigaryev commented Jan 14, 2024

f18b512 (bundle: create filtered bundles, 2022-03-09) introduced an incredibly useful ability to create filtered bundles, which advances the partial clone/promisor support in Git and allows for archiving large repositories to object storages like S3 in bundles that are:

  • easy to manage
    • bundle is just a single file, it's easier to guarantee atomic replacements in object storages like S3 and they are faster to fetch than a bare repository since there's only a single GET request involved
  • incredibly tiny
    • no indexes (which may be more than 10 MB for some repositories) and other fluff, compared to cloning a bare repository
    • bundle can be filtered to only contain the tips of refs neccessary for e.g. code-analysis purposes

However, in 86fdd94 (clone: fail gracefully when cloning filtered bundle, 2022-03-09) the cloning of such bundles was disabled, with a note that this behavior is not desired, and it the long-term this should be possible.

The commit above states that it's not possible to have this at the moment due to lack of remote and a repository-global config that specifies an object filter, yet it's unclear why a remote-specific config can't be used instead, which is what this change does.

CC: Derrick Stolee stolee@gmail.com
cc: Phillip Wood phillip.wood123@gmail.com

Copy link

Welcome to GitGitGadget

Hi @edigaryev, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that your Pull Request has a good description, as it will be used as cover letter. You can CC potential reviewers by adding a footer to the PR description with the following syntax:

CC: Revi Ewer <revi.ewer@example.com>, Ill Takalook <ill.takalook@example.net>

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "revisions:" to state which subsystem the change is about, and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the Libera Chat IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this. Note that any reviewers CC'd via the list in the PR description will not actually be sent emails.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Libera Chat. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@Ikke
Copy link
Contributor

Ikke commented Jan 14, 2024

/allow

Copy link

User edigaryev is now allowed to use GitGitGadget.

f18b512 (bundle: create filtered bundles, 2022-03-09) introduced an
incredibly useful ability to create filtered bundles, which advances
the partial clone/promisor support in Git and allows for archiving
large repositories to object storages like S3 in bundles that are:

* easy to manage
  * bundle is just a single file, it's easier to guarantee atomic
    replacements in object storages like S3 and they are faster to
    fetch than a bare repository since there's only a single GET
    request involved
* incredibly tiny
  * no indexes (which may be more than 10 MB for some repositories)
    and other fluff, compared to cloning a bare repository
  * bundle can be filtered to only contain the tips of refs neccessary
    for e.g. code-analysis purposes

However, in 86fdd94 (clone: fail gracefully when cloning filtered
bundle, 2022-03-09) the cloning of such bundles was disabled, with a
note that this behavior is not desired, and it the long-term this
should be possible.

The commit above states that it's not possible to have this at the
moment due to lack of remote and a repository-global config that
specifies an object filter, yet it's unclear why a remote-specific
config can't be used instead, which is what this change does.

Signed-off-by: Nikolay Edigaryev <edigaryev@gmail.com>
@edigaryev edigaryev force-pushed the clone-filtered-bundles branch from 96dd580 to fa4aa3b Compare January 14, 2024 10:46
@edigaryev
Copy link
Contributor Author

/preview

Copy link

Preview email sent as pull.1644.git.git.1705229489779.gitgitgadget@gmail.com

@edigaryev
Copy link
Contributor Author

/submit

Copy link

Submitted as pull.1644.git.git.1705231010118.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-git-1644/edigaryev/clone-filtered-bundles-v1

To fetch this version to local tag pr-git-1644/edigaryev/clone-filtered-bundles-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-git-1644/edigaryev/clone-filtered-bundles-v1

Misschevious1

This comment was marked as abuse.

Copy link

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi Nikolay

On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
> From: Nikolay Edigaryev <edigaryev@gmail.com>
> > f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
> incredibly useful ability to create filtered bundles, which advances
> the partial clone/promisor support in Git and allows for archiving
> large repositories to object storages like S3 in bundles that are:
> > * easy to manage
>    * bundle is just a single file, it's easier to guarantee atomic
>      replacements in object storages like S3 and they are faster to
>      fetch than a bare repository since there's only a single GET
>      request involved
> * incredibly tiny
>    * no indexes (which may be more than 10 MB for some repositories)
>      and other fluff, compared to cloning a bare repository
>    * bundle can be filtered to only contain the tips of refs neccessary
>      for e.g. code-analysis purposes
> > However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
> bundle, 2022-03-09) the cloning of such bundles was disabled, with a
> note that this behavior is not desired, and it the long-term this
> should be possible.
> > The commit above states that it's not possible to have this at the
> moment due to lack of remote and a repository-global config that
> specifies an object filter, yet it's unclear why a remote-specific
> config can't be used instead, which is what this change does.

As I understand it if you're cloning from a bundle file then then there is no remote so how can we set a remote-specific config?

I'm surprised that the proposed change does not require the user to pass "--filter" to "git clone" as I expected that we'd want to check that the filter on the command line was compatible with the filter used to create the bundle. Allowing "git clone" to create a partial clone without the user asking for it by passing the "--filter" option feels like is going to end up confusing users.

> +test_expect_success 'cloning from filtered bundle works' '
> +	git bundle create partial.bdl --all --filter=blob:none &&
> +	git clone --bare partial.bdl partial 2>err

The redirection hides any error message which will make debugging test failures harder. It would be nice to see this test check any config set when cloning and that git commands can run successfully in the repository.

Best Wishes

Phillip

Copy link

User Phillip Wood <phillip.wood123@gmail.com> has been added to the cc: list.

Copy link

On the Git mailing list, Nikolay Edigaryev wrote (reply to this):

Hello Phillip,

> As I understand it if you're cloning from a bundle file then then there
> is no remote so how can we set a remote-specific config?

There is a remote, for more details see df61c88979 (clone: also
configure url for bare clones, 2010-03-29), which has the following
code:

strbuf_addf(&key, "remote.%s.url", remote_name);
git_config_set(key.buf, repo);
strbuf_reset(&key);

You can verify this by creating a bundle on Git 2.43.0 with "git create
bundle bundle.bundle --all" and then cloning it with "git clone
--bare /path/to/bundle.bundle", in my case the following repo-wide
configuration file was created:

[core]
repositoryformatversion = 0
filemode = true
bare = true
ignorecase = true
precomposeunicode = true
[remote "origin"]
url = /Users/edi/src/cirrus-cli/cli.bundle

> I'm surprised that the proposed change does not require the user to pass
> "--filter" to "git clone" as I expected that we'd want to check that the
> filter on the command line was compatible with the filter used to create
> the bundle. Allowing "git clone" to create a partial clone without the
> user asking for it by passing the "--filter" option feels like is going
> to end up confusing users.

Note that currently, when you clone a normal non-filtered bundle with a
'--filter' argument specified, no filtering will take place and no error
will be thrown. "promisor = true" and "partialclonefilter = ..." options
will be set in the repo config, but no .promisor file will be created.
This is even more confusing IMO, but that's how it currently on
Git 2.43.0.

You have a good point, but I feel like completely preventing cloning of
filtered bundles and requiring a '--filter' argument is very taxing. If
you've already specified a '--filter' when creating a bundle (and thus
your intent to use partially cloned data), why do it multiple times?

What I propose as an alternative here is to act based on the user's
intent when cloning:

* when the user specifies no '--filter' argument, do nothing special,
   allow cloning both types of bundles: normal and filtered (with the
   logic from this patch)

* when the user does specify a '--filter' argument, either:
  * throw an error explaining that filtering of filtered bundles is not
    supported
  * or compare the user's filter specification and the one that is
    in the bundle and only throw an error if they mismatch

Let me know what you think about this (and perhaps you have a more
concrete example in mind where this will have negative consequences)
and I'll be happy to do a next iteration.


On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Nikolay
>
> On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
> > From: Nikolay Edigaryev <edigaryev@gmail.com>
> >
> > f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
> > incredibly useful ability to create filtered bundles, which advances
> > the partial clone/promisor support in Git and allows for archiving
> > large repositories to object storages like S3 in bundles that are:
> >
> > * easy to manage
> >    * bundle is just a single file, it's easier to guarantee atomic
> >      replacements in object storages like S3 and they are faster to
> >      fetch than a bare repository since there's only a single GET
> >      request involved
> > * incredibly tiny
> >    * no indexes (which may be more than 10 MB for some repositories)
> >      and other fluff, compared to cloning a bare repository
> >    * bundle can be filtered to only contain the tips of refs neccessary
> >      for e.g. code-analysis purposes
> >
> > However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
> > bundle, 2022-03-09) the cloning of such bundles was disabled, with a
> > note that this behavior is not desired, and it the long-term this
> > should be possible.
> >
> > The commit above states that it's not possible to have this at the
> > moment due to lack of remote and a repository-global config that
> > specifies an object filter, yet it's unclear why a remote-specific
> > config can't be used instead, which is what this change does.
>
> As I understand it if you're cloning from a bundle file then then there
> is no remote so how can we set a remote-specific config?
>
> I'm surprised that the proposed change does not require the user to pass
> "--filter" to "git clone" as I expected that we'd want to check that the
> filter on the command line was compatible with the filter used to create
> the bundle. Allowing "git clone" to create a partial clone without the
> user asking for it by passing the "--filter" option feels like is going
> to end up confusing users.
>
> > +test_expect_success 'cloning from filtered bundle works' '
> > +     git bundle create partial.bdl --all --filter=blob:none &&
> > +     git clone --bare partial.bdl partial 2>err
>
> The redirection hides any error message which will make debugging test
> failures harder. It would be nice to see this test check any config set
> when cloning and that git commands can run successfully in the repository.
>
> Best Wishes
>
> Phillip

Copy link

On the Git mailing list, Nikolay Edigaryev wrote (reply to this):

> Note that currently, when you clone a normal non-filtered bundle with a
> '--filter' argument specified, no filtering will take place and no error
> will be thrown. "promisor = true" and "partialclonefilter = ..." options
> will be set in the repo config, but no .promisor file will be created.
> This is even more confusing IMO, but that's how it currently on
> Git 2.43.0.

I was wrong about this one: the .promisor pack file actually gets created.

And the filtering is not being done because the "uploadpack.allowFilter"
global option is not enabled by default.

Unfortunately the only way to figure this out is to run Git with
GIT_TRACE=2, which shows:

warning: filtering not recognized by server, ignoring

So please feel free to skip this part from the consideration.


On Sun, Jan 14, 2024 at 11:39 PM Nikolay Edigaryev <edigaryev@gmail.com> wrote:
>
> Hello Phillip,
>
> > As I understand it if you're cloning from a bundle file then then there
> > is no remote so how can we set a remote-specific config?
>
> There is a remote, for more details see df61c88979 (clone: also
> configure url for bare clones, 2010-03-29), which has the following
> code:
>
> strbuf_addf(&key, "remote.%s.url", remote_name);
> git_config_set(key.buf, repo);
> strbuf_reset(&key);
>
> You can verify this by creating a bundle on Git 2.43.0 with "git create
> bundle bundle.bundle --all" and then cloning it with "git clone
> --bare /path/to/bundle.bundle", in my case the following repo-wide
> configuration file was created:
>
> [core]
> repositoryformatversion = 0
> filemode = true
> bare = true
> ignorecase = true
> precomposeunicode = true
> [remote "origin"]
> url = /Users/edi/src/cirrus-cli/cli.bundle
>
> > I'm surprised that the proposed change does not require the user to pass
> > "--filter" to "git clone" as I expected that we'd want to check that the
> > filter on the command line was compatible with the filter used to create
> > the bundle. Allowing "git clone" to create a partial clone without the
> > user asking for it by passing the "--filter" option feels like is going
> > to end up confusing users.
>
> Note that currently, when you clone a normal non-filtered bundle with a
> '--filter' argument specified, no filtering will take place and no error
> will be thrown. "promisor = true" and "partialclonefilter = ..." options
> will be set in the repo config, but no .promisor file will be created.
> This is even more confusing IMO, but that's how it currently on
> Git 2.43.0.
>
> You have a good point, but I feel like completely preventing cloning of
> filtered bundles and requiring a '--filter' argument is very taxing. If
> you've already specified a '--filter' when creating a bundle (and thus
> your intent to use partially cloned data), why do it multiple times?
>
> What I propose as an alternative here is to act based on the user's
> intent when cloning:
>
> * when the user specifies no '--filter' argument, do nothing special,
>    allow cloning both types of bundles: normal and filtered (with the
>    logic from this patch)
>
> * when the user does specify a '--filter' argument, either:
>   * throw an error explaining that filtering of filtered bundles is not
>     supported
>   * or compare the user's filter specification and the one that is
>     in the bundle and only throw an error if they mismatch
>
> Let me know what you think about this (and perhaps you have a more
> concrete example in mind where this will have negative consequences)
> and I'll be happy to do a next iteration.
>
>
> On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
> >
> > Hi Nikolay
> >
> > On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
> > > From: Nikolay Edigaryev <edigaryev@gmail.com>
> > >
> > > f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
> > > incredibly useful ability to create filtered bundles, which advances
> > > the partial clone/promisor support in Git and allows for archiving
> > > large repositories to object storages like S3 in bundles that are:
> > >
> > > * easy to manage
> > >    * bundle is just a single file, it's easier to guarantee atomic
> > >      replacements in object storages like S3 and they are faster to
> > >      fetch than a bare repository since there's only a single GET
> > >      request involved
> > > * incredibly tiny
> > >    * no indexes (which may be more than 10 MB for some repositories)
> > >      and other fluff, compared to cloning a bare repository
> > >    * bundle can be filtered to only contain the tips of refs neccessary
> > >      for e.g. code-analysis purposes
> > >
> > > However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
> > > bundle, 2022-03-09) the cloning of such bundles was disabled, with a
> > > note that this behavior is not desired, and it the long-term this
> > > should be possible.
> > >
> > > The commit above states that it's not possible to have this at the
> > > moment due to lack of remote and a repository-global config that
> > > specifies an object filter, yet it's unclear why a remote-specific
> > > config can't be used instead, which is what this change does.
> >
> > As I understand it if you're cloning from a bundle file then then there
> > is no remote so how can we set a remote-specific config?
> >
> > I'm surprised that the proposed change does not require the user to pass
> > "--filter" to "git clone" as I expected that we'd want to check that the
> > filter on the command line was compatible with the filter used to create
> > the bundle. Allowing "git clone" to create a partial clone without the
> > user asking for it by passing the "--filter" option feels like is going
> > to end up confusing users.
> >
> > > +test_expect_success 'cloning from filtered bundle works' '
> > > +     git bundle create partial.bdl --all --filter=blob:none &&
> > > +     git clone --bare partial.bdl partial 2>err
> >
> > The redirection hides any error message which will make debugging test
> > failures harder. It would be nice to see this test check any config set
> > when cloning and that git commands can run successfully in the repository.
> >
> > Best Wishes
> >
> > Phillip

Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Nikolay Edigaryev via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> diff --git a/builtin/clone.c b/builtin/clone.c
> index c6357af9498..4b3fedf78ed 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -1227,9 +1227,18 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
>  
>  		if (fd > 0)
>  			close(fd);
> +
> +		if (has_filter) {
> +			strbuf_addf(&key, "remote.%s.promisor", remote_name);
> +			git_config_set(key.buf, "true");
> +			strbuf_reset(&key);
> +
> +			strbuf_addf(&key, "remote.%s.partialclonefilter", remote_name);
> +			git_config_set(key.buf, expand_list_objects_filter_spec(&header.filter));
> +			strbuf_reset(&key);
> +		}
> +

> -# NEEDSWORK: 'git clone --bare' should be able to clone from a filtered
> -# bundle, but that requires a change to promisor/filter config options.

Sorry, but this "should be able to" does not make sense to me in the
first place.

I can understand how an operation to create a narrow clone of an
unfiltered bundle and then prepare the resulting repository for
future "fattening" by naming the unfiltered bundle file a
"promisor", and allow the user to lazily fetch objects that have
initially been filtered out of the bundle lazily.

But a bundle that were created with objects _omitted_ already?  It
obviously cannot "promise" to deliber any objects that ought to be
reachable from the objects contained in it, so setting the bundle
file as promisor in the configuration does not help the resulting
repository.  Those missing objects must be obtained from somewhere
other than that original "filtered" bundle, and if that source of
objects that can promise all the objects that are ought to be
reachable were specified as the promisor, it would make sense.  But
the source of this clone operation, i.e. the bundle file that is
pointed at by "remote.$remote_name.url", cannot be that promisor.

Copy link

On the Git mailing list, Junio C Hamano wrote (reply to this):

Junio C Hamano <gitster@pobox.com> writes:

> "Nikolay Edigaryev via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
>> diff --git a/builtin/clone.c b/builtin/clone.c
>> index c6357af9498..4b3fedf78ed 100644
>> --- a/builtin/clone.c
>> +++ b/builtin/clone.c
>> @@ -1227,9 +1227,18 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
>>  
>>  		if (fd > 0)
>>  			close(fd);
>> +
>> +		if (has_filter) {
>> +			strbuf_addf(&key, "remote.%s.promisor", remote_name);
>> +			git_config_set(key.buf, "true");
>> +			strbuf_reset(&key);
>> +
>> +			strbuf_addf(&key, "remote.%s.partialclonefilter", remote_name);
>> +			git_config_set(key.buf, expand_list_objects_filter_spec(&header.filter));
>> +			strbuf_reset(&key);
>> +		}
>> +
>
>> -# NEEDSWORK: 'git clone --bare' should be able to clone from a filtered
>> -# bundle, but that requires a change to promisor/filter config options.
> ...
> But a bundle that were created with objects _omitted_ already?
> ... the source of this clone operation, i.e. the bundle file that is
> pointed at by "remote.$remote_name.url", cannot be that promisor.

Extending the above a bit, one important way a bundle is used is as
a medium for sneaker-net.  Instead of making a full clone over the
network, if you can create a bundle that records all objects and all
refs out of the source repository and then unbundle it in a
different place to create a repository, you can tweak the resulting
repository by either adding a separete remote or changing the
remote.origin.url so that your subsequent fetch goes over the
network to the repository you took the initial bundle from.

The "tweak the resulting repository" part however MUST be done
manually with the current system.  If we can optionally record the
publically reachable URL of the source repository when we create a
bundle file, and "git clone" on the receiving side can read the URL
out of the bundle and act on it (e.g., show it to the user and offer
to record it as remote.origin.url in the resulting repository---I do
not think it is wise to do this silently without letting the user
know from security's point of view), then the use of bundle files as
a medium for sneaker-netting will become even easier.

And once that is done, perhaps allowing a filtered bundle to act as
a sneaker-net medium to simulate an initial filtered clone would
make sense.  The promisor as well as the origin will be the network
reachable URL and subsequent fetches (both deliberate ones via "git
fetch" as well as lazy on-demand ones that backfills missing objects
via the "promisor" access) would become possible.

But without such a change to the bundle file format, allowing
"clone" to finish and pretend the resulting repository is usable is
somewhat irresponsible to the users.  The on-demand lazy fetch would
fail after this code cloned from such a filtered bundle, no?

Copy link

On the Git mailing list, phillip.wood123@gmail.com wrote (reply to this):

Hi Nikolay

On 14/01/2024 19:39, Nikolay Edigaryev wrote:
> Hello Phillip,
> >> As I understand it if you're cloning from a bundle file then then there
>> is no remote so how can we set a remote-specific config?
> > There is a remote, for more details see df61c88979 (clone: also
> configure url for bare clones, 2010-03-29), which has the following
> code:
> > strbuf_addf(&key, "remote.%s.url", remote_name);
> git_config_set(key.buf, repo);
> strbuf_reset(&key);
> > You can verify this by creating a bundle on Git 2.43.0 with "git create
> bundle bundle.bundle --all" and then cloning it with "git clone
> --bare /path/to/bundle.bundle", in my case the following repo-wide
> configuration file was created:
> > [core]
> repositoryformatversion = 0
> filemode = true
> bare = true
> ignorecase = true
> precomposeunicode = true
> [remote "origin"]
> url = /Users/edi/src/cirrus-cli/cli.bundle

Oh, thanks for clarifying that I didn't realize we set "origin" to point to the bundle. That means this patch creates a promisor remote config pointing to a bundle that does not contain the missing objects. As Junio said that doesn't make much sense to me as the point of the promisor config is to allow git to lazily fetch the missing objects.

Best Wishes

Phillip

>> I'm surprised that the proposed change does not require the user to pass
>> "--filter" to "git clone" as I expected that we'd want to check that the
>> filter on the command line was compatible with the filter used to create
>> the bundle. Allowing "git clone" to create a partial clone without the
>> user asking for it by passing the "--filter" option feels like is going
>> to end up confusing users.
> > Note that currently, when you clone a normal non-filtered bundle with a
> '--filter' argument specified, no filtering will take place and no error
> will be thrown. "promisor = true" and "partialclonefilter = ..." options
> will be set in the repo config, but no .promisor file will be created.
> This is even more confusing IMO, but that's how it currently on
> Git 2.43.0.
> > You have a good point, but I feel like completely preventing cloning of
> filtered bundles and requiring a '--filter' argument is very taxing. If
> you've already specified a '--filter' when creating a bundle (and thus
> your intent to use partially cloned data), why do it multiple times?
> > What I propose as an alternative here is to act based on the user's
> intent when cloning:
> > * when the user specifies no '--filter' argument, do nothing special,
>     allow cloning both types of bundles: normal and filtered (with the
>     logic from this patch)
> > * when the user does specify a '--filter' argument, either:
>    * throw an error explaining that filtering of filtered bundles is not
>      supported
>    * or compare the user's filter specification and the one that is
>      in the bundle and only throw an error if they mismatch
> > Let me know what you think about this (and perhaps you have a more
> concrete example in mind where this will have negative consequences)
> and I'll be happy to do a next iteration.
> > > On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>> Hi Nikolay
>>
>> On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
>>> From: Nikolay Edigaryev <edigaryev@gmail.com>
>>>
>>> f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
>>> incredibly useful ability to create filtered bundles, which advances
>>> the partial clone/promisor support in Git and allows for archiving
>>> large repositories to object storages like S3 in bundles that are:
>>>
>>> * easy to manage
>>>     * bundle is just a single file, it's easier to guarantee atomic
>>>       replacements in object storages like S3 and they are faster to
>>>       fetch than a bare repository since there's only a single GET
>>>       request involved
>>> * incredibly tiny
>>>     * no indexes (which may be more than 10 MB for some repositories)
>>>       and other fluff, compared to cloning a bare repository
>>>     * bundle can be filtered to only contain the tips of refs neccessary
>>>       for e.g. code-analysis purposes
>>>
>>> However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
>>> bundle, 2022-03-09) the cloning of such bundles was disabled, with a
>>> note that this behavior is not desired, and it the long-term this
>>> should be possible.
>>>
>>> The commit above states that it's not possible to have this at the
>>> moment due to lack of remote and a repository-global config that
>>> specifies an object filter, yet it's unclear why a remote-specific
>>> config can't be used instead, which is what this change does.
>>
>> As I understand it if you're cloning from a bundle file then then there
>> is no remote so how can we set a remote-specific config?
>>
>> I'm surprised that the proposed change does not require the user to pass
>> "--filter" to "git clone" as I expected that we'd want to check that the
>> filter on the command line was compatible with the filter used to create
>> the bundle. Allowing "git clone" to create a partial clone without the
>> user asking for it by passing the "--filter" option feels like is going
>> to end up confusing users.
>>
>>> +test_expect_success 'cloning from filtered bundle works' '
>>> +     git bundle create partial.bdl --all --filter=blob:none &&
>>> +     git clone --bare partial.bdl partial 2>err
>>
>> The redirection hides any error message which will make debugging test
>> failures harder. It would be nice to see this test check any config set
>> when cloning and that git commands can run successfully in the repository.
>>
>> Best Wishes
>>
>> Phillip

Copy link

On the Git mailing list, phillip.wood123@gmail.com wrote (reply to this):

Hi Nikolay

On 14/01/2024 21:26, Nikolay Edigaryev wrote:
>> Note that currently, when you clone a normal non-filtered bundle with a
>> '--filter' argument specified, no filtering will take place and no error
>> will be thrown. "promisor = true" and "partialclonefilter = ..." options
>> will be set in the repo config, but no .promisor file will be created.
>> This is even more confusing IMO, but that's how it currently on
>> Git 2.43.0.
> > I was wrong about this one: the .promisor pack file actually gets created.
> > And the filtering is not being done because the "uploadpack.allowFilter"
> global option is not enabled by default.

That makes sense - if you try to make a partial clone from a remote that does not support object filtering we fallback to a full clone and print the warning you mention below.

> Unfortunately the only way to figure this out is to run Git with
> GIT_TRACE=2, which shows:

That seems strange, you should see the warning without having to set GIT_TRACE. I've certainly seen the warning in the past when trying to create a partial clone from a remote did not support them without me setting any special environment variables.

Best Wishes

Phillip

> warning: filtering not recognized by server, ignoring
> > So please feel free to skip this part from the consideration.
> > > On Sun, Jan 14, 2024 at 11:39 PM Nikolay Edigaryev <edigaryev@gmail.com> wrote:
>>
>> Hello Phillip,
>>
>>> As I understand it if you're cloning from a bundle file then then there
>>> is no remote so how can we set a remote-specific config?
>>
>> There is a remote, for more details see df61c88979 (clone: also
>> configure url for bare clones, 2010-03-29), which has the following
>> code:
>>
>> strbuf_addf(&key, "remote.%s.url", remote_name);
>> git_config_set(key.buf, repo);
>> strbuf_reset(&key);
>>
>> You can verify this by creating a bundle on Git 2.43.0 with "git create
>> bundle bundle.bundle --all" and then cloning it with "git clone
>> --bare /path/to/bundle.bundle", in my case the following repo-wide
>> configuration file was created:
>>
>> [core]
>> repositoryformatversion = 0
>> filemode = true
>> bare = true
>> ignorecase = true
>> precomposeunicode = true
>> [remote "origin"]
>> url = /Users/edi/src/cirrus-cli/cli.bundle
>>
>>> I'm surprised that the proposed change does not require the user to pass
>>> "--filter" to "git clone" as I expected that we'd want to check that the
>>> filter on the command line was compatible with the filter used to create
>>> the bundle. Allowing "git clone" to create a partial clone without the
>>> user asking for it by passing the "--filter" option feels like is going
>>> to end up confusing users.
>>
>> Note that currently, when you clone a normal non-filtered bundle with a
>> '--filter' argument specified, no filtering will take place and no error
>> will be thrown. "promisor = true" and "partialclonefilter = ..." options
>> will be set in the repo config, but no .promisor file will be created.
>> This is even more confusing IMO, but that's how it currently on
>> Git 2.43.0.
>>
>> You have a good point, but I feel like completely preventing cloning of
>> filtered bundles and requiring a '--filter' argument is very taxing. If
>> you've already specified a '--filter' when creating a bundle (and thus
>> your intent to use partially cloned data), why do it multiple times?
>>
>> What I propose as an alternative here is to act based on the user's
>> intent when cloning:
>>
>> * when the user specifies no '--filter' argument, do nothing special,
>>     allow cloning both types of bundles: normal and filtered (with the
>>     logic from this patch)
>>
>> * when the user does specify a '--filter' argument, either:
>>    * throw an error explaining that filtering of filtered bundles is not
>>      supported
>>    * or compare the user's filter specification and the one that is
>>      in the bundle and only throw an error if they mismatch
>>
>> Let me know what you think about this (and perhaps you have a more
>> concrete example in mind where this will have negative consequences)
>> and I'll be happy to do a next iteration.
>>
>>
>> On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>>
>>> Hi Nikolay
>>>
>>> On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
>>>> From: Nikolay Edigaryev <edigaryev@gmail.com>
>>>>
>>>> f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
>>>> incredibly useful ability to create filtered bundles, which advances
>>>> the partial clone/promisor support in Git and allows for archiving
>>>> large repositories to object storages like S3 in bundles that are:
>>>>
>>>> * easy to manage
>>>>     * bundle is just a single file, it's easier to guarantee atomic
>>>>       replacements in object storages like S3 and they are faster to
>>>>       fetch than a bare repository since there's only a single GET
>>>>       request involved
>>>> * incredibly tiny
>>>>     * no indexes (which may be more than 10 MB for some repositories)
>>>>       and other fluff, compared to cloning a bare repository
>>>>     * bundle can be filtered to only contain the tips of refs neccessary
>>>>       for e.g. code-analysis purposes
>>>>
>>>> However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
>>>> bundle, 2022-03-09) the cloning of such bundles was disabled, with a
>>>> note that this behavior is not desired, and it the long-term this
>>>> should be possible.
>>>>
>>>> The commit above states that it's not possible to have this at the
>>>> moment due to lack of remote and a repository-global config that
>>>> specifies an object filter, yet it's unclear why a remote-specific
>>>> config can't be used instead, which is what this change does.
>>>
>>> As I understand it if you're cloning from a bundle file then then there
>>> is no remote so how can we set a remote-specific config?
>>>
>>> I'm surprised that the proposed change does not require the user to pass
>>> "--filter" to "git clone" as I expected that we'd want to check that the
>>> filter on the command line was compatible with the filter used to create
>>> the bundle. Allowing "git clone" to create a partial clone without the
>>> user asking for it by passing the "--filter" option feels like is going
>>> to end up confusing users.
>>>
>>>> +test_expect_success 'cloning from filtered bundle works' '
>>>> +     git bundle create partial.bdl --all --filter=blob:none &&
>>>> +     git clone --bare partial.bdl partial 2>err
>>>
>>> The redirection hides any error message which will make debugging test
>>> failures harder. It would be nice to see this test check any config set
>>> when cloning and that git commands can run successfully in the repository.
>>>
>>> Best Wishes
>>>
>>> Phillip

Copy link

On the Git mailing list, Nikolay Edigaryev wrote (reply to this):

Hello Junio and Phillip,

Thanks a lot for the explanations of how this is supposed to work. It
seems that to make this work properly, we'd need to:

(1) add an argument (or an option) to 'git bundle create', so that
    the user will be able to explicitly request the inclusion of a
    desired remote's URL

Without such mechanism in place data leak is possible, e.g. remote with
credentials hardcoded in it.

(2) extend the 'gitformat-bundle' to include 'url'

However, a remote can have multiple URLs and other remote-specific
options might be necessary to properly work with it.

(3) add an argument (or an option) to 'git clone', so that the user
    will be able to explicitly request the write of the URL contained
    in the bundle to the repository's config

Otherwise, it's insecure, e.g. someone might craft a bundle with a URL
that collects data from the user.

I don't want waste anyone's time on this anymore because I've toyed with
'git bundle' a bit more and realized that what I'm trying to accomplish
can be done the other way:

1. git init

2. git bundle unbundle <PATH> | <script that swaps hashes and refs in
   'git bundle unbundle output' and feeds them to 'git update-ref'>

Hopefully this discussion will be useful for people looking to
accomplish something similar to what I've described in the initial
message.

On Mon, Jan 15, 2024 at 6:09 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > "Nikolay Edigaryev via GitGitGadget" <gitgitgadget@gmail.com>
> > writes:
> >
> >> diff --git a/builtin/clone.c b/builtin/clone.c
> >> index c6357af9498..4b3fedf78ed 100644
> >> --- a/builtin/clone.c
> >> +++ b/builtin/clone.c
> >> @@ -1227,9 +1227,18 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
> >>
> >>              if (fd > 0)
> >>                      close(fd);
> >> +
> >> +            if (has_filter) {
> >> +                    strbuf_addf(&key, "remote.%s.promisor", remote_name);
> >> +                    git_config_set(key.buf, "true");
> >> +                    strbuf_reset(&key);
> >> +
> >> +                    strbuf_addf(&key, "remote.%s.partialclonefilter", remote_name);
> >> +                    git_config_set(key.buf, expand_list_objects_filter_spec(&header.filter));
> >> +                    strbuf_reset(&key);
> >> +            }
> >> +
> >
> >> -# NEEDSWORK: 'git clone --bare' should be able to clone from a filtered
> >> -# bundle, but that requires a change to promisor/filter config options.
> > ...
> > But a bundle that were created with objects _omitted_ already?
> > ... the source of this clone operation, i.e. the bundle file that is
> > pointed at by "remote.$remote_name.url", cannot be that promisor.
>
> Extending the above a bit, one important way a bundle is used is as
> a medium for sneaker-net.  Instead of making a full clone over the
> network, if you can create a bundle that records all objects and all
> refs out of the source repository and then unbundle it in a
> different place to create a repository, you can tweak the resulting
> repository by either adding a separete remote or changing the
> remote.origin.url so that your subsequent fetch goes over the
> network to the repository you took the initial bundle from.
>
> The "tweak the resulting repository" part however MUST be done
> manually with the current system.  If we can optionally record the
> publically reachable URL of the source repository when we create a
> bundle file, and "git clone" on the receiving side can read the URL
> out of the bundle and act on it (e.g., show it to the user and offer
> to record it as remote.origin.url in the resulting repository---I do
> not think it is wise to do this silently without letting the user
> know from security's point of view), then the use of bundle files as
> a medium for sneaker-netting will become even easier.
>
> And once that is done, perhaps allowing a filtered bundle to act as
> a sneaker-net medium to simulate an initial filtered clone would
> make sense.  The promisor as well as the origin will be the network
> reachable URL and subsequent fetches (both deliberate ones via "git
> fetch" as well as lazy on-demand ones that backfills missing objects
> via the "promisor" access) would become possible.
>
> But without such a change to the bundle file format, allowing
> "clone" to finish and pretend the resulting repository is usable is
> somewhat irresponsible to the users.  The on-demand lazy fetch would
> fail after this code cloned from such a filtered bundle, no?

@edigaryev edigaryev closed this Jan 16, 2024
@edigaryev edigaryev deleted the clone-filtered-bundles branch January 16, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants