-
Notifications
You must be signed in to change notification settings - Fork 26.6k
clone: support cloning of filtered bundles #1644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Welcome to GitGitGadgetHi @edigaryev, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests. Please make sure that your Pull Request has a good description, as it will be used as cover letter. You can CC potential reviewers by adding a footer to the PR description with the following syntax:
Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:
It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code. Contributing the patchesBefore you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form Both the person who commented An alternative is the channel
Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment If you want to see what email(s) would be sent for a After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail). If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the curl -g --user "<EMailAddress>:<Password>" \
--url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):
To send a new iteration, just add another PR comment with the contents: Need help?New contributors who want advice are encouraged to join git-mentoring@googlegroups.com, where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join. You may also be able to find help in real time in the developer IRC channel, |
/allow |
User edigaryev is now allowed to use GitGitGadget. |
f18b512 (bundle: create filtered bundles, 2022-03-09) introduced an incredibly useful ability to create filtered bundles, which advances the partial clone/promisor support in Git and allows for archiving large repositories to object storages like S3 in bundles that are: * easy to manage * bundle is just a single file, it's easier to guarantee atomic replacements in object storages like S3 and they are faster to fetch than a bare repository since there's only a single GET request involved * incredibly tiny * no indexes (which may be more than 10 MB for some repositories) and other fluff, compared to cloning a bare repository * bundle can be filtered to only contain the tips of refs neccessary for e.g. code-analysis purposes However, in 86fdd94 (clone: fail gracefully when cloning filtered bundle, 2022-03-09) the cloning of such bundles was disabled, with a note that this behavior is not desired, and it the long-term this should be possible. The commit above states that it's not possible to have this at the moment due to lack of remote and a repository-global config that specifies an object filter, yet it's unclear why a remote-specific config can't be used instead, which is what this change does. Signed-off-by: Nikolay Edigaryev <edigaryev@gmail.com>
96dd580
to
fa4aa3b
Compare
/preview |
Preview email sent as pull.1644.git.git.1705229489779.gitgitgadget@gmail.com |
/submit |
Submitted as pull.1644.git.git.1705231010118.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Phillip Wood wrote (reply to this): Hi Nikolay
On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
> From: Nikolay Edigaryev <edigaryev@gmail.com>
> > f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
> incredibly useful ability to create filtered bundles, which advances
> the partial clone/promisor support in Git and allows for archiving
> large repositories to object storages like S3 in bundles that are:
> > * easy to manage
> * bundle is just a single file, it's easier to guarantee atomic
> replacements in object storages like S3 and they are faster to
> fetch than a bare repository since there's only a single GET
> request involved
> * incredibly tiny
> * no indexes (which may be more than 10 MB for some repositories)
> and other fluff, compared to cloning a bare repository
> * bundle can be filtered to only contain the tips of refs neccessary
> for e.g. code-analysis purposes
> > However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
> bundle, 2022-03-09) the cloning of such bundles was disabled, with a
> note that this behavior is not desired, and it the long-term this
> should be possible.
> > The commit above states that it's not possible to have this at the
> moment due to lack of remote and a repository-global config that
> specifies an object filter, yet it's unclear why a remote-specific
> config can't be used instead, which is what this change does.
As I understand it if you're cloning from a bundle file then then there is no remote so how can we set a remote-specific config?
I'm surprised that the proposed change does not require the user to pass "--filter" to "git clone" as I expected that we'd want to check that the filter on the command line was compatible with the filter used to create the bundle. Allowing "git clone" to create a partial clone without the user asking for it by passing the "--filter" option feels like is going to end up confusing users.
> +test_expect_success 'cloning from filtered bundle works' '
> + git bundle create partial.bdl --all --filter=blob:none &&
> + git clone --bare partial.bdl partial 2>err
The redirection hides any error message which will make debugging test failures harder. It would be nice to see this test check any config set when cloning and that git commands can run successfully in the repository.
Best Wishes
Phillip |
User |
On the Git mailing list, Nikolay Edigaryev wrote (reply to this): Hello Phillip,
> As I understand it if you're cloning from a bundle file then then there
> is no remote so how can we set a remote-specific config?
There is a remote, for more details see df61c88979 (clone: also
configure url for bare clones, 2010-03-29), which has the following
code:
strbuf_addf(&key, "remote.%s.url", remote_name);
git_config_set(key.buf, repo);
strbuf_reset(&key);
You can verify this by creating a bundle on Git 2.43.0 with "git create
bundle bundle.bundle --all" and then cloning it with "git clone
--bare /path/to/bundle.bundle", in my case the following repo-wide
configuration file was created:
[core]
repositoryformatversion = 0
filemode = true
bare = true
ignorecase = true
precomposeunicode = true
[remote "origin"]
url = /Users/edi/src/cirrus-cli/cli.bundle
> I'm surprised that the proposed change does not require the user to pass
> "--filter" to "git clone" as I expected that we'd want to check that the
> filter on the command line was compatible with the filter used to create
> the bundle. Allowing "git clone" to create a partial clone without the
> user asking for it by passing the "--filter" option feels like is going
> to end up confusing users.
Note that currently, when you clone a normal non-filtered bundle with a
'--filter' argument specified, no filtering will take place and no error
will be thrown. "promisor = true" and "partialclonefilter = ..." options
will be set in the repo config, but no .promisor file will be created.
This is even more confusing IMO, but that's how it currently on
Git 2.43.0.
You have a good point, but I feel like completely preventing cloning of
filtered bundles and requiring a '--filter' argument is very taxing. If
you've already specified a '--filter' when creating a bundle (and thus
your intent to use partially cloned data), why do it multiple times?
What I propose as an alternative here is to act based on the user's
intent when cloning:
* when the user specifies no '--filter' argument, do nothing special,
allow cloning both types of bundles: normal and filtered (with the
logic from this patch)
* when the user does specify a '--filter' argument, either:
* throw an error explaining that filtering of filtered bundles is not
supported
* or compare the user's filter specification and the one that is
in the bundle and only throw an error if they mismatch
Let me know what you think about this (and perhaps you have a more
concrete example in mind where this will have negative consequences)
and I'll be happy to do a next iteration.
On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Nikolay
>
> On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
> > From: Nikolay Edigaryev <edigaryev@gmail.com>
> >
> > f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
> > incredibly useful ability to create filtered bundles, which advances
> > the partial clone/promisor support in Git and allows for archiving
> > large repositories to object storages like S3 in bundles that are:
> >
> > * easy to manage
> > * bundle is just a single file, it's easier to guarantee atomic
> > replacements in object storages like S3 and they are faster to
> > fetch than a bare repository since there's only a single GET
> > request involved
> > * incredibly tiny
> > * no indexes (which may be more than 10 MB for some repositories)
> > and other fluff, compared to cloning a bare repository
> > * bundle can be filtered to only contain the tips of refs neccessary
> > for e.g. code-analysis purposes
> >
> > However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
> > bundle, 2022-03-09) the cloning of such bundles was disabled, with a
> > note that this behavior is not desired, and it the long-term this
> > should be possible.
> >
> > The commit above states that it's not possible to have this at the
> > moment due to lack of remote and a repository-global config that
> > specifies an object filter, yet it's unclear why a remote-specific
> > config can't be used instead, which is what this change does.
>
> As I understand it if you're cloning from a bundle file then then there
> is no remote so how can we set a remote-specific config?
>
> I'm surprised that the proposed change does not require the user to pass
> "--filter" to "git clone" as I expected that we'd want to check that the
> filter on the command line was compatible with the filter used to create
> the bundle. Allowing "git clone" to create a partial clone without the
> user asking for it by passing the "--filter" option feels like is going
> to end up confusing users.
>
> > +test_expect_success 'cloning from filtered bundle works' '
> > + git bundle create partial.bdl --all --filter=blob:none &&
> > + git clone --bare partial.bdl partial 2>err
>
> The redirection hides any error message which will make debugging test
> failures harder. It would be nice to see this test check any config set
> when cloning and that git commands can run successfully in the repository.
>
> Best Wishes
>
> Phillip |
On the Git mailing list, Nikolay Edigaryev wrote (reply to this): > Note that currently, when you clone a normal non-filtered bundle with a
> '--filter' argument specified, no filtering will take place and no error
> will be thrown. "promisor = true" and "partialclonefilter = ..." options
> will be set in the repo config, but no .promisor file will be created.
> This is even more confusing IMO, but that's how it currently on
> Git 2.43.0.
I was wrong about this one: the .promisor pack file actually gets created.
And the filtering is not being done because the "uploadpack.allowFilter"
global option is not enabled by default.
Unfortunately the only way to figure this out is to run Git with
GIT_TRACE=2, which shows:
warning: filtering not recognized by server, ignoring
So please feel free to skip this part from the consideration.
On Sun, Jan 14, 2024 at 11:39 PM Nikolay Edigaryev <edigaryev@gmail.com> wrote:
>
> Hello Phillip,
>
> > As I understand it if you're cloning from a bundle file then then there
> > is no remote so how can we set a remote-specific config?
>
> There is a remote, for more details see df61c88979 (clone: also
> configure url for bare clones, 2010-03-29), which has the following
> code:
>
> strbuf_addf(&key, "remote.%s.url", remote_name);
> git_config_set(key.buf, repo);
> strbuf_reset(&key);
>
> You can verify this by creating a bundle on Git 2.43.0 with "git create
> bundle bundle.bundle --all" and then cloning it with "git clone
> --bare /path/to/bundle.bundle", in my case the following repo-wide
> configuration file was created:
>
> [core]
> repositoryformatversion = 0
> filemode = true
> bare = true
> ignorecase = true
> precomposeunicode = true
> [remote "origin"]
> url = /Users/edi/src/cirrus-cli/cli.bundle
>
> > I'm surprised that the proposed change does not require the user to pass
> > "--filter" to "git clone" as I expected that we'd want to check that the
> > filter on the command line was compatible with the filter used to create
> > the bundle. Allowing "git clone" to create a partial clone without the
> > user asking for it by passing the "--filter" option feels like is going
> > to end up confusing users.
>
> Note that currently, when you clone a normal non-filtered bundle with a
> '--filter' argument specified, no filtering will take place and no error
> will be thrown. "promisor = true" and "partialclonefilter = ..." options
> will be set in the repo config, but no .promisor file will be created.
> This is even more confusing IMO, but that's how it currently on
> Git 2.43.0.
>
> You have a good point, but I feel like completely preventing cloning of
> filtered bundles and requiring a '--filter' argument is very taxing. If
> you've already specified a '--filter' when creating a bundle (and thus
> your intent to use partially cloned data), why do it multiple times?
>
> What I propose as an alternative here is to act based on the user's
> intent when cloning:
>
> * when the user specifies no '--filter' argument, do nothing special,
> allow cloning both types of bundles: normal and filtered (with the
> logic from this patch)
>
> * when the user does specify a '--filter' argument, either:
> * throw an error explaining that filtering of filtered bundles is not
> supported
> * or compare the user's filter specification and the one that is
> in the bundle and only throw an error if they mismatch
>
> Let me know what you think about this (and perhaps you have a more
> concrete example in mind where this will have negative consequences)
> and I'll be happy to do a next iteration.
>
>
> On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
> >
> > Hi Nikolay
> >
> > On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
> > > From: Nikolay Edigaryev <edigaryev@gmail.com>
> > >
> > > f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
> > > incredibly useful ability to create filtered bundles, which advances
> > > the partial clone/promisor support in Git and allows for archiving
> > > large repositories to object storages like S3 in bundles that are:
> > >
> > > * easy to manage
> > > * bundle is just a single file, it's easier to guarantee atomic
> > > replacements in object storages like S3 and they are faster to
> > > fetch than a bare repository since there's only a single GET
> > > request involved
> > > * incredibly tiny
> > > * no indexes (which may be more than 10 MB for some repositories)
> > > and other fluff, compared to cloning a bare repository
> > > * bundle can be filtered to only contain the tips of refs neccessary
> > > for e.g. code-analysis purposes
> > >
> > > However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
> > > bundle, 2022-03-09) the cloning of such bundles was disabled, with a
> > > note that this behavior is not desired, and it the long-term this
> > > should be possible.
> > >
> > > The commit above states that it's not possible to have this at the
> > > moment due to lack of remote and a repository-global config that
> > > specifies an object filter, yet it's unclear why a remote-specific
> > > config can't be used instead, which is what this change does.
> >
> > As I understand it if you're cloning from a bundle file then then there
> > is no remote so how can we set a remote-specific config?
> >
> > I'm surprised that the proposed change does not require the user to pass
> > "--filter" to "git clone" as I expected that we'd want to check that the
> > filter on the command line was compatible with the filter used to create
> > the bundle. Allowing "git clone" to create a partial clone without the
> > user asking for it by passing the "--filter" option feels like is going
> > to end up confusing users.
> >
> > > +test_expect_success 'cloning from filtered bundle works' '
> > > + git bundle create partial.bdl --all --filter=blob:none &&
> > > + git clone --bare partial.bdl partial 2>err
> >
> > The redirection hides any error message which will make debugging test
> > failures harder. It would be nice to see this test check any config set
> > when cloning and that git commands can run successfully in the repository.
> >
> > Best Wishes
> >
> > Phillip |
On the Git mailing list, Junio C Hamano wrote (reply to this): "Nikolay Edigaryev via GitGitGadget" <gitgitgadget@gmail.com>
writes:
> diff --git a/builtin/clone.c b/builtin/clone.c
> index c6357af9498..4b3fedf78ed 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -1227,9 +1227,18 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
>
> if (fd > 0)
> close(fd);
> +
> + if (has_filter) {
> + strbuf_addf(&key, "remote.%s.promisor", remote_name);
> + git_config_set(key.buf, "true");
> + strbuf_reset(&key);
> +
> + strbuf_addf(&key, "remote.%s.partialclonefilter", remote_name);
> + git_config_set(key.buf, expand_list_objects_filter_spec(&header.filter));
> + strbuf_reset(&key);
> + }
> +
> -# NEEDSWORK: 'git clone --bare' should be able to clone from a filtered
> -# bundle, but that requires a change to promisor/filter config options.
Sorry, but this "should be able to" does not make sense to me in the
first place.
I can understand how an operation to create a narrow clone of an
unfiltered bundle and then prepare the resulting repository for
future "fattening" by naming the unfiltered bundle file a
"promisor", and allow the user to lazily fetch objects that have
initially been filtered out of the bundle lazily.
But a bundle that were created with objects _omitted_ already? It
obviously cannot "promise" to deliber any objects that ought to be
reachable from the objects contained in it, so setting the bundle
file as promisor in the configuration does not help the resulting
repository. Those missing objects must be obtained from somewhere
other than that original "filtered" bundle, and if that source of
objects that can promise all the objects that are ought to be
reachable were specified as the promisor, it would make sense. But
the source of this clone operation, i.e. the bundle file that is
pointed at by "remote.$remote_name.url", cannot be that promisor. |
On the Git mailing list, Junio C Hamano wrote (reply to this): Junio C Hamano <gitster@pobox.com> writes:
> "Nikolay Edigaryev via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
>> diff --git a/builtin/clone.c b/builtin/clone.c
>> index c6357af9498..4b3fedf78ed 100644
>> --- a/builtin/clone.c
>> +++ b/builtin/clone.c
>> @@ -1227,9 +1227,18 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
>>
>> if (fd > 0)
>> close(fd);
>> +
>> + if (has_filter) {
>> + strbuf_addf(&key, "remote.%s.promisor", remote_name);
>> + git_config_set(key.buf, "true");
>> + strbuf_reset(&key);
>> +
>> + strbuf_addf(&key, "remote.%s.partialclonefilter", remote_name);
>> + git_config_set(key.buf, expand_list_objects_filter_spec(&header.filter));
>> + strbuf_reset(&key);
>> + }
>> +
>
>> -# NEEDSWORK: 'git clone --bare' should be able to clone from a filtered
>> -# bundle, but that requires a change to promisor/filter config options.
> ...
> But a bundle that were created with objects _omitted_ already?
> ... the source of this clone operation, i.e. the bundle file that is
> pointed at by "remote.$remote_name.url", cannot be that promisor.
Extending the above a bit, one important way a bundle is used is as
a medium for sneaker-net. Instead of making a full clone over the
network, if you can create a bundle that records all objects and all
refs out of the source repository and then unbundle it in a
different place to create a repository, you can tweak the resulting
repository by either adding a separete remote or changing the
remote.origin.url so that your subsequent fetch goes over the
network to the repository you took the initial bundle from.
The "tweak the resulting repository" part however MUST be done
manually with the current system. If we can optionally record the
publically reachable URL of the source repository when we create a
bundle file, and "git clone" on the receiving side can read the URL
out of the bundle and act on it (e.g., show it to the user and offer
to record it as remote.origin.url in the resulting repository---I do
not think it is wise to do this silently without letting the user
know from security's point of view), then the use of bundle files as
a medium for sneaker-netting will become even easier.
And once that is done, perhaps allowing a filtered bundle to act as
a sneaker-net medium to simulate an initial filtered clone would
make sense. The promisor as well as the origin will be the network
reachable URL and subsequent fetches (both deliberate ones via "git
fetch" as well as lazy on-demand ones that backfills missing objects
via the "promisor" access) would become possible.
But without such a change to the bundle file format, allowing
"clone" to finish and pretend the resulting repository is usable is
somewhat irresponsible to the users. The on-demand lazy fetch would
fail after this code cloned from such a filtered bundle, no? |
On the Git mailing list, phillip.wood123@gmail.com wrote (reply to this): Hi Nikolay
On 14/01/2024 19:39, Nikolay Edigaryev wrote:
> Hello Phillip,
> >> As I understand it if you're cloning from a bundle file then then there
>> is no remote so how can we set a remote-specific config?
> > There is a remote, for more details see df61c88979 (clone: also
> configure url for bare clones, 2010-03-29), which has the following
> code:
> > strbuf_addf(&key, "remote.%s.url", remote_name);
> git_config_set(key.buf, repo);
> strbuf_reset(&key);
> > You can verify this by creating a bundle on Git 2.43.0 with "git create
> bundle bundle.bundle --all" and then cloning it with "git clone
> --bare /path/to/bundle.bundle", in my case the following repo-wide
> configuration file was created:
> > [core]
> repositoryformatversion = 0
> filemode = true
> bare = true
> ignorecase = true
> precomposeunicode = true
> [remote "origin"]
> url = /Users/edi/src/cirrus-cli/cli.bundle
Oh, thanks for clarifying that I didn't realize we set "origin" to point to the bundle. That means this patch creates a promisor remote config pointing to a bundle that does not contain the missing objects. As Junio said that doesn't make much sense to me as the point of the promisor config is to allow git to lazily fetch the missing objects.
Best Wishes
Phillip
>> I'm surprised that the proposed change does not require the user to pass
>> "--filter" to "git clone" as I expected that we'd want to check that the
>> filter on the command line was compatible with the filter used to create
>> the bundle. Allowing "git clone" to create a partial clone without the
>> user asking for it by passing the "--filter" option feels like is going
>> to end up confusing users.
> > Note that currently, when you clone a normal non-filtered bundle with a
> '--filter' argument specified, no filtering will take place and no error
> will be thrown. "promisor = true" and "partialclonefilter = ..." options
> will be set in the repo config, but no .promisor file will be created.
> This is even more confusing IMO, but that's how it currently on
> Git 2.43.0.
> > You have a good point, but I feel like completely preventing cloning of
> filtered bundles and requiring a '--filter' argument is very taxing. If
> you've already specified a '--filter' when creating a bundle (and thus
> your intent to use partially cloned data), why do it multiple times?
> > What I propose as an alternative here is to act based on the user's
> intent when cloning:
> > * when the user specifies no '--filter' argument, do nothing special,
> allow cloning both types of bundles: normal and filtered (with the
> logic from this patch)
> > * when the user does specify a '--filter' argument, either:
> * throw an error explaining that filtering of filtered bundles is not
> supported
> * or compare the user's filter specification and the one that is
> in the bundle and only throw an error if they mismatch
> > Let me know what you think about this (and perhaps you have a more
> concrete example in mind where this will have negative consequences)
> and I'll be happy to do a next iteration.
> > > On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>> Hi Nikolay
>>
>> On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
>>> From: Nikolay Edigaryev <edigaryev@gmail.com>
>>>
>>> f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
>>> incredibly useful ability to create filtered bundles, which advances
>>> the partial clone/promisor support in Git and allows for archiving
>>> large repositories to object storages like S3 in bundles that are:
>>>
>>> * easy to manage
>>> * bundle is just a single file, it's easier to guarantee atomic
>>> replacements in object storages like S3 and they are faster to
>>> fetch than a bare repository since there's only a single GET
>>> request involved
>>> * incredibly tiny
>>> * no indexes (which may be more than 10 MB for some repositories)
>>> and other fluff, compared to cloning a bare repository
>>> * bundle can be filtered to only contain the tips of refs neccessary
>>> for e.g. code-analysis purposes
>>>
>>> However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
>>> bundle, 2022-03-09) the cloning of such bundles was disabled, with a
>>> note that this behavior is not desired, and it the long-term this
>>> should be possible.
>>>
>>> The commit above states that it's not possible to have this at the
>>> moment due to lack of remote and a repository-global config that
>>> specifies an object filter, yet it's unclear why a remote-specific
>>> config can't be used instead, which is what this change does.
>>
>> As I understand it if you're cloning from a bundle file then then there
>> is no remote so how can we set a remote-specific config?
>>
>> I'm surprised that the proposed change does not require the user to pass
>> "--filter" to "git clone" as I expected that we'd want to check that the
>> filter on the command line was compatible with the filter used to create
>> the bundle. Allowing "git clone" to create a partial clone without the
>> user asking for it by passing the "--filter" option feels like is going
>> to end up confusing users.
>>
>>> +test_expect_success 'cloning from filtered bundle works' '
>>> + git bundle create partial.bdl --all --filter=blob:none &&
>>> + git clone --bare partial.bdl partial 2>err
>>
>> The redirection hides any error message which will make debugging test
>> failures harder. It would be nice to see this test check any config set
>> when cloning and that git commands can run successfully in the repository.
>>
>> Best Wishes
>>
>> Phillip |
On the Git mailing list, phillip.wood123@gmail.com wrote (reply to this): Hi Nikolay
On 14/01/2024 21:26, Nikolay Edigaryev wrote:
>> Note that currently, when you clone a normal non-filtered bundle with a
>> '--filter' argument specified, no filtering will take place and no error
>> will be thrown. "promisor = true" and "partialclonefilter = ..." options
>> will be set in the repo config, but no .promisor file will be created.
>> This is even more confusing IMO, but that's how it currently on
>> Git 2.43.0.
> > I was wrong about this one: the .promisor pack file actually gets created.
> > And the filtering is not being done because the "uploadpack.allowFilter"
> global option is not enabled by default.
That makes sense - if you try to make a partial clone from a remote that does not support object filtering we fallback to a full clone and print the warning you mention below.
> Unfortunately the only way to figure this out is to run Git with
> GIT_TRACE=2, which shows:
That seems strange, you should see the warning without having to set GIT_TRACE. I've certainly seen the warning in the past when trying to create a partial clone from a remote did not support them without me setting any special environment variables.
Best Wishes
Phillip
> warning: filtering not recognized by server, ignoring
> > So please feel free to skip this part from the consideration.
> > > On Sun, Jan 14, 2024 at 11:39 PM Nikolay Edigaryev <edigaryev@gmail.com> wrote:
>>
>> Hello Phillip,
>>
>>> As I understand it if you're cloning from a bundle file then then there
>>> is no remote so how can we set a remote-specific config?
>>
>> There is a remote, for more details see df61c88979 (clone: also
>> configure url for bare clones, 2010-03-29), which has the following
>> code:
>>
>> strbuf_addf(&key, "remote.%s.url", remote_name);
>> git_config_set(key.buf, repo);
>> strbuf_reset(&key);
>>
>> You can verify this by creating a bundle on Git 2.43.0 with "git create
>> bundle bundle.bundle --all" and then cloning it with "git clone
>> --bare /path/to/bundle.bundle", in my case the following repo-wide
>> configuration file was created:
>>
>> [core]
>> repositoryformatversion = 0
>> filemode = true
>> bare = true
>> ignorecase = true
>> precomposeunicode = true
>> [remote "origin"]
>> url = /Users/edi/src/cirrus-cli/cli.bundle
>>
>>> I'm surprised that the proposed change does not require the user to pass
>>> "--filter" to "git clone" as I expected that we'd want to check that the
>>> filter on the command line was compatible with the filter used to create
>>> the bundle. Allowing "git clone" to create a partial clone without the
>>> user asking for it by passing the "--filter" option feels like is going
>>> to end up confusing users.
>>
>> Note that currently, when you clone a normal non-filtered bundle with a
>> '--filter' argument specified, no filtering will take place and no error
>> will be thrown. "promisor = true" and "partialclonefilter = ..." options
>> will be set in the repo config, but no .promisor file will be created.
>> This is even more confusing IMO, but that's how it currently on
>> Git 2.43.0.
>>
>> You have a good point, but I feel like completely preventing cloning of
>> filtered bundles and requiring a '--filter' argument is very taxing. If
>> you've already specified a '--filter' when creating a bundle (and thus
>> your intent to use partially cloned data), why do it multiple times?
>>
>> What I propose as an alternative here is to act based on the user's
>> intent when cloning:
>>
>> * when the user specifies no '--filter' argument, do nothing special,
>> allow cloning both types of bundles: normal and filtered (with the
>> logic from this patch)
>>
>> * when the user does specify a '--filter' argument, either:
>> * throw an error explaining that filtering of filtered bundles is not
>> supported
>> * or compare the user's filter specification and the one that is
>> in the bundle and only throw an error if they mismatch
>>
>> Let me know what you think about this (and perhaps you have a more
>> concrete example in mind where this will have negative consequences)
>> and I'll be happy to do a next iteration.
>>
>>
>> On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>>
>>> Hi Nikolay
>>>
>>> On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
>>>> From: Nikolay Edigaryev <edigaryev@gmail.com>
>>>>
>>>> f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
>>>> incredibly useful ability to create filtered bundles, which advances
>>>> the partial clone/promisor support in Git and allows for archiving
>>>> large repositories to object storages like S3 in bundles that are:
>>>>
>>>> * easy to manage
>>>> * bundle is just a single file, it's easier to guarantee atomic
>>>> replacements in object storages like S3 and they are faster to
>>>> fetch than a bare repository since there's only a single GET
>>>> request involved
>>>> * incredibly tiny
>>>> * no indexes (which may be more than 10 MB for some repositories)
>>>> and other fluff, compared to cloning a bare repository
>>>> * bundle can be filtered to only contain the tips of refs neccessary
>>>> for e.g. code-analysis purposes
>>>>
>>>> However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
>>>> bundle, 2022-03-09) the cloning of such bundles was disabled, with a
>>>> note that this behavior is not desired, and it the long-term this
>>>> should be possible.
>>>>
>>>> The commit above states that it's not possible to have this at the
>>>> moment due to lack of remote and a repository-global config that
>>>> specifies an object filter, yet it's unclear why a remote-specific
>>>> config can't be used instead, which is what this change does.
>>>
>>> As I understand it if you're cloning from a bundle file then then there
>>> is no remote so how can we set a remote-specific config?
>>>
>>> I'm surprised that the proposed change does not require the user to pass
>>> "--filter" to "git clone" as I expected that we'd want to check that the
>>> filter on the command line was compatible with the filter used to create
>>> the bundle. Allowing "git clone" to create a partial clone without the
>>> user asking for it by passing the "--filter" option feels like is going
>>> to end up confusing users.
>>>
>>>> +test_expect_success 'cloning from filtered bundle works' '
>>>> + git bundle create partial.bdl --all --filter=blob:none &&
>>>> + git clone --bare partial.bdl partial 2>err
>>>
>>> The redirection hides any error message which will make debugging test
>>> failures harder. It would be nice to see this test check any config set
>>> when cloning and that git commands can run successfully in the repository.
>>>
>>> Best Wishes
>>>
>>> Phillip |
On the Git mailing list, Nikolay Edigaryev wrote (reply to this): Hello Junio and Phillip,
Thanks a lot for the explanations of how this is supposed to work. It
seems that to make this work properly, we'd need to:
(1) add an argument (or an option) to 'git bundle create', so that
the user will be able to explicitly request the inclusion of a
desired remote's URL
Without such mechanism in place data leak is possible, e.g. remote with
credentials hardcoded in it.
(2) extend the 'gitformat-bundle' to include 'url'
However, a remote can have multiple URLs and other remote-specific
options might be necessary to properly work with it.
(3) add an argument (or an option) to 'git clone', so that the user
will be able to explicitly request the write of the URL contained
in the bundle to the repository's config
Otherwise, it's insecure, e.g. someone might craft a bundle with a URL
that collects data from the user.
I don't want waste anyone's time on this anymore because I've toyed with
'git bundle' a bit more and realized that what I'm trying to accomplish
can be done the other way:
1. git init
2. git bundle unbundle <PATH> | <script that swaps hashes and refs in
'git bundle unbundle output' and feeds them to 'git update-ref'>
Hopefully this discussion will be useful for people looking to
accomplish something similar to what I've described in the initial
message.
On Mon, Jan 15, 2024 at 6:09 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > "Nikolay Edigaryev via GitGitGadget" <gitgitgadget@gmail.com>
> > writes:
> >
> >> diff --git a/builtin/clone.c b/builtin/clone.c
> >> index c6357af9498..4b3fedf78ed 100644
> >> --- a/builtin/clone.c
> >> +++ b/builtin/clone.c
> >> @@ -1227,9 +1227,18 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
> >>
> >> if (fd > 0)
> >> close(fd);
> >> +
> >> + if (has_filter) {
> >> + strbuf_addf(&key, "remote.%s.promisor", remote_name);
> >> + git_config_set(key.buf, "true");
> >> + strbuf_reset(&key);
> >> +
> >> + strbuf_addf(&key, "remote.%s.partialclonefilter", remote_name);
> >> + git_config_set(key.buf, expand_list_objects_filter_spec(&header.filter));
> >> + strbuf_reset(&key);
> >> + }
> >> +
> >
> >> -# NEEDSWORK: 'git clone --bare' should be able to clone from a filtered
> >> -# bundle, but that requires a change to promisor/filter config options.
> > ...
> > But a bundle that were created with objects _omitted_ already?
> > ... the source of this clone operation, i.e. the bundle file that is
> > pointed at by "remote.$remote_name.url", cannot be that promisor.
>
> Extending the above a bit, one important way a bundle is used is as
> a medium for sneaker-net. Instead of making a full clone over the
> network, if you can create a bundle that records all objects and all
> refs out of the source repository and then unbundle it in a
> different place to create a repository, you can tweak the resulting
> repository by either adding a separete remote or changing the
> remote.origin.url so that your subsequent fetch goes over the
> network to the repository you took the initial bundle from.
>
> The "tweak the resulting repository" part however MUST be done
> manually with the current system. If we can optionally record the
> publically reachable URL of the source repository when we create a
> bundle file, and "git clone" on the receiving side can read the URL
> out of the bundle and act on it (e.g., show it to the user and offer
> to record it as remote.origin.url in the resulting repository---I do
> not think it is wise to do this silently without letting the user
> know from security's point of view), then the use of bundle files as
> a medium for sneaker-netting will become even easier.
>
> And once that is done, perhaps allowing a filtered bundle to act as
> a sneaker-net medium to simulate an initial filtered clone would
> make sense. The promisor as well as the origin will be the network
> reachable URL and subsequent fetches (both deliberate ones via "git
> fetch" as well as lazy on-demand ones that backfills missing objects
> via the "promisor" access) would become possible.
>
> But without such a change to the bundle file format, allowing
> "clone" to finish and pretend the resulting repository is usable is
> somewhat irresponsible to the users. The on-demand lazy fetch would
> fail after this code cloned from such a filtered bundle, no? |
f18b512 (bundle: create filtered bundles, 2022-03-09) introduced an incredibly useful ability to create filtered bundles, which advances the partial clone/promisor support in Git and allows for archiving large repositories to object storages like S3 in bundles that are:
However, in 86fdd94 (clone: fail gracefully when cloning filtered bundle, 2022-03-09) the cloning of such bundles was disabled, with a note that this behavior is not desired, and it the long-term this should be possible.
The commit above states that it's not possible to have this at the moment due to lack of remote and a repository-global config that specifies an object filter, yet it's unclear why a remote-specific config can't be used instead, which is what this change does.
CC: Derrick Stolee stolee@gmail.com
cc: Phillip Wood phillip.wood123@gmail.com