Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile-time extensions for list-object-filter #1031

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

olsen232
Copy link

@olsen232 olsen232 commented Sep 5, 2021

Adds an extension: option to list-object-filters,
these are implemented by static libraries that must be compiled into
Git. The Makefile argument FILTER_EXTENSIONS makes it easier
to compile these extensions into a custom build of Git. When no
custom filter-extensions are supplied, Git works as normal.

cc: Ævar Arnfjörð Bjarmason avarab@gmail.com

cc: Bagas Sanjaya bagasdotme@gmail.com
cc: Andrew Olsen andrew.olsen@koordinates.com
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Robert Coup robert.coup@koordinates.com

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 5, 2021

There is an issue in commit cce1da9:
Commit not signed off

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 5, 2021

There is an issue in commit 059c5a9:
Commit not signed off

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 5, 2021

There is an issue in commit ed9a5a1:
Commit not signed off

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 5, 2021

There is an issue in commit 8215e27:
Commit not signed off

@olsen232 olsen232 force-pushed the list-objects-filter-extensions branch from 8215e27 to 2247547 Compare September 5, 2021 22:58
@olsen232
Copy link
Author

olsen232 commented Sep 5, 2021

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 5, 2021

Submitted as pull.1031.git.1630885899.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-1031/koordinates/list-objects-filter-extensions-v1

To fetch this version to local tag pr-1031/koordinates/list-objects-filter-extensions-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-1031/koordinates/list-objects-filter-extensions-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 6, 2021

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Sun, Sep 05 2021, Andrew Olsen via GitGitGadget wrote:

> Adds an extension: option to list-object-filters, these are implemented by
> static libraries that must be compiled into Git. The Makefile argument
> FILTER_EXTENSIONS makes it easier to compile these extensions into a custom
> build of Git. When no custom filter-extensions are supplied, Git works as
> normal.

Having skimmed this and the added documentation I think what's really
missing is a "why"? What concrete use-case is this going to serve?

I.e. what is an extension you have in mind that's useful, but not so
useful as to even suggest it for inclusion in git.git before coming up
with this plug-in API mechanism?

Also, for such plug-ins the license is going to be GPL-v2 too I assume?
But that aspect isn't covered at all.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 6, 2021

User Ævar Arnfjörð Bjarmason <avarab@gmail.com> has been added to the cc: list.

Makefile Outdated
@@ -471,6 +471,11 @@ all::
# directory, and the JSON compilation database 'compile_commands.json' will be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Bagas Sanjaya wrote (reply to this):

On 06/09/21 06.51, Andrew Olsen via GitGitGadget wrote:
> From: Andrew Olsen <andrew.olsen@koordinates.com>
> 
> Custom list-object-filter extensions can be compiled into Git using the
> FILTER_EXTENSIONS Makefile argument.
> 

This can be squashed to previous patch.

-- 
An old man doll... just what I always wanted! - Clara

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 6, 2021

User Bagas Sanjaya <bagasdotme@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 6, 2021

On the Git mailing list, Bagas Sanjaya wrote (reply to this):

On 06/09/21 06.51, Andrew Olsen via GitGitGadget wrote:
> Adds an extension: option to list-object-filters, these are implemented by
> static libraries that must be compiled into Git. The Makefile argument
> FILTER_EXTENSIONS makes it easier to compile these extensions into a custom
> build of Git. When no custom filter-extensions are supplied, Git works as
> normal.

I don't see why this series is useful (use cases?).

-- 
An old man doll... just what I always wanted! - Clara

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 7, 2021

On the Git mailing list, Andrew Olsen wrote (reply to this):

Good point - sorry I sent this out without accompanying explanation. I'm still
learning about contributing to Git.

The filter extension that I want to implement is a spatial filter - it will
return blobs that store geometries that intersect with a given geometry, eg,
"only return blobs in North America". This is useful to us at kartproject.org,
"distributed version control for geospatial data", which is built on Git. But
safe to say that this functionality is not generally useful to Git users.

However, the idea we have is that there will be others who want to implement
custom filters also - perhaps like the spatial filter, these could be
domain-specific filters that are not useful to most Git users, but allow for
a custom Git to be more powerful when storing data from a particular domain.
We could just fork git and do what we want with the fork, but defining a plugin
interface makes it possible for us to keep using Git at master, instead of
maintaining a fork indefinitely.

My colleague Robert Coup coded this up once already as a plugin library
interface that could be loaded at runtime, and I've been tasked with rewriting
it as a compile-time interface, which he thought was "more likely" (but of
course not guaranteed) to be accepted as a worthwhile change to Git. He's
unfortunately on the other side of the world to me and not working today, but
I hope when he reappears he'll be able to say something more in defence of this
idea, and perhaps give a history of the reasoning for this particular solution.

Regarding licenses: the sample extensions I'm contributing will be covered by
Git's GPL-v2 (I assume), if they make it into the Git repository. Any other
extensions that may be written by third party authors and are maintained
elsewhere could be licensed as those authors see fit, as long as they take care
not to violate the terms of Git's GPL-v2 when they distribute the extension or
Git and the extension together. I could add a link to the GPL-v2 in the README
warning developers to check it before distributing any kind of extension to Git.
I'm not a lawyer and wouldn't want to give more specific advice than that.

On Tue, Sep 7, 2021 at 11:24 AM Bagas Sanjaya <bagasdotme@gmail.com> wrote:
>
> On 06/09/21 06.51, Andrew Olsen via GitGitGadget wrote:
> > Adds an extension: option to list-object-filters, these are implemented by
> > static libraries that must be compiled into Git. The Makefile argument
> > FILTER_EXTENSIONS makes it easier to compile these extensions into a custom
> > build of Git. When no custom filter-extensions are supplied, Git works as
> > normal.
>
> I don't see why this series is useful (use cases?).
>
> --
> An old man doll... just what I always wanted! - Clara
>

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 7, 2021

User Andrew Olsen <andrew.olsen@koordinates.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 7, 2021

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):


On Tue, Sep 07 2021, Andrew Olsen wrote:

> Good point - sorry I sent this out without accompanying explanation. I'm still
> learning about contributing to Git.
>
> The filter extension that I want to implement is a spatial filter - it will
> return blobs that store geometries that intersect with a given geometry, eg,
> "only return blobs in North America". This is useful to us at kartproject.org,
> "distributed version control for geospatial data", which is built on Git. But
> safe to say that this functionality is not generally useful to Git users.

That's interesting. I think you're probably right that returning blobs
by a GEO filter is not going to be generally useful to git (I assume
it's aware of some specially-encoded blobs?), but a mechanism that
enables that might be...

> However, the idea we have is that there will be others who want to implement
> custom filters also - perhaps like the spatial filter, these could be
> domain-specific filters that are not useful to most Git users, but allow for
> a custom Git to be more powerful when storing data from a particular domain.
> We could just fork git and do what we want with the fork, but defining a plugin
> interface makes it possible for us to keep using Git at master, instead of
> maintaining a fork indefinitely.
>
> [...]
>
> My colleague Robert Coup coded this up once already as a plugin library
> interface that could be loaded at runtime, and I've been tasked with rewriting
> it as a compile-time interface, which he thought was "more likely" (but of
> course not guaranteed) to be accepted as a worthwhile change to Git. He's
> unfortunately on the other side of the world to me and not working today, but
> I hope when he reappears he'll be able to say something more in defence of this
> idea, and perhaps give a history of the reasoning for this particular solution.

While it would be easier for you it would leave this project stuck
maintaining a C API interface, and indeed your documentation suggests
that not only should users use the narrow C API provided here, but any
arbitrary internal structs in git.git.

Personally I'm not per-se opposed to such a thing, but I think that we
should really be considering and trying something like the clean/smudge
hook interface first rather than a full C API.

This seems like a perfect fit for such an IPC interface, i.e. we'd have
a hook to register custom filters, and when it came to filtering objects
git would communicate with that hook, which in turn would query data
with something like the "git cat-file --batch" interface.

> Regarding licenses: the sample extensions I'm contributing will be covered by
> Git's GPL-v2 (I assume), if they make it into the Git repository. Any other
> extensions that may be written by third party authors and are maintained
> elsewhere could be licensed as those authors see fit, as long as they take care
> not to violate the terms of Git's GPL-v2 when they distribute the extension or
> Git and the extension together. I could add a link to the GPL-v2 in the README
> warning developers to check it before distributing any kind of extension to Git.
> I'm not a lawyer and wouldn't want to give more specific advice than that.

I believe you've misunderstood how the GPL works, those third party
authors would not be free to license their plugins as they see fit. The
reason the LGPL license exists is to allow what you're describing, but
git uses the full GPL v2.

See https://en.wikipedia.org/wiki/GPL_linking_exception and
https://www.gnu.org/licenses/gpl-faq.html#LinkingWithGPL

The project you've linked to even has a GPL linking exception of its own
(but git.git does not):
https://github.com/koordinates/kart/blob/2934f2b951d61233cbaab9ff627aa3c8cbfb82bc/COPYING#L7-L16

> On Tue, Sep 7, 2021 at 11:24 AM Bagas Sanjaya <bagasdotme@gmail.com> wrote:
>>
>> On 06/09/21 06.51, Andrew Olsen via GitGitGadget wrote:
>> > Adds an extension: option to list-object-filters, these are implemented by
>> > static libraries that must be compiled into Git. The Makefile argument
>> > FILTER_EXTENSIONS makes it easier to compile these extensions into a custom
>> > build of Git. When no custom filter-extensions are supplied, Git works as
>> > normal.
>>
>> I don't see why this series is useful (use cases?).
>>
>> --
>> An old man doll... just what I always wanted! - Clara
>>

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 7, 2021

User Ævar Arnfjörð Bjarmason <avarab@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 8, 2021

On the Git mailing list, Robert Coup wrote (reply to this):

Hi all,

Sorry, life got in the way at an unfortunate moment. And it should
very much be tagged "RFC" — thanks Ævar and Bagas for reading. Here's
the additional background you could have used earlier on — I've
bundled it together, but I'll happily follow up specific questions
individually. I've CCed in a couple of other people who might find it
interesting too.

So Andrew & my motivation here is to provide some specialised
filtering at clone/fetch time. In Kart[1] datasets are organised
(simplistically) by primary key, but for spatial data we want to
provide an orthogonal spatial extent filter which isn't part of the
tree path, so we can't reuse the work done in the sparse filters. For
a fetch obviously the server-side will require support for any
indexing and ultimately deciding whether a particular blob should be
part of the tree or not.

In the original filter implementation [2], various "profiles" were
alluded to as a case where the server operator might know a lot more
about how the developer would want to use the repository than the
client does, and a named profile for the server to interpret would be
a reasonably clean approach. Referred again in [3]. Sparse filters,
subject to the performance issues hopefully being improved by the
cone-mode changes, cater to a lot of them. The existing built-in
filters are fairly simple and there's a relatively simple interface
for them to implement, extending them seems like a reasonable approach
to me — potentially allowing people doing interesting things with
partial clones to take it and run in a general way without too much
effort.

So the key element to clarify/understand for this proposal is that the
main change to Git is the ability to use
`--filter=extension:<name>[=<param>]` which passes through to
git-upload-pack on the server side, to rev-list, which looks up /
validates the filter name/parameter and applies it. So if you want to
offer a custom filter, you build & set it up on the server and any Git
client (if this is merged) can make use of it without any additional
code.

Wrt IPC, my very first proof of concept used an external process that
rev-list launched, passed a series of oids/types via stdin, receiving
yes/no responses via stdout. Even after quite a lot of OS-specific
efforts to optimise the data flow across the pipes it was slow for
non-trivial sized repositories (where it matters) — essentially
boiling down to too much context switching between processes.
Reorganising the existing filtering approach to do batching with
deferred responses and parallelising the filtering into threads seemed
like an awful lot of effort for potentially little gain, in a niche
use case.

Moving it in-process made it perform well: CPU use moves into the
"deciding whether this object is in or out" phase rather than burning
it in IPC & context-switching. I did build up a basic runtime-loadable
plugin approach, but there was a reasonable amount of the internal git
API that the filters need/touched (even things like hash sizes add a
pile of complexity to it) unless it was reduced back to passing
oids+types. My approach for plugins was basically "could I potentially
implement the existing filters?" Without more of the git API I don't
think this would be feasible. Plus Git would have to agree on and
support a public ABI going forward, which for a potentially niche use
case didn't seem reasonable to propose.

Hence compile time: simpler; no ABI issues; the internal API doesn't
change that much wrt things that filters are likely to do — if someone
creates a plugin then it's on them to keep it building across git
upgrades on their server; platform support is simpler; and if others
find exciting uses for it then a runtime-loadable plugin API is always
possible in future. And only the server ever needs any custom
binaries.

Licensing — yes, any filters would need to be GPL-licensed since
they're compiled with Git. Only the server operator needs to concern
themselves with complying with this (& associated licensing for any
external libraries/etc a plugin might need) since that's where the
plugin code is linked & runs. With the usual issue around internal use
within an organisation not qualifying as "distribution" under the GPL.
FWIW, for Kart we'll be GPL-licensing the server-side spatial filter
plugin code for anyone who's interested.

Hope this clarifies a bit.

Rob :)

[1] https://kartproject.org — building on Git to version geospatial
datasets. Not sure if the videos ever got released (thanks Covid), but
I did a talk at Git Merge 2020 on it when we released the first alpha.
[2] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/
[3] https://public-inbox.org/git/79b06312-75ca-5a50-c337-dc6715305edb@jeffhostetler.com/

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 8, 2021

User Robert Coup <robert.coup@koordinates.com> has been added to the cc: list.

@rcoup rcoup force-pushed the list-objects-filter-extensions branch from e1645ec to 3cab392 Compare November 9, 2022 15:58
@gitgitgadget
Copy link

gitgitgadget bot commented Nov 9, 2022

There are issues in commit 4c9b5ae:
Fix tabs issue in generate-list-objects-filter-extensions.sh
Commit checks stopped - the message is too short
Commit not signed off

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 9, 2022

There are issues in commit 3cab392:
Make linker configurable in Makefile
Commit not signed off

Adds an extension:<custom-filter> option to list-object-filters,
these are implemented by static libraries that must be compiled into
Git. C code changes only - Makefile changes follow.

Signed-off-by: Andrew Olsen <andrew.olsen@koordinates.com>
Custom list-object-filter extensions can be compiled into Git using the
FILTER_EXTENSIONS Makefile argument.

Signed-off-by: Andrew Olsen <andrew.olsen@koordinates.com>
Basic filter extension example which filters to a random subset of
blobs, and another example which shows how to do the same in C++ and
how to link in another library required by a filter extension.
Documentation changes follow.

Signed-off-by: Andrew Olsen <andrew.olsen@koordinates.com>
Explains how to develop a custom extension for list-objects-filter
behavior, and how to compile it into a custom build of Git using the
FILTER_EXTENSIONS Makefile argument.

Signed-off-by: Andrew Olsen <andrew.olsen@koordinates.com>
Specify via LINK=mylinker, and defaults to $(CC)
@rcoup rcoup force-pushed the list-objects-filter-extensions branch from 3cab392 to 3843f33 Compare November 9, 2022 20:51
@gitgitgadget
Copy link

gitgitgadget bot commented Nov 9, 2022

There are issues in commit fdc1707:
Fix tabs issue in generate-list-objects-filter-extensions.sh
Commit checks stopped - the message is too short
Commit not signed off

@gitgitgadget
Copy link

gitgitgadget bot commented Nov 9, 2022

There are issues in commit 3843f33:
Make linker configurable in Makefile
Commit not signed off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant