Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using a managed Git backend #505

Closed
qmfrederik opened this issue Sep 4, 2020 · 13 comments · Fixed by #521
Closed

Consider using a managed Git backend #505

qmfrederik opened this issue Sep 4, 2020 · 13 comments · Fixed by #521
Milestone

Comments

@qmfrederik
Copy link
Contributor

TL;DR

I've done some experiments with a managed Git backend (instead of libgit2).

In all scenarios, performance is better with the managed backend. In the best case, throughput increases more than tenfold.
The prototype source code is available here: https://github.com/qmfrederik/quamotion.gitversioning.

There are other managed Git backends (GitNet, GitSharp, NGit) available.

Backend Repository Mean Error StdDev
managed Cuemon 6.585 ms 0.0977 ms 0.0914 ms
libgit2 Cuemon 12.946 ms 0.1425 ms 0.1333 ms
managed SuperSocket 7.583 ms 0.1067 ms 0.0998 ms
libgit2 SuperSocket 61.295 ms 0.8168 ms 0.7640 ms
managed local 1.938 ms 0.0161 ms 0.0151 ms
libgit2 local 24.805 ms 0.3896 ms 0.3253 ms
managed xunit 57.891 ms 0.6076 ms 0.5386 ms
libgit2 xunit 62.054 ms 0.9426 ms 0.7871 ms

Why

NerdBank.GitVersioning uses LibGit2Sharp as its back-end. It comes with a couple of drawbacks:

  • Performance - libgit2 is a general-purpose library and perhaps not geared towards read-only scenarios like nbgv, P/Invoke overhead,... .
  • Maintainability - it looks like development on LibGit2Sharp has stalled a bit (last commit is from April this year)
  • Portability - there's a very long list of issues related to nbgv not on various Linux distributions

What

My goal was to implement a minimal viable Git backend which you can use to calculate the Git height. That's all.

This includes

  • Read commits, trees and blobs from a local Git repository
  • Support for 'packed' Git repositories (i.e. what you get when you call git gc or after a fresh Git clone) and deltafied objects
  • In-memory caching

I've also applied some of the suggestions related to performance made by @filipnavara and @djluck, such as

  • Using tree IDs to check whether files have changed, instead of parsing the full version.json contents
  • Using the .NET Core JSON API instead of Newtonsoft.Json

I've not yet attempted / further exploration

  • Walking the Git commit graph (freshly cloned GitHub repositories do not appear to have a Git graph file)
  • The git tree and version.json objects are fully loaded into memory before parsing them; we can probably further improve performance by only reading the data we actually need.

Validation & lessons learned

  • It can be done I ran tests on three popular GitHub repositories which use nbgv, and Git height calculation seems to work.
  • Keep it simple Most of the GitHub repositories which use nbgv do so in a very simple way - a standard version.json file in the repository root, no path filters,... . nbgv has a lot of configuration knobs, which may impact performance.
  • Packed repositories have different performance characteristics than unpacked repositories. It turns out that performance of freshly cloned GitHub repositories is very different from local repositories, because all files are stored in git packs. It took some time to get the performance on par with libgit2 for repos with a large git height (like xunit); but it looks good now.
  • Room for improvement I'm sure there's still room for improvement if you want to squeeze out extra performance

What's next

Obviously, that's up to the maintainers of this repository. Personally, I've spent too much time on getting LibGit2Sharp working on the platforms I care about (Visual Studio Code on Ubuntu, to name one) and will want to move to a purely managed build task for calculating Git height. My preference is to keep using nbgv, so I can take this further and open a PR if there's interest in getting it merged.

@filipnavara mentioned he has a repository with a very large git height (> 1500 IIRC). It's be interesting to run the benchmarks on that repository, too, and see how the managed implementation holds up (I'm guessing running benchmarks will uncover some bugs, too), both in an 'unpacked' and a packed state of the repository.

@filipnavara
Copy link
Member

Thanks for giving this a try. In fact I was planning to prototype this myself so you saved me quite a bit of time. I'll likely get to try it early next week and I'll report back the results.

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

Thanks for your investment and willingness to contribute more. I would love to switch to a managed git implementation to finally put all the linux distro compat issues behind me. Of course, libgit2sharp is on the cusp of fixing that by replacing their native https implementation with a managed one, but they've been on the cusp for probably over a year now. As you say, their investment is really light right now.

I do have some concerns/responses to some of your points however:

Some nbgv commands do actually author commits, checkout branches, etc. So unless a managed git implementation supported all that, we'd still have to keep our libgit2sharp dependency, albeit we could potentially use it only in those scenarios and that shouldn't hurt linux distro compat for all the scenarios that the managed git implementation does support.

Via libgit2sharp we also support worktrees in readonly scenarios. The managed git implementation would have to support that too.

I'm not sure what other git configs or .git folder states that we should consider here, so I would not likely switch to a managed git implementation unless it was either very mature or had a very responsive developer backing it who was committed to making nbgv work great on it, since I personally do not have time to invest in a managed git implementation. @qmfrederik what are your thoughts on this? Are you committed to reaching parity in at least the read-only git scenarios that nbgv currently requires (without feature cuts)? Or would you recommend one of the several other managed git implementations that you linked to as active and possibly willing?

no path filters,... . nbgv has a lot of configuration knobs, which may impact performance.

Path filters was a recent feature which I was only willing to accept by PR after conditions were put in place to avoid perf regressions when they weren't used. I'd be surprised (and disappointed) if they had negative perf impact after all.

But certainly walking up a project directory to the git root does add time. There are plenty of repos that do this, so while perf is nice, I believe there are ample ways to improve perf substantially without cutting this or other features.

@qmfrederik
Copy link
Contributor Author

Some nbgv commands do actually author commits, checkout branches, etc.

Yes, I noticed that. I first tried to refactor nbgv and put the Git API behind an interface.

My current thinking is that it'd be worth to make the read-only operations (everything required to calculate the Git height) use a managed implementation, and have nbgv rely on LibGit2Sharp.

I at first thought it'd be straightforward to extract an interface for the Git API and have nbgv use both, allowing users perhaps to choose between both with a feature flag. It is not as straightforward as I first thought (but probably doable). nbgv doesn't use DI much / uses statics / ... which does not simplify the process.

Are you committed to reaching parity in at least the read-only git scenarios that nbgv currently requires (without feature cuts)?

For read-only access, it looks like at least @filipnavara and I are willing to invest time in making it work. The format of the git object database and the pack files seems to be very stable over time, which is a bonus. The .NET projects I found on GitHub are all fairly old and seem unmaintained.

Calculating the commit height essentially only touches the Git object database & pack files, so worktree support should be fairly trivial.

Path filters and other features should be doable (especially since nbgv comes with a good set of unit tests) but I'd say there's be limited incentive to push the performance to its limits when those features are on.

Net, I think the next steps are probably something along the lines of:

  1. Decide whether using a read-only implementation for calculating version height is sufficient.
  2. Prototype a managed implementation and see if it meets the requirements for acceptance in this repo.
  3. Decide whether we want have both implementations (managed + libgit2) with a feature flag, or we can make the full switch at once.

@qmfrederik
Copy link
Contributor Author

@filipnavara

I'll likely get to try it early next week and I'll report back the results.

Looking forward to see what perf you can squeeze out of git!

PS: I apparently forgot to push a couple of commits to the repo, so you probably want to make sure you pull the latest changes. There were some changes related to working with pack files.

@filipnavara
Copy link
Member

filipnavara commented Sep 4, 2020

I found some missing parts in the Frederik's managed GIT implementation. For example, due to the sheer size of our repository some of our developers use git clone --shared to maintain more than one work tree but to share the object/pack files (needs working support for objects/info/alternates). I am not sure whether the newer git worktree uses the same mechanism or not but I know few of my colleagues use it too. This is generally not hard to implement and I would be happy to contribute it if the managed code approach turns out viable. All the other managed GIT implementations seemed to be abandoned.

In the past I contributed to go-git - a pure Go library for manipulating GIT - and from my experience the read-only implementation is fairly easy to get done right. It gets a little tricky with all the nooks and crannies around reference parsing, config file parsing and features like alternative object stores but once that gets done it requires nearly zero maintenance.

My main issue with libgit2 is the size, portability and lack of commitment to adding the new GIT features into the native library. For my use cases (GitVersioning, online GitHub-like browsing through Gitea) it's absolutely essential to get some performance aspects right. Since libgit2 lacks the support for commit graph files this turned out to be difficult. Looking up the most recent commit for every file in a specific directory listing took up to 30 times more without the commit graph files on our repository. The same thing with commit graph files could be done within a second or two even for huge repositories with 10+ years of history.

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

Decide whether using a read-only implementation for calculating version height is sufficient.

I think this should be fine.

Prototype a managed implementation and see if it meets the requirements for acceptance in this repo.

Yes. We might hit snags around interesting niche features like #479 that may represent a good deal of extra work in the managed git library to support it. I'd be hesitant to take feature cuts. But deferring new feature work while the managed git impl catches up to libgit2sharp in those ways is tolerable.

Decide whether we want have both implementations (managed + libgit2) with a feature flag, or we can make the full switch at once.

I'd rather be incremental about it, where NBGV takes a hybrid approach that gradually migrates as the managed git implementation becomes more capable and we have time to migrate code.
I am not thrilled at the idea of a feature flag, since that would significantly increase the size of the source code (since it's all git interaction, pretty much) and make maintaining/testing harder. The upside to a feature flag is people could switch back to libgit2sharp if they hit a repo situation that the managed impl doesn't support. But I hope we can get ahead of that as much as possible through prereleases and testing, and just be committed to being super-responsive when the bug reports come in.

My main issue with libgit2 is the size

This work may not address that concern. I think we should plan on having libgit2sharp "in box" as a crutch so we can (continue to) support things that the managed git implementation does not (yet) support. I would only remove libgit2sharp once we're confident we don't and won't need it any more. But in the meantime we can/should avoid loading it until we absolutely need it, with the goal being that simply computing the version as is required in builds should not require it at all.

All the other managed GIT implementations seemed to be abandoned.

Are they decent bases though on which to continue work as opposed to starting fresh?
Also: if we do start fresh, I hope the API can be rich in Memory<T>/Span<T> use to provide maximum potential for great perf.

@qmfrederik
Copy link
Contributor Author

Are they decent bases though on which to continue work as opposed to starting fresh?
Also: if we do start fresh, I hope the API can be rich in Memory/Span use to provide maximum potential for great perf.

The open source libraries are fairly old (think NetFx-era), and at first sight it seemed like working with packfiles was not well supported by these libraries. That, paired with the absence of Span<T>-based code, made me decide to start fresh.

An option could be to fork one of the repositories and evolve it. My main concern there is that it might set expectations of a supported, general-purpose Git library for .NET. I'm not sure who would maintain that project - it's not something I'd take on at the moment.

@qmfrederik
Copy link
Contributor Author

For example, due to the sheer size of our repository some of our developers use git clone --shared to maintain more than one work tree but to share the object/pack files (needs working support for objects/info/alternates).

https://github.com/qmfrederik/Quamotion.GitVersioning/commit/fbbadbec36f92d91d685018f02f2dbc11dcea275 may help ;-)

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

@qmfrederik It sounds like your interest is limited just to NB.GV scenarios. So would we put the managed git implementation directly into this repo? That is probably fine.

Another place to look for a good base is SourceLink, which I believe has a readonly managed git implementation already.

I haven't yet looked closely at your prototype code @qmfrederik, so I'm certainly not trying to diss your existing prototype. I just want to keep the up front and maintenance cost as low as possible. If we can share code with another project with similar interests, we may help each other. But if that isn't a good fit, and you're willing to put in the work, I'm good with a unique solution in this repo too.

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

FYI, I think the perf improvements coming for #114 will be very significant, and are relatively cheap compared to rewriting portions of libgit2sharp. So the long-term value I see in a managed git impl is primarily around working everywhere instead of being limited to where libgit2sharp has native binary support. And if libgit2sharp ever finishes removing their native HTTPS dependency, that would be resolved too.
So I wonder... might we be putting in a lot of work for your proposal for a short-term-only gain? If you share the concern about this, let's push on libgit2sharp one more time to see what they're thinking, and give me a few more days to put together all the nbgv perf improvements already in the pipe. Then you can see if you're still motivated to put in the work.

@filipnavara
Copy link
Member

filipnavara commented Sep 7, 2020

For example, due to the sheer size of our repository some of our developers use git clone --shared to maintain more than one work tree but to share the object/pack files (needs working support for objects/info/alternates).

qmfrederik/Quamotion.GitVersioning@fbbadbe may help ;-)

That's not exactly it. It should act as a fallback. First try the regular objects directory and only if that fails then try the alternate one. The objects could be present in either one. There could also be more than one alternate directory and the paths can be relative.

I'll file any problem I find as an issue on your repository to keep the discussion here cleaner.

@qmfrederik
Copy link
Contributor Author

@AArnott

So would we put the managed git implementation directly into this repo? That is probably fine.

That's what I would suggest at the moment.

Another place to look for a good base is SourceLink, which I believe has a readonly managed git implementation already.

Yes, I initially started from the code which is in the SourceLink repository. But it's a very light implementation - it supports reading the URL of the remote repository and the name of the branch you're on, and that's pretty much it. There's no support for reading commits, trees or blobs.

So the long-term value I see in a managed git impl is primarily around working everywhere instead of being limited to where libgit2sharp has native binary support. And if libgit2sharp ever finishes removing their native HTTPS dependency, that would be resolved too.

Yes, but there are other issues caused by libgit2sharp:

  1. Support for Visual Studio Code on Linux is still broken (because it uses Omnisharp which uses Mono which has different native library loading rules - Visual Studio Code/OmniSharp on Ubuntu 18.04: DllNotFoundException #417)
  2. There's the libcurl dependency, too.
  3. Then there's support for new architectures (e.g. macOS on ARM, Windows on ARM, Alpline Linux on ARM, perhaps Android if it evolves into a desktop OS,...). You'll get it all for free when you do a managed implementation (and dotnet/core follows).

Here's what the timeline for linux-arm64 looked like:

So, that's 7 months to get it done. You'd get it 'for free' with a managed implementation.

Honestly, I never want to go there again. The complexity of having to raise PRs across three repositories with two different owners, the complexity in debugging this (a MSBuild task which runs a managed wrapper around native code, which can execute in .NET, .NET Core or Mono, with all the different loading rules,...).

I think that time is better spent building a performant, managed, read-only Git implementation.

If you share the concern about this, let's push on libgit2sharp one more time to see what they're thinking, and give me a few more days to put together all the nbgv perf improvements already in the pipe. Then you can see if you're still motivated to put in the work.

It's always good to reach out to libgit2sharp and see what they are up to. libgit2/libgit2sharp.nativebinaries#96 was merged a while ago, but not sure what the impact is.

@AArnott
Copy link
Collaborator

AArnott commented Sep 8, 2020

You make a strong argument, @qmfrederik. I'm in favor of your continuing your prototyping. Please raise architectural or other significant questions early to avoid lost work. I'm listening for a response to the "managed HTTP stack" PR over at libgit2sharp, but at the moment I'm inclined to go with your proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants