Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve publish performance, especially for prefixes with a large number of snapshots #1273

Merged
merged 6 commits into from
Apr 24, 2024

Conversation

neolynx
Copy link
Member

@neolynx neolynx commented Apr 17, 2024

Replaces #1222

Requirements

All new code should be covered with tests, documentation should be updated. CI should pass.

Description of the Change

This contains a variety of improvements for publish performance, specifically speeding up the cleanup operation by:

  • using a separate, faster package to walk the filesystem
  • avoiding re-loading the same packages repeatedly (this will happen if a single prefix has a large number of snapshots or repos that share most of their packages)
  • slightly improving the package list loading performance w/ zero-copy deserialization

Benchmarks were added for all of these, and some unit tests were added specifically to test aspects of cleanup.

We have a relatively large aptly repository with >90 repositories, ~207k packages across all of the repositories, and >3.5k snapshots; a testing version of that repository was used to measure the publishing performance. Prior to these changes, publishing took >9 minutes, with over 8 minutes of that time just in the cleanup phase. With these, the cleanup time goes down to ~13 seconds, for a total publish time of a little under a minute.

Checklist

  • unit-test added (if change is algorithm)
  • functional test added/updated (if change is functional)
  • man page updated (if applicable)
  • bash completion updated (if applicable)
  • documentation updated
  • author name in AUTHORS

Copy link

codecov bot commented Apr 17, 2024

Codecov Report

Attention: Patch coverage is 92.94118% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 74.78%. Comparing base (ff8f79f) to head (289c843).

Files Patch % Lines
deb/publish.go 79.31% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1273      +/-   ##
==========================================
+ Coverage   74.70%   74.78%   +0.08%     
==========================================
  Files         144      144              
  Lines       16216    16246      +30     
==========================================
+ Hits        12114    12150      +36     
+ Misses       3158     3154       -4     
+ Partials      944      942       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@neolynx neolynx changed the title feature/publish perf Improve publish performance, especially for prefixes with a large number of snapshots Apr 20, 2024
@neolynx neolynx added needs review Ready for review & merge needs rebase The PR needs to be rebased on master labels Apr 20, 2024
refi64 and others added 5 commits April 21, 2024 11:21
In some local tests w/ a slowed down filesystem, this massively cut down
on the time to clean up a repository by ~3x, bringing a total 'publish
update' time from ~16s to ~13s.

Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
When merging reflists with ignoreConflicting set to true and
overrideMatching set to false, the individual ref components are never
examined, but the refs are still split anyway. Avoiding the split when
we never use the components brings a massive speedup: on my system, the
included benchmark goes from ~1500 us/it to ~180 us/it.

Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
The cleanup phase needs to list out all the files in each component in
order to determine what's still in use. When there's a large number of
sources (e.g. from having many snapshots), the time spent just loading
the package information becomes substantial. However, in many cases,
most of the packages being loaded are actually shared across the
sources; if you're taking frequent snapshots, for instance, most of the
packages in each snapshot will be the same as other snapshots. In these
cases, re-reading the packages repeatedly is just a waste of time.

To improve this, we maintain a list of refs that we know were processed
for each component. When listing the refs from a source, only the ones
that have not yet been processed will be examined. Some tests were also
added specifically to check listing the files in a component.

With this change, listing the files in components on a copy of our
production database went from >10 minutes to ~10 seconds, and the newly
added benchmark went from ~300ms to ~43ms.

Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
Reflists are basically stored as arrays of strings, which are quite
space-efficient in MessagePack. Thus, using zero-copy decoding results
in nice performance and memory savings, because the overhead of separate
allocations ends up far exceeding the overhead of the original slice.

With the included benchmark run for 20s with -benchmem, the runtime,
memory usage, and allocations go from ~740us/op, ~192KiB/op, and 4100
allocs/op to ~240us/op, ~97KiB/op, and 13 allocs/op, respectively.

Signed-off-by: Ryan Gonzalez <ryan.gonzalez@collabora.com>
let's be compatible with debian/bookworm
@neolynx neolynx added needs rebase The PR needs to be rebased on master and removed needs rebase The PR needs to be rebased on master labels Apr 21, 2024
@neolynx neolynx removed the needs rebase The PR needs to be rebased on master label Apr 21, 2024
@neolynx neolynx removed the needs review Ready for review & merge label Apr 24, 2024
@neolynx neolynx merged commit 27013c0 into master Apr 24, 2024
9 checks passed
@neolynx neolynx deleted the feature/publish-perf branch April 24, 2024 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants