Skip to content

Extraction/buffer sizing improvements#887

Merged
egibs merged 12 commits into
chainguard-dev:mainfrom
egibs:optimize-extractions
Apr 27, 2025
Merged

Extraction/buffer sizing improvements#887
egibs merged 12 commits into
chainguard-dev:mainfrom
egibs:optimize-extractions

Conversation

@egibs
Copy link
Copy Markdown
Member

@egibs egibs commented Apr 27, 2025

Follow-up for #867.

Turns out, extracting many small files with ExtractZip was taking a long time, especially in the context of handling thousands of .jar files. This was partly due to the incorrectly-sized buffer and also the CPU overhead of calling io.CopyBuffer.

This PR replaces our reliance on io.CopyBuffer with a more manual in-line loop for each extraction function (I tried a helper function for the loops but it was just as slow) which handles the writing and error checking. In my testing, this was ~3x faster (for example, all Trino packages for a single architecture can be extracted in ~11 seconds).

Anecdotally, scanning a single Sonarqube package takes 28-30 seconds (I did this at least a dozen times to double-check) with these changes as opposed to 82-85 seconds in HEAD and scanning all of the Trino packages for a single architecture takes ~nine minutes as opposed to ~fifteen.

This PR also replaces the standard library archive functions with popular, performant third-party alternatives (we were already using klauspost/compress) and handles some cleanup/refactoring elsewhere. I addressed what #885 was trying to do as well.

I don't see much effort left in speeding up extractions after this merges. Future optimizations will come down to rules (condition/pattern matching).

Bonus -- I updated make install-yara-x to enable native-code-serialization which measurably speeds up scans.

Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
@egibs egibs force-pushed the optimize-extractions branch from 4f115cf to 8b019e1 Compare April 27, 2025 18:41
egibs added 2 commits April 27, 2025 14:08
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
@egibs egibs requested a review from Copilot April 27, 2025 19:19

This comment was marked as outdated.

egibs added 2 commits April 27, 2025 14:20
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
@egibs egibs requested a review from Copilot April 27, 2025 19:22

This comment was marked as outdated.

egibs added 2 commits April 27, 2025 14:25
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
@egibs egibs requested a review from Copilot April 27, 2025 19:28

This comment was marked as outdated.

Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
@egibs egibs requested a review from Copilot April 27, 2025 19:38

This comment was marked as outdated.

Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
@egibs egibs requested a review from Copilot April 27, 2025 19:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves extraction performance by replacing io.CopyBuffer with custom in-line loops and by tuning buffer pool sizing. Key changes include:

  • Adopting manual loops in archive extractors (zstd, zlib, zip, tar, rpm, gzip, deb, bz2) to reduce CPU overhead.
  • Refactoring buffer pool initialization to pre-populate buffers based on dynamic counts and updated size limits.
  • Switching to third‐party archive libraries and adjusting error handling and resource cleanup.

Reviewed Changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/report/strings.go Updates buffer pool creation with pre-size based on match count.
pkg/programkind/programkind.go Refactors header reading and file type detection logic.
pkg/pool/pool.go Modifies NewBufferPool signature; pre-populates buffers and calls clear(buf).
pkg/archive/*.go Replaces io.CopyBuffer with manual loops for file extraction.
pkg/action/scan.go Updates file scanning logic to use manual file reading loops.
(Other archive files) Applies similar extraction loop and buffer management improvements.
Files not reviewed (1)
  • go.mod: Language not supported
Comments suppressed due to low confidence (2)

pkg/archive/archive.go:97

  • Switching from os.Remove to os.RemoveAll may remove directories in addition to files. Please confirm that fullPath will always refer to a file or that removal of directories is the intended behavior.
if err := os.RemoveAll(fullPath); err != nil {

pkg/pool/pool.go:66

  • Ensure that the clear(buf) function is properly defined and that its behavior is appropriate for reinitializing buffers without inadvertently affecting buffer capacity or content retention.
clear(buf)

@egibs egibs marked this pull request as ready for review April 27, 2025 19:46
egibs added 3 commits April 27, 2025 14:52
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Comment thread Makefile
cd out/$(YARAX_REPO) && \
cargo install cargo-c --locked && \
cargo cinstall -p yara-x-capi --release --prefix="$(LINT_ROOT)/out" --libdir="$(LINT_ROOT)/out/lib"
cargo cinstall -p yara-x-capi --features=native-code-serialization --release --prefix="$(LINT_ROOT)/out" --libdir="$(LINT_ROOT)/out/lib"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not even joking, this shaved off 3-4 minutes of additional scan time when I scanned Trino (~5 minutes instead of 9).

@egibs egibs merged commit 39bbdab into chainguard-dev:main Apr 27, 2025
9 checks passed
@egibs egibs deleted the optimize-extractions branch April 27, 2025 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants