Extraction/buffer sizing improvements#887
Merged
Merged
Conversation
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
4f115cf to
8b019e1
Compare
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Overview
This PR improves extraction performance by replacing io.CopyBuffer with custom in-line loops and by tuning buffer pool sizing. Key changes include:
- Adopting manual loops in archive extractors (zstd, zlib, zip, tar, rpm, gzip, deb, bz2) to reduce CPU overhead.
- Refactoring buffer pool initialization to pre-populate buffers based on dynamic counts and updated size limits.
- Switching to third‐party archive libraries and adjusting error handling and resource cleanup.
Reviewed Changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/report/strings.go | Updates buffer pool creation with pre-size based on match count. |
| pkg/programkind/programkind.go | Refactors header reading and file type detection logic. |
| pkg/pool/pool.go | Modifies NewBufferPool signature; pre-populates buffers and calls clear(buf). |
| pkg/archive/*.go | Replaces io.CopyBuffer with manual loops for file extraction. |
| pkg/action/scan.go | Updates file scanning logic to use manual file reading loops. |
| (Other archive files) | Applies similar extraction loop and buffer management improvements. |
Files not reviewed (1)
- go.mod: Language not supported
Comments suppressed due to low confidence (2)
pkg/archive/archive.go:97
- Switching from os.Remove to os.RemoveAll may remove directories in addition to files. Please confirm that fullPath will always refer to a file or that removal of directories is the intended behavior.
if err := os.RemoveAll(fullPath); err != nil {
pkg/pool/pool.go:66
- Ensure that the clear(buf) function is properly defined and that its behavior is appropriate for reinitializing buffers without inadvertently affecting buffer capacity or content retention.
clear(buf)
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
Signed-off-by: egibs <20933572+egibs@users.noreply.github.com>
egibs
commented
Apr 27, 2025
| cd out/$(YARAX_REPO) && \ | ||
| cargo install cargo-c --locked && \ | ||
| cargo cinstall -p yara-x-capi --release --prefix="$(LINT_ROOT)/out" --libdir="$(LINT_ROOT)/out/lib" | ||
| cargo cinstall -p yara-x-capi --features=native-code-serialization --release --prefix="$(LINT_ROOT)/out" --libdir="$(LINT_ROOT)/out/lib" |
Member
Author
There was a problem hiding this comment.
Not even joking, this shaved off 3-4 minutes of additional scan time when I scanned Trino (~5 minutes instead of 9).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up for #867.
Turns out, extracting many small files with
ExtractZipwas taking a long time, especially in the context of handling thousands of.jarfiles. This was partly due to the incorrectly-sized buffer and also the CPU overhead of callingio.CopyBuffer.This PR replaces our reliance on
io.CopyBufferwith a more manual in-line loop for each extraction function (I tried a helper function for the loops but it was just as slow) which handles the writing and error checking. In my testing, this was ~3x faster (for example, all Trino packages for a single architecture can be extracted in ~11 seconds).Anecdotally, scanning a single Sonarqube package takes 28-30 seconds (I did this at least a dozen times to double-check) with these changes as opposed to 82-85 seconds in
HEADand scanning all of the Trino packages for a single architecture takes ~nine minutes as opposed to ~fifteen.This PR also replaces the standard library archive functions with popular, performant third-party alternatives (we were already using
klauspost/compress) and handles some cleanup/refactoring elsewhere. I addressed what #885 was trying to do as well.I don't see much effort left in speeding up extractions after this merges. Future optimizations will come down to rules (condition/pattern matching).
Bonus -- I updated
make install-yara-xto enablenative-code-serializationwhich measurably speeds up scans.