Skip to content

Conversation

kriswest
Copy link
Contributor

@kriswest kriswest commented Sep 4, 2025

resolves #1040

Reimplements parsing of PACK file contents to resolve a number of issues that were causing a loss of sync with the headers and compressed data streams that make up the pack file, resulting in Z_DATA_ERROR being thrown and the rest of the pack file failing to decode. This results in partial content being extracted from push pack files and commits being missed/not processed by later steps. The missing commits can also interact with the checkHiddenCommits task to block a subset of pushes, while others proceed with missing data.

After this PR, Git Proxy will correctly process a handful of pushes I have queued up that either parse with missing data or are blocked by checkHiddenCommits.

This PR:

  • Corrects issues with the reading of the PACK file metadata (correct number of entries)
  • Corrects parsing of the git object headers, including the variable length size field and variable length offset value for ofs_daelta (type 6) object headers
  • Reimplements the decompression such that the input is fed into the inflator a byte at a time, until an end event fires. This appears to happen one byte after the end of the compressed data (and any trailing bytes in its stream). This replaces the current approach of inflating the object and then deflating it again (possibly with different settings) to estimate the length of the compressed data and better replicates how git handles the operation.
  • Implements two tests for decompressing multi-object pack files that confirm correct handling of the variable length headers and implements a number of utility functions that accurately mock git pack file encoding.
  • Removes the previous implementation and tests for it - the updated tests don't mock zlib anymore, instead producing realistic mock packfile data encoded and decoded with zlib.

@kriswest kriswest requested review from coopernetes, jescalada and a team September 4, 2025 00:34
Copy link

netlify bot commented Sep 4, 2025

Deploy Preview for endearing-brigadeiros-63f9d0 canceled.

Name Link
🔨 Latest commit e54f255
🔍 Latest deploy log https://app.netlify.com/projects/endearing-brigadeiros-63f9d0/deploys/68d406587e935500089ebe82

@github-actions github-actions bot added the fix label Sep 4, 2025
Copy link

codecov bot commented Sep 4, 2025

Codecov Report

❌ Patch coverage is 94.28571% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.11%. Comparing base (dcf154f) to head (e54f255).
⚠️ Report is 40 commits behind head on main.

Files with missing lines Patch % Lines
src/proxy/processors/push-action/parsePush.ts 94.28% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1187      +/-   ##
==========================================
+ Coverage   83.88%   84.11%   +0.22%     
==========================================
  Files          68       68              
  Lines        2904     2958      +54     
  Branches      364      373       +9     
==========================================
+ Hits         2436     2488      +52     
- Misses        409      410       +1     
- Partials       59       60       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kriswest kriswest changed the title fix: reimplement push parsing to prevent z data errors fix: reimplement push parsing to prevent Z_DATA_ERROR Sep 4, 2025
Copy link
Contributor

@jescalada jescalada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a low-level code expert, so I'll just leave some general suggestions!

I've experimented a bit to verify whether the current code slows down the push process somewhat, and it seems that it's slightly slower (10~20%) for very large pushes:

Basic timing method

Image

Timing scripts

Regular commit

main

Image

This PR

Image

Massive commit (~12 MB of text changes)

main

Image

This PR

Image

This might not be something we want to optimize right now, though.

@kriswest
Copy link
Contributor Author

kriswest commented Sep 5, 2025

I think there's not a lot more I can do on this PR, without heading into a port to fast-zlib or zlib-sync, which I think is a separate challenge - we probably need to do something similar to fast-zlib (using the internals of zlib, such as _processChunk directly) but will need to figure out how to detect that we are done. I think that's best left to subsequent work as the push parsing time is currently dominated by the repo checkout - decompression is just a small part last milliseconds.

Copy link
Contributor

@coopernetes coopernetes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One very minor nit but otherwise LGTM. I think @kriswest knows the packet format best at this point now that this refactoring has been done lol.

My only suggestion would be to add some real (binary) receive-pack files to use as input for the tests.

    # Create some new content to generate a pack
    echo "New content for pack testing $(date)" > new-file.txt
    echo "# Updated README $(date)" >> README.md
    git add .
    git commit -m "Test commit for pack capture $(date)"
    
    echo "Capturing receive-pack output to $output_file..."
    
    # Method 1: Capture the actual pack data being sent
    # This captures the raw pack file that would be sent to receive-pack
    git format-patch --stdout HEAD~1..HEAD > "$output_file.patch"
    
    # Method 2: Capture pack file from push operation with verbose output
    GIT_TRACE_PACKET=1 git push origin main 2> "$output_file.trace" || true
    
    # Method 3: Create a pack file directly
    git pack-objects "$output_file" < <(git rev-list --objects HEAD)

@kriswest
Copy link
Contributor Author

kriswest commented Sep 24, 2025

@coopernetes test added that uses a captured push. I ended up popping a temporary change into parsePush to capture it as I had trouble with the methods given (and alternatives that an AI came up with):

Method 1: git format-patch --stdout HEAD~1..HEAD > "$output_file.patch"

Produces a patch file rather than a pack file.

Method 2: GIT_TRACE_PACKET=1 git push origin main 2> "$output_file.trace" || true

...and a number of variations I tried, produce a trace file that you can extract the pack data from in theory. The output didn't quite match what I was expecting and was missing the packet lines that proceed the PACK data in a push

Method 3: git pack-objects "$output_file" < <(git rev-list --objects HEAD)

Creates PACK and IDX files, but for the entire content of the repo (so about 16 mb for git proxy). Again is missing the packet lines that a push will have preceding the PACK file data. However, there is probably a variation on this one that would have been quite close.

In the end, just capturing a request body as a buffer and writing it to a file worked and is easy to replicate. Added details to a readme file adjacent to the captured binary file.

@kriswest kriswest requested a review from jescalada September 24, 2025 09:31
@kriswest kriswest enabled auto-merge September 24, 2025 14:53
@kriswest kriswest merged commit 2ea1bcd into finos:main Sep 24, 2025
14 checks passed
@kriswest kriswest deleted the 1040-Z-DATA-ERROR-during-push-parsing branch September 24, 2025 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Z_DATA_ERROR during push parsing
3 participants