Skip to content

Conversation

leodido
Copy link

@leodido leodido commented Sep 22, 2025

Description

This PR implements comprehensive SLSA Level 3 cache verification for Leeway, ensuring cached artifacts are verified before use. The implementation provides production-ready security with zero breaking changes to existing workflows.

🏗️ Implementation Details

Phase 1: Core Infrastructure

  • SLSA verifier integration with slsa-verifier v2.6.0
  • RemoteConfig extension with verification settings
  • Backward compatibility preservation

Phase 2: Production-Ready S3Cache

  • Fixed critical channel reading race conditions
  • Added comprehensive timeout protection
  • Implemented rate limiting and resource management
  • Created mock verifier for comprehensive testing

Phase 3: CLI Integration

  • Added environment variable constants following Leeway patterns
  • Implemented CLI flags with proper validation
  • Added help documentation and GCP compatibility warnings

🛡️ Security & Reliability

Concurrency Safety

  • Fixed channel reading race conditions in parallel operations
  • Added mutex protection for file cleanup operations
  • Implemented proper context cancellation patterns

Resource Management

  • Rate limiting prevents S3 API abuse
  • Goroutine limiting prevents memory exhaustion
  • Timeout protection prevents indefinite hangs

Error Handling

  • Graceful degradation when attestations missing
  • Clear validation messages for configuration errors
  • GCP compatibility warnings with automatic fallback

📈 Testing

  • Unit tests: All existing tests pass + new SLSA test suite
  • Integration tests: CLI flags, environment variables, validation
  • Race detection: No data races detected with -race flag
  • Mock verification: Comprehensive mock verifier for testing

🔄 Backward Compatibility

  • Zero breaking changes - existing workflows unchanged
  • Verification disabled by default - opt-in security
  • Existing environment variables continue to work
    -All cache modes (none, local, remote) preserved

🚦 Validation

  • Required source URI: Clear error when --slsa-source-uri missing
  • GCP compatibility: Automatic warning and graceful degradation
  • Flag precedence: CLI flags properly override environment variables

Related Issue(s)

Closes https://linear.app/ona-team/issue/CLC-1961/implement-cache-verification-during-download-leeway

How to test

Environment Variables

export LEEWAY_SLSA_CACHE_VERIFICATION=true
export LEEWAY_SLSA_SOURCE_URI=github.com/myorg/myrepo
leeway build //:app

CLI Flags

leeway build --slsa-cache-verification --slsa-source-uri=github.com/myorg/myrepo //:app

CI/CD Ready

# Production deployment
LEEWAY_SLSA_CACHE_VERIFICATION=true \
LEEWAY_SLSA_SOURCE_URI=github.com/gitpod-io/gitpod-next \
leeway build --cache=remote //:all

Documentation

/hold

leodido and others added 10 commits September 22, 2025 15:35
- Add github.com/slsa-framework/slsa-verifier/v2 v2.6.0
- Enables SLSA Level 3 verification for cached artifacts
- Direct Go API integration without external processes

Co-authored-by: Ona <no-reply@ona.com>
- Add SLSAVerification bool for enabling verification
- Add TrustedRoots []string for CA configuration
- Add RequireAttestation bool for strict mode
- Add SourceURI string for repository validation
- Support YAML/JSON serialization for configuration files

Co-authored-by: Ona <no-reply@ona.com>
Features:
- SLSA Level 3 verification using slsa-verifier/v2 Go API
- SHA256 artifact hash calculation and validation
- Source URI verification for supply chain security
- Attestation key generation (.att file naming)
- Comprehensive error handling with context support

Security:
- Prevents repository substitution attacks via SourceURI validation
- Validates artifact integrity through cryptographic hashing
- Follows official slsa-verifier CLI implementation patterns

Testing:
- 90% test coverage with unit tests
- Exact hash verification for deterministic testing
- Error handling validation for missing files
- Edge case coverage for robustness

Co-authored-by: Ona <no-reply@ona.com>
- Define VerifierInterface for testable SLSA verification
- Implement MockVerifier with configurable behavior for testing
- Add helper functions for simulating verification scenarios
- Enable comprehensive testing of SLSA verification logic

Co-authored-by: Ona <no-reply@ona.com>
- Add direct dependency on golang.org/x/time/rate
- Required for S3 API rate limiting to prevent service abuse
- Ensures proper dependency management for production deployment

Co-authored-by: Ona <no-reply@ona.com>
Critical production fixes:
- Fix channel reading race conditions in parallel operations
- Add context timeout protection (30-60s) for all storage calls
- Implement rate limiting (100 RPS, 200 burst) to prevent API abuse
- Add goroutine count limiting (max 50) for scalability
- Add file cleanup race condition protection with mutex
- Use VerifierInterface for testable SLSA verification
- Add proper error handling and graceful degradation

Performance improvements:
- Maintain backward compatibility for non-SLSA workflows
- Preserve throughput while adding safety mechanisms
- Add monitoring for high resource usage detection

Co-authored-by: Ona <no-reply@ona.com>
- Test successful SLSA verification with mock verifier
- Test missing attestation scenarios (required vs optional)
- Test backward compatibility when verification disabled
- Test network error handling during attestation download
- Test mock verifier integration and behavior
- Add race-safe mock storage with mutex protection
- Verify performance overhead targets and logging

Co-authored-by: Ona <no-reply@ona.com>
- Initialize rateLimiter and semaphore in all test S3Cache instances
- Add required imports for golang.org/x/time/rate
- Fix nil pointer dereferences that caused segmentation violations
- Ensure all tests properly initialize production-ready S3Cache fields
- Maintain test compatibility with new concurrency safety features

Fixes critical issue where tests created S3Cache directly without
going through constructor, leaving essential fields uninitialized.

Co-authored-by: Ona <no-reply@ona.com>
- Add EnvvarSLSACacheVerification and EnvvarSLSASourceURI constants
- Update CLI help text to document new environment variables
- Follow existing Leeway patterns for environment variable naming

Co-authored-by: Ona <no-reply@ona.com>
- Add --slsa-cache-verification and --slsa-source-uri flags
- Implement flag override logic with proper error handling
- Add validation requiring source-uri when verification enabled
- Support GCP compatibility with graceful degradation warning
- Maintain backward compatibility with getRemoteCacheFromEnv()

Priority order: CLI flags → environment variables → defaults

Co-authored-by: Ona <no-reply@ona.com>
@leodido leodido self-assigned this Sep 22, 2025
leodido and others added 2 commits September 22, 2025 19:23
Remove unused golang.org/x/time v0.11.0 entry, keep only v0.13.0

Co-authored-by: Ona <no-reply@ona.com>
- Add proper error handling for os.Remove calls during cleanup
- Add proper error handling for file.Close calls
- Remove unused createInvalidAttestation function

Co-authored-by: Ona <no-reply@ona.com>
@leodido leodido marked this pull request as ready for review September 22, 2025 21:32
}

// Initialize rate limiter with default limits
rateLimiter := rate.NewLimiter(rate.Limit(defaultRateLimit), defaultBurstLimit)
Copy link
Member

@geropl geropl Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation to add a rateLimiter and the semaphore to limit max concurrent Goroutines?
Sounds sensible in general, but I wonder why now and in this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this may look like a nice-to-have, but the reasoning that led me here is simple: we are doubling the S3 traffic, and that creates real production risk.

I'll try to answer your "why now" question:

  1. Traffic Impact: SLSA verification downloads both artifact + artifact.att (2x S3 requests per package)
  2. Performance Requirements: We download concurrently to avoid impacting build times
  3. AWS Limits: Large builds could easily exceed AWS S3 rate limits (5,500 RPS per prefix)
  4. Failure Mode: S3 rate limiting -> 503 errors -> broken builds for everyone

The semaphore controls concurrency specifically to prevent AWS from throttling us during peak usage.

This was the reasoning that convinced me it's as essential as the verification logic itself—preventing SLSA verification from overwhelming S3 and breaking builds system-wide.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could implement this as a separate PR, but since the rate limiting is specifically triggered by the 2x request pattern of SLSA verification, it felt more cohesive to include the protection mechanism with the feature that creates the risk.

@geropl
Copy link
Member

geropl commented Sep 23, 2025

Overall the changes make sense, but the PR is quite meaty, so hard to tell for sure. I'd say whatever we can do to simplify/reduce it would be a big plus! 👍

leodido and others added 4 commits September 23, 2025 10:40
- Create dedicated SLSAConfig struct to encapsulate SLSA settings
- Extract parseSLSAConfig() function with proper CLI/env priority handling
- Update getRemoteCache() to use clean configuration structure
- Add comprehensive documentation for RequireAttestation behavior
- Preserve RequireAttestation field for future CLI flag extensibility

This refactoring addresses architectural feedback while maintaining
backward compatibility and production readiness.

Co-authored-by: Ona <no-reply@ona.com>
…hitecture

- Extract downloadFileAsync() function to eliminate duplication between artifact/attestation downloads
- Refactor downloadBothParallel() to use shared download logic
- Update constructor to use new SLSA configuration structure
- Simplify verification check to use s.slsaVerifier != nil pattern
- Add comprehensive documentation for RequireAttestation behavior and graceful fallback design
- Maintain all rate limiting, concurrency controls, and error handling

This reduces code duplication while preserving all production safety features.

Co-authored-by: Ona <no-reply@ona.com>
- Create helper functions createTestConfig() and createTestConfigWithRequiredAttestation()
- Update all test cases to use new SLSAConfig structure
- Add comprehensive documentation for RequireAttestation test behavior
- Document expected behavior: RequireAttestation=true + missing attestation → skip download, allow local build fallback
- Maintain all existing test coverage and expectations

All tests pass, confirming the architectural changes preserve functionality.

Co-authored-by: Ona <no-reply@ona.com>
- Add downloadResult struct with proper error attribution by kind
- Update downloadFileAsync to use structured results instead of plain errors
- Refactor downloadBothParallel with improved context cancellation handling
- Replace fragile string matching with reliable kind-based error attribution
- Improve error messages for context cancellation scenarios

Benefits:
✅ Reliable Error Attribution: No more fragile string matching
✅ Better Context Cancellation: Proper error reporting when operations are cancelled
✅ Cleaner Code: Structured approach is more maintainable
✅ Better Debugging: Clear error messages show exactly what failed and why

Co-authored-by: Ona <no-reply@ona.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants