Skip to content

feat: adding ready handler#2676

Merged
chatton merged 8 commits intomainfrom
cian/add-readiness-endpoint
Sep 22, 2025
Merged

feat: adding ready handler#2676
chatton merged 8 commits intomainfrom
cian/add-readiness-endpoint

Conversation

@chatton
Copy link
Contributor

@chatton chatton commented Sep 18, 2025

Overview

Adds the readiness check suggested in the linked issue.

closes #2673

@github-actions
Copy link
Contributor

github-actions bot commented Sep 18, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedSep 22, 2025, 8:23 AM

@codecov
Copy link

codecov bot commented Sep 18, 2025

Codecov Report

❌ Patch coverage is 70.58824% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.47%. Comparing base (844053c) to head (75d1ed6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/rpc/server/http.go 73.91% 9 Missing and 3 partials ⚠️
node/full.go 16.66% 4 Missing and 1 partial ⚠️
node/light.go 50.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2676      +/-   ##
==========================================
- Coverage   66.48%   66.47%   -0.02%     
==========================================
  Files          76       76              
  Lines        7657     7714      +57     
==========================================
+ Hits         5091     5128      +37     
- Misses       2097     2112      +15     
- Partials      469      474       +5     
Flag Coverage Δ
combined 66.47% <70.58%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@claude
Copy link
Contributor

claude bot commented Sep 18, 2025

🔍 Code Review for PR #2676: Adding Ready Handler

Summary

This PR successfully implements a readiness endpoint to determine if a node is safe to receive client traffic. The implementation is well-structured with good test coverage and follows project patterns effectively.

✅ Strengths

  • Well-designed abstraction: The BestKnownHeightProvider interface cleanly abstracts height tracking across different node types
  • Comprehensive readiness logic: Properly checks peer connectivity, synchronization status, and node type-specific requirements
  • Excellent test coverage: Tests cover edge cases including being ahead of network, no peers, and various error conditions
  • Follows project patterns: Adheres to ev-node's modular architecture and coding conventions from CLAUDE.md
  • Production-ready: Appropriate HTTP status codes and error handling for load balancer integration

⚠️ Issues to Address

High Priority

  1. Precision issue in full.go:68-74: Using math.Min with float64 conversion can cause precision loss. Use direct uint64 comparison instead:

    hHeight := n.hSyncService.Store().Height()
    dHeight := n.dSyncService.Store().Height()
    if hHeight < dHeight {
        return hHeight
    }
    return dHeight
  2. Hard-coded threshold: The 3-block tolerance at service.go:195 should be configurable for different deployment requirements

Medium Priority

  1. Function complexity: NewServiceHandler now takes 6 parameters. Consider a config struct for better maintainability
  2. Missing test coverage: Add tests for aggregator nodes and error scenarios (e.g., store.Height() failures)
  3. Error context: When height retrieval fails, include the actual error in the response for better debugging

🛡️ Security Assessment

No security concerns identified. The endpoint appropriately avoids exposing sensitive information.

⚡ Performance Assessment

Minimal overhead with quick, non-blocking operations suitable for frequent health checks.

📊 Overall Rating: 8/10

✅ Recommendation: APPROVE with minor changes

This is a valuable feature that addresses real operational needs. The core implementation is solid, and the identified issues are minor and can be addressed before merging.

Priority fixes before merge:

  1. Fix the float64 precision issue
  2. Make the block tolerance configurable
  3. Add aggregator node test coverage

Great work on this implementation! The readiness probe will significantly improve production deployments.

@claude
Copy link
Contributor

claude bot commented Sep 18, 2025

Code Review for PR #2676: Adding Ready Handler

Thank you for implementing the readiness endpoint! This is a valuable addition for production deployments.

Strengths

  • Well-structured implementation following codebase patterns
  • Good test coverage with comprehensive scenarios
  • Configurable ReadinessMaxBlocksBehind parameter
  • Proper HTTP semantics (200/503 status codes)

Code Quality Concerns

1. Potential Race Condition (node/full.go:284-290)
The bestKnownHeightProvider reads from two sync services without synchronization. Heights could change between reads. Consider adding synchronization or documenting that eventual consistency is acceptable.

2. Error Response Structure
Consider returning JSON responses instead of plain text for better observability and monitoring integration.

Potential Issues

1. Nil Provider Documentation
BestKnownHeightProvider can be nil but this isn't documented in the function signature comment.

2. Test Setup Inconsistency
da_visualization_test.go passes nil parameters to RegisterCustomHTTPEndpoints which could cause issues if tests hit the readiness endpoint.

Performance Considerations

  • The endpoint will be called frequently. Consider caching height values with a short TTL.
  • GetPeers() might be expensive with many peers. Consider if only peer count is needed.

Test Coverage

Excellent table-driven tests! Consider adding a test case for when Height() returns an error.

Suggestions

  • Add deployment guidance comments for ReadinessMaxBlocksBehind values
  • Define the default value 3 as a named constant
  • Consider logging readiness state transitions

Summary

Solid implementation addressing issue #2673 requirements. Ready to merge after addressing the race condition concern.

Great work!

@claude
Copy link
Contributor

claude bot commented Sep 18, 2025

Code Review for PR #2676: Adding Ready Handler

Thank you for implementing the readiness endpoint! This is a valuable addition for production deployments. Here's my comprehensive review:

✅ Strengths

  1. Well-structured implementation: The readiness endpoint properly checks multiple conditions (peer connectivity, local height, sync status)
  2. Good test coverage: Comprehensive test cases covering various scenarios
  3. Follows existing patterns: Consistent with the codebase's architecture and style
  4. Configurable tolerance: The ReadinessMaxBlocksBehind parameter allows operators to tune readiness based on their needs

🔍 Areas for Improvement

1. Race Condition in bestKnownHeightProvider

In node/full.go:284-291, the height provider function accesses sync service stores without synchronization:

bestKnownHeightProvider := func() uint64 {
    hHeight := n.hSyncService.Store().Height()  // Potential race
    dHeight := n.dSyncService.Store().Height()  // Potential race
    if hHeight < dHeight {
        return hHeight
    }
    return dHeight
}

Recommendation: Consider adding mutex protection or ensuring the Store().Height() methods are thread-safe.

2. Error Handling Enhancement

The readiness endpoint returns generic "UNREADY" messages. Consider adding more specific error details:

// Current
http.Error(w, "UNREADY: behind best-known head", http.StatusServiceUnavailable)

// Suggested
msg := fmt.Sprintf("UNREADY: local height %d is %d blocks behind best-known %d (max allowed: %d)", 
    localHeight, bestKnownHeight-localHeight, bestKnownHeight, allowedBlocksBehind)
http.Error(w, msg, http.StatusServiceUnavailable)

3. Missing Context Timeout

In pkg/rpc/server/http.go:46, the store height query uses the request context without timeout:

localHeight, err := s.Height(r.Context())

Recommendation: Add a timeout to prevent hanging requests:

ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()
localHeight, err := s.Height(ctx)

4. Documentation Improvements

  • The BestKnownHeightProvider type comment could clarify what "best-known" means in different node types
  • Consider adding a comment explaining why we return min(headerHeight, dataHeight) for full nodes

🐛 Potential Issues

  1. Nil Check Order: In the readiness handler, you check pm != nil but don't handle the case where s (store) might be nil before calling s.Height().

  2. Edge Case: When bestKnownHeight == localHeight == 0, the node is considered unready. However, this might be valid for a fresh network. Consider special-casing genesis.

💡 Suggestions

  1. Add Metrics: Consider exposing the readiness status and block lag as Prometheus metrics for monitoring
  2. Grace Period: For nodes that just started, consider a grace period before marking as unready
  3. Health Status Endpoint: Consider a /health/status endpoint that returns JSON with detailed health information

✨ Security & Performance

  • ✅ No security concerns identified
  • ✅ Minimal performance impact (simple height comparisons)
  • ✅ No credentials or sensitive data exposed

📋 Test Coverage

The tests are comprehensive but could benefit from:

  • Testing concurrent access to the height provider
  • Testing behavior during active sync (heights changing during request)
  • Testing with nil store or peer manager

Overall Assessment

This is a solid implementation that addresses issue #2673 effectively. The code is clean, well-tested, and follows the project's patterns. With the suggested improvements around race conditions and error handling, this will be production-ready.

Verdict: Approved with suggestions

The implementation is functional and safe to merge, but please consider addressing the race condition concern before deploying to production environments.

@chatton chatton marked this pull request as ready for review September 18, 2025 09:30

bestKnownHeight := bestKnownHeightProvider()
if bestKnownHeight == 0 {
http.Error(w, "UNREADY: best-known height unknown", http.StatusServiceUnavailable)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can turn this into a structured output if it makes things easier to work with cc @auricom WDYT? We can do a contains UNREADY / READY from a liveness probe check like this.

@claude
Copy link
Contributor

claude bot commented Sep 18, 2025

Comprehensive Pull Request Review

Overview

This PR implements a readiness endpoint (/health/ready) as requested in issue #2673. The implementation adds functionality to distinguish between nodes that are up (liveness) and nodes that are safe to query (readiness), which is crucial for load balancers and Kubernetes deployments.

✅ Positive Aspects

1. Well-Structured Implementation

  • Follows the modular design pattern mentioned in CLAUDE.md
  • Properly extends the existing HTTP endpoint registration system
  • Uses dependency injection pattern with the BestKnownHeightProvider function type
  • Maintains separation of concerns between node types (full vs light)

2. Good Test Coverage

  • Comprehensive table-driven tests covering various scenarios
  • Tests edge cases like no blocks, no peers, and varying block heights
  • Follows the testing patterns described in CLAUDE.md
  • Uses mocks appropriately for external dependencies

3. Configuration Management

  • Adds configurable ReadinessMaxBlocksBehind parameter with sensible default (3 blocks)
  • Properly updates configuration tests and flag management
  • Follows existing configuration patterns

🔧 Areas for Improvement

1. Error Handling and Context Usage

The readiness endpoint doesn't follow the error handling patterns mentioned in CLAUDE.md. Current implementation lacks context wrapping:

localHeight, err := s.Height(r.Context())
if err != nil {
    http.Error(w, "UNREADY: state unavailable", http.StatusServiceUnavailable)
    return
}

Suggestion: Consider providing more specific error context:

localHeight, err := s.Height(r.Context())
if err != nil {
    http.Error(w, fmt.Sprintf("UNREADY: state unavailable: %v", err), http.StatusServiceUnavailable)
    return
}

2. Function Signature Design

The BestKnownHeightProvider function type could be more robust:

type BestKnownHeightProvider func() uint64

Issues:

  • No error handling capability
  • No context support for cancellation
  • No way to distinguish between "unknown" (0) and "genesis" (0)

Suggestion:

type BestKnownHeightProvider func(ctx context.Context) (uint64, error)

3. Logic Edge Cases

The best-known height calculation in full nodes could be more sophisticated:

bestKnownHeightProvider := func() uint64 {
    hHeight := n.hSyncService.Store().Height()
    dHeight := n.dSyncService.Store().Height()
    if hHeight < dHeight {
        return hHeight
    }
    return dHeight
}

Potential Issues:

  • No error handling if either sync service fails
  • Uses minimum height, but this might not always be the best indicator
  • Could lead to false negatives if one sync service is lagging

4. HTTP Response Consistency

The endpoint mixes error messages in HTTP errors and success responses. Consider a more structured response format or at least consistent messaging.

5. Aggregator Node Logic

The peer check is skipped for aggregator nodes, but this might not be sufficient. Should aggregator nodes have different readiness criteria? They might need connectivity to DA layers or other specific requirements.

🔒 Security Considerations

Positive:

  • ✅ Error messages don't expose sensitive internal state
  • ✅ No private keys or sensitive data leaked

Areas to address:

  • ⚠️ The HTTP handler doesn't validate request method (should be GET only)
  • ⚠️ No rate limiting considerations for health endpoints

⚡ Performance Considerations

1. Sync Service Access

  • The bestKnownHeightProvider function calls Height() on sync services without error handling
  • These calls happen on every readiness check and could block

2. P2P Peer Query

  • GetPeers() call could be expensive for nodes with many peers
  • No caching or throttling of peer queries

📋 Recommendations

High Priority:

  • Add method validation (GET only)
  • Improve BestKnownHeightProvider signature with context and error handling

Medium Priority:

  • Add timeout/context to sync service height calls
  • Consider caching peer count for performance
  • Add structured logging for readiness check results

Low Priority:

  • Consider more sophisticated "best known height" calculation
  • Add metrics for readiness check success/failure rates

🎯 Overall Assessment

This is a well-implemented feature that addresses the requirements effectively. The code follows most of the project's patterns and includes good test coverage. The main areas for improvement are around error handling robustness and performance considerations for the sync service calls.

Recommendation: ✅ Approve with minor suggested improvements

The implementation successfully addresses the issue requirements and provides a solid foundation for production readiness checks. The suggested improvements would enhance robustness but don't block the core functionality.

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

@tac0turtle tac0turtle requested a review from auricom September 18, 2025 18:53
@claude
Copy link
Contributor

claude bot commented Sep 18, 2025

Pull Request Review: Readiness Handler Implementation

Thank you for implementing the readiness check handler! This is a valuable addition for deployment health monitoring. Here's my comprehensive review:

Strengths

  • Clean implementation following the existing patterns in the codebase
  • Good separation of concerns with the BestKnownHeightProvider abstraction
  • Comprehensive test coverage with multiple edge cases
  • Follows the ev-node architecture principles from CLAUDE.md
  • Proper configuration with sensible defaults (3 blocks behind)

🎯 Code Quality & Best Practices

  1. Function Signature Consistency: The addition of bestKnownHeightProvider parameter to NewServiceHandler is well-placed at the end, maintaining backward compatibility where possible.

  2. Error Handling: Good error handling with descriptive messages that will help with debugging:

    • "UNREADY: failed to query peers"
    • "UNREADY: no peers connected"
    • "UNREADY: state unavailable"
  3. Configuration: The default value of 3 blocks for ReadinessMaxBlocksBehind seems reasonable for most deployments.

🐛 Potential Issues

  1. Race Condition in Height Calculation: In full.go, the bestKnownHeightProvider function reads from two different sync services without synchronization:

    hHeight := n.hSyncService.Store().Height()
    dHeight := n.dSyncService.Store().Height()

    Consider if these calls need atomic access or if the slight inconsistency is acceptable.

  2. Nil Check Order: The readiness handler checks pm != nil but doesn't guard against s (store) being nil. While unlikely in practice, consider adding a nil check for consistency.

Performance Considerations

  1. Function Call Overhead: The bestKnownHeightProvider is called on every readiness check. Since these checks can be frequent (every few seconds from k8s), consider if caching with a short TTL would be beneficial.

  2. GetPeers() Call: The pm.GetPeers() call might be expensive depending on the implementation. Consider if peer count could be cached or if there's a lighter-weight way to check connectivity.

🔒 Security Considerations

  1. Information Disclosure: The readiness endpoint reveals some internal state (behind by how many blocks). This is generally acceptable but ensure this aligns with your security posture.

  2. No Authentication: The health endpoints are unauthenticated, which is standard practice but worth noting.

🧪 Test Coverage

Excellent test coverage in server_test.go! The table-driven tests cover:

  • Node at head
  • Node within acceptable lag
  • Node too far behind
  • Local node ahead of network
  • No blocks yet
  • Unknown best-known height
  • No peers connected

Suggestion: Consider adding a test case for aggregator nodes (where peer checking is skipped).

💡 Minor Suggestions

  1. Logging: Consider adding debug logging when readiness status changes, which could help with troubleshooting:

    if bestKnownHeight-localHeight > allowedBlocksBehind {
        // Consider logging: logger.Debug().Uint64("local", localHeight).Uint64("best", bestKnownHeight).Msg("Node unready: too far behind")
        http.Error(w, "UNREADY: behind best-known head", http.StatusServiceUnavailable)
        return
    }
  2. Metrics: Consider exposing the block lag as a metric for monitoring systems that prefer numeric values over binary ready/not-ready.

  3. Documentation: The comment for BestKnownHeightProvider is excellent. Consider adding a similar comment in CLAUDE.md about the readiness endpoint for future developers.

Overall Assessment

This is a solid implementation that addresses issue #2673 effectively. The code is clean, well-tested, and follows the project's patterns. The minor suggestions above are optional improvements rather than blocking issues.

Recommendation: APPROVE

The implementation is production-ready and will provide valuable health monitoring capabilities for deployments.

@julienrbrt julienrbrt changed the title chore: adding ready handler feat: adding ready handler Sep 19, 2025
node/full.go Outdated
bestKnownHeightProvider := func() uint64 {
hHeight := n.hSyncService.Store().Height()
dHeight := n.dSyncService.Store().Height()
if hHeight < dHeight {
Copy link
Member

@julienrbrt julienrbrt Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, return min(hHeight, dHeight)

@claude
Copy link
Contributor

claude bot commented Sep 19, 2025

Code Review for PR #2676: Adding Ready Handler

Thank you for implementing the readiness endpoint! This is a valuable addition for production deployments. Here's my review:

Strengths

  • Clean implementation following existing patterns in the codebase
  • Good test coverage with comprehensive edge cases
  • Proper configuration with sensible defaults (3 blocks tolerance)
  • Clear separation between full and light node behavior
  • Follows Go conventions and error handling patterns

🔍 Code Quality Observations

  1. Architecture & Design: The implementation is well-structured with the BestKnownHeightProvider abstraction allowing different node types to provide their own height calculation logic.

  2. Error Messages: The UNREADY messages are descriptive and helpful for debugging (e.g., "UNREADY: no peers connected", "UNREADY: behind best-known head").

  3. Configuration: Good addition to the config system with proper flag registration and test updates.

⚠️ Potential Issues & Suggestions

  1. Race Condition Risk: In node/full.go:284-288, the bestKnownHeightProvider function accesses sync services without synchronization. Consider if these Store() calls are thread-safe or if synchronization is needed.

  2. Missing Metrics: Consider adding metrics/logging when readiness state changes. This would be valuable for monitoring and alerting.

  3. Edge Case: When bestKnownHeight == 0 and localHeight == 0, the endpoint returns UNREADY with "no blocks yet". This is correct, but during initial chain startup, both aggregator and non-aggregator nodes might remain unready for extended periods. Consider if this is the desired behavior.

  4. Documentation: The BestKnownHeightProvider type comment is clear, but consider adding a comment explaining why we use min(headerHeight, dataHeight) for full nodes - this ensures we only consider ourselves ready when both header and data are sufficiently synced.

🚀 Performance Considerations

  • The readiness check is lightweight and suitable for frequent health checks
  • No blocking operations or expensive computations in the critical path

🔒 Security Considerations

  • No security concerns - the endpoint exposes minimal information appropriate for health monitoring
  • Properly returns 503 status codes for unhealthy states

Test Coverage

  • Excellent test coverage with table-driven tests covering all edge cases
  • Tests properly mock dependencies and verify both success and failure scenarios
  • Config test count properly updated

💡 Minor Suggestions

  1. Consider making the readiness check more granular by returning JSON with details for debugging (could be a future enhancement if needed).

  2. The ReadinessMaxBlocksBehind default of 3 seems reasonable, but you might want to document the rationale in CLAUDE.md or a comment.

Overall Assessment

This is a well-implemented feature that addresses the requirements from issue #2673. The code is production-ready with only minor suggestions for improvement. The implementation correctly handles the different node types (full vs light) and provides appropriate readiness criteria.

Great work! 🎉

julienrbrt
julienrbrt previously approved these changes Sep 19, 2025
@claude
Copy link
Contributor

claude bot commented Sep 19, 2025

Code Review: Readiness Handler Implementation

Thank you for this contribution! The readiness endpoint is a valuable addition for production deployments.

Strengths

  • Clean Architecture: Well-integrated with existing codebase patterns
  • Excellent Test Coverage: Comprehensive test cases covering edge scenarios
  • Good Separation of Concerns: Proper use of BestKnownHeightProvider function type
  • Follows Go Conventions: Clean error handling and HTTP patterns

Critical Issues (Must Fix)

1. Nil Pointer Safety

The code does not validate parameters before use, which could cause panics in pkg/rpc/server/http.go:319

  • localHeight, err := s.Height(r.Context()) could panic if s is nil
  • cfg.Node.ReadinessMaxBlocksBehind could panic if cfg is nil
    Fix: Add nil checks at the beginning of the handler.

2. Context Timeout

The s.Height() call could hang indefinitely. Consider adding a timeout with context.WithTimeout.

Suggested Improvements

  1. More Descriptive Error Messages - provide actual block heights and differences
  2. Additional Test Coverage - error scenarios, context cancellation, nil provider testing
  3. Documentation Enhancement - more comprehensive docs for BestKnownHeightProvider

Security & Performance

  • Security: Low risk, minimal information disclosure
  • Performance: Minimal overhead, efficient early returns
  • Consider rate limiting at infrastructure level

Overall Assessment

Recommendation: APPROVE after addressing critical issues

This is a well-implemented feature with high code quality. Once nil pointer safety issues are addressed, this will be production-ready.

Great work on this implementation!

// Start RPC server
handler, err := rpcserver.NewServiceHandler(n.Store, n.p2pClient, n.genesis.ProposerAddress, n.Logger, n.nodeConfig)
bestKnownHeightProvider := func() uint64 {
hHeight := n.hSyncService.Store().Height()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea here is to push the check if the node is at the highest height onto the operator correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah exactly

tac0turtle
tac0turtle previously approved these changes Sep 22, 2025
@chatton chatton dismissed stale reviews from tac0turtle and julienrbrt via 893c571 September 22, 2025 08:13
@claude
Copy link
Contributor

claude bot commented Sep 22, 2025

Code Review for PR #2676: Adding Ready Handler

Overall, this is a well-implemented feature that adds a valuable readiness check endpoint. The implementation is clean and follows good practices. Here's my detailed feedback:

✅ Strengths

  1. Clean Architecture: The BestKnownHeightProvider function type is a good abstraction that allows different node types to provide their best-known height appropriately.

  2. Comprehensive Test Coverage: Excellent test coverage with various edge cases tested, including scenarios where the node is ahead, behind, has no peers, or has no blocks.

  3. Configuration Flexibility: The ReadinessMaxBlocksBehind configuration parameter allows operators to tune readiness checks based on their requirements.

  4. Proper Error Messages: Clear and descriptive error messages that explain why the node is not ready.

🔧 Suggestions for Improvement

  1. Default Value for ReadinessMaxBlocksBehind: Consider setting a sensible default value (e.g., 3-5 blocks) in config.DefaultConfig(). Currently it appears to default to 0, which might be too strict for production environments where brief lags are acceptable.

  2. Race Condition Mitigation: The current implementation uses min(hHeight, dHeight) for full nodes. Consider adding a comment explaining why the minimum is used, as this might not be immediately obvious to future maintainers.

  3. Configuration Comment Enhancement: The comment for ReadinessMaxBlocksBehind could be more descriptive. Consider:

    comment: "Maximum number of blocks the node can lag behind the best-known network height while still being considered ready for serving traffic. Set to 0 for strict synchronization requirements."

🐛 Minor Issues

  1. Missing Default Config Test: While the flag is tested, consider adding an explicit test for the default configuration value to ensure it's properly initialized.

  2. Aggregator Mode Edge Case: For aggregator nodes, the peer check is skipped, but there's no readiness check specific to aggregators (e.g., checking if they can produce blocks). Consider if additional checks are needed.

💡 Performance Considerations

The readiness check is lightweight and should have minimal performance impact. The use of function providers for height calculations is efficient.

🔒 Security

No security concerns identified. The endpoint properly exposes only necessary information without leaking sensitive details.

📊 Test Coverage

Test coverage is comprehensive with good edge case handling. The test cases cover all critical paths and failure scenarios.

✨ Additional Recommendations

Consider adding metrics/observability around readiness state changes to help operators monitor when nodes transition between ready/unready states.

Overall, this is a solid implementation that addresses the requirements effectively. Great work! 👍

@claude
Copy link
Contributor

claude bot commented Sep 22, 2025

Code Review for PR #2676: feat: adding ready handler

Thank you for implementing the readiness probe! This is a valuable addition for production deployments. I've reviewed the implementation and have the following feedback:

Strengths

  1. Clean implementation - The readiness endpoint is well-structured and follows the established patterns in the codebase
  2. Good test coverage - Comprehensive test cases covering various scenarios (at head, blocks behind, no peers, etc.)
  3. Flexible configuration - The ReadinessMaxBlocksBehind config option provides good operational flexibility
  4. Proper separation of concerns - Using BestKnownHeightProvider function type allows full and light nodes to provide their own logic

Suggestions for Improvement

1. Missing Default Value for ReadinessMaxBlocksBehind

The new config field doesn't have a default value set in DefaultConfig() (pkg/config/defaults.go:40-79). Consider adding a sensible default (e.g., 3 blocks) to match the test expectations.

2. Edge Case: bestKnownHeight == 0

When bestKnownHeight == 0, the endpoint returns UNREADY: best-known height unknown. However, during initial startup or network bootstrap, this might be a valid state. Consider whether the node should be ready if both local and best-known heights are 0 (initial network state).

3. Improve Error Messages

The error messages could be more descriptive to help operators diagnose issues. For example, include the actual heights and allowed lag in the behind best-known head message to aid debugging.

4. Performance Consideration

The readiness check queries peers on every request for non-aggregator nodes. In high-traffic scenarios, consider caching the peer count for a short duration (e.g., 1-5 seconds) to reduce overhead.

5. Documentation Enhancement

The comment for ReadinessMaxBlocksBehind could be clearer. Consider: Maximum number of blocks the node can lag behind the best-known network height while still being considered ready for serving queries. Set to 0 to require exact synchronization with the network head.

Potential Issues

1. Race Condition Risk

The bestKnownHeightProvider in full.go uses min(hHeight, dHeight) without synchronization. While likely safe due to atomic operations on the underlying stores, consider documenting this assumption or adding explicit synchronization if needed.

2. Aggregator Node Behavior

Aggregator nodes skip the peer check entirely. This might be intentional, but consider whether aggregators should have different readiness criteria (e.g., checking DA layer connectivity).

Security Considerations

  • No security issues identified. The endpoint doesn't expose sensitive information and properly validates all inputs.

Test Coverage

  • Excellent test coverage with multiple scenarios
  • Consider adding a test for the aggregator node case (where peer check is skipped)
  • Consider adding integration tests to verify the endpoint behavior during actual sync operations

Overall Assessment

This is a solid implementation that addresses the requirements from issue #2673. With the suggested improvements (especially adding the default config value), this PR is ready for merge. The readiness probe will significantly improve operational reliability by preventing traffic routing to nodes that aren't fully synced.

Great work on this feature!

@claude
Copy link
Contributor

claude bot commented Sep 22, 2025

Code Review

This PR adds a readiness handler that monitors node synchronization state - a valuable addition for health monitoring and orchestration. The implementation is clean and follows the codebase patterns well.

✅ Strengths

  1. Clean Architecture: The BestKnownHeightProvider abstraction nicely separates concerns between full and light nodes
  2. Good Test Coverage: Comprehensive test cases covering edge conditions and different node states
  3. Proper Configuration: Well-integrated config parameter with sensible default (3 blocks)
  4. Follows Patterns: Consistent with the existing codebase structure and conventions

🔍 Minor Suggestions

  1. Error Message Consistency: Consider standardizing the error format. Some return "UNREADY: " while the success just returns "READY". Consider "READY: synced" for consistency.

  2. Documentation Enhancement: The comment for ReadinessMaxBlocksBehind in config.go could clarify the behavior when set to 0 (e.g., "0 means must be exactly at head - no tolerance")

  3. Height Provider Naming: Consider renaming bestKnown variable to bestKnownProvider in light.go:81 for clarity

✅ Security & Performance

  • No security concerns identified
  • Minimal performance impact - simple height comparisons
  • No resource leaks or concurrency issues

✅ Test Coverage

Excellent coverage including:

  • Various sync states (at head, behind, ahead)
  • Peer connectivity checks
  • Error conditions
  • Both aggregator and non-aggregator nodes

Overall, this is a solid implementation that addresses #2673 effectively. The code is production-ready with only minor cosmetic suggestions. Great work!

@claude
Copy link
Contributor

claude bot commented Sep 22, 2025

Code Review for PR #2676: Adding Ready Handler

Summary

This PR adds a readiness health check endpoint (/health/ready) to the ev-node system, which is essential for Kubernetes deployments and general operational monitoring. The implementation looks solid overall with good test coverage and sensible defaults.

Strengths

  • Clean implementation: The readiness check logic is well-structured and follows existing patterns
  • Good test coverage: Comprehensive unit tests covering various scenarios
  • Proper configuration: Added as configurable parameter with sensible default (3 blocks behind)
  • Appropriate checks: Verifies peer connectivity, local height, and compares with best-known height

Code Quality

Minor improvements suggested:

  1. In pkg/rpc/server/http.go:47, error message could be more informative
  2. The bestKnownHeightProvider nil check could be moved earlier for fail-fast behavior

Potential Issues

  1. The bestKnownHeightProvider in node/full.go reads from two sync services without explicit synchronization. While likely safe with atomic uint64 operations, consider documenting this assumption.
  2. Edge case handling when both heights are 0 could distinguish between "not started" vs "genesis state"

Performance

  • Lightweight and efficient implementation
  • Consider caching peer count check if endpoint gets hit frequently

Security

  • No security concerns identified
  • Appropriately exposes only operational status

Test Coverage

  • Excellent coverage with multiple scenarios
  • Consider adding integration tests for sync operation behavior

Suggestions

  1. Add Prometheus metrics for height delta monitoring
  2. Document typical ReadinessMaxBlocksBehind values
  3. Consider a degraded state for slightly behind but functional nodes

Overall Assessment

Well-implemented feature following project patterns. Clean code with good tests. Ready for merge with minor improvements.

Approved with suggestions

Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@chatton chatton added this pull request to the merge queue Sep 22, 2025
Merged via the queue into main with commit 727e591 Sep 22, 2025
29 of 31 checks passed
@chatton chatton deleted the cian/add-readiness-endpoint branch September 22, 2025 09:31
@github-project-automation github-project-automation bot moved this to Done in Evolve Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add readiness probe to exclude syncing nodes from client traffic

3 participants