-
Notifications
You must be signed in to change notification settings - Fork 475
Add exponential backoff retry logic to Netlify builds #20409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add test-remote-failure.md with multiple failure scenarios - Configure netlify.toml for retry testing environment - Keep cache plugin for performance during testing - Ready for manual PR creation to trigger deploy preview
- Create comprehensive build test script with retry capabilities - Add detailed logging and build statistics - Support both retry testing and stress testing modes - Include build timing and attempt tracking - Make script executable for Netlify deployment
- Remove test-remote-failure.md with intentional failures - Update netlify.toml for production retry configuration - Rename build-test.sh to build.sh and remove test-specific code - Configure 3 retries with 30s base exponential backoff delay - Simplify logging for production use while keeping retry functionality
✅ Deploy Preview for cockroachdb-interactivetutorials-docs canceled.
|
✅ Deploy Preview for cockroachdb-api-docs canceled.
|
Files changed:
|
✅ Netlify Preview
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebembi-crdb I like the logging and exponential backoff. I tested the PR with some common error types (via a few commits on a branch off your branch) and found the following:
- Good behavior: htmltest failures (bad anchor links) do not retry - this is correct since these are permanent errors (commit/log)
- Problematic behavior:
Both of these are permanent errors that won't resolve on retry.
I would suggest adding error classification to only retry transient network errors (and in the future, we could always add additional possible criteria).
Specifically, per Claude, I believe we could modify the build_with_monitoring
function to capture and analyze Jekyll's output:
function build_with_monitoring {
local config=$1
local build_log="build_${ATTEMPT_COUNT}.log"
# Capture Jekyll output for analysis
if bundle exec jekyll build --trace --config _config_base.yml,$config 2>&1 | tee "$build_log"; then
return 0
else
local exit_code=$?
# Only retry on transient network errors
if grep -qE "Temporary failure in name resolution|SocketError|Connection refused|Connection reset|Failed to open TCP connection" "$build_log"; then
return 2 # Retryable error
else
# Permanent errors: Liquid syntax, missing files, ArgumentError
echo "❌ Permanent build error detected - not retrying"
return 1 # Non-retryable
fi
fi
}
Then update build_with_retries
to check the return code:
if build_with_monitoring "$config"; then
success=true
break
else
local result=$?
if [[ $result -eq 1 ]]; then
echo "Permanent error - failing immediately"
break # Don't retry
fi
# Only retry if result == 2 (transient error)
fi
I have not tested this, but I wanted to share it in case it is helpful.
One way or another I do think it is worth only retrying on DNS/network failures, since I can't think of any other failure that it would've benefitted us to retry, and I know that retrying on those other cases I mentioned would waste time for writers when they aren't closely watching their build log.
Summary
Implements robust retry logic with exponential backoff for Netlify documentation builds to handle transient network
failures and remote include errors.
Changes
Benefits
Configuration
MAX_RETRIES
: Number of retry attempts (default: 3)BASE_RETRY_DELAY
: Base delay in seconds for exponential backoff (default: 30)Testing
Validated with intentional failure scenarios including:
The retry mechanism successfully recovers from transient failures while failing fast on persistent issues.