Add timeout and retry logic to Azure token fetch #3113

fivetran-rahulprakash · 2025-11-21T05:35:43Z

Problem

The getAccessToken() method in AzureCredentialsStorageIntegration used an unbounded blocking call which could hang indefinitely if Azure's token endpoint was slow or unresponsive. This could lead to:

Thread pool exhaustion in high-concurrency scenarios
Cascading failures when Azure AD experiences degraded performance
Poor user experience with no visibility into token fetch failures

Solution

This PR adds defensive timeout and retry mechanisms using Project Reactor's built-in capabilities:

15-second timeout per individual token request attempt to prevent indefinite blocking
Exponential backoff retry (3 attempts with delays: 2s, 4s, 8s) with 50% jitter to prevent thundering herd during mass failures
90-second overall timeout as a safety net for the complete operation
Intelligent retry filtering for known transient Azure AD errors:
- AADSTS50058 - Token endpoint timeout
- AADSTS50078 - Service temporarily unavailable
- AADSTS700084 - Token refresh required
- 503 - Service unavailable
- 429 - Too many requests
Enhanced logging for better observability (warnings on errors, info on retries)

Testing

Code leverages existing reactor dependencies (no new dependencies)
Follows existing Polaris patterns for reactive error handling

Benefits

Improves system resilience to transient Azure service issues
Prevents indefinite blocking that could cascade to request timeouts
Provides better observability with structured logging
Uses well-established retry patterns with exponential backoff and jitter

Checklist

🛡️ Don't disclose security issues! (Not applicable - this is a resilience improvement)
🔗 Clearly explained why the changes are needed
🧪 Manually tested via compilation; reactive behavior follows reactor-core semantics
💡 Added comprehensive Javadoc explaining the retry strategy
🧾 Updated CHANGELOG.md (awaiting maintainer guidance on format)
📚 Updated documentation (no user-facing config changes)

Previously, the getAccessToken method used an unbounded blocking call which could hang indefinitely if Azure's token endpoint was slow or unresponsive. This change adds defensive timeout and retry mechanisms: - 15-second timeout per individual token request attempt - Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter to prevent thundering herd during mass failures - 90-second overall timeout as a safety net - Specific retry logic for known transient Azure AD errors (AADSTS50058, AADSTS50078, AADSTS700084, 503, 429) This makes the system more resilient to transient Azure service issues and prevents indefinite blocking that could cascade to request timeouts or service degradation.

dimas-b

Nice improvement! Thanks for your contribution, @fivetran-rahulprakash !

dimas-b · 2025-11-21T17:46:56Z

.../src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java

        defaultAzureCredential
            .getToken(new TokenRequestContext().addScopes(scope).setTenantId(tenantId))
-            .blockOptional()
+            .timeout(Duration.ofSeconds(15)) // Per-attempt timeout


We have RealmConfig here, could you add a general setting in FeatureConfiguration for this timeout. I suppose it could applicable to other integrations too (but, of course, in this PR we can concentrate on Azure only)

I've added four new configuration constants in FeatureConfiguration:
CLOUD_API_TIMEOUT_SECONDS (default: 15) - Per-attempt timeout
CLOUD_API_RETRY_COUNT (default: 3) - Number of retry attempts
CLOUD_API_RETRY_DELAY_SECONDS (default: 2) - Initial delay for exponential backoff
CLOUD_API_RETRY_JITTER_MILLIS (default: 500) - Maximum jitter to prevent thundering herd
These use generic naming (CLOUD_API_*) rather than Azure-specific names, making them reusable for future implementations.

dimas-b · 2025-11-21T17:47:40Z

.../src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java

+                        tenantId,
+                        error.getMessage()))
+            .retryWhen(
+                Retry.backoff(3, Duration.ofSeconds(2)) // 3 retries: 2s, 4s, 8s


Having backoff settings configurable could also be helpful.

dimas-b · 2025-11-21T17:48:42Z

.../src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java

+                    .jitter(0.5) // ±50% jitter to prevent thundering herd
+                    .filter(
+                        throwable ->
+                            throwable instanceof java.util.concurrent.TimeoutException


TimeoutException is already handled by isRetriableAzureException, right?

Yes, you're absolutely right! I missed that. Removed the duplicate check now. Thank you for catching that!

dimas-b · 2025-11-21T17:49:56Z

.../src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java

+                                    "Azure token fetch exhausted after %d attempts for tenant %s",
+                                    retrySignal.totalRetries(), tenantId),
+                                retrySignal.failure())))
+            .blockOptional(Duration.ofSeconds(90)) // Maximum total wait time


Why do we need this on top of .timeout() (line 337)?

Good point! I initially added the overall timeout as a safety net to ensure we never block indefinitely, but you're right it's unnecessary. The combination of per-attempt timeout and .retryWhen() with exponential backoff already provides sufficient protection. Removed it now. Thanks for the feedback!

- Add 4 generic cloud provider API configuration constants: CLOUD_API_TIMEOUT_SECONDS (default: 15) CLOUD_API_RETRY_COUNT (default: 3) CLOUD_API_RETRY_DELAY_SECONDS (default: 2) CLOUD_API_RETRY_JITTER_MILLIS (default: 500) - Update AzureCredentialsStorageIntegration to use configurable values - Remove hardcoded 90s overall timeout (per-attempt timeout + retries sufficient) - Improve error logging and retry logic documentation - Generic naming allows future reuse by AWS/GCP storage integrations Addresses review comments from dimas-b on PR 3113

fivetran-rahulprakash · 2025-11-24T06:45:15Z

Thank you @dimas-b for the thorough review and excellent suggestions! I've addressed all your comments

fivetran-rahulprakash · 2025-11-24T06:55:29Z

polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java

+          .defaultValue(2)
+          .buildFeatureConfiguration();
+
+  public static final FeatureConfiguration<Integer> CLOUD_API_RETRY_JITTER_MILLIS =


I chose to use milliseconds instead of a 0-1 jitter factor for a few reasons:

User clarity - It's more intuitive for operators to specify "500 milliseconds of jitter" rather than understanding what "0.5 jitter factor" means (50% of the retry delay)
Concrete vs relative - Millis gives direct control over the maximum random delay added, while a factor requires understanding how it interacts with the exponential backoff delays
Consistency - All other time-based configs use concrete units (seconds/millis) rather than abstract factors
Predictability - With millis, the max jitter is always clear regardless of retry delay values

The small conversion cost (jitterMillis / 1000.0) is negligible compared to the benefits of making the config more operator friendly. Happy to change to 0-1 factor if you prefer that approach though!

dimas-b

LGTM overall, just some minor comments about the new config.

dimas-b · 2025-11-24T19:00:40Z

.../src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java

+  private AccessToken getAccessToken(RealmConfig realmConfig, String tenantId) {
+    int timeoutSeconds = realmConfig.getConfig(CLOUD_API_TIMEOUT_SECONDS);
+    int retryCount = realmConfig.getConfig(CLOUD_API_RETRY_COUNT);
+    int initialDelaySeconds = realmConfig.getConfig(CLOUD_API_RETRY_DELAY_SECONDS);


Would you mind using millis for initialDelaySeconds... in some cases even 1 sec may be too long. Let's delegate what the min delay should be to the admin user who configures it.

Same for timeoutSeconds... I hope Azure SDK supports millis.

Thanks for the suggestion!

Changed both to milliseconds:
CLOUD_API_TIMEOUT_MILLIS (default: 15000ms)
CLOUD_API_RETRY_DELAY_MILLIS (default: 2000ms)

- Rename CLOUD_API_TIMEOUT_SECONDS to CLOUD_API_TIMEOUT_MILLIS (default: 15000ms) - Rename CLOUD_API_RETRY_DELAY_SECONDS to CLOUD_API_RETRY_DELAY_MILLIS (default: 2000ms) - Update AzureCredentialsStorageIntegration to use Duration.ofMillis() - Allows admins to configure sub-second timeouts for finer control Addresses review feedback from dimas-b

dimas-b

Sorry, one more minor comment from my side... Otherwise LGTM 👍

dimas-b · 2025-11-25T16:26:30Z

.../src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java

+    int retryCount = realmConfig.getConfig(CLOUD_API_RETRY_COUNT);
+    int initialDelayMillis = realmConfig.getConfig(CLOUD_API_RETRY_DELAY_MILLIS);
+    int jitterMillis = realmConfig.getConfig(CLOUD_API_RETRY_JITTER_MILLIS);
+    double jitter = jitterMillis / 1000.0; // Convert millis to fraction for jitter factor


I'm not sure, I fully understand this logic... per javadoc of reactor.util.retry.RetryBackoffSpec.jitter() the factor applies to the "computed delay", which may not be 1000 ms 🤔 How can the user reason about what the CLOUD_API_RETRY_JITTER_MILLIS value of 750 (for example) means?

Would it not be simpler to use the 0.0-1.0 factor value in the config?

RahulPrakash96 added 2 commits November 21, 2025 10:54

Merge latest upstream changes from apache/polaris main

b0a2714

github-project-automation bot added this to Basic Kanban Board Nov 21, 2025

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Nov 21, 2025

dimas-b reviewed Nov 21, 2025

View reviewed changes

fivetran-rahulprakash requested a review from dimas-b November 24, 2025 06:45

fivetran-rahulprakash commented Nov 24, 2025

View reviewed changes

dimas-b reviewed Nov 24, 2025

View reviewed changes

fivetran-rahulprakash requested a review from dimas-b November 25, 2025 07:43

dimas-b reviewed Nov 25, 2025

View reviewed changes

Add timeout and retry logic to Azure token fetch #3113

Are you sure you want to change the base?

Add timeout and retry logic to Azure token fetch #3113

Conversation

fivetran-rahulprakash commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Benefits

Checklist

Uh oh!

dimas-b left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fivetran-rahulprakash commented Nov 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimas-b left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimas-b left a comment

Choose a reason for hiding this comment

Uh oh!

dimas-b Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fivetran-rahulprakash commented Nov 21, 2025 •

edited

Loading

dimas-b Nov 25, 2025 •

edited

Loading