[GOBBLIN-2257] Fix thread-safety of shared maps after parallel onAddSpec#4183
Closed
DaisyModi wants to merge 1 commit into
Closed
[GOBBLIN-2257] Fix thread-safety of shared maps after parallel onAddSpec#4183DaisyModi wants to merge 1 commit into
DaisyModi wants to merge 1 commit into
Conversation
GOBBLIN-2257 removed synchronized from onAddSpec and introduced a multi-threaded executor for parallel flow compilation. However, the listener callbacks and FlowCatalog internals that were implicitly protected by that synchronized block now run concurrently without thread-safe data structures. GobblinServiceJobScheduler.onAddSpec() reads and writes scheduledFlowSpecs and lastUpdatedTimeForFlowSpec (plain HashMaps) from multiple concurrent callback threads, and NonScheduledJobRunner also removes entries from a separate thread pool. Concurrent HashMap modifications cause structural corruption: lost entries, orphaned DAGs, and LaunchDagProc/DagNode errors in production. Fix: - GobblinServiceJobScheduler: HashMap -> ConcurrentHashMap for scheduledFlowSpecs and lastUpdatedTimeForFlowSpec - FlowCatalog: HashMap -> ConcurrentHashMap for specSyncObjects (also accessed concurrently from updateOrAddSpecHelper and NonScheduledJobRunner) - FlowCatalog: Downgrade "SpecStore is missing in FlowCatalog" log from ERROR to WARN — this is an expected transient condition with concurrent SpecStore writes, already handled by exponential backoff retries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GOBBLIN-2257 removed
synchronizedfromonAddSpecand introduced a multi-threaded executor for parallel flow compilation, improvingflowConfigsV2GET API P99 latency. However, the listener callbacks and FlowCatalog internals that were implicitly protected by thatsynchronizedblock now run concurrently without thread-safe data structures.The bug
GobblinServiceJobScheduler.onAddSpec()reads and writesscheduledFlowSpecsandlastUpdatedTimeForFlowSpec— both plainHashMaps — from multiple concurrent callback threads.NonScheduledJobRunneralso removes entries from a separate thread pool. ConcurrentHashMapmodifications cause structural corruption:LaunchDagProc - error,DagNode or its job status not foundFlowCatalog.specSyncObjects(also a plainHashMap) is similarly accessed concurrently fromupdateOrAddSpecHelper(multiple API request threads) andNonScheduledJobRunner.The fix
GobblinServiceJobScheduler:Maps.newHashMap()→Maps.newConcurrentMap()for bothscheduledFlowSpecsandlastUpdatedTimeForFlowSpecFlowCatalog:HashMap→ConcurrentHashMapforspecSyncObjectsFlowCatalog: Downgrade "SpecStore is missing in FlowCatalog" log from ERROR to WARN — with concurrent SpecStore writes,getSpecURIs()can list a URI beforeaddSpec()fully commits. This transient condition is already handled by exponential backoff retries; ERROR level creates false alarms in monitoringAll changes are drop-in replacements with no API changes and no impact on the P99 latency improvement from GOBBLIN-2257.
Test plan
FlowCatalogTestandGobblinServiceJobSchedulerTestpassLaunchDagProc - errorandDagNode not founderrors should stop occurring