Fix GitDagBundle race condition in concurrent task execution#61489
Fix GitDagBundle race condition in concurrent task execution#61489Arunodoy18 wants to merge 2 commits intoapache:mainfrom
Conversation
When LocalExecutor runs multiple tasks concurrently that use the same GitDagBundle version, the existing _clone_repo_if_required() method had a race condition where: 1. Multiple processes check if version directory exists (all see False) 2. All processes start cloning to the same directory simultaneously 3. Partial clones become visible to other processes mid-operation 4. FileNotFoundError or corrupted repos can occur during checkout This fix implements atomic clone operations: - Clone to temporary directory with PID-based naming - Use atomic os.rename() to make clone visible only when complete - Validate existing repos before reuse (check repo.head) - Handle rename race gracefully when another process wins - Clean up temp directories on any failure This ensures each version directory is created atomically and tasks safely reuse valid existing clones without race conditions. Test coverage includes: - Concurrent clone attempts - Existing repo reuse - Corrupted repo detection and replacement - Temp directory cleanup on failure - Atomic rename race handling - Multiple version isolation
| Clone repository if required, with atomic directory creation to prevent races. | ||
|
|
||
| This method ensures that: | ||
| 1. Only one process clones a specific version at a time (version-level locking) |
There was a problem hiding this comment.
We already have this:
|
|
||
| This method ensures that: | ||
| 1. Only one process clones a specific version at a time (version-level locking) | ||
| 2. Clones are atomic - no partial clones visible to other processes |
There was a problem hiding this comment.
And that lock should prevent others from seeing partials too.
| except (InvalidGitRepositoryError, GitCommandError, ValueError) as e: | ||
| # Repo exists but is invalid/corrupted - clean it up | ||
| self._log.warning( | ||
| "Existing repo is invalid, will re-clone", |
There was a problem hiding this comment.
And, afaik this is the root cause of the issue you picked up - with prune_dotgit_folder true you always end up with an "invalid" repo, leading to churn. That's the thing we need to ultimately fix - prune_dotgit_folder should not result in needing to reclone to reuse the bundle.
|
cc @ephraimbuddy since you added the problematic |
I'll take a look tomorrow. However, the prune_dotgit_folder is configurable. This PR doesn't look nice to me. I'll take a better look tomorrow |
|
Thanks — that makes sense. |
Root Cause:
Versioned bundles (using versions/{SHA}/ directories) are designed to be
shared across concurrent tasks to avoid redundant cloning. However, when
prune_dotgit_folder=True (the default), the .git folder was removed after
initialization, breaking subsequent reuse attempts:
1. Task A initializes version ABC123, prunes .git
2. Task B tries to reuse ABC123, validation fails (not a git repo)
3. Task B deletes and re-clones entire version directory
4. Race conditions and performance degradation ensue
Evidence: ALL existing tests for versioned bundles explicitly set
prune_dotgit_folder=False, indicating maintainers understood this
constraint but didn't enforce it in the code.
Fix:
Only prune non-versioned (tracking) bundles. Versioned bundles must keep
.git to enable:
- Safe validation via Repo(path) before reuse
- Shared version directories across concurrent tasks
- Elimination of unnecessary re-clones
The --local clone strategy already uses hardlinks to the bare repo, so
disk usage impact is minimal despite keeping .git intact.
Changes:
- Modified prune condition: \if prune_dotgit_folder and not self.version\
- Added debug logging to explain prune decisions
- Added test verifying versioned bundles keep .git with default settings
- Updated docstring to document design rationale
Design Alignment:
This fix aligns with Airflow's KISS principle - minimal invasive change
that addresses the root cause rather than working around symptoms. The
solution enforces what test patterns already demonstrated was correct.
Backward Compatibility:
Fully compatible - tracking repos (no version) still get pruned as before.
Versioned repos now work correctly with the default prune_dotgit_folder=True
setting that users expect.
SUMMARY
Fix GitDagBundle performing a full git clone into a new directory for each task execution when used with LocalExecutor, which can lead to inode changes and result in FileNotFoundError during task runtime.
This change ensures that an existing valid bundle clone is reused when safe, while preserving executor isolation, bundle lifecycle guarantees, and correctness across parallel task execution.
ROOT CAUSE
The current GitDagBundle behavior triggers a fresh git clone into a new directory for each task execution.
This results in inode changes between DAG parsing and task execution phases, which can cause FileNotFoundError when tasks reference paths from the previously resolved bundle location.
SOLUTION
Introduce safe bundle reuse logic by:
The implementation ensures no partial clone states are exposed and preserves cleanup lifecycle behavior.
TESTING
Added tests covering:
Functional:
Regression:
Edge Cases:
All existing tests pass, and new tests validate bundle reuse correctness.
PERFORMANCE IMPACT
Reduces unnecessary full git clones per task execution and avoids redundant filesystem operations while maintaining correctness guarantees.
closes: #61396