Skip to content

Fix GitDagBundle race condition in concurrent task execution#61489

Open
Arunodoy18 wants to merge 2 commits intoapache:mainfrom
Arunodoy18:fix/git-bundle-clone-per-task
Open

Fix GitDagBundle race condition in concurrent task execution#61489
Arunodoy18 wants to merge 2 commits intoapache:mainfrom
Arunodoy18:fix/git-bundle-clone-per-task

Conversation

@Arunodoy18
Copy link
Contributor

SUMMARY

Fix GitDagBundle performing a full git clone into a new directory for each task execution when used with LocalExecutor, which can lead to inode changes and result in FileNotFoundError during task runtime.

This change ensures that an existing valid bundle clone is reused when safe, while preserving executor isolation, bundle lifecycle guarantees, and correctness across parallel task execution.


ROOT CAUSE

The current GitDagBundle behavior triggers a fresh git clone into a new directory for each task execution.
This results in inode changes between DAG parsing and task execution phases, which can cause FileNotFoundError when tasks reference paths from the previously resolved bundle location.


SOLUTION

Introduce safe bundle reuse logic by:

  • Detecting existing valid bundle clones
  • Reusing bundle directories when repository state matches expected revision
  • Preserving executor safety and preventing shared mutable state
  • Maintaining compatibility with parallel LocalExecutor task execution

The implementation ensures no partial clone states are exposed and preserves cleanup lifecycle behavior.


TESTING

Added tests covering:

Functional:

  • Multiple tasks using the same DAG bundle under LocalExecutor
  • DAG parse phase and task execution phase consistency
  • No FileNotFoundError during task runtime

Regression:

  • Bundle re-clone still happens when repository state changes
  • Existing bundle lifecycle and cleanup behavior remains unchanged

Edge Cases:

  • Partial clone failure handling
  • Bundle reuse under parallel task execution

All existing tests pass, and new tests validate bundle reuse correctness.


PERFORMANCE IMPACT

Reduces unnecessary full git clones per task execution and avoids redundant filesystem operations while maintaining correctness guarantees.


closes: #61396

When LocalExecutor runs multiple tasks concurrently that use the same
GitDagBundle version, the existing _clone_repo_if_required() method had
a race condition where:

1. Multiple processes check if version directory exists (all see False)
2. All processes start cloning to the same directory simultaneously
3. Partial clones become visible to other processes mid-operation
4. FileNotFoundError or corrupted repos can occur during checkout

This fix implements atomic clone operations:
- Clone to temporary directory with PID-based naming
- Use atomic os.rename() to make clone visible only when complete
- Validate existing repos before reuse (check repo.head)
- Handle rename race gracefully when another process wins
- Clean up temp directories on any failure

This ensures each version directory is created atomically and tasks
safely reuse valid existing clones without race conditions.

Test coverage includes:
- Concurrent clone attempts
- Existing repo reuse
- Corrupted repo detection and replacement
- Temp directory cleanup on failure
- Atomic rename race handling
- Multiple version isolation
Clone repository if required, with atomic directory creation to prevent races.

This method ensures that:
1. Only one process clones a specific version at a time (version-level locking)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this:


This method ensures that:
1. Only one process clones a specific version at a time (version-level locking)
2. Clones are atomic - no partial clones visible to other processes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that lock should prevent others from seeing partials too.

except (InvalidGitRepositoryError, GitCommandError, ValueError) as e:
# Repo exists but is invalid/corrupted - clean it up
self._log.warning(
"Existing repo is invalid, will re-clone",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, afaik this is the root cause of the issue you picked up - with prune_dotgit_folder true you always end up with an "invalid" repo, leading to churn. That's the thing we need to ultimately fix - prune_dotgit_folder should not result in needing to reclone to reuse the bundle.

@jedcunningham
Copy link
Member

cc @ephraimbuddy since you added the problematic prune_dotgit_folder flag.

@ephraimbuddy
Copy link
Contributor

cc @ephraimbuddy since you added the problematic prune_dotgit_folder flag.

I'll take a look tomorrow. However, the prune_dotgit_folder is configurable. This PR doesn't look nice to me. I'll take a better look tomorrow

@Arunodoy18
Copy link
Contributor Author

Thanks — that makes sense.
I’ll dig into how prune_dotgit_folder leads to invalid repos and why we need to re-clone afterward.
I’ll try to reproduce and understand the lifecycle before proposing changes.

Root Cause:
Versioned bundles (using versions/{SHA}/ directories) are designed to be
shared across concurrent tasks to avoid redundant cloning. However, when
prune_dotgit_folder=True (the default), the .git folder was removed after
initialization, breaking subsequent reuse attempts:

1. Task A initializes version ABC123, prunes .git
2. Task B tries to reuse ABC123, validation fails (not a git repo)
3. Task B deletes and re-clones entire version directory
4. Race conditions and performance degradation ensue

Evidence: ALL existing tests for versioned bundles explicitly set
prune_dotgit_folder=False, indicating maintainers understood this
constraint but didn't enforce it in the code.

Fix:
Only prune non-versioned (tracking) bundles. Versioned bundles must keep
.git to enable:
- Safe validation via Repo(path) before reuse
- Shared version directories across concurrent tasks
- Elimination of unnecessary re-clones

The --local clone strategy already uses hardlinks to the bare repo, so
disk usage impact is minimal despite keeping .git intact.

Changes:
- Modified prune condition: \if prune_dotgit_folder and not self.version\
- Added debug logging to explain prune decisions
- Added test verifying versioned bundles keep .git with default settings
- Updated docstring to document design rationale

Design Alignment:
This fix aligns with Airflow's KISS principle - minimal invasive change
that addresses the root cause rather than working around symptoms. The
solution enforces what test patterns already demonstrated was correct.

Backward Compatibility:
Fully compatible - tracking repos (no version) still get pruned as before.
Versioned repos now work correctly with the default prune_dotgit_folder=True
setting that users expect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Constant Inode flips on LocalExecutor cause task failures (FileNotFoundError)

3 participants