Skip to content

fix(python apps): fork child + exec appcmd in child to avoid issues#1580

Merged
cicoyle merged 56 commits intodapr:masterfrom
sicoyle:fix-python-hangs
Feb 24, 2026
Merged

fix(python apps): fork child + exec appcmd in child to avoid issues#1580
cicoyle merged 56 commits intodapr:masterfrom
sicoyle:fix-python-hangs

Conversation

@sicoyle
Copy link
Contributor

@sicoyle sicoyle commented Jan 29, 2026

Description

Fork is unsafe after threading has been used, at least for python. This is a known issue in the python community. See:

Forking after threading has been used can cause deadlocks/hangs when using threading primitives like Lock, Condition, or Queue in a process created via fork. All of this occurs within the internals of durabletask-python. This means it will impact any app using the python sdk, be it dapr agents or general python workflows. I tried various python versions and they all experience hangs, at various points due to this.

This PR corrects this by using syscall.ForkExec to fork a child, then execs python or whatever app language within that child so the app language starts via exec, avoiding it knowing about the fork. I will say a downfall of this is I lose the blue APP prefix stuff on the log lines. So this means changing the UX via terminals unfortunately. I tried adding that back with pipes, but then adding the pipes to give me the log prefix then broke my python app again causing the hang.

This is what it looks like btw with this PR. Blue INFO is from the sidecar and the app logs are in white without the ===APP=== prefix that we used to have:

image

I need to test this on a windows machine to make sure I don't leave zombies, and also I think at this point this is good enough to fix other OS's. I could leave a note in the docs that running python apps on windows could lead to unpredictable behavior for now and create a follow up issue for that. Thoughts?

Issue reference

Checklist

Please make sure you've completed the relevant tasks for this PR, out of the following list:

  • Code compiles correctly
  • Created/updated tests
  • Extended the documentation

…problems

Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
@sicoyle sicoyle requested review from a team as code owners January 29, 2026 21:32
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
@JoshVanL
Copy link
Contributor

JoshVanL commented Feb 4, 2026

@sicoyle tests are very red

JoshVanL and others added 14 commits February 10, 2026 17:07
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 0% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 15.45%. Comparing base (9780aab) to head (fd34ce5).
⚠️ Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
pkg/standalone/run.go 0.00% 17 Missing ⚠️
pkg/standalone/stop.go 0.00% 10 Missing ⚠️
pkg/standalone/process_group_unix.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1580      +/-   ##
==========================================
- Coverage   15.74%   15.45%   -0.29%     
==========================================
  Files          64       64              
  Lines        7090     7220     +130     
==========================================
  Hits         1116     1116              
- Misses       5888     6018     +130     
  Partials       86       86              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

cli, err := dockerClient.NewClientWithOpts(dockerClient.FromEnv)
cli, err := dockerClient.NewClientWithOpts(dockerClient.FromEnv, dockerClient.WithVersion("1.48"))
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a hardcoded Docker API version "1.48" may cause compatibility issues with older Docker installations. The Docker client should ideally negotiate the API version automatically, or this version should be made configurable. Consider whether this version requirement is documented and whether it's the minimum supported version for all features used in these tests.

Suggested change
cli, err := dockerClient.NewClientWithOpts(dockerClient.FromEnv, dockerClient.WithVersion("1.48"))
cli, err := dockerClient.NewClientWithOpts(dockerClient.FromEnv)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as I was getting a compatibility issue due to versions so pinned on one that would work.

Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
@sicoyle
Copy link
Contributor Author

sicoyle commented Feb 23, 2026

ready again @JoshVanL pls

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Samantha Coyle <sam@diagrid.io>
Signed-off-by: Samantha Coyle <sam@diagrid.io>
cicoyle
cicoyle previously approved these changes Feb 24, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +80 to +92
err = syscall.Kill(-pgid, syscall.SIGTERM)
if err != nil {
// If process group doesn't exist (ESRCH), treat it as already stopped.
if err == syscall.ESRCH {
return nil
}
errKill := syscall.Kill(-pgid, syscall.SIGKILL)
// If process group doesn't exist, treat it as already stopped.
if errKill != nil && errKill != syscall.ESRCH {
return errKill
}
}
return nil
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After sending SIGTERM (line 80), if it succeeds, the function returns immediately without waiting for the process to actually exit. This means the caller has no guarantee that the process group has terminated. Consider documenting that this is a fire-and-forget termination, or implementing a wait-and-verify mechanism if process termination confirmation is needed.

Copilot uses AI. Check for mistakes.
}
return nil
}
// Kill the whole process group.
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed signal from SIGINT to SIGTERM, which is more conventional for graceful shutdown. However, this could potentially affect applications that specifically handle SIGINT differently than SIGTERM. Consider documenting this change in user-facing documentation or release notes to inform users about the signal change.

Suggested change
// Kill the whole process group.
// Kill the whole process group.
// Use SIGTERM for graceful shutdown of the process group. This differs from using SIGINT
// and may affect applications that handle SIGINT and SIGTERM differently.

Copilot uses AI. Check for mistakes.
cmd/run_unix.go Outdated
return fmt.Errorf("failed to fork/exec app: %w", err)
}
if output.AppCMD == nil {
return fmt.Errorf("app command is nil")
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After successfully calling ForkExec and before assigning the process, there's a check for output.AppCMD == nil. If this is nil, the function returns an error but the forked child process has already been started (via ForkExec at line 59) and is now orphaned. Consider either ensuring AppCMD is never nil before calling ForkExec, or cleaning up the forked process if AppCMD is unexpectedly nil by calling syscall.Kill on the returned pid.

Suggested change
return fmt.Errorf("app command is nil")
// Clean up the forked child process to avoid leaving it orphaned.
if killErr := syscall.Kill(pid, syscall.SIGKILL); killErr != nil {
return fmt.Errorf("app command is nil; failed to kill child process %d: %w", pid, killErr)
}
return fmt.Errorf("app command is nil; killed child process %d", pid)

Copilot uses AI. Check for mistakes.
Signed-off-by: Samantha Coyle <sam@diagrid.io>
@cicoyle cicoyle merged commit 26d9692 into dapr:master Feb 24, 2026
35 of 38 checks passed
@sicoyle sicoyle deleted the fix-python-hangs branch February 24, 2026 20:35
sicoyle pushed a commit to sicoyle/dapr-cli that referenced this pull request Feb 24, 2026
fix(python apps): fork child + exec appcmd in child to avoid issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants