Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Child process termination not known by Parent error in concurrent.futures #29

Closed
GoogleCodeExporter opened this issue Mar 14, 2015 · 1 comment

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?

1. Submit a task using ProcessPoolExecutor
2. kill -9 <one_of_childrens_pid>
3. parent process gets blocked forever.

What is the expected output? What do you see instead?

We encountered an error in which if a child process dies or crashes the parent 
process is not notified and parent goes in blocked state. Other children are 
either in blocked or timed out state.

We were able to reproduce this scenario by using following code and by killing 
one of the child.

#!/home/y/bin64/python2.7

import concurrent.futures
import time
import signal
import os
import sys
import traceback


def just_wait(identifier):
    time.sleep(20)
    return identifier

def signal_handler(sig, stack):
    try:
        result = os.waitpid(-1, os.WNOHANG)
        while result[0]:
            print("Reaped child process %s" % result[0])
            result = os.waitpid(-1, os.WNOHANG)
        traceback.print_stack()
        sys.exit()    
    except (OSError):
        pass

def main():
    with concurrent.futures.ProcessPoolExecutor(max_workers=30) as executor:
        future_to_id = [executor.submit(just_wait, i) for i in range(1, 31)]
        for future in concurrent.futures.as_completed(future_to_id):
            returned_id = future.result()
            print "Process Id: ", returned_id

if __name__=='__main__':
    signal.signal(signal.SIGCHLD, signal_handler)
    main()

The status of one of the child processes:
$sudo strace -p 30974
Password: 
Process 30974 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection 
timed out)
gettimeofday({1410964539, 104107}, NULL) = 0
gettimeofday({1410964539, 104165}, NULL) = 0
futex(0x7f3e698e7000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1410964539, 
204165000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1410964539, 204812}, NULL) = 0
gettimeofday({1410964539, 204845}, NULL) = 0

The status for parent process:
sudo strace -p 30948
Process 30948 attached - interrupt to quit
futex(0x1addc30, FUTEX_WAIT_PRIVATE, 0, NULL

What version of the product are you using? On what operating system?
RHEL - 6.4.

Please provide any additional information below.
Here's the related issue that got fixed in python 3.3 - 
http://bugs.python.org/issue9205
Since we are using python 2.7.5, is this possible to backport this fix as well 
to futures for 2.7.5.

Original issue reported on code.google.com by immil...@yahoo-inc.com on 17 Sep 2014 at 9:25

@agronholm
Copy link
Owner

The relevant fix upstream uses Python 3 features and cannot be backported. If you have a suggestion, I'm all ears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants