Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Hdfs client isn't fork-safe #18057

Closed
asfimport opened this issue Feb 1, 2018 · 4 comments
Closed

[Python] Hdfs client isn't fork-safe #18057

asfimport opened this issue Feb 1, 2018 · 4 comments

Comments

@asfimport
Copy link

asfimport commented Feb 1, 2018

Given the following script:

 

import multiprocessing as mp
import pyarrow as pa


def ls(h):
    print("calling ls")
    return h.ls("/tmp")


if __name__ == '__main__':
    h = pa.hdfs.connect()
    print("Using 'spawn'")
    pool = mp.get_context('spawn').Pool(2)
    results = pool.map(ls, [h, h])
    sol = h.ls("/tmp")
    for r in results:
        assert r == sol
    print("'spawn' succeeded\n")

    print("Using 'fork'")
    pool = mp.get_context('fork').Pool(2)
    results = pool.map(ls, [h, h])
    sol = h.ls("/tmp")
    for r in results:
        assert r == sol
    print("'fork' succeeded")

 

Results in the following output:

 

$ python test.py
Using 'spawn'
calling ls
calling ls
'spawn' succeeded

Using 'fork

 

The process then hangs, and I have to kill -9 the forked worker processes.

 

I'm unable to get the libhdfs3 driver to work, so I'm unsure if this is a problem with libhdfs or just arrow's use of it (a quick google search didn't turn up anything useful).

Reporter: Jim Crist / @jcrist

Related issues:

Note: This issue was originally created as ARROW-2081. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I think this has to do with the general policy around forking with an embedded JVM. It may not be supported, but I didn't turn up any immediate references

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Is there a way we can detect the fork in the child process(es) and at least avoid a hang or segfault?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
For the record, if you want decent multiprocessing performance together with fork safety, I would suggest using the "forkserver" method, not "spawn".

(Note the C libhdfs3 library isn't fork-safe, so no need to try it out IMHO :-))

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Closing as won't fix for several reasons:

  1. we don't have any HDFS expertise among the developer team (AFAIK)
  2. it is likely that the fork-safety issue resides in the underlying C library (libhdfs)
  3. no interested party has been willing to investigate and propose a fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant