Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.hdfs.connect crashes when executed asynchronously in processes #23721

Closed
asfimport opened this issue Dec 20, 2019 · 1 comment

Comments

@asfimport
Copy link

asfimport commented Dec 20, 2019

When trying to connect to hdfs from a ProcessPoolExecutor then the first call raises an Exception and the function never returns (potential deadlock?). On the other hand it works as expected with a ThreadPoolExecutor.

Sample code that reproduces the problem follows:

import pyarrow as pa

from concurrent.futures import (
        ThreadPoolExecutor,
        ProcessPoolExecutor,
        wait,
        ALL_COMPLETED)

def ls():
    fs = pa.hdfs.connect('hdfs://host')
    print(fs.ls('/'))

# This works as expected
ls()

# Running in parallel
thread_pool = ThreadPoolExecutor(max_workers=4)
process_pool = ProcessPoolExecutor(max_workers=4)

def run(pool):
    futures = [pool.submit(ls) for _ in range(5)]
    wait(futures, return_when=ALL_COMPLETED)

# The thread_pool works as expected
run(thread_pool)

# The process_pool raises an exception
run(process_pool)

The following exception is raised:


java.lang.ClassFormatError: Incompatible magic value 1347093252 in class file org/xml/sax/helpers/LocatorImpl
        at java.lang.ClassLoader.findBootstrapClass(Native Method)
        at java.lang.ClassLoader.findBootstrapClassOrNull(ClassLoader.java:1015)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:413)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
        at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2684)
        at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2672)
        at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2746)
        at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2696)
        at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2579)
        at org.apache.hadoop.conf.Configuration.get(Configuration.java:1091)
        at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:404)

Reporter: Panagiotis Nezis

Related issues:

Note: This issue was originally created as ARROW-7451. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
This may be related to os.fork-related issues discussed in ARROW-2081

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant