[Java] while using s3 FileSystemDatasetFactory getting this exception #36069

febinct · 2023-06-14T16:56:56Z

Describe the bug, including details regarding any error messages, version, and platform.

/Users/voltrondata/github-actions-runner/_work/crossbow/crossbow/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598: arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit

i think the java client is not closing the s3 client gracefully because of that memory leak happening

Component(s)

Java

westonpace · 2023-06-14T21:54:23Z

CC @davisusanibar @lidavidm (note that this warning was newly added to the S3 filesystem in the previous release so it is very possible the Java implementation has never been calling finalize)

davisusanibar · 2023-06-14T22:29:45Z

CC @davisusanibar @lidavidm (note that this warning was newly added to the S3 filesystem in the previous release so it is very possible the Java implementation has never been calling finalize)

Just able to reproduce this warning with:

import org.apache.arrow.dataset.file.FileFormat;
import org.apache.arrow.dataset.file.FileSystemDatasetFactory;
import org.apache.arrow.dataset.jni.NativeMemoryPool;
import org.apache.arrow.dataset.scanner.ScanOptions;
import org.apache.arrow.dataset.scanner.Scanner;
import org.apache.arrow.dataset.source.Dataset;
import org.apache.arrow.dataset.source.DatasetFactory;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.ipc.ArrowReader;
import org.apache.arrow.vector.types.pojo.Schema;

public class DatasetModule {
    public static void main(String[] args) {
        String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // AWS S3
        // String uri = "hdfs://{hdfs_host}:{port}/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // HDFS
        // String uri = "gs://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; // Google Cloud Storage
        ScanOptions options = new ScanOptions(/*batchSize*/ 32768);
        try (
            BufferAllocator allocator = new RootAllocator();
            DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri);
            Dataset dataset = datasetFactory.finish();
            Scanner scanner = dataset.newScan(options);
            ArrowReader reader = scanner.scanBatches()
        ) {
            Schema schema = scanner.schema();
            System.out.println(schema);
            while (reader.loadNextBatch()) {
                System.out.println("RowCount: " + reader.getVectorSchemaRoot().getRowCount());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Output messages:

RowCount: 2979
/Users/runner/work/crossbow/crossbow/arrow/cpp/src/arrow/filesystem/s3fs.cc:2598:  arrow::fs::FinalizeS3 was not called even though S3 was initialized.  This could lead to a segmentation fault at exit

Next step:

Review reason of error messages
Add Arrow Java cookbook to cover S3 integration

febinsathar · 2023-06-15T05:28:54Z

while running s3 file reader using java after for sometime the machine is crash looping due to this AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43, A libcurl function was given a bad argument
this generally happens when aws s3 connection pool are not closed AFAIK

AcidChristLab · 2023-07-08T20:35:23Z

A simple solution to fix this error:

pip install --upgrade --force-reinstall pyarrow==11.0.0

febinct · 2023-07-21T09:45:25Z

this is in java not python

AcidChristLab · 2023-07-22T23:36:52Z

this is in java not python

Are you not using the pyarrow package for java?

mapleFU · 2023-07-23T06:27:56Z

I guess the problem in introduced in #33858 . The description contains:

BREAKING CHANGE: S3 can only be initialized and finalized once.
BREAKING CHANGE: S3 (the AWS SDK) will not be finalized until after all CPU & I/O threads are finished.

A helpful way is calling FinalizeS3 when exit. This patch is in arrow-12.0

However, I guess this patch might fixed it: #36442 . Maybe you can confirm it latter?

kou · 2023-07-24T02:50:04Z

AFAIK, there is no pyarrow package for Java.

Java's dataset API uses JNI. It means that Java's dataset API calls C++ implementation directly. (It doesn't use Python.)

We need to implement a Java binding for arrow::fs::FinalizeS3() and users need to call it explicitly.
Or we may be able to do it implicitly if Java provides an atexit() like hook.

lidavidm · 2023-07-24T12:24:24Z

Java has Runtime#addShutdownHook

lidavidm · 2023-07-24T12:24:48Z

@davisusanibar @danepitkin any interest in this?

danepitkin · 2023-07-24T18:27:50Z

Yes! We can take this on. Thanks for the ping.

### Rationale for this change Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime: `arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit` ### What changes are included in this PR? * Add a Java runtime shutdown hook that calls `EnsureS3Finalized()` via JNI. This is a noop if S3 is uninitialized or already finalized. ### Are these changes tested? Yes, reproduced with: ``` import org.apache.arrow.dataset.file.FileFormat; import org.apache.arrow.dataset.file.FileSystemDatasetFactory; import org.apache.arrow.dataset.jni.NativeMemoryPool; import org.apache.arrow.dataset.source.DatasetFactory; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; public class DatasetModule { public static void main(String[] args) { String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; try ( BufferAllocator allocator = new RootAllocator(); DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri); ) { // S3 is initialized } catch (Exception e) { e.printStackTrace(); } } } ``` I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning: * We can't test the actual shutdown since thats a JVM thing. * We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO. * No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval. ### Are there any user-facing changes? No * Closes: #36069 Authored-by: Dane Pitkin <dane@voltrondata.com> Signed-off-by: David Li <li.davidm96@gmail.com>

### Rationale for this change Java datasets can implicitly create an S3 filesystem, which will initialize S3 APIs. There is currently no explicit call to shutdown S3 APIs in Java, which results in a warning message being printed at runtime: `arrow::fs::FinalizeS3 was not called even though S3 was initialized. This could lead to a segmentation fault at exit` ### What changes are included in this PR? * Add a Java runtime shutdown hook that calls `EnsureS3Finalized()` via JNI. This is a noop if S3 is uninitialized or already finalized. ### Are these changes tested? Yes, reproduced with: ``` import org.apache.arrow.dataset.file.FileFormat; import org.apache.arrow.dataset.file.FileSystemDatasetFactory; import org.apache.arrow.dataset.jni.NativeMemoryPool; import org.apache.arrow.dataset.source.DatasetFactory; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; public class DatasetModule { public static void main(String[] args) { String uri = "s3://voltrondata-labs-datasets/nyc-taxi-tiny/year=2022/month=2/part-0.parquet"; try ( BufferAllocator allocator = new RootAllocator(); DatasetFactory datasetFactory = new FileSystemDatasetFactory(allocator, NativeMemoryPool.getDefault(), FileFormat.PARQUET, uri); ) { // S3 is initialized } catch (Exception e) { e.printStackTrace(); } } } ``` I didn't think a unit test was worth adding. Let me know if you think otherwise. Reasoning: * We can't test the actual shutdown since thats a JVM thing. * We could test to see if the hook is registered, but that involves exposing the API and having access to the thread object registered with the hook. Or using reflection to obtain it. Not worth it IMO. * No need to test the functionality inside the hook, its just a wrapper around a single C++ API with no params/retval. ### Are there any user-facing changes? No * Closes: apache#36069 Authored-by: Dane Pitkin <dane@voltrondata.com> Signed-off-by: David Li <li.davidm96@gmail.com>

febinct added the Type: bug label Jun 14, 2023

github-actions bot added the Component: Java label Jun 14, 2023

westonpace changed the title ~~while using s3 FileSystemDatasetFactory getting this exception~~ [Java] while using s3 FileSystemDatasetFactory getting this exception Jun 14, 2023

cjackal mentioned this issue Jun 23, 2023

libcurl error on scanning huge pyarrow dataset over s3 pola-rs/polars#9505

Closed

2 tasks

AlUlkesh mentioned this issue Jul 17, 2023

[Bug]: arrow::fs::FinalizeS3 was not called even though S3 was initialized. AlUlkesh/stable-diffusion-webui-images-browser#216

Closed

danepitkin added this to the 14.0.0 milestone Jul 24, 2023

danepitkin added a commit to danepitkin/arrow that referenced this issue Jul 28, 2023

apacheGH-36069: [Java] Ensure S3 is finalized on shutdown

6058031

github-actions bot mentioned this issue Jul 28, 2023

GH-36069: [Java] Ensure S3 is finalized on shutdown #36934

Merged

github-actions bot assigned danepitkin Jul 28, 2023

lidavidm closed this as completed in #36934 Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Java] while using s3 FileSystemDatasetFactory getting this exception #36069

[Java] while using s3 FileSystemDatasetFactory getting this exception #36069

febinct commented Jun 14, 2023

westonpace commented Jun 14, 2023 •

edited

Loading

davisusanibar commented Jun 14, 2023

febinsathar commented Jun 15, 2023 •

edited

Loading

AcidChristLab commented Jul 8, 2023

febinct commented Jul 21, 2023

AcidChristLab commented Jul 22, 2023

mapleFU commented Jul 23, 2023 •

edited

Loading

kou commented Jul 24, 2023

lidavidm commented Jul 24, 2023

lidavidm commented Jul 24, 2023

danepitkin commented Jul 24, 2023

[Java] while using s3 FileSystemDatasetFactory getting this exception #36069

[Java] while using s3 FileSystemDatasetFactory getting this exception #36069

Comments

febinct commented Jun 14, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

westonpace commented Jun 14, 2023 • edited Loading

davisusanibar commented Jun 14, 2023

febinsathar commented Jun 15, 2023 • edited Loading

AcidChristLab commented Jul 8, 2023

febinct commented Jul 21, 2023

AcidChristLab commented Jul 22, 2023

mapleFU commented Jul 23, 2023 • edited Loading

kou commented Jul 24, 2023

lidavidm commented Jul 24, 2023

lidavidm commented Jul 24, 2023

danepitkin commented Jul 24, 2023

westonpace commented Jun 14, 2023 •

edited

Loading

febinsathar commented Jun 15, 2023 •

edited

Loading

mapleFU commented Jul 23, 2023 •

edited

Loading