[SPARK-53191][CORE][SQL][MLLIB][YARN] Use Java `InputStream.readAllBytes` instead of `ByteStreams.toByteArray` #51919

dongjoon-hyun · 2025-08-08T05:06:48Z

What changes were proposed in this pull request?

This PR aims to use Java InputStream.readAllBytes instead of ByteStreams.toByteArray in order to improve the performance.

Why are the changes needed?

Since Java 9+, we can use readAllBytes which is roughly 30% faster than ByteStreams.toByteArray.

BEFORE (ByteStreams.toByteArray)

scala> spark.time(com.google.common.io.ByteStreams.toByteArray(new java.io.FileInputStream("/tmp/1G.bin")).length)
Time taken: 386 ms
val res0: Int = 1073741824

AFTER (InputStream.readAllBytes)

scala> spark.time(new java.io.FileInputStream("/tmp/1G.bin").readAllBytes().length)
Time taken: 248 ms
val res0: Int = 1073741824

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

…tes` instead of `ByteStreams.toByteArray`

dongjoon-hyun · 2025-08-08T05:59:02Z

Thank you, @yaooqinn ~

dongjoon-hyun · 2025-08-08T07:03:03Z

Merged to master for Apache Spark 4.1.0.

[SPARK-53191][CORE][SQL][MLLIB][YARN] Use Java `InputStream.readAllBy…

80c5923

…tes` instead of `ByteStreams.toByteArray`

github-actions bot added SQL ML MLLIB YARN CORE labels Aug 8, 2025

yaooqinn approved these changes Aug 8, 2025

View reviewed changes

dongjoon-hyun closed this in 96edcce Aug 8, 2025

dongjoon-hyun deleted the SPARK-53191 branch August 8, 2025 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53191][CORE][SQL][MLLIB][YARN] Use Java `InputStream.readAllBytes` instead of `ByteStreams.toByteArray` #51919

[SPARK-53191][CORE][SQL][MLLIB][YARN] Use Java `InputStream.readAllBytes` instead of `ByteStreams.toByteArray` #51919

Uh oh!

dongjoon-hyun commented Aug 8, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun commented Aug 8, 2025

Uh oh!

dongjoon-hyun commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-53191][CORE][SQL][MLLIB][YARN] Use Java InputStream.readAllBytes instead of ByteStreams.toByteArray #51919

[SPARK-53191][CORE][SQL][MLLIB][YARN] Use Java InputStream.readAllBytes instead of ByteStreams.toByteArray #51919

Uh oh!

Conversation

dongjoon-hyun commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Aug 8, 2025

Uh oh!

dongjoon-hyun commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-53191][CORE][SQL][MLLIB][YARN] Use Java `InputStream.readAllBytes` instead of `ByteStreams.toByteArray` #51919

[SPARK-53191][CORE][SQL][MLLIB][YARN] Use Java `InputStream.readAllBytes` instead of `ByteStreams.toByteArray` #51919

dongjoon-hyun commented Aug 8, 2025 •

edited

Loading