HADOOP-18543. AliyunOSSFileSystem#open(Path path, int bufferSize) use buffer size as its downloadPartSize #5172

masteryhx · 2022-11-28T16:09:24Z

Description of PR

In our application, different components have their own suitable buffer size to download.

But currently, AliyunOSSFileSystem#open(Path path, int bufferSize) just get downloadPartSize from configuration.

We cannnot use different value for different components in our programs.

I think we should the method should use the buffer size from the paramater.

AliyunOSSFileSystem#open(Path path) could have default value as current default downloadPartSize.

How was this patch tested?

Added method TestAliyunOSSInputStream#testConfiguration to test.

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

hadoop-yetus · 2022-11-28T17:47:28Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	1m 4s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	41m 19s		trunk passed
+1 💚	compile	0m 39s		trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚	compile	0m 44s		trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚	checkstyle	0m 39s		trunk passed
+1 💚	mvnsite	0m 45s		trunk passed
+1 💚	javadoc	0m 46s		trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	0m 45s		trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	1m 6s		trunk passed
+1 💚	shadedclient	20m 47s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 24s		the patch passed
+1 💚	compile	0m 25s		the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚	javac	0m 25s		the patch passed
+1 💚	compile	0m 24s		the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚	javac	0m 24s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 22s	/results-checkstyle-hadoop-tools_hadoop-aliyun.txt	hadoop-tools/hadoop-aliyun: The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)
+1 💚	mvnsite	0m 28s		the patch passed
+1 💚	javadoc	0m 27s		the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	0m 26s		the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚	spotbugs	0m 50s		the patch passed
+1 💚	shadedclient	20m 21s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	0m 31s		hadoop-aliyun in the patch passed.
+1 💚	asflicense	0m 45s		The patch does not generate ASF License warnings.
		96m 17s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/1/artifact/out/Dockerfile
GITHUB PR	#5172
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 5bc2bfc76972 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `7580a7c`
Default Java	Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/1/testReport/
Max. process+thread count	731 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aliyun U: hadoop-tools/hadoop-aliyun
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2022-11-29T10:37:23Z

...-tools/hadoop-aliyun/src/main/java/org/apache/hadoop/fs/aliyun/oss/AliyunOSSInputStream.java

    this.store = store;
    this.key = key;
    this.statistics = statistics;
    this.contentLength = contentLength;
-    downloadPartSize = conf.getLong(MULTIPART_DOWNLOAD_SIZE_KEY,
-        MULTIPART_DOWNLOAD_SIZE_DEFAULT);
+    this.downloadPartSize = downloadPartSize;


you might want to have a minimum size for this

Good point. I'd like use IO_FILE_BUFFER_SIZE_DEFAULT(4KB) as its min size, WDYT?

i owrry that the downloadpart size of kb is way too small for efficient http GET requests; kilobytes to megabytes are better

I think we could see different performance between uploading/requesting 4KB and 4MB ?
In my some cases, some data are orgnazied with unit of ~16KB, and I will read them randomly.
In this case, I am sure what I need is just these KB, more data will cost more time and bandwidth.

its about the efficiencies of a GET call, overhead of creating HTTPS connections etc. which comes down to

reading a whole file is the wrong strategy for random IO formats (ORC, parquet)

random IO/small ranged GETs wrong for whole files.

even with random io, KB is way too small.

This is why there's a tendency for the stores to do adaptive "first backwards/big forward seek means random IO", and since HADOOP-16202 let caller declared read policy.

If the existing code was changed to say "we set GET range to be the buffer size you passed in on open()", then everyone's existing code is going to suffer really badly on performance.

… buffer size as its downloadPartSize

hadoop-yetus · 2022-11-30T08:28:43Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 38s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	39m 59s		trunk passed
+1 💚	compile	0m 40s		trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚	compile	0m 40s		trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚	checkstyle	0m 36s		trunk passed
+1 💚	mvnsite	0m 37s		trunk passed
+1 💚	javadoc	0m 49s		trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚	javadoc	0m 42s		trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚	spotbugs	1m 0s		trunk passed
+1 💚	shadedclient	20m 32s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 23s		the patch passed
+1 💚	compile	0m 23s		the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚	javac	0m 23s		the patch passed
+1 💚	compile	0m 25s		the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚	javac	0m 25s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 22s	/results-checkstyle-hadoop-tools_hadoop-aliyun.txt	hadoop-tools/hadoop-aliyun: The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)
+1 💚	mvnsite	0m 28s		the patch passed
+1 💚	javadoc	0m 22s		the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚	javadoc	0m 26s		the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚	spotbugs	0m 50s		the patch passed
+1 💚	shadedclient	20m 11s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	0m 31s		hadoop-aliyun in the patch passed.
+1 💚	asflicense	0m 41s		The patch does not generate ASF License warnings.
		93m 23s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/2/artifact/out/Dockerfile
GITHUB PR	#5172
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 2276ff2601e8 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `76ce3cb`
Default Java	Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/2/testReport/
Max. process+thread count	679 (vs. ulimit of 5500)
modules	C: hadoop-tools/hadoop-aliyun U: hadoop-tools/hadoop-aliyun
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2022-12-01T17:25:56Z

sorry, but I'm going to say -1 to using the normal IO buffer size as the GET range. The default value of 4k is way too small even for parquet/orc reads, it will break all existing apps in performance terms: distcp, parquet library, avro, ORC, everything, as they all use the default value.

there is a configuration option for multipart download size, which is filesystem-wide. Not as flexible, but something everyone will expect to work.
If you want better control of read policy, buffer sizes etc, then this connector needs to implement openFile(), as s3a and abfs do. that will let you add a new option to specify the range for GET calls.

steveloughran · 2022-12-01T17:26:35Z

(note also that includes letting you declare read policy (whole-file, sequential, random, vectored....that can be used to change default block size too)

masteryhx · 2022-12-02T04:29:15Z

sorry, but I'm going to say -1 to using the normal IO buffer size as the GET range. The default value of 4k is way too small even for parquet/orc reads, it will break all existing apps in performance terms: distcp, parquet library, avro, ORC, everything, as they all use the default value.

there is a configuration option for multipart download size, which is filesystem-wide. Not as flexible, but something everyone will expect to work.

If you want better control of read policy, buffer sizes etc, then this connector needs to implement openFile(), as s3a and abfs do. that will let you add a new option to specify the range for GET calls.

Thanks a lot for your detailed explaination and suggestion.
I agree that it's better not to change the implementaion of current interface.
I'd like to make oss also implement openFile() in this pr as s3a does which could also meet our needs.
WDYT?

steveloughran · 2022-12-02T11:36:31Z

I'd like to make oss also implement openFile() in this pr as s3a does which could also meet our needs.

This is exactly what the API was designed for -to let people provide extra hints/options. to the object stores.

Do read the filesystem.md and related docs on the topic, and look at the abfs implementation as well as the s3a one.

If your implementation takes a FileStatus or length option, it can then skip the HEAD request on opening and save time and money. All the hadoop internal uses of openFile() do this. My own copies of the parquet and avro readers also do it for better cloud reading performance

steveloughran reviewed Nov 29, 2022

View reviewed changes

HADOOP-18543. AliyunOSSFileSystem#open(Path path, int bufferSize) use…

76ce3cb

… buffer size as its downloadPartSize

masteryhx force-pushed the HADOOP-18543 branch from 7580a7c to 76ce3cb Compare November 30, 2022 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HADOOP-18543. AliyunOSSFileSystem#open(Path path, int bufferSize) use buffer size as its downloadPartSize #5172

HADOOP-18543. AliyunOSSFileSystem#open(Path path, int bufferSize) use buffer size as its downloadPartSize #5172

masteryhx commented Nov 28, 2022 •

edited

Loading

hadoop-yetus commented Nov 28, 2022

steveloughran Nov 29, 2022

masteryhx Nov 30, 2022

steveloughran Nov 30, 2022

masteryhx Nov 30, 2022

steveloughran Dec 1, 2022

hadoop-yetus commented Nov 30, 2022

steveloughran commented Dec 1, 2022

steveloughran commented Dec 1, 2022

masteryhx commented Dec 2, 2022

steveloughran commented Dec 2, 2022

HADOOP-18543. AliyunOSSFileSystem#open(Path path, int bufferSize) use buffer size as its downloadPartSize #5172

Are you sure you want to change the base?

HADOOP-18543. AliyunOSSFileSystem#open(Path path, int bufferSize) use buffer size as its downloadPartSize #5172

Conversation

masteryhx commented Nov 28, 2022 • edited Loading

Description of PR

How was this patch tested?

For code changes:

hadoop-yetus commented Nov 28, 2022

steveloughran Nov 29, 2022

Choose a reason for hiding this comment

masteryhx Nov 30, 2022

Choose a reason for hiding this comment

steveloughran Nov 30, 2022

Choose a reason for hiding this comment

masteryhx Nov 30, 2022

Choose a reason for hiding this comment

steveloughran Dec 1, 2022

Choose a reason for hiding this comment

hadoop-yetus commented Nov 30, 2022

steveloughran commented Dec 1, 2022

steveloughran commented Dec 1, 2022

masteryhx commented Dec 2, 2022

steveloughran commented Dec 2, 2022

masteryhx commented Nov 28, 2022 •

edited

Loading