Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-18543. AliyunOSSFileSystem#open(Path path, int bufferSize) use buffer size as its downloadPartSize #5172

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

masteryhx
Copy link

@masteryhx masteryhx commented Nov 28, 2022

Description of PR

In our application, different components have their own suitable buffer size to download.

But currently, AliyunOSSFileSystem#open(Path path, int bufferSize) just get downloadPartSize from configuration.

We cannnot use different value for different components in our programs.

I think we should the method should use the buffer size from the paramater.

AliyunOSSFileSystem#open(Path path) could have default value as current default downloadPartSize.

How was this patch tested?

Added method TestAliyunOSSInputStream#testConfiguration to test.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 4s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 19s trunk passed
+1 💚 compile 0m 39s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 44s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 0m 39s trunk passed
+1 💚 mvnsite 0m 45s trunk passed
+1 💚 javadoc 0m 46s trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 45s trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 1m 6s trunk passed
+1 💚 shadedclient 20m 47s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 24s the patch passed
+1 💚 compile 0m 25s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 25s the patch passed
+1 💚 compile 0m 24s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 javac 0m 24s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 22s /results-checkstyle-hadoop-tools_hadoop-aliyun.txt hadoop-tools/hadoop-aliyun: The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)
+1 💚 mvnsite 0m 28s the patch passed
+1 💚 javadoc 0m 27s the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 26s the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 0m 50s the patch passed
+1 💚 shadedclient 20m 21s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 0m 31s hadoop-aliyun in the patch passed.
+1 💚 asflicense 0m 45s The patch does not generate ASF License warnings.
96m 17s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/1/artifact/out/Dockerfile
GITHUB PR #5172
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 5bc2bfc76972 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 7580a7c
Default Java Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/1/testReport/
Max. process+thread count 731 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aliyun U: hadoop-tools/hadoop-aliyun
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

this.store = store;
this.key = key;
this.statistics = statistics;
this.contentLength = contentLength;
downloadPartSize = conf.getLong(MULTIPART_DOWNLOAD_SIZE_KEY,
MULTIPART_DOWNLOAD_SIZE_DEFAULT);
this.downloadPartSize = downloadPartSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to have a minimum size for this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'd like use IO_FILE_BUFFER_SIZE_DEFAULT(4KB) as its min size, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i owrry that the downloadpart size of kb is way too small for efficient http GET requests; kilobytes to megabytes are better

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could see different performance between uploading/requesting 4KB and 4MB ?
In my some cases, some data are orgnazied with unit of ~16KB, and I will read them randomly.
In this case, I am sure what I need is just these KB, more data will cost more time and bandwidth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its about the efficiencies of a GET call, overhead of creating HTTPS connections etc. which comes down to

  • reading a whole file is the wrong strategy for random IO formats (ORC, parquet)
  • random IO/small ranged GETs wrong for whole files.
  • even with random io, KB is way too small.

This is why there's a tendency for the stores to do adaptive "first backwards/big forward seek means random IO", and since HADOOP-16202 let caller declared read policy.

If the existing code was changed to say "we set GET range to be the buffer size you passed in on open()", then everyone's existing code is going to suffer really badly on performance.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 38s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 39m 59s trunk passed
+1 💚 compile 0m 40s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 compile 0m 40s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 checkstyle 0m 36s trunk passed
+1 💚 mvnsite 0m 37s trunk passed
+1 💚 javadoc 0m 49s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 0m 42s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 1m 0s trunk passed
+1 💚 shadedclient 20m 32s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 23s the patch passed
+1 💚 compile 0m 23s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javac 0m 23s the patch passed
+1 💚 compile 0m 25s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 javac 0m 25s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 22s /results-checkstyle-hadoop-tools_hadoop-aliyun.txt hadoop-tools/hadoop-aliyun: The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)
+1 💚 mvnsite 0m 28s the patch passed
+1 💚 javadoc 0m 22s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 0m 26s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 0m 50s the patch passed
+1 💚 shadedclient 20m 11s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 0m 31s hadoop-aliyun in the patch passed.
+1 💚 asflicense 0m 41s The patch does not generate ASF License warnings.
93m 23s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/2/artifact/out/Dockerfile
GITHUB PR #5172
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 2276ff2601e8 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 76ce3cb
Default Java Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/2/testReport/
Max. process+thread count 679 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aliyun U: hadoop-tools/hadoop-aliyun
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5172/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor

sorry, but I'm going to say -1 to using the normal IO buffer size as the GET range. The default value of 4k is way too small even for parquet/orc reads, it will break all existing apps in performance terms: distcp, parquet library, avro, ORC, everything, as they all use the default value.

  1. there is a configuration option for multipart download size, which is filesystem-wide. Not as flexible, but something everyone will expect to work.
  2. If you want better control of read policy, buffer sizes etc, then this connector needs to implement openFile(), as s3a and abfs do. that will let you add a new option to specify the range for GET calls.

@steveloughran
Copy link
Contributor

(note also that includes letting you declare read policy (whole-file, sequential, random, vectored....that can be used to change default block size too)

@masteryhx
Copy link
Author

sorry, but I'm going to say -1 to using the normal IO buffer size as the GET range. The default value of 4k is way too small even for parquet/orc reads, it will break all existing apps in performance terms: distcp, parquet library, avro, ORC, everything, as they all use the default value.

  1. there is a configuration option for multipart download size, which is filesystem-wide. Not as flexible, but something everyone will expect to work.
  2. If you want better control of read policy, buffer sizes etc, then this connector needs to implement openFile(), as s3a and abfs do. that will let you add a new option to specify the range for GET calls.

Thanks a lot for your detailed explaination and suggestion.
I agree that it's better not to change the implementaion of current interface.
I'd like to make oss also implement openFile() in this pr as s3a does which could also meet our needs.
WDYT?

@steveloughran
Copy link
Contributor

I'd like to make oss also implement openFile() in this pr as s3a does which could also meet our needs.

This is exactly what the API was designed for -to let people provide extra hints/options. to the object stores.

Do read the filesystem.md and related docs on the topic, and look at the abfs implementation as well as the s3a one.

If your implementation takes a FileStatus or length option, it can then skip the HEAD request on opening and save time and money. All the hadoop internal uses of openFile() do this. My own copies of the parquet and avro readers also do it for better cloud reading performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants