Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HADOOP-16202. Enhanced openFile() -branch-3.3 backport #4238

Conversation

steveloughran
Copy link
Contributor

Description of PR

backport of HADOOP-16202. Enhanced openFile() to branch-3.3, plus a couple of other cherrypicks from trunk to ease the backporting.

if yetus is happy i wil merge the entire sequence in as the ordered chain of commits

How was this patch tested?

cloud store testing in progress against aws london and azure cardiff.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

sumangala-patki and others added 5 commits April 27, 2022 12:34
…1)

This defines standard option and values for the
openFile() builder API for opening a file:

fs.option.openfile.read.policy
 A list of the desired read policy, in preferred order.
 standard values are
 adaptive, default, random, sequential, vector, whole-file

fs.option.openfile.length
 How long the file is.

fs.option.openfile.split.start
 start of a task's split

fs.option.openfile.split.end
 end of a task's split

These can be used by filesystem connectors to optimize their
reading of the source file, including but not limited to
* skipping existence/length probes when opening a file
* choosing a policy for prefetching/caching data

The hadoop shell commands which read files all declare "whole-file"
and "sequential", as appropriate.

Contributed by Steve Loughran.

Change-Id: Ia290f79ea7973ce8713d4f90f1315b24d7a23da1
…e#2584/2)

These changes ensure that sequential files are opened with the
right read policy, and split start/end is passed in.

As well as offering opportunities for filesystem clients to
choose fetch/cache/seek policies, the settings ensure that
processing text files on an s3 bucket where the default policy
is "random" will still be processed efficiently.

This commit depends on the associated hadoop-common patch,
which must be committed first.

Contributed by Steve Loughran.

Change-Id: Ic6713fd752441cf42ebe8739d05c2293a5db9f94
S3A input stream support for the few fs.option.openfile settings.
As well as supporting the read policy option and values,
if the file length is declared in fs.option.openfile.length
then no HEAD request will be issued when opening a file.
This can cut a few tens of milliseconds off the operation.

The patch adds a new openfile parameter/FS configuration option
fs.s3a.input.async.drain.threshold (default: 16000).
It declares the number of bytes remaining in the http input stream
above which any operation to read and discard the rest of the stream,
"draining", is executed asynchronously.
This asynchronous draining offers some performance benefit on seek-heavy
file IO.

Contributed by Steve Loughran.

Change-Id: I9b0626bbe635e9fd97ac0f463f5e7167e0111e39
Stops the abfs connector warning if openFile().withFileStatus()
is invoked with a FileStatus is not an abfs VersionedFileStatus.

Contributed by Steve Loughran.

Change-Id: I85076b365eb30aaef2ed35139fa8714efd4d048e
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 10m 15s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 2s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 23 new or modified test files.
_ branch-3.3 Compile Tests _
+0 🆗 mvndep 14m 55s Maven dependency ordering for branch
+1 💚 mvninstall 26m 52s branch-3.3 passed
+1 💚 compile 18m 45s branch-3.3 passed
+1 💚 checkstyle 3m 21s branch-3.3 passed
+1 💚 mvnsite 10m 43s branch-3.3 passed
+1 💚 javadoc 9m 48s branch-3.3 passed
+1 💚 spotbugs 15m 21s branch-3.3 passed
+1 💚 shadedclient 27m 13s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 26s Maven dependency ordering for patch
+1 💚 mvninstall 6m 11s the patch passed
+1 💚 compile 18m 34s the patch passed
-1 ❌ javac 18m 34s /results-compile-javac-root.txt root generated 1 new + 1926 unchanged - 0 fixed = 1927 total (was 1926)
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 3m 12s root: The patch generated 0 new + 696 unchanged - 2 fixed = 696 total (was 698)
+1 💚 mvnsite 10m 49s the patch passed
+1 💚 xml 0m 1s The patch has no ill-formed XML file.
+1 💚 javadoc 1m 57s hadoop-common in the patch passed.
+1 💚 javadoc 1m 21s hadoop-yarn-common in the patch passed.
+1 💚 javadoc 0m 51s hadoop-mapreduce-client-core in the patch passed.
+1 💚 javadoc 0m 55s hadoop-mapreduce-client-app in the patch passed.
+1 💚 javadoc 0m 53s hadoop-mapreduce-examples in the patch passed.
+1 💚 javadoc 0m 52s hadoop-streaming in the patch passed.
+1 💚 javadoc 0m 51s hadoop-distcp in the patch passed.
+1 💚 javadoc 0m 57s hadoop-tools_hadoop-aws generated 0 new + 38 unchanged - 1 fixed = 38 total (was 39)
+1 💚 javadoc 0m 55s hadoop-azure in the patch passed.
+1 💚 spotbugs 17m 44s the patch passed
+1 💚 shadedclient 28m 7s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 17m 46s hadoop-common in the patch passed.
+1 💚 unit 5m 3s hadoop-yarn-common in the patch passed.
+1 💚 unit 6m 30s hadoop-mapreduce-client-core in the patch passed.
+1 💚 unit 8m 47s hadoop-mapreduce-client-app in the patch passed.
+1 💚 unit 1m 13s hadoop-mapreduce-examples in the patch passed.
+1 💚 unit 6m 53s hadoop-streaming in the patch passed.
+1 💚 unit 15m 40s hadoop-distcp in the patch passed.
+1 💚 unit 2m 39s hadoop-aws in the patch passed.
+1 💚 unit 2m 38s hadoop-azure in the patch passed.
+1 💚 asflicense 1m 18s The patch does not generate ASF License warnings.
304m 3s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4238/1/artifact/out/Dockerfile
GITHUB PR #4238
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell markdownlint xml
uname Linux 8bcdd5b97190 4.15.0-153-generic #160-Ubuntu SMP Thu Jul 29 06:54:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision branch-3.3 / 74d3b18
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~18.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4238/1/testReport/
Max. process+thread count 2240 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-examples hadoop-tools/hadoop-streaming hadoop-tools/hadoop-distcp hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4238/1/console
versions git=2.17.1 maven=3.6.0 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org

This message was automatically generated.

@steveloughran
Copy link
Contributor Author

merged locally; closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants