Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing #4677

Closed
wants to merge 2 commits into from

Conversation

ashutoshcipher
Copy link
Contributor

@ashutoshcipher ashutoshcipher commented Aug 2, 2022

Description of PR

Optimize liststatus for better performance by using recursive listing.

JIRA - MAPREDUCE-7401

How was this patch tested?

Unit tests

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@ashutoshcipher ashutoshcipher changed the title MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing [Draft] MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing Aug 2, 2022
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 23m 37s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 5 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 14m 57s Maven dependency ordering for branch
+1 💚 mvninstall 29m 5s trunk passed
+1 💚 compile 25m 34s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 22m 7s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 4m 28s trunk passed
+1 💚 mvnsite 3m 18s trunk passed
+1 💚 javadoc 2m 30s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 2m 0s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 4m 58s trunk passed
+1 💚 shadedclient 25m 30s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 55s Maven dependency ordering for patch
+1 💚 mvninstall 1m 47s the patch passed
+1 💚 compile 24m 35s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javac 24m 35s the patch passed
+1 💚 compile 21m 57s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 21m 57s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 4m 28s /results-checkstyle-root.txt root: The patch generated 12 new + 438 unchanged - 5 fixed = 450 total (was 443)
+1 💚 mvnsite 3m 17s the patch passed
+1 💚 javadoc 2m 21s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 2m 1s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 5m 9s the patch passed
+1 💚 shadedclient 25m 28s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 18m 22s /patch-unit-hadoop-common-project_hadoop-common.txt hadoop-common in the patch passed.
+1 💚 unit 7m 31s hadoop-mapreduce-client-core in the patch passed.
+1 💚 asflicense 1m 17s The patch does not generate ASF License warnings.
281m 8s
Reason Tests
Failed junit tests hadoop.fs.TestFilterFileSystem
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/1/artifact/out/Dockerfile
GITHUB PR #4677
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux bdb77f7fe201 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / e8cc761
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/1/testReport/
Max. process+thread count 3144 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1, sorry.

Do not go near this unless you can show that the current `listFiles(path, recursive)' is inadequate. Which I do not believe it is.

If you can make the case that it doesn't change it then you have to look very closely at the Javadocs at the top of FileSystem and any recent changes to the API to see how they are managed. Vectored IO for example. also look at HADOOP-16898 and HADOOP-16898 to see their listing changes including my unhappiness about something going in without more publicity across the different teams.

Any change in that API is public facing and has to be maintained forever. It needs to be supported effectively in HDFS and in cloud storage. That means you're going to have to do a full api specification, write contract tests, implement those contact tests on in hadoop-aws and azure, and ideally anywhere else (google gcs). then make sure that you don't break the external libs named in the javadocs.

Assume that I will automatically veto any new list method returning an array. It hits scale problems on HDFS -lock duration, size of responses to marshall- and prevents us doing things in the object stores including prefetching, IOStatistics collection and supporting close(). Also using builder APIs and returning a CompletableFuture.

Look at the s3a and abfs listing code to see how implement listFiles, and the s3a and manifest I committed to see how they are effectively used. we kick off operations (treewalk, file loading) while waiting for next page of responses to come in, ideally swallowing the entire latency of each list call.

Note also that because listFiles only returns files, not directories, we can do O(files/page size) deep list calls against s3.

If the justification is that we need path filtering, see HADOOP-16673 Add filter parameter to FileSystem>>listFiles to see why that doesn't work in cloud and hence closed as WONTFIX.

I think a more manageable focus of this work would be to see how FileInputFormat could be speeded up by using the existing APIs, I am at with all work done knowing that many external libraries subclass that. For example, Parquet, Avro and ORC. Any incompatible change will stop them upgrading and we cannot do that.

Am I being very negative here? Yes I am. If you do want to change the Apis then you need to start talking about it on the HDFS and common lists, show that it delivers tangible benefit on-prem and in cloud, and undertake the extensive piece of work needed to implement in the primary cloud stores to show it is performant.

Finally, when you consider that the future of tables is one of manifest files (iceberg, hudi, delta lake), IMO it is better to focus on making workign with those formats faster. treewalk listing may be slow with hive partitioned data, but they are so pathologically bad in cloud for commit as well as query planning, that new code is moving beyond them

* The input filter that can be used to filter files/dirs.
* @throws IOException
*/
protected void addInputPathRecursively(List<FileStatus> result,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't remove this as it breaks methods external classes may use

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 40s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 5 new or modified test files.
_ trunk Compile Tests _
+0 🆗 mvndep 15m 22s Maven dependency ordering for branch
+1 💚 mvninstall 28m 23s trunk passed
+1 💚 compile 23m 29s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 compile 20m 44s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 checkstyle 4m 28s trunk passed
+1 💚 mvnsite 3m 44s trunk passed
+1 💚 javadoc 2m 53s trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 2m 32s trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 5m 17s trunk passed
+1 💚 shadedclient 23m 45s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+0 🆗 mvndep 0m 31s Maven dependency ordering for patch
+1 💚 mvninstall 1m 44s the patch passed
+1 💚 compile 24m 36s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javac 24m 36s the patch passed
+1 💚 compile 22m 51s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 javac 22m 51s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 4m 13s root: The patch generated 0 new + 437 unchanged - 6 fixed = 437 total (was 443)
+1 💚 mvnsite 3m 33s the patch passed
+1 💚 javadoc 2m 18s the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1
+1 💚 javadoc 2m 13s the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
+1 💚 spotbugs 5m 19s the patch passed
+1 💚 shadedclient 23m 13s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 19m 11s hadoop-common in the patch passed.
+1 💚 unit 7m 36s hadoop-mapreduce-client-core in the patch passed.
+1 💚 asflicense 1m 21s The patch does not generate ASF License warnings.
253m 52s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/2/artifact/out/Dockerfile
GITHUB PR #4677
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux a2681ddc1b47 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 4843c21
Default Java Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/2/testReport/
Max. process+thread count 3103 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: .
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4677/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@ashutoshcipher ashutoshcipher marked this pull request as ready for review August 3, 2022 21:01
@ashutoshcipher ashutoshcipher changed the title [Draft] MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing MAPREDUCE-7401. Optimize liststatus for better performance by using recursive listing Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants