GH-34213: [C++] Use recursive calls without a delimiter if the user is doing a recursive GetFileInfo #35440

westonpace · 2023-05-05T01:30:38Z

Rationale for this change

The old model of "walk"ing the directory could lead to a large number of calls. If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket. With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories.

The only potential regression would be if max_recursion was set to something > 1. For example, if a user had:

bucket/foo/bar/<10000 files here>

Then if they make a request for bucket with max_recursion=2 the new approach will list all 10,000 files and then eliminate the files that don't match.

However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does).

What changes are included in this PR?

The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed.

Are these changes tested?

Yes, there should be no behavior change. All of the existing filesystem tests will test this change.

Are there any user-facing changes?

No, other than (hopefully) better performance.

Closes: [C++] Performance issue listing files over S3 #34213

github-actions · 2023-05-05T01:31:00Z

Closes: [C++] Performance issue listing files over S3 #34213

github-actions · 2023-05-05T01:31:02Z

⚠️ GitHub issue #34213 has been automatically assigned in GitHub to PR creator.

westonpace · 2023-05-05T01:33:34Z

I'm leaving this in draft while I do more profiling. I have already tested the worst case scenario (10k files spread across 10k directories) and it improves performance by 10-15x when testing from my desktop to S3. I've also tested the flat scenario (10k files in the bucket with no directories) and there is no regression.

I also want to test running from within EC2. I expect the performance gains to be smaller since the request latency is smaller but there should still be some gain.

Finally, I want to run some local perf tests with minio.

dalbani · 2023-06-05T19:09:10Z

Thanks for the report @westonpace 👍
I can't wait to see this improvement in a future release, should the final tests confirm what you found so far.

westonpace · 2023-06-23T16:15:37Z

I've gone through and done some more thorough testing now. I was surprised to see we even have a big improvement when running in EC2, even if we're in the same datacenter (where the latency should be very low). I tested with two test datasets.

The first test dataset (s3://ursa-qa/wide-partition) was 10,000 files split into 10,000 folders nested two deep:

x=0/y=0/file.parquet
x=0/y=1/file.parquet
...
x=0/y=99/file.parquet
x=1/y=0/file.parquet
...
x=99/y=99/file.parquet

The second dataset (s3://ursa-qa/flat-partition) was 10,000 files in the root folder:

file-0.parquet
file-1.parquet
...
file-9999.parquet

For all of my tests I timed how long it took to recursively listed all files in the dataset. I ran the tests on my local desktop (outside of EC2, on an EC2 server that was in a different region (us-west-2) and on an EC2 server that was in the same region (us-east-2). All times are in seconds.

There were (as hoped) significant performance improvements in the wide-partition dataset:

Regrettably, there might be a slight regression in the flat-partition dataset, although it is largely within the noise. I have run the tests frequently enough that I feel it is stable:

I've verified that, in both the old and new approach, we are sending the same number of HTTP messages to S3 and the content is very close (less than 300 bytes difference). I don't think it is additional compute time (or else I'd expect to see a worse regression on the AWS servers).

We could keep both styles (tree walking and recursive listing) but I don't think this regression is significant enough to justify the complexity.

There is one other case that would likely regress. That is the case where the data is deeply partitioned (e.g. each file is 4 or 5 folders deep) and the user specifies a low max recursion. For example...

a=0/b=0/c=0/d=0/e=0/file-0.parquet
...
a=0/b=0/c=0/d=0/e=0/file-9999.parquet

I would expect no regression if I fully listed the above dataset. However, if I listed the above dataset with a max_recursion of 2 then the old approach would likely be much faster since it only needs to return 1 file info (the new approach would return all 10k file infos and then we would pare them down in memory). I'm not aware of anyone using this use case (pyarrow doesn't even expose max_recursion) and so I'm not sure if it is worth the complexity of keeping both approaches. Even in this case I suspect we would be looking at a difference of 0.3 vs 3 seconds which is better than our current worst case (~100 seconds).

cboettig · 2023-06-23T18:15:43Z

Thanks @westonpace , this matches what we see as well. We're really looking forward to seeing this merged, will be hugely beneficial on our current workflows.

pitrou · 2023-06-26T09:24:40Z

@westonpace Thanks for the analysis. I agree this looks very beneficial.

pitrou · 2023-06-26T09:25:43Z

cpp/src/arrow/filesystem/path_util.cc

+  if (offset >= static_cast<int>(components.size())) {
+    return "";
+  }
+  int end = length;


Why not use length directly?

Actually, this should be length + offset. I've fixed it.

pitrou · 2023-06-26T09:26:34Z

cpp/src/arrow/filesystem/path_util.h

 ARROW_EXPORT
-std::string GetAbstractPathExtension(const std::string& s);
+std::string SliceAbstractPath(const std::string& path, int offset, int length,


Can you add basic tests for this in filesystem_test.cc?

pitrou · 2023-06-26T09:28:35Z

cpp/src/arrow/util/async_util_test.cc

@@ -204,6 +204,29 @@ TEST(AsyncTaskScheduler, InitialTaskFails) {
  ASSERT_FINISHES_AND_RAISES(Invalid, finished);
 }

+TEST(AsyncTaskScheduler, TaskDestroyedBeforeSchedulerEnds) {
+  bool my_task_destroyed = false;


Should it be an atomic?

No. The AsyncTaskScheduler itself does not actually spawn thread tasks. It relies on the tasks themselves to do this (in fact, it doesn't include executor.h or have any knowledge of thread pools). So this test does not involve threads at all and is entirely synchronous.

cpp/src/arrow/filesystem/s3fs.cc

pitrou · 2023-06-26T09:35:16Z

cpp/src/arrow/filesystem/s3fs.cc

+      int base_depth =
+          (prefix.empty())
+              ? 0
+              : static_cast<int>(std::count(prefix.begin(), prefix.end(), kSep));


Write a small helper for this?

int GetAbstractPathDepth(util::string_view path) { if (path.empty()) { return 0; }; return static_cast<int>(std::count(path.begin(), path.end(), kSep)); }

I did and I put it in path_util.h

pitrou · 2023-06-26T09:58:40Z

cpp/src/arrow/filesystem/s3fs.cc

+      //     no way we can hit "not found"
+      // * If they key is not empty, then it's possible
+      //     that the file itself didn't exist and there
+      //     were not files under it.  In that case we will hit this if statement and


What does "the file itself didn't exist and there were not files under it" mean?

I've reworded the comment. Please take a look.

pitrou · 2023-06-26T09:59:48Z

cpp/src/arrow/filesystem/s3fs.cc

+        files_queue.Push(PathNotFound(req.GetBucket(), req.GetPrefix()));
+      }
+      if (close_sink) {
+        files_queue.Close();


Does this mean we're in the top-level task? Why not do it from the task continuation?

Not necessarily the top-level task but it means that we've finished all tasks. This happens as the final task is finishing. However, I agree this makes more sense in the task continuation and I've moved it there (this code is no longer a destructor which was kind of confusing).

Also, in this process I ended up getting rid of close_sinkand moved the logic higher up. I think it has better symmetry as the close of the sink is at the same level as creating the sink.

cpp/src/arrow/filesystem/s3fs.cc

pitrou · 2023-06-26T10:03:41Z

cpp/src/arrow/filesystem/s3fs.cc

+        if (parent_base.first.empty()) {
+          break;
+        }
+        const std::string& parent_dir = parent_base.first;


Why not do this above?

const auto parent_dir = internal::GetAbstractPathParent(current).first;

I've switched to this.

pitrou · 2023-06-26T10:06:59Z

cpp/src/arrow/filesystem/s3fs.cc

+                sink.Push(std::move(buckets_as_directories));
+
+                if (recursive) {
+                  // Recursively list each bucket (these will run in parallel but out_gen


Suggested change

// Recursively list each bucket (these will run in parallel but out_gen

// Recursively list each bucket (these will run in parallel but sink

westonpace · 2023-06-27T01:18:06Z

@pitrou thanks for the review! I believe I've addressed your feedback.

pitrou

Just a few more comments. Thanks a lot for the update.

cpp/src/arrow/filesystem/path_util.cc

pitrou · 2023-06-28T15:39:55Z

cpp/src/arrow/filesystem/s3fs.cc

+    const bool allow_not_found;
+    const int max_recursion;
+
+    const bool include_virtual;


Perhaps name this include_implicit_directories? It seems more descriptive than include_virtual.

I changed to include_implicit_dirs and updated the wording in comments from "virtual" to "implicit".

pitrou · 2023-06-28T15:43:34Z

cpp/src/arrow/filesystem/s3fs.cc

-    RETURN_NOT_OK(TreeWalker::Walk(client_, io_context_, bucket, key, kListObjectsMaxKeys,
-                                   handle_results, handle_error, handle_recursion));
+  void ListAsync(const FileSelector& select, const std::string& bucket,
+                 const std::string& key, bool include_virtual,


Thanks for the explanation. Perhaps make the name more descriptive as already suggested above?

pitrou · 2023-06-28T15:46:12Z

cpp/src/arrow/filesystem/s3fs.cc

-          }
-          producer.Close();
+    return DeferNotOk(SubmitIO(ctx, [self]() { return self->client_->ListBuckets(); }))
+        // TODO(ARROW-12655) Change to Then(Impl::ProcessListBuckets)


Suggested change

// TODO(ARROW-12655) Change to Then(Impl::ProcessListBuckets)

// TODO(GH-18652) Change to Then(Impl::ProcessListBuckets)

westonpace · 2023-07-07T00:58:20Z

I added a stress test for GetFileInfoGenerator and DeleteDirContents. It was very useful (detected two bugs, one in this PR and one that existed before) however it is pretty slow (doubles the runtime of s3fs_test on my system). The main slow part is I need to create more than 1000 files so that the kListObjectsMaxKeys and kMultipleDeleteMaxKeys limits are applied.

@pitrou , I'd be curious to know if you think this test is worth leaving in or if I should remove it back out.

pitrou · 2023-07-07T06:59:55Z

The main slow part is I need to create more than 1000 files so that the kListObjectsMaxKeys and kMultipleDeleteMaxKeys limits are applied.

Have you tried creating those files in parallel using a ThreadPool?

westonpace · 2023-07-13T13:39:57Z

@westonpace Did you rebase this on the latest S3 fixes?

Yes, it should have the latest client_lock fixes (using Move).

pitrou · 2023-07-18T16:29:17Z

cpp/src/arrow/filesystem/s3fs.cc

+        if (result.GetContentLength() > 0 || key[key.size() - 1] != '/') {
+          return Status::IOError("Cannot delete directory contents at ", bucket, kSep,
+                                 key, " because it is a file");
+        }


It seems a bit weird to have this code in a helper function named EnsureFileAsync.
Perhaps make this a private helper inside DoDeleteDirContentsAsync, or have this return a Result<FileType> instead?

I changed the method to EnsureIsDirAsync and changed it so that it returns Result<bool> and moved the error message into DeleteDirContentsAsync.

pitrou · 2023-07-18T16:30:23Z

cpp/src/arrow/filesystem/s3fs.cc

+        [](const Status& st) {
+          // No need for special abort logic.
+        },
+        StopToken::Unstoppable());


Pass io_context().stop_token() instead?

Done. I also added a test for cancellation to make sure it works correctly. There was one other spot I had to change to get it to pass correctly.

…ries we now rely on S3's ability to do a recursive find. This significantly reduces the number of requests made to S3. Fix an invalid lifecycle issue. Minor tweak to the async task scheduler. Tasks should all be destroyed before the scheduler ends Add missing headers Need to use auto to avoid API inconsistency in S3 Addressing PR review comments Add a stress test for GetFileInfoGenerator. Fix an old bug in DeleteDirContents. Fix a bug in GetFileInfoGenerator. Addressing review comments Applying suggestion from pitrou Remove yields added for debugging Remove accidentally added include

…ectly hooked up. Refactored EnsureNotFileAsync into EnsureIsDirAsync and changed it to return a bool

pitrou · 2023-07-19T10:52:18Z

@westonpace Can you reintegrate the changes from c9ec4d3 ? They were clobbered when you force-pushed.

pitrou

Just some minor questions. Thanks for the update.

pitrou · 2023-07-19T10:53:12Z

cpp/src/arrow/filesystem/s3fs.cc

+  void ListAsync(const FileSelector& select, const std::string& bucket,
+                 const std::string& key, bool include_implicit_dirs,
+                 util::AsyncTaskScheduler* scheduler, FileInfoSink sink,
+                 S3FileSystem::Impl* self) {


Why do we pass a S3FileSystem::Impl* explicitly here? It is just this, right?

Yes. Good catch. I'm not sure what I was thinking. I've removed these arguments.

pitrou · 2023-07-19T10:53:39Z

cpp/src/arrow/filesystem/s3fs.cc

+  // Fully list all files from all buckets
+  void FullListAsync(bool include_implicit_dirs, util::AsyncTaskScheduler* scheduler,
+                     FileInfoSink sink, io::IOContext io_context, bool recursive,
+                     S3FileSystem::Impl* self) {


Same question. Also, the IOContext needn't be passed explicitly either (it's just this->io_context_).

I've removed this argument

westonpace · 2023-07-20T05:08:50Z

@westonpace Can you reintegrate the changes from c9ec4d3 ? They were clobbered when you force-pushed.

Sorry, I missed that when I rebased. I've restored the commit.

rqthomas · 2023-07-25T14:41:02Z

I want to echo @cboettig about our excitement for this upgrade to arrow and hope to see it in the next release. It seems that it will greatly accelerate our data pipelines! Thanks so much for your work developing and testing!

pitrou · 2023-07-25T15:53:10Z

@github-actions crossbow submit -g cpp

github-actions · 2023-07-25T15:56:13Z

Revision: 7edea03

Submitted crossbow builds: ursacomputing/crossbow @ actions-21347af314

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-11-cpp-amd64
test-debian-11-cpp-i386
test-fedora-35-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20

…user is doing a recursive GetFileInfo (apache#35440) ### Rationale for this change The old model of "walk"ing the directory could lead to a large number of calls. If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket. With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories. The only potential regression would be if max_recursion was set to something > 1. For example, if a user had: ``` bucket/foo/bar/<10000 files here> ``` Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match. However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does). ### What changes are included in this PR? The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed. ### Are these changes tested? Yes, there should be no behavior change. All of the existing filesystem tests will test this change. ### Are there any user-facing changes? No, other than (hopefully) better performance. * Closes: apache#34213 Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added Component: C++ awaiting committer review Awaiting committer review labels May 5, 2023

westonpace mentioned this pull request May 23, 2023

[R] open_dataset() on long vec of URIs uses much more RAM & is much slower than on partition root. #35715

Open

amoeba mentioned this pull request Jun 9, 2023

[R][C++] Calling bucket$ls on GCS bucket without recursive = TRUE doesn't list full contents #35993

Open

westonpace force-pushed the experiment/better-s3-recursive-list-perf branch from 2a12837 to 5c88217 Compare June 23, 2023 00:02

westonpace marked this pull request as ready for review June 23, 2023 23:52

westonpace requested a review from pitrou June 23, 2023 23:52

pitrou reviewed Jun 26, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 27, 2023

pitrou reviewed Jun 28, 2023

View reviewed changes

westonpace mentioned this pull request Jul 7, 2023

[C++] TSAN error in s3 (maybe false positive?) #36523

Closed

westonpace force-pushed the experiment/better-s3-recursive-list-perf branch from 2d9f78b to f0e2afd Compare July 7, 2023 00:35

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 7, 2023

westonpace force-pushed the experiment/better-s3-recursive-list-perf branch from 229d3ff to 2fa4013 Compare July 7, 2023 00:52

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 7, 2023

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 7, 2023

pitrou force-pushed the experiment/better-s3-recursive-list-perf branch from 5219550 to c9ec4d3 Compare July 18, 2023 16:14

pitrou reviewed Jul 18, 2023

View reviewed changes

westonpace added 2 commits July 18, 2023 15:09

Added test for cancellation. Fixed spot where stop token was not corr…

56bed6a

…ectly hooked up. Refactored EnsureNotFileAsync into EnsureIsDirAsync and changed it to return a bool

westonpace force-pushed the experiment/better-s3-recursive-list-perf branch from c9ec4d3 to 56bed6a Compare July 18, 2023 23:14

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 18, 2023

pitrou reviewed Jul 19, 2023

View reviewed changes

pitrou and others added 2 commits July 19, 2023 21:56

Also test returned FileInfos in stress test

6e75a6e

Addressing review comments

7edea03

github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 20, 2023

westonpace requested a review from pitrou July 20, 2023 18:36

pitrou approved these changes Jul 25, 2023

View reviewed changes

pitrou merged commit 3ac880d into apache:main Jul 25, 2023
36 of 37 checks passed

pitrou removed the awaiting changes Awaiting changes label Jul 25, 2023

felipecrv mentioned this pull request Nov 21, 2023

[Python][C++] S3FileSystem delete_dir() regression in PyArrow 14 #38618

Closed

jorisvandenbossche mentioned this pull request Nov 21, 2023

[C++] S3FileSystem: deleting directory with double slash in name crashes #38821

Closed

mapleFU mentioned this pull request Nov 26, 2023

[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794

Closed

	// Recursively list each bucket (these will run in parallel but out_gen
	// Recursively list each bucket (these will run in parallel but sink

	// TODO(ARROW-12655) Change to Then(Impl::ProcessListBuckets)
	// TODO(GH-18652) Change to Then(Impl::ProcessListBuckets)

GH-34213: [C++] Use recursive calls without a delimiter if the user is doing a recursive GetFileInfo #35440

GH-34213: [C++] Use recursive calls without a delimiter if the user is doing a recursive GetFileInfo #35440

Conversation

westonpace commented May 5, 2023 • edited by github-actions bot

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 5, 2023

github-actions bot commented May 5, 2023

westonpace commented May 5, 2023

dalbani commented Jun 5, 2023

westonpace commented Jun 23, 2023

cboettig commented Jun 23, 2023

pitrou commented Jun 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace Jun 27, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Jun 27, 2023

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Jul 7, 2023

pitrou commented Jul 7, 2023

westonpace commented Jul 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou commented Jul 19, 2023

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Jul 20, 2023

rqthomas commented Jul 25, 2023

pitrou commented Jul 25, 2023

github-actions bot commented Jul 25, 2023

westonpace commented May 5, 2023 •

edited by github-actions bot

westonpace Jun 27, 2023 •

edited