Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38704: [C++] Implement Azure FileSystem Move() via Azure DataLake Storage Gen 2 API #39904

Merged
merged 25 commits into from
Feb 10, 2024

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Feb 2, 2024

Rationale for this change

We need to move directories and files via the arrow::FileSystem interface.

What changes are included in this PR?

  • A few filesystem error reporting improvements
  • A helper class to deal with Azure Storage leases 1
  • The Move() implementation that can move files and directories within the same container on storage accounts with Hierarchical Namespace Support enabled
  • Lots of tests

Are these changes tested?

Yes, by existing and a huge number of tests added by this PR. The test code introduced here should be extracted to a reusable test module that we can use to test move in other file system implementations.

Are there any user-facing changes?

No breaking changes, only new functionality.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at all changes but I don't fully understand yet. I'll review again later.

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
// "subdir0/file-at-subdir" exists

// src is a directory and dest does not exists
CreateDirectory(adlfs_client, "subdir0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is needless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's necessary because I'm testing the scenario where the src exists. The next line moves the subdir0 to subdir1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Right. Removing it.

Comment on lines 1275 to 1276
CreateDirectory(adlfs_client, "subdir1");
CreateDirectory(adlfs_client, "subdir2");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that they are needless.

latest < break_or_expires_at_ &&
!latest_known_expiry_time_.compare_exchange_weak(latest, break_or_expires_at_)) {
}
DCHECK_GE(latest_known_expiry_time_.load(), break_or_expires_at_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe?
I think that latest_known_expiry_time_ may be changed between latest_known_expiry_time_.compare_exchange_weak() and latest_known_expiry_time_.load().

Copy link
Contributor Author

@felipecrv felipecrv Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe because latest_known_expiry_time_ monotonically increases (it never goes down). So even if it's changed, the [G]reater than or [E]qual will always work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!

/// doesn't exist, otherwise a PathNotFound(location) error is produced right away
/// \return A BlobLeaseClient is wrapped as a unique_ptr so it's moveable and
/// optional (nullptr denotes blob not found)
Result<std::unique_ptr<Blobs::BlobLeaseClient>> AcquireBlobLease(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that most codes are duplicated with AcquireContainerLease(). Can we unify them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They use different SDK classes and the error handling is subtly different between each. Unifying these would require a lot of templating that would obfuscate the code more than clarify it.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Feb 2, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Feb 2, 2024
static constexpr std::chrono::seconds kMaxLeaseDuration{60};

public:
LeaseGuard(std::unique_ptr<Blobs::BlobLeaseClient> lease_client,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really make sense for this to be controlled by a consumer vs being controlled more internally? Is this a common pattern with Azure outside of our usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you mean by "consumer" and "more internally" here? The Arrow implementation is internal compared to the software using Arrow.

Leases are a common Distributed Systems pattern [1] and the multi-step operations being performed here would have almost unpredictable outcomes in the presence of concurrent clients. Without concurrent mutators, they are very cheap (lead to no delays at all) and with concurrent mutators, they lead to outcomes we and users can reason about. Note that I often use the lease acquisition as an existence check I would have to do anyways.

[1] https://martinfowler.com/articles/patterns-of-distributed-systems/lease.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you mean by "consumer" and "more internally" here? The Arrow implementation is internal compared to the software using Arrow.

I'm referring to anyone using arrow::filesystem::AzureFileSystem directly, whether inside the arrow library (datasets) or not.

Leases are a common Distributed Systems pattern [1] and the multi-step operations being performed here would have almost unpredictable outcomes in the presence of concurrent clients

Yup, I know. I'm just referring to where the control of the lease is managed. But I also just realized that this entire leaseguard class isn't publicly exposed haha. Making my entire question here moot. So we're all good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A-ha. Yes, the class is totally private. It will get more use-cases but they will all be within this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might move it to a separate internal.h/cc but I would prefer doing it later to reduce noise in these PR as a lot would be moving.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 5, 2024
//
// NOTE: The initial constant values were chosen conservatively. If we learn,
// from experience, that they are causing issues, we can increase them. And if
// broadly applicable values aren't possible, we can make them configurable.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeroshade There isn't much to these numbers, but what I can say is that they work well for a client running in Brazil talking to a storage account in a US east coast zone replicated across the US.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to make them configurable right off the bat?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be premature. My plan if these constants are not good enough:

Raise lease time to 30s.
Raise operation times to 15s.

Network slow downs are unbounded, but failing without data loss risk and allowing a retry would be the way to go here IMO.

(I hardcoded GetUrl on the SDK class to debug my changes :)
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 7, 2024
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
Comment on lines +2070 to +2072
ARROW_ASSIGN_OR_RAISE(auto src_lease_client,
AcquireContainerLease(src, kLeaseDuration));
LeaseGuard src_lease_guard{std::move(src_lease_client), kLeaseDuration};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must specify the same lease duration to AcquireContainer() and LeaseGuard::LeaseGuard(), right?
It may be misused.
Can we return std::unique_ptr<LeaseGuard> by AcquireContainerLease() to avoid creating a LeaseGuard manually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered this, but the problem is that the LeaseGuard should often live in an outer scope relative to the AcquireContainerLease call, so I considered this misuse trap less bad than writing code that declares the guard far from where it's used -- I do that now sometime with optionals and I think that communicates the intent more clearly.

try {
auto src_list_response = src_container_client.ListBlobs(list_blobs_options);
if (!src_list_response.Blobs.empty()) {
return Status::IOError("Unable to replace empty container: '", dest.all, "'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Status::IOError("Unable to replace empty container: '", dest.all, "'");
return Status::IOError("Unable to replace by non empty container: '", src.all, "'");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment explaining why dest is the correct here.

}
try {
src_lease_guard.BreakBeforeDeletion(kTimeNeededForContainerDeletion);
src_container_client.DeleteIfExists(options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use DeleteIfExists() here? Can we use Delete() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Delete is enough. I'm changing.

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

// These functions are marked ARROW_NOINLINE because they are called from
// multiple locations, but are not performance-critical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is about the 3 functions below, not only the immediately next one.

Comment on lines 1013 to 1014
// TODO(felipecrv): investigate why this can't be false
select.allow_not_found = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to solve this in this PR?
If you want to defer this to a follow-up task, could you create an issue for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
Comment on lines 1205 to 1206
GTEST_SKIP()
<< "The rest of TestMovePaths is not implemented for non-HNS scenarios";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just return here?
Should we use GTEST_SKIP() here? We have some tests for non-HNS case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to more loudly communicate that MOST of the tests are not in fact running. The return is too subtle.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 9, 2024
@felipecrv felipecrv merged commit 0ce54b6 into apache:main Feb 10, 2024
33 of 34 checks passed
@felipecrv felipecrv removed the awaiting change review Awaiting change review label Feb 10, 2024
@felipecrv felipecrv deleted the azure_move branch February 10, 2024 02:14
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 0ce54b6.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…aLake Storage Gen 2 API (apache#39904)

### Rationale for this change

We need to move directories and files via the `arrow::FileSystem` interface.

### What changes are included in this PR?

 - A few filesystem error reporting improvements
 - A helper class to deal with Azure Storage leases [1]
 - The `Move()` implementation that can move files and directories within the same container on storage accounts with Hierarchical Namespace Support enabled
 - Lots of tests

[1]: https://learn.microsoft.com/en-us/rest/api/storageservices/lease-blob

### Are these changes tested?

Yes, by existing and a huge number of tests added by this PR. The test code introduced here should be extracted to a reusable test module that we can use to test move in other file system implementations.

### Are there any user-facing changes?

No breaking changes, only new functionality.
* Closes: apache#38704

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…aLake Storage Gen 2 API (apache#39904)

### Rationale for this change

We need to move directories and files via the `arrow::FileSystem` interface.

### What changes are included in this PR?

 - A few filesystem error reporting improvements
 - A helper class to deal with Azure Storage leases [1]
 - The `Move()` implementation that can move files and directories within the same container on storage accounts with Hierarchical Namespace Support enabled
 - Lots of tests

[1]: https://learn.microsoft.com/en-us/rest/api/storageservices/lease-blob

### Are these changes tested?

Yes, by existing and a huge number of tests added by this PR. The test code introduced here should be extracted to a reusable test module that we can use to test move in other file system implementations.

### Are there any user-facing changes?

No breaking changes, only new functionality.
* Closes: apache#38704

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…aLake Storage Gen 2 API (apache#39904)

### Rationale for this change

We need to move directories and files via the `arrow::FileSystem` interface.

### What changes are included in this PR?

 - A few filesystem error reporting improvements
 - A helper class to deal with Azure Storage leases [1]
 - The `Move()` implementation that can move files and directories within the same container on storage accounts with Hierarchical Namespace Support enabled
 - Lots of tests

[1]: https://learn.microsoft.com/en-us/rest/api/storageservices/lease-blob

### Are these changes tested?

Yes, by existing and a huge number of tests added by this PR. The test code introduced here should be extracted to a reusable test module that we can use to test move in other file system implementations.

### Are there any user-facing changes?

No breaking changes, only new functionality.
* Closes: apache#38704

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][FS][Azure] Implement Move()
3 participants