Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38597: [C++] Implement GetFileInfo(selector) for Azure filesystem #39009

Merged
merged 13 commits into from
Dec 8, 2023

Conversation

felipecrv
Copy link
Contributor

@felipecrv felipecrv commented Nov 30, 2023

Rationale for this change

Part of Azure FS implementation.

What changes are included in this PR?

The version of GetFileInfo that takes a prefix and can optionally recurse into directories.

Are these changes tested?

By unit tests present in this PR. Separate from this PR, I'm thinking of way to fuzz-test the FS API.

@felipecrv
Copy link
Contributor Author

@kou @Tom-Newton

@felipecrv
Copy link
Contributor Author

TODO:

  • Leverage hierarchical support if enabled for the account
  • Introduce an explicit std::stack to remove stack overflow risk when listing very deep filesystems

@felipecrv felipecrv requested a review from kou November 30, 2023 18:05
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll review GetFileInfoWithSelector() and GetFileInfoWithSelectorFromContainer() logic carefully later.)

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
Comment on lines 861 to 866
static std::string_view BasenameView(std::string_view s) {
auto offset = s.find_last_of(internal::kSep);
auto tail = (offset == std::string_view::npos) ? s : s.substr(offset);
return internal::RemoveTrailingSlash(tail, /*preserve_root=*/false);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this to path_util.{cc,h}?
Because extracting basename from a path is a generic path operation.
We will be able to use GetAbstractPathParent() to implement this like return GetAbstractPathParent(s).second;.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for factoring out generic path operations, and also unit-test them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with path_util.h functions is that they don't follow any concept of path normalization for inputs and have to check the input for many variations, this BasenameView works well in this context, but not in a general context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changing it to this:

  static std::string_view BasenameView(std::string_view s) {
    DCHECK(!internal::HasTrailingSlash(s));
    auto offset = s.find_last_of(internal::kSep);
    auto result = (offset == std::string_view::npos) ? s : s.substr(offset);
    DCHECK(!result.empty() && result.back() != internal::kSep);
    return result;
  }

cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/path_util.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
Comment on lines +1333 to +1403
Azure::Core::Context context;
Azure::Nullable<int32_t> page_size_hint; // unspecified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't object them but why do you want to specify them explicitly?
Our other methods don't specify them explicitly. (They use the default value.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want all the wiring to be in place if we need to specify the page_size_hint which I'm pretty sure we will need to tweak when this is used in practice.

}
return Status::OK();
};
return ListContainers(context, std::move(on_container));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass page_size_hint here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page_size_hint only applies to the pagination of blobs. If I were to pass page size here, it would be a separate constant, but I thought this would be an overkill.

cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved
} else if (cmp > 0) {
RETURN_NOT_OK(process_prefix(prefix));
blob_prefix_index += 1;
} else { // there is a blob (empty dir marker) and a prefix with the same name
Copy link
Member

@kou kou Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really happen?
emptydir/ is a case of this in the test, right?
In this case, emptydir/ may exist in Blobs (with options.Prefix = emptydir/) or BlobPrefixes (with no options.Prefix) but doesn't exit in both of them. (I checked this with gdb.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the direction the whole implementation will move towards. CreateDir should create a marker blob, so we can check that an empty directory exists after it's created or after its contents are deleted.

As @pitrou said in another comment, all the Filesystem implementations should behave the same way.

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

try {
auto list_response =
container_client.ListBlobsByHierarchy(/*delimiter=*/"/", options, context);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use ListBlobs() instead when select.max_recursion == INT32_MAX as an optimization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require that we build a trie of all the blob names to list them hierarchically. I'm open to doing it, but I think we can delay doing it until there is clear indication that it would help a concrete use-case.

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
@Tom-Newton
Copy link
Contributor

I reviewed the tests, so I'm happy with the behaviour. I haven't had a chance to understand and review the implementation yet.

options.Prefix = internal::EnsureTrailingSlash(base_location.path);
}
options.PageSizeHint = page_size_hint;
options.Include = Azure::Storage::Blobs::Models::ListBlobsIncludeFlags::Metadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question: was it considered to have a namespace alias at the top of this file such as namespace BlobsModels = Azure::Storage::Blobs::Models;?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I prefer to do once all PRs land because it would be a noisy change.

@felipecrv
Copy link
Contributor Author

I rebased to fix a conflict, but the new changes are on the last two commits.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 5, 2023
@felipecrv felipecrv requested review from kou and pitrou December 5, 2023 02:02
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Dec 5, 2023
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes awaiting committer review Awaiting committer review and removed awaiting change review Awaiting change review awaiting review Awaiting review awaiting changes Awaiting changes awaiting committer review Awaiting committer review labels Dec 7, 2023
@github-actions github-actions bot added the awaiting changes Awaiting changes label Dec 8, 2023
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Could you fix a lint failure?

cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Dec 8, 2023
@felipecrv
Copy link
Contributor Author

I will rebase and force-push to see if all CI passes now.

@felipecrv
Copy link
Contributor Author

CI failure is unrelated so I'm merging this.

@felipecrv felipecrv merged commit 476c78f into apache:main Dec 8, 2023
32 of 33 checks passed
@felipecrv felipecrv deleted the azure_selector branch December 8, 2023 18:48
@kou
Copy link
Member

kou commented Dec 8, 2023

@felipecrv Did you use our merge script?
https://github.com/apache/arrow/tree/main/dev#how-to-merge-a-pull-request

GH-38597's milestone isn't set. In general, it's set by the merge script.

Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 476c78f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 32 possible false positives for unstable benchmarks that are known to sometimes produce them.

@felipecrv
Copy link
Contributor Author

@felipecrv Did you use our merge script? https://github.com/apache/arrow/tree/main/dev#how-to-merge-a-pull-request

GH-38597's milestone isn't set. In general, it's set by the merge script.

Oh no. Sorry. I had no idea. I will use it from now on. Anything we can do to make this right for this PR?

@kou
Copy link
Member

kou commented Dec 10, 2023

I set GH-37597's milestone manually. :-)

mapleFU pushed a commit to mapleFU/arrow that referenced this pull request Dec 13, 2023
…ystem (apache#39009)

### Rationale for this change

Part of Azure FS implementation.

### What changes are included in this PR?

The version of `GetFileInfo` that takes a prefix and can optionally
recurse into directories.

### Are these changes tested?

By unit tests present in this PR. Separate from this PR, I'm thinking of
way to fuzz-test the FS API.
* Closes: apache#38597
clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…ystem (apache#39009)

### Rationale for this change

Part of Azure FS implementation.

### What changes are included in this PR?

The version of `GetFileInfo` that takes a prefix and can optionally
recurse into directories.

### Are these changes tested?

By unit tests present in this PR. Separate from this PR, I'm thinking of
way to fuzz-test the FS API.
* Closes: apache#38597
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…ystem (apache#39009)

### Rationale for this change

Part of Azure FS implementation.

### What changes are included in this PR?

The version of `GetFileInfo` that takes a prefix and can optionally
recurse into directories.

### Are these changes tested?

By unit tests present in this PR. Separate from this PR, I'm thinking of
way to fuzz-test the FS API.
* Closes: apache#38597
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Implement GetFileInfo selector for Azure filesystem
5 participants