Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35903: [C++] Skeleton for Azure Blob Storage filesystem implementation #35701

Merged
merged 8 commits into from
Jul 12, 2023

Conversation

srilman
Copy link
Contributor

@srilman srilman commented May 20, 2023

What changes are included in this PR?

This PR splits out the overall skeleton of #12914 in order to make merging of the overall Azure Filesystem easier to do.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@srilman srilman marked this pull request as ready for review May 20, 2023 22:07
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@srilman srilman changed the title ARROW-2034: [C++] Skeleton for Azure Filesystem Implementation GH-18014: [C++] Skeleton for Azure Filesystem Implementation May 20, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #18014 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-18014: [C++] Skeleton for Azure Filesystem Implementation GH-18014: [C++] Skeleton for Azure filesystem implementation May 21, 2023
@github-actions
Copy link

⚠️ GitHub issue #18014 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-18014: [C++] Skeleton for Azure filesystem implementation GH-18014: [C++] Skeleton for Azure Blob Storage filesystem implementation May 21, 2023
@kou
Copy link
Member

kou commented May 21, 2023

Thanks!
Could you open a new sub issue of #18014 for each step like #29804 for the GCS filesystem implementation instead of associating to #18014? Because we need multiple steps to complete #18014.
And could you use our pull request template instead of removing it next time?

@github-actions
Copy link

⚠️ GitHub issue #18014 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #18014 has been automatically assigned in GitHub to PR creator.

@srilman srilman changed the title GH-18014: [C++] Skeleton for Azure Blob Storage filesystem implementation GH-35903: [C++] Skeleton for Azure Blob Storage filesystem implementation Jun 4, 2023
@github-actions
Copy link

github-actions bot commented Jun 4, 2023

⚠️ GitHub issue #35903 has been automatically assigned in GitHub to PR creator.

@srilman
Copy link
Contributor Author

srilman commented Jun 4, 2023

@kou Thank you for the suggestions! This is my first time opening a PR for Arrow, so I'm a bit unfamiliar with the codebase infrastructure.

  • I looked at other PRs and tried to emulate the template
  • I created a subissue for the skeleton
  • I copied the infrastructure changes you recommended from the original PR to this one.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you enable GitHub Actions in your fork?

@@ -226,6 +227,7 @@ jobs:
- os: windows-2019
name: Windows 2019
env:
ARROW_AZURE: OFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert this because ARROW_AZURE is OFF by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I reverted it.

@@ -329,6 +331,7 @@ jobs:
- msystem_lower: clang64
msystem_upper: CLANG64
env:
ARROW_AZURE: OFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reverted this.

// -----------------------------------------------------------------------
// AzureFilesystem Implementation

class AzureFileSystem::Impl : public std::enable_shared_from_this<AzureFileSystem::Impl> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove std::enable_shared_from_this?

Suggested change
class AzureFileSystem::Impl : public std::enable_shared_from_this<AzureFileSystem::Impl> {
class AzureFileSystem::Impl {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed std::enable_shared_from_this is probably not useful if the Impl is stored by its parent using a std::unique_ptr.

That said, once we start implementing async APIs, a std::shared_ptr will probably be required.

Comment on lines 60 to 83
struct ARROW_EXPORT AzureOptions {
std::string scheme;
std::string account_dfs_url;
std::string account_blob_url;
bool is_azurite = false;
AzureCredentialsKind credentials_kind = AzureCredentialsKind::Anonymous;

std::string sas_token;
std::string connection_string;
std::shared_ptr<Azure::Storage::StorageSharedKeyCredential>
storage_credentials_provider;
std::shared_ptr<Azure::Core::Credentials::TokenCredential>
service_principle_credentials_provider;

AzureOptions();

bool Equals(const AzureOptions& other) const;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing AzureOptions and related codes for now and adding it as the next step?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they're already there and not terribly long, I think we can keep them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to keep AzureOptions in this PR, we should pick up tests for AzureOptions from the original PR too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will add some of the basic tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I looked at the tests for AzureOptions and all of them require a lot of other functions to be implemented first. I think adding those functions to this PR would make it too complicated. Personally, I don't feel like it is horrible to keep AzureOptions in this PR without adding tests, but it is up to you. Alternatively, I can remove some of the additional fields that we don't immediately need, like sas_token, connection_string, storage_credentials_provider, and service_principle_credentials_provider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing AzureOptions related codes from this PR to merge this PR quickly?

diff --git a/cpp/src/arrow/filesystem/azurefs.cc b/cpp/src/arrow/filesystem/azurefs.cc
index 0158c0cec..9fdf903b0 100644
--- a/cpp/src/arrow/filesystem/azurefs.cc
+++ b/cpp/src/arrow/filesystem/azurefs.cc
@@ -23,43 +23,21 @@
 namespace arrow {
 namespace fs {
 
-// -----------------------------------------------------------------------
-// AzureOptions Implementation
-
-AzureOptions::AzureOptions() {}
-
-bool AzureOptions::Equals(const AzureOptions& other) const {
-  return (account_dfs_url == other.account_dfs_url &&
-          account_blob_url == other.account_blob_url &&
-          credentials_kind == other.credentials_kind);
-}
-
 // -----------------------------------------------------------------------
 // AzureFilesystem Implementation
 
 class AzureFileSystem::Impl {
  public:
   io::IOContext io_context_;
-  bool is_hierarchical_namespace_enabled_;
-  AzureOptions options_;
 
-  explicit Impl(AzureOptions options, io::IOContext io_context)
-      : io_context_(io_context), options_(std::move(options)) {}
+  explicit Impl(io::IOContext io_context)
+      : io_context_(io_context) {}
 
   Status Init() {
-    if (options_.backend == AzureBackend::Azurite) {
-      // gen1Client_->GetAccountInfo().Value.IsHierarchicalNamespaceEnabled
-      // throws error in azurite
-      is_hierarchical_namespace_enabled_ = false;
-    }
     return Status::OK();
   }
-
-  const AzureOptions& options() const { return options_; }
 };
 
-const AzureOptions& AzureFileSystem::options() const { return impl_->options(); }
-
 bool AzureFileSystem::Equals(const FileSystem& other) const {
   if (this == &other) {
     return true;
@@ -67,8 +45,7 @@ bool AzureFileSystem::Equals(const FileSystem& other) const {
   if (other.type_name() != type_name()) {
     return false;
   }
-  const auto& azure_fs = ::arrow::internal::checked_cast<const AzureFileSystem&>(other);
-  return options().Equals(azure_fs.options());
+  return true;
 }
 
 Result<FileInfo> AzureFileSystem::GetFileInfo(const std::string& path) {
@@ -138,15 +115,14 @@ Result<std::shared_ptr<io::OutputStream>> AzureFileSystem::OpenAppendStream(
 }
 
 Result<std::shared_ptr<AzureFileSystem>> AzureFileSystem::Make(
-    const AzureOptions& options, const io::IOContext& io_context) {
-  std::shared_ptr<AzureFileSystem> ptr(new AzureFileSystem(options, io_context));
+    const io::IOContext& io_context) {
+  std::shared_ptr<AzureFileSystem> ptr(new AzureFileSystem(io_context));
   RETURN_NOT_OK(ptr->impl_->Init());
   return ptr;
 }
 
-AzureFileSystem::AzureFileSystem(const AzureOptions& options,
-                                 const io::IOContext& io_context)
-    : FileSystem(io_context), impl_(std::make_unique<Impl>(options, io_context)) {
+AzureFileSystem::AzureFileSystem(const io::IOContext& io_context)
+    : FileSystem(io_context), impl_(std::make_unique<Impl>(io_context)) {
   default_async_is_sync_ = false;
 }
 
diff --git a/cpp/src/arrow/filesystem/azurefs.h b/cpp/src/arrow/filesystem/azurefs.h
index 45087668d..31c014dcb 100644
--- a/cpp/src/arrow/filesystem/azurefs.h
+++ b/cpp/src/arrow/filesystem/azurefs.h
@@ -25,63 +25,9 @@
 #include "arrow/util/macros.h"
 #include "arrow/util/uri.h"
 
-namespace Azure {
-namespace Core {
-namespace Credentials {
-
-class TokenCredential;
-
-}  // namespace Credentials
-}  // namespace Core
-namespace Storage {
-
-class StorageSharedKeyCredential;
-
-}  // namespace Storage
-}  // namespace Azure
-
 namespace arrow {
 namespace fs {
 
-enum class AzureCredentialsKind : int8_t {
-  /// Anonymous access (no credentials used), public
-  Anonymous,
-  /// Use explicitly-provided access key pair
-  StorageCredentials,
-  /// Use ServicePrincipleCredentials
-  ServicePrincipleCredentials,
-  /// Use Sas Token to authenticate
-  Sas,
-  /// Use Connection String
-  ConnectionString
-};
-
-enum class AzureBackend : bool {
-  /// Official Azure Remote Backend
-  Azure,
-  /// Local Simulated Storage
-  Azurite
-};
-
-/// Options for the AzureFileSystem implementation.
-struct ARROW_EXPORT AzureOptions {
-  std::string account_dfs_url;
-  std::string account_blob_url;
-  AzureBackend backend = AzureBackend::Azure;
-  AzureCredentialsKind credentials_kind = AzureCredentialsKind::Anonymous;
-
-  std::string sas_token;
-  std::string connection_string;
-  std::shared_ptr<Azure::Storage::StorageSharedKeyCredential>
-      storage_credentials_provider;
-  std::shared_ptr<Azure::Core::Credentials::TokenCredential>
-      service_principle_credentials_provider;
-
-  AzureOptions();
-
-  bool Equals(const AzureOptions& other) const;
-};
-
 /// \brief Azure-backed FileSystem implementation for ABFS and ADLS.
 ///
 /// ABFS (Azure Blob Storage - https://azure.microsoft.com/en-us/products/storage/blobs/)
@@ -94,7 +40,7 @@ struct ARROW_EXPORT AzureOptions {
 /// compatibility. Gen1 exists as a separate object that will retired
 /// on Feb 29, 2024. New ADLS accounts will use Gen2 instead, which is
 /// implemented on top of ABFS.
-/// 
+///
 /// TODO: GH-18014 Complete the internal implementation
 /// and review the documentation
 class ARROW_EXPORT AzureFileSystem : public FileSystem {
@@ -103,9 +49,6 @@ class ARROW_EXPORT AzureFileSystem : public FileSystem {
 
   std::string type_name() const override { return "abfs"; }
 
-  /// Return the original Azure options when constructing the filesystem
-  const AzureOptions& options() const;
-
   bool Equals(const FileSystem& other) const override;
 
   Result<FileInfo> GetFileInfo(const std::string& path) override;
@@ -146,10 +89,10 @@ class ARROW_EXPORT AzureFileSystem : public FileSystem {
       const std::shared_ptr<const KeyValueMetadata>& metadata = {}) override;
 
   static Result<std::shared_ptr<AzureFileSystem>> Make(
-      const AzureOptions& options, const io::IOContext& = io::default_io_context());
+      const io::IOContext& = io::default_io_context());
 
  private:
-  explicit AzureFileSystem(const AzureOptions& options, const io::IOContext& io_context);
+  explicit AzureFileSystem(const io::IOContext& io_context);
 
   class Impl;
   std::unique_ptr<Impl> impl_;
diff --git a/cpp/src/arrow/filesystem/azurefs_test.cc b/cpp/src/arrow/filesystem/azurefs_test.cc
index 0f03e8839..d7f97d1e4 100644
--- a/cpp/src/arrow/filesystem/azurefs_test.cc
+++ b/cpp/src/arrow/filesystem/azurefs_test.cc
@@ -34,11 +34,9 @@ using ::testing::IsEmpty;
 using ::testing::Not;
 using ::testing::NotNull;
 
-// Placeholder test for file structure
-// TODO: GH-18014 Remove once a proper test is added
-TEST(AzureFileSystem, OptionsCompare) {
-  AzureOptions options;
-  EXPECT_TRUE(options.Equals(options));
+TEST(AzureFileSystem, type_name) {
+  ASSERT_OK_AND_ASSIGN(auto fs, AzureFileSystem::Make());
+  ASSERT_EQ(fs->type_name(), "abfs");
 }
 
 }  // namespace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if this PR is ready, it probably doesn't make sense to remove AzureOptions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srilman It seems that you can choose whichever you want. If you remove AzureOptions, I'll merge this PR, pitrou will merge this PR otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou @pitrou I'm ok to remove it, but if the PR is ready to merge, I'd prefer to keep it to avoid making further changes.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jun 6, 2023
@zeroshade
Copy link
Member

@srilman Are you still working on this? Would you be opposed to some assistance with getting this up and running in the coming weeks?


const AzureOptions& options() const { return options_; }

protected:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but since it's a private implementation class you don't really need to differentiate between public and protected members.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed the protected section and moved AzureOptions options_ above

std::string scheme;
std::string account_dfs_url;
std::string account_blob_url;
bool is_azurite = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several comments / questions:

  1. Will this command special behaviour? Are there other possible backends?
  2. Does this need to be parametered by the user or can it be auto-detected, e.g. from response headers?
  3. A boolean is not future-proof if we need to support more backends in the future. Our S3 internals use an enum:
    enum class S3Backend { Amazon, Minio, Other };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There are a couple of areas with some special behavior. In particular,
  • Azurite does not support hierarchical namespaces, then AzureFileSystem::Impl::is_hierarchical_namespace_enabled_ needs to always be set to false
  • In the original PR, when is_azurite is set, then it will configure the following with account key credentials:
    account_blob_url = "http://127.0.0.1:10000/" + account_name + "/";
    account_dfs_url = "http://127.0.0.1:10000/" + account_name + "/";

This is instead of the official URLs. I'm don't like the built-in localhost URLs, so I'll probably modify it a bit. I'm not aware of any other backends, unless ADLS or ADLS Gen1 happens to differ significantly from ABFS.

  1. I'm not aware if it can be detected from response headers or not. With the hierarchical namespaces issue, it seems like this might be necessary even for setting up the connection.

  2. Good point, will use a enum instead.

/// ADLS provides filesystem semantics, file-level security, and Hadoop
/// compatibility. Gen1 exists as a separate object that will retired
/// on Feb 29, 2024. New ADLS accounts will use Gen2 instead, which is
/// implemented on top of ABFS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the filesystem implementation can provide access to both storage types? If so, then nice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, it should. I did an initial test a while ago on the original PR and was able to.

std::string type_name() const override { return "abfs"; }

/// Return the original Azure options when constructing the filesystem
AzureOptions options() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can probably return a const-ref:

Suggested change
AzureOptions options() const;
const AzureOptions& options() const;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. AzureFileSystem::Impl::options already returns a const-ref anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@pitrou
Copy link
Member

pitrou commented Jun 21, 2023

@srilman Sorry if this was discussed already before, but what will be the testing strategy? Will we run Azurite from the C++ unit tests (as we already do e.g. for S3 using Minio)?

@kou
Copy link
Member

kou commented Jun 21, 2023

Will we run Azurite from the C++ unit tests (as we already do e.g. for S3 using Minio)?

I think so.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Jul 3, 2023
@srilman
Copy link
Contributor Author

srilman commented Jul 4, 2023

@pitrou It looks like CI is failing due to flaky tests. Is there anything I should do, or do you need to manually merge it?

@pitrou
Copy link
Member

pitrou commented Jul 5, 2023

@srilman Some of the flakiness should be fixed by rebasing/merging latest git main.

Also, we are nearing a release, so we are prioritizing fixing current CI issues. I'll take a look again at this PR a bit later. Thanks a lot for your patience!

@pitrou
Copy link
Member

pitrou commented Jul 12, 2023

I rebased on latest git main.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 12, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll merge this in the interesting of moving things forward.
The next step should be to start adding some code that makes use of the Azure SDK, even something minimal such as actually instantiating the filesystem internals.

@pitrou
Copy link
Member

pitrou commented Jul 12, 2023

CI failures are unrelated.

@pitrou pitrou merged commit 085a0ba into apache:main Jul 12, 2023
50 of 53 checks passed
@pitrou pitrou removed the awaiting change review Awaiting change review label Jul 12, 2023
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this pull request Jul 20, 2023
…ementation (apache#35701)

### What changes are included in this PR?
This PR splits out the overall skeleton of apache#12914 in order to make merging of the overall Azure Filesystem easier to do.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
Yes.

* Closes: apache#35903

Authored-by: Srinivas Lade <srinulade1@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 085a0ba.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

kou added a commit that referenced this pull request Aug 7, 2023
…C++ filesystem (#36988)

### Rationale for this change
We need to write tests for #18014. azurite is like a fake Azure blob storage so it can be used to write integration tests

### What changes are included in this PR?
Extract the `azurite` related changes from #12914 to create a smaller PR that's easier to review. I have made very minimal changes compared to that PR. 

Currently `azurite` is configured for all the environments where `ARROW_AZURE` was enabled by #35701. I assume its deliberate that its not enabled yet for windows, alpine, conda, debian or fedora builds. 

### Are these changes tested?
Its tested by there aren't really any good tests in this PR. I used this `azurite` config in #36835 to make an integration test that uses the Azure C++ SDK. On its own we can't really write tests for this `azurite` setup PR. 

### Are there any user-facing changes?
No

* Closes: #36886

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this pull request Aug 20, 2023
…ementation (apache#35701)

### What changes are included in this PR?
This PR splits out the overall skeleton of apache#12914 in order to make merging of the overall Azure Filesystem easier to do.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
Yes.

* Closes: apache#35903

Authored-by: Srinivas Lade <srinulade1@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…Azure C++ filesystem (apache#36988)

### Rationale for this change
We need to write tests for apache#18014. azurite is like a fake Azure blob storage so it can be used to write integration tests

### What changes are included in this PR?
Extract the `azurite` related changes from apache#12914 to create a smaller PR that's easier to review. I have made very minimal changes compared to that PR. 

Currently `azurite` is configured for all the environments where `ARROW_AZURE` was enabled by apache#35701. I assume its deliberate that its not enabled yet for windows, alpine, conda, debian or fedora builds. 

### Are these changes tested?
Its tested by there aren't really any good tests in this PR. I used this `azurite` config in apache#36835 to make an integration test that uses the Azure C++ SDK. On its own we can't really write tests for this `azurite` setup PR. 

### Are there any user-facing changes?
No

* Closes: apache#36886

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Skeleton for Azure Filesystem Implementation
4 participants