Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add azure-sdk-for-cpp to ThirdpartyToolchain #29847

Closed
asfimport opened this issue Oct 9, 2021 · 5 comments · Fixed by #36835
Closed

[C++] Add azure-sdk-for-cpp to ThirdpartyToolchain #29847

asfimport opened this issue Oct 9, 2021 · 5 comments · Fixed by #36835

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 9, 2021

This is a requirement to be able to make progress on ARROW-9611

Reporter: Yesh

Related issues:

Note: This issue was originally created as ARROW-14270. Please see the migration documentation for further details.

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Jul 22, 2023

It looks like the azure file system skeleton has merged #35903. So I think this is probably the next step.

#12914 did a lot of the work but when I came to using this I had to change a few things:

  1. ARROW-2034: [C++] Filesystem implementation for Azure Blob Storage #12914 created an external project for each relevant component of the Azure C++ SDK. However each release of the Azure SDK actually contains the entire Azure C++ SDK, despite the confusing naming convention. For example azure-identity_1.5.1 is actually just a newer release of azure-core_1.10.1. In the original PR it was actually a race condition for which version of the Azure SDK was used.
  2. Its valuable to use a very recent version of the Azure SDK. azure-identity_1.5.1 from 2023-07-06 fixed a bug in managed identity authentication. Managed identity is supposed to be the preferred auth method for Azure so I think its important that it works. Additionally the build seemed broken for the released used by ARROW-2034: [C++] Filesystem implementation for Azure Blob Storage #12914 due to a new release of https://www.zlib.net/ leading to https://zlib.net/zlib-1.2.12.tar.gz no longer existing.
  3. There is a new dependency on xml2. This can be installed straightforwardly from apt in the ubuntu docker builds but I did not find a simple solution for the manylinux build. Not sure its a good solution but, I was able to get this working by adding a bundled cmake build for xml2. I mostly copied how this was done with nlohmann_json and crc32c for the google cloud SDK.
  4. There are difficulties around OpenSSL versions. The Azure C++ SDK seems to build for OpenSSL 3 by default and the way to configure it is quite annoying https://github.com/Azure/azure-sdk-for-cpp/blob/main/README.md#openssl-version. I feel like there must be a better way to do this but I have not found it. For builds using vcpkg to install OpenSSL we get version 3 so that works but I had problems with the more pure cmake build since the requirement is just OpenSSL>=1.0.2

I will rebase my changes described above onto main and open a draft PR. I have basically no experience with C++ but I have built a python wheel using these changes that we've been using in production for a while now. If someone who knows what they are talking about can point me in the right direction maybe I can get something ready for review.

@kou
Copy link
Member

kou commented Jul 24, 2023

I think that preparing Azurite is the next step. Because we can't run our tests without Azurite.

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Jul 24, 2023

I opened the draft PR #36835. This PR does actually include azurite setup and a simple test that writes to azurite and reads it back. I just used the azurite changes direct from #12914 and it seems to work great.

If we want to split the azurite setup to a different PR though, that makes sense to me.

@kou
Copy link
Member

kou commented Jul 24, 2023

Yes. We should create a separated PR for Azurite.

@Tom-Newton
Copy link
Contributor

I've opened a PR for review #36835. I don't have experience with C++ so hopefully what I've done makes sense. I've just tried to copy the patterns I can see in the repo and debug when I problems.

kou added a commit that referenced this issue Aug 30, 2023
### Rationale for this change
We want to use the Azure SDK for C++ to read/write to Azure blob storage. Obviously this is pretty important for building an `AzureFileSystem`. 

### What changes are included in this PR?
Builds the the relevant parts of the azure SDK as a cmake external project. Adds a couple of simple tests that just assert that the Azure SDK is working and a couple of lines in `AzureFileSystem` to initialise the blob storage client to ensure the build is working correctly in all environments.

I started with the build setup from #12914 but I did make few changes.

1. Although its atypical for this project we chose to switch from cmake's `ExternalProject` to `FetchContent`. `FetchContent` is recomended by the Azure docs https://github.com/Azure/azure-sdk-for-cpp#cmake-project--fetch-content. It also solves a few problems including: automatically linking system curl and ssl instead of bootstrapping vcpkg and installing curl and ssl from there. 
2. Only build one version of the Azure SDK for C++ because it contains all the components. Previously we were unnecessarily building 5 different versions of the whole thing on top of each other. This created race conditions for which version each component came from.
3. We are using `azure-core_1.10.2` which is a very recent version. There are a couple of important reasons for this 1.  [an important managed identity fix](Azure/azure-sdk-for-cpp#4723), 2. [fixed support for curl versions < 7.71.0](Azure/azure-sdk-for-cpp#4792).

There will be follow up PRs to enable Azure in the manylinux builds. We need to update `vcpkg` first so we can get a version of the Azure SDK which contains [an important managed identity fix](Azure/azure-sdk-for-cpp#4723).

### Are these changes tested?
Yes. There is a simple test that just runs the Azure client against azurite. Additionally just initialising the client in `AzureFileSystem` goes a long way towards ensuring the build is working. 

### Are there any user-facing changes?
No
* Closes: #29847

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: shefali singh <shefalisingh@microsoft.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou kou added this to the 14.0.0 milestone Aug 30, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
### Rationale for this change
We want to use the Azure SDK for C++ to read/write to Azure blob storage. Obviously this is pretty important for building an `AzureFileSystem`. 

### What changes are included in this PR?
Builds the the relevant parts of the azure SDK as a cmake external project. Adds a couple of simple tests that just assert that the Azure SDK is working and a couple of lines in `AzureFileSystem` to initialise the blob storage client to ensure the build is working correctly in all environments.

I started with the build setup from apache#12914 but I did make few changes.

1. Although its atypical for this project we chose to switch from cmake's `ExternalProject` to `FetchContent`. `FetchContent` is recomended by the Azure docs https://github.com/Azure/azure-sdk-for-cpp#cmake-project--fetch-content. It also solves a few problems including: automatically linking system curl and ssl instead of bootstrapping vcpkg and installing curl and ssl from there. 
2. Only build one version of the Azure SDK for C++ because it contains all the components. Previously we were unnecessarily building 5 different versions of the whole thing on top of each other. This created race conditions for which version each component came from.
3. We are using `azure-core_1.10.2` which is a very recent version. There are a couple of important reasons for this 1.  [an important managed identity fix](Azure/azure-sdk-for-cpp#4723), 2. [fixed support for curl versions < 7.71.0](Azure/azure-sdk-for-cpp#4792).

There will be follow up PRs to enable Azure in the manylinux builds. We need to update `vcpkg` first so we can get a version of the Azure SDK which contains [an important managed identity fix](Azure/azure-sdk-for-cpp#4723).

### Are these changes tested?
Yes. There is a simple test that just runs the Azure client against azurite. Additionally just initialising the client in `AzureFileSystem` goes a long way towards ensuring the build is working. 

### Are there any user-facing changes?
No
* Closes: apache#29847

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: shefali singh <shefalisingh@microsoft.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
### Rationale for this change
We want to use the Azure SDK for C++ to read/write to Azure blob storage. Obviously this is pretty important for building an `AzureFileSystem`. 

### What changes are included in this PR?
Builds the the relevant parts of the azure SDK as a cmake external project. Adds a couple of simple tests that just assert that the Azure SDK is working and a couple of lines in `AzureFileSystem` to initialise the blob storage client to ensure the build is working correctly in all environments.

I started with the build setup from apache#12914 but I did make few changes.

1. Although its atypical for this project we chose to switch from cmake's `ExternalProject` to `FetchContent`. `FetchContent` is recomended by the Azure docs https://github.com/Azure/azure-sdk-for-cpp#cmake-project--fetch-content. It also solves a few problems including: automatically linking system curl and ssl instead of bootstrapping vcpkg and installing curl and ssl from there. 
2. Only build one version of the Azure SDK for C++ because it contains all the components. Previously we were unnecessarily building 5 different versions of the whole thing on top of each other. This created race conditions for which version each component came from.
3. We are using `azure-core_1.10.2` which is a very recent version. There are a couple of important reasons for this 1.  [an important managed identity fix](Azure/azure-sdk-for-cpp#4723), 2. [fixed support for curl versions < 7.71.0](Azure/azure-sdk-for-cpp#4792).

There will be follow up PRs to enable Azure in the manylinux builds. We need to update `vcpkg` first so we can get a version of the Azure SDK which contains [an important managed identity fix](Azure/azure-sdk-for-cpp#4723).

### Are these changes tested?
Yes. There is a simple test that just runs the Azure client against azurite. Additionally just initialising the client in `AzureFileSystem` goes a long way towards ensuring the build is working. 

### Are there any user-facing changes?
No
* Closes: apache#29847

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: shefali singh <shefalisingh@microsoft.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants