Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Implement file reads for Azure filesystem #37511

Closed
Tom-Newton opened this issue Sep 1, 2023 · 8 comments · Fixed by #38269
Closed

[C++] Implement file reads for Azure filesystem #37511

Tom-Newton opened this issue Sep 1, 2023 · 8 comments · Fixed by #38269

Comments

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Sep 1, 2023

Describe the enhancement requested

Read support probably requires an Azure implementation for arrow::io::RandomAccessFile then that can be used to implement the OpenInputStream and OpenInputFile methods of the AzureFileSystem.

#12914 implemented all of these features so this will be largely a case of just extracting the relevant parts from there. One modification I would suggest compared to that would be to avoid branching logic based on whether the Azure storage account has the hierarchical namespace enabled. Utilising features of the hierarchical namespace can make renames and listing tasks faster but for just reading blobs it shouldn't make any difference.

If we want to use features of the hierarchical namespace that adds some complexities:

  1. Makes things harder to test because its not supported by azurite Is there a plan to support AdlsGen2 (Datalake Storage) on top of blobstore emulator ?  Azure/Azurite#553
  2. Its a bit difficult to query the storage account to determine if it supports hierarchical namespace. ServiceClient::GetAccountInfo() requires Storage Account Contributor permissions (https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-service-properties?tabs=azure-ad#authorization) which is quite significantly elevated. Hadoop solves this by essentially calling PathClient::GetAccessControlList() and if it raises an exception hierarchical namespace is not supported https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.java#L356-L385.

Related Issues:

Component(s)

C++

@Tom-Newton Tom-Newton changed the title Implement file reads for Azure filesystem [C++] Implement file reads for Azure filesystem Sep 1, 2023
@Tom-Newton
Copy link
Contributor Author

cc @srilman since you mentioned you might be able to help out
If nobody else is able to pick it up I will probably start working on this within a couple of weeks.

@felipecrv
Copy link
Contributor

@Tom-Newton are you taking over the work on #12914?

cc @zeroshade @bkietz

@Tom-Newton
Copy link
Contributor Author

@Tom-Newton are you taking over the work on #12914?

I wouldn't say I'm taking over but I'm keen to push it along. So far @srilman and I have both merged PRs that implement a subset of what #12914 implemented.

@felipecrv
Copy link
Contributor

@Tom-Newton are you taking over the work on #12914?

I wouldn't say I'm taking over but I'm keen to push it along. So far @srilman and I have both merged PRs that implement a subset of what #12914 implemented.

Thank you! Would you mind rebasing that PR to incorporate the work that has been done on the other PRs?

@Tom-Newton
Copy link
Contributor Author

Thank you! Would you mind rebasing that PR to incorporate the work that has been done on the other PRs?

Plan was to keep merging small sections until it's feature complete. I think that's more likely to be done by extracting small sections from #12914 rather than rebasing it and trying to merge it all in one go. This was the approach taken for the GCS filesystem and recommend by @kou.

@felipecrv
Copy link
Contributor

felipecrv commented Sep 13, 2023

@Tom-Newton ok. This makes sense. Just make sure you make any work in progress you have as visible as possible so we can help along the way.

@Tom-Newton
Copy link
Contributor Author

I'm going to start working on this.

@Tom-Newton
Copy link
Contributor Author

I think the PR is ready for review #38269.
Hopefully what I've done makes sense, I'm still very inexperienced writing C++.

bkietz added a commit that referenced this issue Oct 19, 2023
### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from #12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from #12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on #12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: #37511

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
@bkietz bkietz added this to the 15.0.0 milestone Oct 19, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants