Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][FS][Azure] Minimal python bindings for AzureFilesystem #39968

Closed
Tom-Newton opened this issue Feb 6, 2024 · 3 comments · Fixed by #40021
Closed

[Python][FS][Azure] Minimal python bindings for AzureFilesystem #39968

Tom-Newton opened this issue Feb 6, 2024 · 3 comments · Fixed by #40021

Comments

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Feb 6, 2024

Describe the enhancement requested

Child issue of: #39317

Blocked by: #39352

Create minimal pybindings such that the Azure filesystem is useable. Probably this will have no configuration options exposed and will always just use default credential auth.

I think it makes sense to circle back later to fill out all the configuration options that exist and may be added to AzureFilesystem

Component(s)

Python

@Tom-Newton Tom-Newton changed the title [Python][FS][Azure] Minimal pybindings for AzureFilesystem [Python][FS][Azure] Minimal python bindings for AzureFilesystem Feb 6, 2024
@Tom-Newton
Copy link
Contributor Author

take

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Feb 9, 2024

I want to enable the python side tests for Azure but there are a couple of failures.

  1. Move operations are not supported yet. GH-38704: [C++] Implement Azure FileSystem Move() via Azure DataLake Storage Gen 2 API #39904 looks like its almost done I believe this will implement file and directory moves for hierarchical namespace accounts but not for flat namespace. These tests run against azurite, which is flat namespace so GH-38704: [C++] Implement Azure FileSystem Move() via Azure DataLake Storage Gen 2 API #39904 actually won't solve this. We should at least add support for file moves to get parity with S3 and GCS on flat namespace accounts.
  2. Currently we can't read metadata set by the user, only system metadata generated by azure. This is explained here https://github.com/Tom-Newton/arrow/blob/8d6b96d7269a57317068cbaa59d2ff80d65be1a7/cpp/src/arrow/filesystem/azurefs.cc#L334-L346

I'm hoping its ok to skip these for Azure initially. I can create Github issues to get these ironed out.

@felipecrv
Copy link
Contributor

felipecrv commented Feb 9, 2024

I plan to add ::Move() for flat-namespace accounts as well. I started with HNS because it's easier, but I hope the one based on Blobs will be helped by the infrastructure I've built for Move on HNS accounts. I won't have to write SO MANY tests, but I will have to fix all the issues they catch.

UPDATE: I ended up not implementing Move for flat-namespace accounts. More info at #40405

Tom-Newton added a commit to Tom-Newton/arrow that referenced this issue Feb 10, 2024
…thub.com:Tom-Newton/arrow into tomnewton/minimal_python_bindings/apacheGH-39968
jorisvandenbossche added a commit that referenced this issue Mar 13, 2024
…ystem` (#40021)

### Rationale for this change
We want to use the new `AzureFileSystem` in `pyarrow`. 

### What changes are included in this PR?
- Add minimal python bindings for `AzureFileSystem`. This includes just enough to run the python tests against azurite plus default credential auth to enable real use of this once this PR merges. 
- Adding additional configuration options and remaining authentication options can be done as a follow up. 
- I tried to copy the existing pybinds for GCS and S3
- Explicitly set `ARROW_AZURE=OFF` rather than relying on defaults. The defaults are different for builds vs tests so this was causing tests to be enabled while Azure was disabled during the build.

### Are these changes tested?
Enabled the the python filesystem tests for the new filesystem. I had to skip azure in a couple of the tests though because they are not yet working on the C++ side. I created Github issues to resolve these #40025 and #40026 and added TODO comments where relevant, that reference these Github issues. 

### Are there any user-facing changes?
`pyarrow` users can now use the native `AzureFileSystem` to get much better reliability and performance compared to `adlfs` based options. 

* Closes: #39968
* GitHub Issue: #39968

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 16.0.0 milestone Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants