Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][FS][Azure] Should GetFileInfo() against a directory always return true without hierarchical namespace support? #38772

Closed
kou opened this issue Nov 18, 2023 · 9 comments · Fixed by #39361

Comments

@kou
Copy link
Member

kou commented Nov 18, 2023

Describe the enhancement requested

Because CreateDir() without hierarchical namespace support does nothing and returns arrow::Status::OK().

See also the discussion: #38708 (comment)

Component(s)

C++

@felipecrv
Copy link
Contributor

felipecrv commented Nov 21, 2023

I thought a bit more about this idea and realized always returning OK() on "directory exists?" queries would contradict the post-condition of DeleteDir() – after DeleteDir() the caller expects DirectoryExists() to return false.

I think our approach of doing nothing on CreateDir has to be reconsidered. s3fs.cc creates an empty object with a path ending in / to indicate a directory was created:

  Status CreateEmptyDir(const std::string& bucket, const std::string& key) {
    DCHECK(!key.empty());
    return CreateEmptyObject(bucket, key + kSep);
  }

cc @Tom-Newton

@Tom-Newton
Copy link
Contributor

Yeah, I saw the "directory markers" thing on the S3 filesystem. Does anyone know if that is an arrow thing or more common pattern with S3? Personally I assumed that was a common pattern in S3 and I'm not aware of the same pattern existing in Azure.

My only concern with this approach would be when listing blobs. We could handle it in the arrow filesystem but if anybody tries to list with a different blob storage client they might get a confusing result.

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Nov 21, 2023

It looks like the arrow GCS filesystem also creates directory markers in the same way as S3

google::cloud::StatusOr<gcs::ObjectMetadata> CreateDirMarker(const std::string& bucket,
. The azure fsspec filesystem though, does a noop for create directory https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L1060C6-L1060C6

If we chose to create directory markers it would probably be good to add some metadata to the blob to indicate that it's actually a directory. I believe hierarchical namespace accounts add hdi_isfolder but looking at the fsspec implementation it checks for several metadata fields https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L851. I think the arrow filesystem should be compatible with the fsspec one so we should use one of those metadata fields.

@felipecrv
Copy link
Contributor

felipecrv commented Nov 21, 2023

Personally I assumed that was a common pattern in S3 and I'm not aware of the same pattern existing in Azure.

We're trying to impose filesystem semantics on top of a blob store and filesystem semantics require that after a CreateDir and DeleteDir you get directory exists and directory doesn't exist outcomes respectively.

The azure fsspec filesystem though, does a noop for create directory https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L1060C6-L1060C6

This might not be very relevant here because the goal of the Arrow Filesystem API is to provide an uniform interface on top of regular filesystems and blob storage systems. It's pretty clear that the Azure fsspec is violating filesystem semantics of directory management in this case.

@felipecrv
Copy link
Contributor

If we chose to create directory markers it would probably be good to add some metadata to the blob to indicate that it's actually a directory. I believe hierarchical namespace accounts add hdi_isfolder but looking at the fsspec implementation it checks for several metadata fields https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L851. I think the arrow filesystem should be compatible with the fsspec one so we should use one of those metadata fields.

If the Azure fsspec has ways of providing believable filesystem semantics, then we could adopt their patterns. The main thing is: we should provide filesystem semantics and no-op for CreateDir/DirExists would not achieve that.

@Tom-Newton
Copy link
Contributor

I think I'm in agreement. If we want to be rigorous we need to add directory markers in CreateDir.

I think the current implementation of GetFileInfo will already return that a directory is present but probably we should make sure to add a test case for empty directory with just a directory marker.

Probably I should also have added checks for metadata indicating a blob is a directory marker on ObjectInputFile::Init and ObjectAppendStream::Init.

@kou
Copy link
Member Author

kou commented Dec 5, 2023

@felipecrv @Tom-Newton Do either of you want to work on this?

@Tom-Newton
Copy link
Contributor

@felipecrv @Tom-Newton Do either of you want to work on this?

I might be able to do it in about a week

@felipecrv
Copy link
Contributor

@felipecrv @Tom-Newton Do either of you want to work on this?

I will take this one.

@felipecrv felipecrv self-assigned this Dec 5, 2023
felipecrv added a commit that referenced this issue Jan 5, 2024
…ccount doesn't support HNS (#39361)

### Rationale for this change

The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it.

### What changes are included in this PR?

 - Skip fewer tests
 - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs`
 - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error
 - Add comprehensive tests to `CreateDir`
 - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue
 - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts)
 - 

### Are these changes tested?

Yes. By existing and new tests added by this PR itself.
* Closes: #38772

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
@felipecrv felipecrv added this to the 15.0.0 milestone Jan 5, 2024
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…rage account doesn't support HNS (apache#39361)

### Rationale for this change

The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it.

### What changes are included in this PR?

 - Skip fewer tests
 - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs`
 - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error
 - Add comprehensive tests to `CreateDir`
 - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue
 - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts)
 - 

### Are these changes tested?

Yes. By existing and new tests added by this PR itself.
* Closes: apache#38772

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…rage account doesn't support HNS (apache#39361)

### Rationale for this change

The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it.

### What changes are included in this PR?

 - Skip fewer tests
 - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs`
 - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error
 - Add comprehensive tests to `CreateDir`
 - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue
 - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts)
 - 

### Are these changes tested?

Yes. By existing and new tests added by this PR itself.
* Closes: apache#38772

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…rage account doesn't support HNS (apache#39361)

### Rationale for this change

The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it.

### What changes are included in this PR?

 - Skip fewer tests
 - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs`
 - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error
 - Add comprehensive tests to `CreateDir`
 - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue
 - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts)
 - 

### Are these changes tested?

Yes. By existing and new tests added by this PR itself.
* Closes: apache#38772

Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Signed-off-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants