Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] GCS: report common prefixes as directories #32403

Open
asfimport opened this issue Jul 15, 2022 · 4 comments
Open

[C++] GCS: report common prefixes as directories #32403

asfimport opened this issue Jul 15, 2022 · 4 comments

Comments

@asfimport
Copy link
Collaborator

I got confused at the behavior differences between S3 and GCS, only to realize GCS only reports special directory markers as "directories" and not the common prefixes. This can have the effect of making a directory look empty in GCS, when it in fact has many folders (see example below).

We currently use the ListObjects method, but perhaps it would be more appropriate to use the ListObjectsWithPrefix. Since they are returned in the same API call, it shouldn't add much overhead.

library(arrow)

bucket <- gs_bucket("voltrondata-labs-datasets", retry_limit_seconds = 3, anonymous = TRUE)
s3_bucket <- s3_bucket("voltrondata-labs-datasets", endpoint_override = "https://storage.googleapis.com")

# We did not create directory markers when uploading the data
# https://github.com/apache/arrow/pull/11842#discussion_r764204767

# The directory appears empty to GCSFileSystem...
bucket$ls("nyc-taxi")
#> character(0)

# ... but S3FileSystem knows otherwise!
s3_bucket$ls("nyc-taxi")
#>  [1] "nyc-taxi/year=2009" "nyc-taxi/year=2010" "nyc-taxi/year=2011"
#>  [4] "nyc-taxi/year=2012" "nyc-taxi/year=2013" "nyc-taxi/year=2014"
#>  [7] "nyc-taxi/year=2015" "nyc-taxi/year=2016" "nyc-taxi/year=2017"
#> [10] "nyc-taxi/year=2018" "nyc-taxi/year=2019" "nyc-taxi/year=2020"
#> [13] "nyc-taxi/year=2021" "nyc-taxi/year=2022"

# Using GCS API, we only get files!
bucket$ls("nyc-taxi", recursive = TRUE)
#>   [1] "nyc-taxi/year=2009/month=1/part-0.parquet" 
#>   [2] "nyc-taxi/year=2009/month=10/part-0.parquet"
#> ...
#> [157] "nyc-taxi/year=2022/month=1/part-0.parquet" 
#> [158] "nyc-taxi/year=2022/month=2/part-0.parquet"

# Using S3 API, we can get directories!
s3_bucket$ls("nyc-taxi", recursive = TRUE)
#>   [1] "nyc-taxi/year=2009"                        
#>   [2] "nyc-taxi/year=2009/month=1"                
#>   [3] "nyc-taxi/year=2009/month=1/part-0.parquet" 
#>   [4] "nyc-taxi/year=2009/month=10"               
#>   [5] "nyc-taxi/year=2009/month=10/part-0.parquet"
#>   [6] "nyc-taxi/year=2009/month=11"               
#> ...
#> [329] "nyc-taxi/year=2022/month=2"                
#> [330] "nyc-taxi/year=2022/month=2/part-0.parquet"

Reporter: Will Jones / @wjones127

Note: This issue was originally created as ARROW-17097. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
@coryan You maybe have had good reason for implementing differently in #11842 but I thought I might ask :)

@asfimport
Copy link
Collaborator Author

Carlos O'Ryan / @coryan:
The details are a bit fuzzy at the moment.  At a high level, all approaches to simulate folders over GCS will fail, but will fail in different ways.  You can make something like ListObjects() return directory markers  for common prefixes, but then trying to call GetFileInfo() on those markers will fail.  Or will need to be very expensive.  In hindsight, I should have written a design doc outlining the tradeoffs and the decisions, but I did not realize when I started the project that the API (and tests) that there would be so many.

 

@asfimport asfimport added this to the 11.0.0 milestone Jan 11, 2023
@raulcd raulcd removed this from the 11.0.0 milestone Jan 11, 2023
@drauschenbach
Copy link

I wanted to leave this breadcrumb somewhere, but not sure where. I noticed a discrepancy between "directories" created via Arrow vs directories created via the GCS cloud console. One uses a traliing slash while the other does not.

In my C++ code, I have to defensively call GetFileInfo() twice, once with and once without a trailing slash.

@wjones127
Copy link
Member

You can make something like ListObjects() return directory markers for common prefixes, but then trying to call GetFileInfo() on those markers will fail.

I've encountered this again, and I think the tradeoff of making ListObjects() work as expected but GetFileInfo() being surprising makes more sense to me. I think people expect common prefixes to work on object stores without special markers, but would understand that directories aren't "real" on object stores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants