Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to use object_store with google cloud storage / gcs local emulators or custom gcs endpoint? #5263

Closed
daviskirk opened this issue Dec 31, 2023 · 15 comments
Labels
development-process Related to development process of arrow-rs question Further information is requested

Comments

@daviskirk
Copy link

Which part is this question about
object_store, gcs interface

Describe your question

Does object_store support the STORAGE_EMULATOR_HOST environment variable for google cloud (or any other way of setting the google cloud endpoint for emulation support)?

Additional context

I am using object_store within the polars library and would like to use the google cloud storage emulator endpoint so that I can run tests on our code.

I am using fake-gcs-server for testing applications that use gcs and utilizing https://github.com/fsouza/fake-gcs-server. However, I can't find a way to use it with object_store.
I was previously using pythons fsspec (https://filesystem-spec.readthedocs.io/en/latest/) and the gcs client directly which support this. Now that polars is using object_store exclusively as it's cloud access layer this is no longer possible.

Is there any recommended solution for this? Or is this just missing functionality in object_store?

The previous workflow is to run:

docker run -d --name fake-gcs-server -p 127.0.0.1:9090:9090 -v ${PWD}/examples/data:/data fsouza/fake-gcs-server -scheme http -port 9090 -external-url http://127.0.0.1:9090

and then either set the environment variable STORAGE_EMULATOR_HOST or just configure the python fsspec client to use this emulator url.

Here is the related polars issue where it was suggested I come here: pola-rs/polars#13085 (comment)

@daviskirk daviskirk added the question Further information is requested label Dec 31, 2023
@tustvold
Copy link
Contributor

The Github CI contains an example of using an emulator, we are running a fork of fake-gcs-server that supports the XML APIs, it is then just a case of overriding the endpoint URL

@daviskirk
Copy link
Author

Thanks, the override of the endpoint URL works well, but the emulator gives me a 404 any time I try to read it.

These are the requests that I do (this is the write from fsspec):

127.000.000.001.09090: {"name": "test.parquet"}
127.000.000.001.09090-127.000.000.001.34408: HTTP/1.1 200 OK
Content-Type: application/json
Location: http://127.0.0.1:9090/upload/storage/v1/b/bla/o?uploadType=resumable&name=test.parquet&upload_id=386b51ebd4daca9eb150216d11ba7811
Date: Thu, 11 Jan 2024 11:26:43 GMT
Content-Length: 233

127.000.000.001.09090: PAR1...4...Polars.+...PAR1
HTTP/1.1 200 OK
Content-Type: application/json
Range: bytes=0-476
Date: Thu, 11 Jan 2024 11:26:43 GMT
Content-Length: 470

{"kind":"storage#object","name":"test.parquet","id":"bla/test.parquet","bucket":"bla","size":"477","contentType":"application/octet-stream","crc32c":"RrP1+Q==","acl":[{"bucket":"bla","entity":"projectOwner-test-project","object":"test.parquet","projectTeam":{},"role":"OWNER"}],"md5Hash":"8JNyEbW3APYXEj7Aine/3g==","etag":"\"8JNyEbW3APYXEj7Aine/3g==\"","timeCreated":"2024-01-11T11:26:43.876526Z","updated":"2024-01-11T11:26:43.876531Z","generation":"1704972403876533"}

Reading directly with fsspec works fine as well:

GET /download/storage/v1/b/bla/o/test.parquet?alt=media HTTP/1.1
Host: 127.0.0.1:9090
User-Agent: python-gcsfs/2023.10.0
Accept: */*
Accept-Encoding: gzip, deflate


HTTP/1.1 200 OK
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Content-Length: 477
Content-Type: application/octet-stream
Last-Modified: Thu, 11 Jan 2024 11:26:43 GMT
X-Goog-Generation: 1704972403876533
X-Goog-Stored-Content-Encoding: identity
Date: Thu, 11 Jan 2024 11:31:19 GMT

PAR1...4.>....PAR1

But the "read" using object_store/polars gives me a 404, (it uses the HEAD as we want to do more using object_store/polars, whereas fsspec obviously just downloads it directly), but I don't understand why this is failing either:

127.000.000.001.09090: HEAD /bla/test%2Eparquet HTTP/1.1
accept: */*
user-agent: object_store/0.8.0
host: 127.0.0.1:9090


127.000.000.001.09090-127.000.000.001.54214: HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 11 Jan 2024 11:26:47 GMT
Content-Length: 10

@tustvold
Copy link
Contributor

You need to run a container with the changes in fsouza/fake-gcs-server#1164, for example tustvold/fake-gcs-server. We make use of the XML APIs, as certain functionality is missing from the GCS JSON API (#4207) but fake-gcs-server currently doesn't have support for this.

@daviskirk
Copy link
Author

daviskirk commented Jan 11, 2024

Yes thank you, I am using the image with the changes in the example above:

    image: tustvold/fake-gcs-server
    command: "-scheme http -port 9090 -external-url http://127.0.0.1:9090"
    ports:
      - 127.0.0.1:9090:9090

Perhaps it is some sort of problem with the way polars is implementing object_store

@tustvold
Copy link
Contributor

tustvold commented Jan 11, 2024

The URLs in your requests refer to different files

test.parquet
test%2Eparquet

It could be a red-herring depending on what the logging is actually printing, but is it possible you are constructing the object store path in such a way that it is url-encoding the dot twice?

@daviskirk
Copy link
Author

daviskirk commented Jan 11, 2024

Just to be sure, I tried it with just "test" as well (after reuploading the file as "test" of course), unfortunately with the same result

HEAD /bla/test HTTP/1.1
accept: */*
user-agent: object_store/0.8.0
host: 127.0.0.1:9090


404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 11 Jan 2024 12:14:39 GMT
Content-Length: 10

@tustvold
Copy link
Contributor

Can you run curl -v --head http://localhost:9090/bla/test and curl -v http://localhost:9090/bla/test and post the output here

@daviskirk
Copy link
Author

daviskirk commented Jan 11, 2024

❯ curl -v --head http://127.0.0.1:9090/bla/test
*   Trying 127.0.0.1:9090...
* Connected to 127.0.0.1 (127.0.0.1) port 9090 (#0)
> HEAD /bla/test HTTP/1.1
> Host: 127.0.0.1:9090
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 404 Not Found
HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
X-Content-Type-Options: nosniff
< Date: Thu, 11 Jan 2024 13:07:45 GMT
Date: Thu, 11 Jan 2024 13:07:45 GMT
< Content-Length: 10
Content-Length: 10

< 
* Connection #0 to host 127.0.0.1 left intact

❯ curl -v http://127.0.0.1:9090/bla/test 
*   Trying 127.0.0.1:9090...
* Connected to 127.0.0.1 (127.0.0.1) port 9090 (#0)
> GET /bla/test HTTP/1.1
> Host: 127.0.0.1:9090
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Thu, 11 Jan 2024 13:07:54 GMT
< Content-Length: 10
< 
Not Found
* Connection #0 to host 127.0.0.1 left intact

Within the container I get: INFO[0028] 172.17.0.1 - - [11/Jan/2024:13:10:12 +0000] "HEAD /bla/test HTTP/1.1" 404 10

The image I'm using is just the one I got from:

docker pull tustvold/fake-gcs-server && docker run --rm -it -p 127.0.0.1:9090:9090 tustvold/fake-gcs-server -scheme http -port 9090 -external-url http://127.0.0.1:9090 -backend memory

@Xuanwo
Copy link
Member

Xuanwo commented Jan 11, 2024

The image I'm using is just the one I got from:

docker pull tustvold/fake-gcs-server && docker run --rm -it -p 127.0.0.1:9090:9090 tustvold/fake-gcs-server -scheme http -port 9090 -external-url http://127.0.0.1:9090 -backend memory

Based on my experience with fake-gcs-server, the bucket is not created automatically. You will need to mount a subpath into the container.

For example, I test with fake-gcs-server via this docker-compose file:

services:
  fake-gcs-server:
    image: fsouza/fake-gcs-server:${FAKE_GCS_SERVER_VERSION:-1.47.7}
    container_name: fake-gcs-server
    ports:
      - "${MAP_HOST_FAKE_GCS_SERVER:-127.0.0.1}:4443:4443"
    volumes:
      - fake_gcs_server_data:/data/sample-bucket    <------------ look here!
    command: -scheme http

volumes:
  fake_gcs_server_data:

@tustvold
Copy link
Contributor

tustvold commented Jan 11, 2024

My only guess is that the bucket doesn't exist, the CI does this via a curl request, but mounting a path may also work as @Xuanwo suggests.

I don't know why fsspec would work, but possibly it buffers to the local filesystem or automatically creates buckets.

The only other option I can think of is that the container isn't running the image we think it is, but that is a bit of a stretch

@daviskirk
Copy link
Author

daviskirk commented Jan 11, 2024

Tried out the explicit mount @Xuanwo mentioned, but no difference (I was creating the bucket explicitely before so I don't think that was the issue, but I wanted to try everything).

Perhaps I'm somehow still not using the right image... however, I went to the CI of this project and picked out the hash of the last CI run and used that, still the same result.

Naivly trying, I can't really get any of the XML API endpoints (i.e. HEAD /bucketname/objectname ) to work (I guess this is the part where I'm doing something wrong but I don't know what). The /upload/storage/v1/ and /download/storage/v1/ work fine also using curl.

@tustvold
Copy link
Contributor

Could you perhaps provide the logs of the container, another option is the requests are actually going to a different instance of fake-gcs-server

@daviskirk
Copy link
Author

daviskirk commented Jan 12, 2024

Sure, thanks for taking so much time trying to help on this

# this is uploading the polars dataframe "test" to the bucket "bla"
[12/Jan/2024:12:40:58 +0000] \"POST /upload/storage/v1/b/bla/o?uploadType=resumable HTTP/1.1\" 200 209"
[12/Jan/2024:12:40:58 +0000] \"POST /upload/storage/v1/b/bla/o?uploadType=resumable&name=test&upload_id=2597bfde7e1e17b6f4ee27bef237f478 HTTP/1.1\" 200 445"
# This is the download/reading with fsspec
[12/Jan/2024:12:41:39 +0000] \"GET /storage/v1/b/bla/o/test HTTP/1.1\" 200 445"
[12/Jan/2024:12:41:39 +0000] \"GET /download/storage/v1/b/bla/o/test?alt=media HTTP/1.1\" 206 0"
# This is trying to read the file using polars read_parquet using
# pl.read_parquet("gs://bla/test", storage_options={"service_account": "./google-client-data.json"})
# and the service account file:
# {"gcs_base_url": "http://127.0.0.1:9090", "disable_oauth": true, "client_email": "test.example.com", "private_key": "", "private_key_id": ""}
[12/Jan/2024:12:42:05 +0000] \"HEAD /bla/test HTTP/1.1\" 404 10"

btw. this is the docker container running:

tustvold/fake-gcs-server@sha256:dcd3aeacc07c731f1336e90c2889be2af8626ae993ee1fe2c0ba042ebbeb5a04   "/bin/fake-gcs-server -data /data -scheme http -port 9090 -external-url http://127.0.0.1:9090"   23 hours ago   Up 23 hours   4443/tcp, 127.0.0.1:9090->9090/tcp   excavator-gcs-1

@tustvold
Copy link
Contributor

tustvold commented Jan 12, 2024

Can you instead start the server with this exact command, same as in the CI configuration

docker run -d -p 4443:4443 tustvold/fake-gcs-server -scheme http -backend memory -public-host localhost:4443

And communicate with it over 4443 within your application.

When I run the image using the arguments you provide, they do not appear to be having the intended effect

@daviskirk
Copy link
Author

It's working now!! For anyone else that might be having this problem:

The -public-host argument did the trick, I did not properly see the difference between this and the -external-url argument.

For anyone who is doing this using polars and fsspec: the -external-url is also still needed for some of the other requests that are being sent (uploading files get a 405 response otherwise).

Thank you for the patience and time in particular on this issue but also for your work on arrow-rs in general!

@tustvold tustvold added the development-process Related to development process of arrow-rs label Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of arrow-rs question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants