Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to map to pyarrow.hdfs.connect? #1113

Open
Jeffwan opened this issue Nov 12, 2022 · 4 comments
Open

Is there a way to map to pyarrow.hdfs.connect? #1113

Jeffwan opened this issue Nov 12, 2022 · 4 comments

Comments

@Jeffwan
Copy link

Jeffwan commented Nov 12, 2022

Let's say we have a file system called bfs:// which is an equivalent implementation of HDFS. We can support pyarrow operation like following way.

import pyarrow as pa
from pyarrow import fs
bfs = pa.hdfs.connect('bfs://service-endpoint')             # --->  work
# bfs, _ = fs.FileSystem.from_uri('bfs://service-endpoint') # ---> doesn't work
local, _ = fs.FileSystem.from_uri('file:///')
with local.open_input_stream('/tmp/license.txt') as src:
    with bfs.open('/tmp-test.txt', mode='wb') as dest:
        dest.upload(src)

Could I know whether fsspec can support our protocol?

@martindurant
Copy link
Member

Could you please explain how you'd like your alternative implementation to be handled by fsspec? I am understanding, that you would like the "bfs" protocol to be registered with fsspec, and have it create the arrow fs, wrap it, and return an fsspec instance. Is this right?

@Jeffwan
Copy link
Author

Jeffwan commented Nov 14, 2022

@martindurant Yes. That's exact what I want. Do you have any suggestion for those custom protocols? Should we do it downstream or upstream?

@martindurant
Copy link
Member

Could you please try:

fsspec.implementations.arrow.HadoopFileSystem("bfs://service-endpoint")

to see if this does this right thing?

If so, the following change

--- a/fsspec/implementations/arrow.py
+++ b/fsspec/implementations/arrow.py
@@ -260,6 +260,8 @@ class HadoopFileSystem(ArrowFSWrapper):
         out = {}
         if ops.get("host", None):
             out["host"] = ops["host"]
+            if ops.get("protocol"):
+                out["host"] = f"{ops.get('protocol')}://{out['host']}"
         if ops.get("username", None):
             out["user"] = ops["username"]

should allow for

fsspec.register_implementation("bfs", fsspec.implementations.arrow.HadoopFileSystem)
with fsspec.open("bfs://service-endpoint/tmp/license.txt") as f:
    use_somehow(f)

@martindurant
Copy link
Member

@Jeffwan , did my proposed solution work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants