Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8766: [Python] Allow implementing filesystems in Python #7349

Closed
wants to merge 2 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jun 4, 2020

No description provided.

@github-actions
Copy link

github-actions bot commented Jun 4, 2020

@jorisvandenbossche
Copy link
Member

A first test case is working, using fsspec's in-memory filesystem:

In [1]: import pyarrow.parquet as pq
   ...: import pyarrow.dataset as ds
   ...:
   ...: from pyarrow.fs import PyFileSystem
   ...: from pyarrow.tests.test_fs import FSSpecHandler

In [2]: import fsspec

In [3]: memfs = fsspec.filesystem("memory")

In [4]: table = pa.table({'a': [1, 2, 3]})

In [5]: with memfs.open("test", "wb") as f:
   ...:     pq.write_table(table, f)
   ...:

In [6]: dataset = ds.dataset("test", filesystem=PyFileSystem(FSSpecHandler(memfs)))

In [7]: dataset.to_table(filter=ds.field('a') > 1).to_pandas()
Out[7]:
   a
0  2
1  3

I only have a bit trouble finding a robust way to get a file handle for an open file from the fsspec filesystem, but so that is not related to this PR ;) For the rest it was quite easy to get this working!

self.info.set_mtime(PyDateTime_to_TimePoint(
<PyDateTime_DateTime*> mtime))
else:
self.info.set_mtime(TimePoint_from_ns(mtime))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: should instead add a separate mtime_ns argument.

@jorisvandenbossche
Copy link
Member

The WIP attempt to create a Handler for fsspec is here (which the example I posted above is using): jorisvandenbossche@14ac33d. With that, reading files is working. Although there are some incosistencies in the file-like objects the open() method creates, so it will probably not work for all fsspec filesystems.
(writing is still TODO).

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou looks all good to me. Went through the code, and tested it further with the fsspec handler, and all seems to be working nicely.
Only the open_append_stream, I am not yet fully sure how to do this (if possible). The PythonFile class might not yet support an append (mode="a") writable mode?

open_append_stream;
};

class ARROW_PYTHON_EXPORT PyFileSystem : public arrow::fs::FileSystem {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need some doc comments? (I don't know how "public" this is, in the end it is only to be used in the python bindings I think, and there the PyFileSystem class has docstrings for python users, so that might be sufficient)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not public as a C++ API, indeed, but if there's some stuff there that needs explaining, feel free to point it out and I'll add some comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just one-liner indicating this is only used to implement pyarrow.fs.PyFileSystem could be added then, but not that important

@abstractmethod
def move(self, src, dest):
"""
Implement PyFileSystem.move(...).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose methods like this are expected to just raise the appropriate FileNotFoundError in case src does not exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

@jorisvandenbossche
Copy link
Member

The PythonFile class might not yet support an append (mode="a") writable mode?

Recognizing append mode as a "write" mode seems to work:

--- a/python/pyarrow/io.pxi
+++ b/python/pyarrow/io.pxi
@@ -654,6 +654,8 @@ cdef class PythonFile(NativeFile):
             kind = 'w'
         elif inferred_mode.startswith('r'):
             kind = 'r'
+        elif inferred_mode.startswith('a'):
+            kind = 'w'
         else:
             raise ValueError('Invalid file mode: {0}'.format(mode))

(if that's a proper fix, can include that in my follow-up PR)

@pitrou
Copy link
Member Author

pitrou commented Jun 9, 2020

Recognizing append mode as a "write" mode seems to work:

You'll have to check that writing indeed appends at the end rather than e.g. truncating.

More generally, while I originally added the append-open method, I'm not sure it will ever be useful, so in the meantime if an implementation wants to raise NotImplementedError, it's fine to me.

@jorisvandenbossche
Copy link
Member

You'll have to check that writing indeed appends at the end rather than e.g. truncating.

At least the tests pass (I added a fsspec local filesytem handler to the list of filesytem fixtures), and from a small local test it seems to be appending.

Now, if it's not actually used within arrow at the moment (eg for enabling reading/writing parquet files or datasets with those filesystems, we won't need it), then it is also not very important to get this working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants