-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11238: [Python] Make SubTreeFileSystem print method more informative #11447
Conversation
|
You can resolve the lint errors using Archery. I will send you some useful commands. |
python/pyarrow/_fs.pyx
Outdated
@@ -833,6 +833,12 @@ cdef class SubTreeFileSystem(FileSystem): | |||
FileSystem.init(self, wrapped) | |||
self.subtreefs = <CSubTreeFileSystem*> wrapped.get() | |||
|
|||
def __str__(self): | |||
return f'SubTreeFileSystem: file:{self.base_path}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all objects in the file system physical files? or can they be an URI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this can be a general URI as well (eg since you can use it with S3). In any case it's good to not use file:
because also for local file paths that would be confusing (it results in file:/base/
, which looks like a URI, but is not a valid one).
The repr uses base_path=
. Maybe we can do the same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, we could also make str and repr just the same output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, __repr__
differs from __str__
in that it allows "reconstruction" (eval) of an equivalent object. But this would be difficult for SubTreeFileSystem
and I do not think Arrow has an established convention for representing Python objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to omit subfs.base_fs
from __repr__
in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think a better route would be for SubTreeFileSystem
to print the base_path
combined with output of the __str__/__repr__
methods of FileSystem
class. But it seems that FileSystem
doesn't have __str__/__repr__
...yet.
Similar to the following:
Note: Notice the call to str(self.base_fs)
which would in turn call the __str__
method of base class FileSystem
.
def __repr__(self):
return f'SubTreeFileSystem(base_path={self.base_path}, base_fs={str(self.base_fs)})'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SubTreeFileSystem
is a simple class representing a base path and a FileSystem
, so it is difficult to be too informative asides from this.
Would it make sense to put this PR on hold, and first add __str__/__repr__
to FileSystem
. Then circle back to this PR and build on top of that information? cc @jorisvandenbossche @ianmcook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to omit
subfs.base_fs
from__repr__
in this case?
I would certainly keep it, as that seems an essential part to understand a SubtreeFilesystem object (i.e. what kind of filesystem is it wrapping)
Ideally,
__repr__
differs from__str__
in that it allows "reconstruction" (eval) of an equivalent object. But this would be difficult forSubTreeFileSystem
and I do not think Arrow has an established convention for representing Python objects.
Indeed, if making a distinction, that's the typical rule to differentiate repr and str. But making an "eval"-able repr is not always easy / possible. We don't generally do that in pyarrow (eg Array, Table, RecordBatch etc don't have a separate repr). For FileSystems it might be possible though. But I would defer that to later (because it requires updating the repr of other filesystems, see note below).
Would it make sense to put this PR on hold, and first add
__str__/__repr__
toFileSystem
. Then circle back to this PR and build on top of that information? cc @jorisvandenbossche @ianmcook
We could certainly improve the str/reprs of the other FileSystems as well (and we should open a JIRA for it). But I don't think that need to hold up this PR. For example, the repr of LocalFileSystem is already informative, but only contains some noise. While SubtreeFileSystem is really lacking information.
python/pyarrow/tests/test_fs.py
Outdated
|
||
subfs = SubTreeFileSystem('/another/base/', LocalFileSystem()) | ||
assert subfs.base_path == '/another/base/' | ||
assert subfs.base_fs == localfs | ||
|
||
assert str(subfs) == 'SubTreeFileSystem: file:/another/base/' | ||
assert repr(subfs) == f'SubTreeFileSystem(base_path=/another/base/, base_fs={subfs.base_fs})' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the failure on CI is like this:
> assert repr(subfs) == f'SubTreeFileSystem(base_path=/another/base/, base_fs={subfs.base_fs})'
E AssertionError: assert 'SubTreeFileS...f66c4759fb0>)' == 'SubTreeFileS...f66c4759e70>)'
E - SubTreeFileSystem(base_path=/another/base/, base_fs=<pyarrow._fs.LocalFileSystem object at 0x7f66c4759e70>)
E ? ^^
E + SubTreeFileSystem(base_path=/another/base/, base_fs=<pyarrow._fs.LocalFileSystem object at 0x7f66c4759fb0>)
E ?
So because the subfs.base_fs
is recreaed each time you access it, the repr of those differ. But I don't really understand why it actually passes in some builds (and also for you locally?), and not in others.
(but so we will need to make the testing a bit more robust against this small expected difference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. Even locally on my M1 it first showed the diff and at the end it passed. I don't really understand how it works. It looks like it is interchanging between two numbers every time you call it.
I don't have an idea how to make it more robust in this case. Can't seem to find a sense in the behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for getting different object IDs is that the property base_fs
creates another instance of base class FileSystem
via the wrap()
method. __repr__
uses base_fs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have an idea how to make it more robust in this case.
I think the trick could be to avoid comparing that part of the string. For example, we know that it always is at the end, so we could slice off the last part, and only compare the remainder. Or something like assert repr(subfs).startswith("SubTreeFileSystem(base_path=/another/base/, base_fs=<pyarrow._fs.LocalFileSystem")
.
6bbb87d
to
5c6106b
Compare
I would maybe suggest to leave out the There is also a linting failure (I think there are still some too long lines) |
Benchmark runs are scheduled for baseline = 3da6600 and contender = fd6cb3a. fd6cb3a is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Thanks @AlenkaF ! |
…ative Added `__str__` and `__repr__` methods for `SubTreeFileSystem` class. Closes apache#11447 from AlenkaF/ARROW-11238 Authored-by: Alenka Frim <frim.alenka@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Added
__str__
and__repr__
methods forSubTreeFileSystem
class.