Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11238: [Python] Make SubTreeFileSystem print method more informative #11447

Closed
wants to merge 4 commits into from

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Oct 18, 2021

Added __str__ and __repr__ methods for SubTreeFileSystem class.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@edponce
Copy link
Contributor

edponce commented Oct 18, 2021

You can resolve the lint errors using Archery. I will send you some useful commands.
https://arrow.apache.org/docs/developers/archery.html

@@ -833,6 +833,12 @@ cdef class SubTreeFileSystem(FileSystem):
FileSystem.init(self, wrapped)
self.subtreefs = <CSubTreeFileSystem*> wrapped.get()

def __str__(self):
return f'SubTreeFileSystem: file:{self.base_path}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all objects in the file system physical files? or can they be an URI?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this can be a general URI as well (eg since you can use it with S3). In any case it's good to not use file: because also for local file paths that would be confusing (it results in file:/base/, which looks like a URI, but is not a valid one).

The repr uses base_path=. Maybe we can do the same here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, we could also make str and repr just the same output.

Copy link
Contributor

@edponce edponce Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, __repr__ differs from __str__ in that it allows "reconstruction" (eval) of an equivalent object. But this would be difficult for SubTreeFileSystem and I do not think Arrow has an established convention for representing Python objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to omit subfs.base_fs from __repr__ in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think a better route would be for SubTreeFileSystem to print the base_path combined with output of the __str__/__repr__ methods of FileSystem class. But it seems that FileSystem doesn't have __str__/__repr__...yet.

Similar to the following:
Note: Notice the call to str(self.base_fs) which would in turn call the __str__ method of base class FileSystem.

def __repr__(self):
        return f'SubTreeFileSystem(base_path={self.base_path}, base_fs={str(self.base_fs)})'

Copy link
Contributor

@edponce edponce Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubTreeFileSystem is a simple class representing a base path and a FileSystem, so it is difficult to be too informative asides from this.

Would it make sense to put this PR on hold, and first add __str__/__repr__ to FileSystem. Then circle back to this PR and build on top of that information? cc @jorisvandenbossche @ianmcook

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to omit subfs.base_fs from __repr__ in this case?

I would certainly keep it, as that seems an essential part to understand a SubtreeFilesystem object (i.e. what kind of filesystem is it wrapping)

Ideally, __repr__ differs from __str__ in that it allows "reconstruction" (eval) of an equivalent object. But this would be difficult for SubTreeFileSystem and I do not think Arrow has an established convention for representing Python objects.

Indeed, if making a distinction, that's the typical rule to differentiate repr and str. But making an "eval"-able repr is not always easy / possible. We don't generally do that in pyarrow (eg Array, Table, RecordBatch etc don't have a separate repr). For FileSystems it might be possible though. But I would defer that to later (because it requires updating the repr of other filesystems, see note below).

Would it make sense to put this PR on hold, and first add __str__/__repr__ to FileSystem. Then circle back to this PR and build on top of that information? cc @jorisvandenbossche @ianmcook

We could certainly improve the str/reprs of the other FileSystems as well (and we should open a JIRA for it). But I don't think that need to hold up this PR. For example, the repr of LocalFileSystem is already informative, but only contains some noise. While SubtreeFileSystem is really lacking information.


subfs = SubTreeFileSystem('/another/base/', LocalFileSystem())
assert subfs.base_path == '/another/base/'
assert subfs.base_fs == localfs

assert str(subfs) == 'SubTreeFileSystem: file:/another/base/'
assert repr(subfs) == f'SubTreeFileSystem(base_path=/another/base/, base_fs={subfs.base_fs})'
Copy link
Member

@jorisvandenbossche jorisvandenbossche Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the failure on CI is like this:

 >       assert repr(subfs) == f'SubTreeFileSystem(base_path=/another/base/, base_fs={subfs.base_fs})'
E       AssertionError: assert 'SubTreeFileS...f66c4759fb0>)' == 'SubTreeFileS...f66c4759e70>)'
E         - SubTreeFileSystem(base_path=/another/base/, base_fs=<pyarrow._fs.LocalFileSystem object at 0x7f66c4759e70>)
E         ?                                                                                                       ^^
E         + SubTreeFileSystem(base_path=/another/base/, base_fs=<pyarrow._fs.LocalFileSystem object at 0x7f66c4759fb0>)
E         ?            

So because the subfs.base_fs is recreaed each time you access it, the repr of those differ. But I don't really understand why it actually passes in some builds (and also for you locally?), and not in others.

(but so we will need to make the testing a bit more robust against this small expected difference)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. Even locally on my M1 it first showed the diff and at the end it passed. I don't really understand how it works. It looks like it is interchanging between two numbers every time you call it.

I don't have an idea how to make it more robust in this case. Can't seem to find a sense in the behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for getting different object IDs is that the property base_fs creates another instance of base class FileSystem via the wrap() method. __repr__ uses base_fs.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an idea how to make it more robust in this case.

I think the trick could be to avoid comparing that part of the string. For example, we know that it always is at the end, so we could slice off the last part, and only compare the remainder. Or something like assert repr(subfs).startswith("SubTreeFileSystem(base_path=/another/base/, base_fs=<pyarrow._fs.LocalFileSystem").

@AlenkaF AlenkaF force-pushed the ARROW-11238 branch 2 times, most recently from 6bbb87d to 5c6106b Compare October 19, 2021 09:05
@jorisvandenbossche
Copy link
Member

I would maybe suggest to leave out the __str__ (and only have __repr__, in which case that will be used for str). We can open a JIRA to discuss if we want a different string representation for FileSystems in general.

There is also a linting failure (I think there are still some too long lines)

python/pyarrow/_fs.pyx Outdated Show resolved Hide resolved
@ursabot
Copy link

ursabot commented Oct 21, 2021

Benchmark runs are scheduled for baseline = 3da6600 and contender = fd6cb3a. fd6cb3a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Scheduled] ursa-i9-9960x
[Failed] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@jorisvandenbossche
Copy link
Member

Thanks @AlenkaF !

@AlenkaF AlenkaF deleted the ARROW-11238 branch October 21, 2021 12:33
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Nov 3, 2021
…ative

Added `__str__` and `__repr__` methods for `SubTreeFileSystem` class.

Closes apache#11447 from AlenkaF/ARROW-11238

Authored-by: Alenka Frim <frim.alenka@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants