Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Support __sizeof__ protocol for Python objects #23248

Closed
asfimport opened this issue Oct 17, 2019 · 9 comments
Closed

[Python] Support __sizeof__ protocol for Python objects #23248

asfimport opened this issue Oct 17, 2019 · 9 comments

Comments

@asfimport
Copy link

It would be helpful if PyArrow objects implemented the __sizeof__ protocol to give other libraries hints about how much data they have allocated. This helps systems like Dask, which have to make judgements about whether or not something is cheap to move or taking up a large amount of space.

Reporter: Matthew Rocklin / @mrocklin
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-6926. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Makes sense to me, should be easy-ish for us to calculate that

@asfimport
Copy link
Author

Matthew Rocklin / @mrocklin:
Someone ended up contributing these to Dask (we have a diispatch mechanism to work around these not being implemented upstream). Obviously it would have been nicer for this code to be implemented in Arrow originally, but I thought I'd point to it here in case it's helpful to others.

https://github.com/dask/dask/blob/539d1e27a8ccce01de5f3d49f1748057c27552f2/dask/sizeof.py#L115-L145

@asfimport
Copy link
Author

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
I started with implementing the nbytes attribute last week (ARROW-3444, which is merged now), with the idea of afterwards looking at sizeof.

Main question is if we just want to return what nbytes does (the number of bytes in the buffers), which is what the dask approximation does, or if we also want to include the size of the cython + C++ object.

sys.getsizeof works out of the box for the cython object (but it ignores the relevant buffers):

In [38]: a = pa.array([1, 2])                                                                                                                                                                                      

In [39]: import sys                                                                                                                                                                                                

In [40]: sys.getsizeof(a)                                                                                                                                                                                          
Out[40]: 96

but when overriding \_\_sizeof\_\_ in Array, I am not sure how to get to this number so I can add the nbytes of the buffers to it.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
You can call object.\_\_sizeof\_\_ to get the baseline, for example:

>>> v = set(range(500))                                                                                                                                           
>>> type(v).__sizeof__(v)                                                                                                                                         
32968
>>> object.__sizeof__(v)                                                                                                                                          
200

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Ah, thanks. But it seems cython is adding a bit more still:

In [21]: a = pa.array([1]*10)                                                                                                                                                                                      

In [22]: sys.getsizeof(a)                                                                                                                                                                                          
Out[22]: 96

In [23]: object.__sizeof__(a)                                                                                                                                                                                      
Out[23]: 72

(not sure how much we care about those small numbers, in reality users will mainly care for big arrays where the nbytes dominates the result)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
sys.getsizeof automatically adds other factors such as the GC overhead.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
OK, thanks!

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5879
#5879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants