Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour when calculating sizeof GeoDataFrame and GeoSeries #3976

Open
avnovikov opened this issue Sep 12, 2018 · 5 comments
Open

Comments

@avnovikov
Copy link

avnovikov commented Sep 12, 2018

As found in discussion in
#3972 (comment) and probably the root cause of #3972

AttributeError: 'GeometryArray' object has no attribute 'nbytes'
distributed.sizeof - WARNING - Sizeof calculation failed.  Defaulting to 1MB
Traceback (most recent call last):
  File "/anaconda3/envs/mariquant/lib/python3.6/site-packages/distributed/sizeof.py", line 16, in safe_sizeof
    return sizeof(obj)
  File "/anaconda3/envs/mariquant/lib/python3.6/site-packages/dask/utils.py", line 415, in __call__
    return meth(arg)
  File "/anaconda3/envs/mariquant/lib/python3.6/site-packages/dask/sizeof.py", line 63, in sizeof_pandas_series
    p = int(s.memory_usage(index=True))
  File "/anaconda3/envs/mariquant/lib/python3.6/site-packages/pandas/core/series.py", line 3503, in memory_usage
    v = super(Series, self).memory_usage(deep=deep)
  File "/anaconda3/envs/mariquant/lib/python3.6/site-packages/pandas/core/base.py", line 1143, in memory_usage
    v = self.values.nbytes

However GeoDataFrame size seems to be calculated correctly. Looking at the code of distributed/sizeof.py the only suspicious line is p = int(s.memory_usage(index=True)), however

type(anchorage_buffer)
Out[55]:
geopandas.geoseries.GeoSeries
In [57]:
anchorage_buffer.memory_usage(index=True)
I am densified (external_values, 20353 elements)
I am densified (external_values, 20353 elements)
Out[57]:
325648

Unfortunately I can't trace the origin of this error.

This issue seems to be most GeoPandas related and as such I posted a companion issue #819

@jorisvandenbossche
Copy link
Member

I can reproduce this, but I am also a bit lost about the cause at this moment (meaning: we can add a nbytes to GeometryArray which will solve this, but it should still not end up there).

So when calling client.scatter(buffer) I get the same warning, but when doing distributed.safe_sizeof(buffer) (which is what is getting called according to the traceback), it works fine.

@avnovikov
Copy link
Author

avnovikov commented Sep 13, 2018

@TomAugspurger can you please comment on this.

@jorisvandenbossche - surely I'll try my best to find what had happened. (I hate not making things fixed after spending four days trying to get to the cause� and create duck tape workaround).
And please note that this is the problem with GeoSeries, GeoDataFrame works fine.

@TomAugspurger
Copy link
Member

TomAugspurger commented Sep 13, 2018 via email

@jorisvandenbossche
Copy link
Member

@avnovikov I think the quick fix would be to add nbytes to GeometryArray (it is still strange why we run into this issue, but adding that attribute is something that we need to do anyway to satisfy the new pandas extension interface)

@mrocklin
Copy link
Member

Any update on this? It appears to have gone quiet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants