Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: set_index results in invalid dask GeoDataFrame (partitions are DataFrames) #59

Closed
jorisvandenbossche opened this issue Jun 17, 2021 · 4 comments · Fixed by #142
Closed

Comments

@jorisvandenbossche
Copy link
Member

Using the dask set_index method results in an "invalid" dask_geopandas.GeoDataFrame, where the partitions are no longer GeoDataFrames but DataFrames (which then results in errors when computing spatial operations)

import geopandas
import dask_geopandas

gdf = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
ddf = dask_geopandas.from_geopandas(gdf, npartitions=4)

>>> type(ddf.partitions[0].compute())
geopandas.geodataframe.GeoDataFrame

>>> ddf2 = ddf.set_index("continent")
# still a dask_geopandas GeoDataFrame
>>> type(ddf2)   
dask_geopandas.core.GeoDataFrame
# but partitions under the hood no longer a geopandas.GeoDataFrame
>>> type(ddf2.partitions[0].compute())   
pandas.core.frame.DataFrame
# which results in errors when doing actual computations
>>> ddf2.area.compute()
....
AttributeError: 'DataFrame' object has no attribute 'area'
@DahnJ
Copy link
Contributor

DahnJ commented Jun 18, 2021

Interestingly, this is not the case for all the partitions:

>>> [type(partition.compute()) for partition in ddf2.partitions]

[pandas.core.frame.DataFrame,
 geopandas.geodataframe.GeoDataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame]

The change seems to happen in rearrange_by_column

@jorisvandenbossche
Copy link
Member Author

Ah, interesting. Then it might also be a bug in GeoPandas (if some operation on a GeoDataFrame results in a pandas DataFrame where it could have preserved the GeoDataFrame type)

@DahnJ
Copy link
Contributor

DahnJ commented Jun 19, 2021

I believe the problem is in the fact that Dask uses partd.pandas to (de)serialize the dataframe:

import geopandas as gpd
from partd.pandas import serialize, deserialize
df = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
type(deserialize(serialize(df)))  # pandas.core.frame.DataFrame

The reason we sometimes get GeoDataFrame is that dask.shuffle.collect returns the meta object for empty partitions and the meta is a GeoDataFrame.

@jorisvandenbossche
Copy link
Member Author

@DahnJ indeed, good catch. I opened an issue on the dask/partd side about this: dask/partd#52

Alternative for now is to specify set_index(.., shuffle="tasks"), so it doesn't use the default "disk".

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants