-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COMPAT: add pandas blockmanager alternative apis for _constructor like things #3080
COMPAT: add pandas blockmanager alternative apis for _constructor like things #3080
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While my knowledge of pandas internals is limited, this looks like a plausible solution to me.
geopandas/geodataframe.py
Outdated
def _constructor_sliced_from_mgr(self, mgr, axes): | ||
is_row_proxy = mgr.index.is_(self.columns) | ||
|
||
assert isinstance(mgr, SingleBlockManager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this line for any reason? If we need to do this check, can it be a TypeError instead? Removing it would also allow use removing an import from pandas.core.internals
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think it is needed, this assertion should be guaranteed by pandas (I wrote it in my PR, so I assume that's why Matt copied it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, borrowed from Joris, but also the check below only makes sense if len(mgr.blocks)==1
i.e. a singleblockmanager. (if the assertion ever not to hold, I don't think we gain a lot by having the assertion vs having some exotic crash in the subsequent pandas code. Happy to remove.
geopandas/geoseries.py
Outdated
@@ -74,6 +68,17 @@ def _geoseries_expanddim(data=None, *args, **kwargs): | |||
return df | |||
|
|||
|
|||
def _geoseries_expanddim(data=None, *args, **kwargs): | |||
# pd.Series._constructor_expanddim == pd.DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# pd.Series._constructor_expanddim == pd.DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is fine to keep to understand what logic _geoseries_expanddim
is supposed to emulate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just add a second sentence to give some context to it.
geopandas/geoseries.py
Outdated
df = pd.DataFrame(data, *args, **kwargs) | ||
if isinstance(data, GeoSeries): | ||
# pandas default column name is 0, keep convention | ||
geo_col_name = data.name if data.name is not None else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test covering this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also doesn't seem to be covered on main. This block was added in #2296, which was a complex PR and includes a bunch of tests (including a test that ensures to_frame()
returns a GeoDataFrame with "0" as column name), but so maybe this wasn't needed in the final state of that PR (it went through some iterations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: appears to be redundant in my local tests. Seems plausible to me it could have been redundant in that PR in the end.
Had a quick look and it appears we still have codecov corresponding to that branch, which indicates it was covered then (https://app.codecov.io/gh/geopandas/geopandas/commit/aa80aa851ad5ee385aa67e7c153c9f3626b11951/blob/geopandas/geoseries.py).
I'm guessing this was then some compatibility thing for an old version of pandas which we no longer test against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, #3001 dropped coverage of this https://app.codecov.io/gh/geopandas/geopandas/commit/b51d43ba182d776cbf073c7a796010efc63eee1d/indirect-changes
if not any(isinstance(block.dtype, GeometryDtype) for block in mgr.blocks): | ||
return pd.DataFrame._from_mgr(mgr, axes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is essentially the same logic as what we have in _geodataframe_constructor_with_fallback
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the intent.
I suppose you could alternatively write this as something like
def _constructor_from_mgr(self, mgr, axes):
df = GeoDataFrame._from_mgr(mgr, axes)
return _geodataframe_constructor_with_fallback(df)
to make that more explicit (and/ or less confusing) - I suppose that's effectively what pandas is doing right now when it throws the warnings. I suppose this is arguably better? (though if the recommendation for a subclass to implement _constructor_from_mgr
is to just call the subclasses _constructor
, it seems like the process of splitting this out into its own method is superfluous)
I did also wonder if we now have enough _constructor
variants where it makes sense to collect them together into a separate module but I think that would be a mess because of partially initalised imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to leave it as is. In any case, your snippet above would pass df
to the GeoDataFrame constructor again (inside _geodataframe_constructor_with_fallback
), so that would be unnecessary for this case.
though if the recommendation for a subclass to implement
_constructor_from_mgr
is to just call the subclasses_constructor
, it seems like the process of splitting this out into its own method is superfluous
I think the problem is that pandas doesn't necessarily know that (and it also doesn't know if the subclass' _from_mgr
will return a fully instantiated object)
geopandas/geodataframe.py
Outdated
def _constructor_sliced_from_mgr(self, mgr, axes): | ||
is_row_proxy = mgr.index.is_(self.columns) | ||
|
||
assert isinstance(mgr, SingleBlockManager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think it is needed, this assertion should be guaranteed by pandas (I wrote it in my PR, so I assume that's why Matt copied it)
geopandas/geoseries.py
Outdated
@@ -74,6 +68,17 @@ def _geoseries_expanddim(data=None, *args, **kwargs): | |||
return df | |||
|
|||
|
|||
def _geoseries_expanddim(data=None, *args, **kwargs): | |||
# pd.Series._constructor_expanddim == pd.DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is fine to keep to understand what logic _geoseries_expanddim
is supposed to emulate
geopandas/geoseries.py
Outdated
df = pd.DataFrame(data, *args, **kwargs) | ||
if isinstance(data, GeoSeries): | ||
# pandas default column name is 0, keep convention | ||
geo_col_name = data.name if data.name is not None else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also doesn't seem to be covered on main. This block was added in #2296, which was a complex PR and includes a bunch of tests (including a test that ensures to_frame()
returns a GeoDataFrame with "0" as column name), but so maybe this wasn't needed in the final state of that PR (it went through some iterations)
geopandas/geoseries.py
Outdated
@@ -629,6 +634,11 @@ def _constructor_from_mgr(self, mgr, axes): | |||
def _constructor_expanddim(self): | |||
return _geoseries_expanddim | |||
|
|||
def _constructor_expanddim_from_mgr(self, mgr, axes): | |||
df = pd.DataFrame._from_mgr(mgr, axes) | |||
existing_name = mgr.axes[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected this needs to be mgr.axes[0][0]
(first [0] to get the first axis (i.e. the columns' Index, in the BlockManager twisted order of (columns, index)) and the second [0] to get the single scalar name out of that Index object).
But if the tests are not failing ..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check, I think this is the same point as the missing coverage above, that this argument/ value might be redundant.
…ndas into pandas_constructor_apis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm and get green CI again! 🚀
Thanks @m-richards! I believe that all of the @jorisvandenbossche's comments were resolved so merging to get it all green again. |
@m-richards we should evaluate how safe it is to backport to this to release in 0.14.x. It probably then also requires to backport one of your open PRs to address the regression? |
…e things (geopandas#3080) Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
I created #3159 to have a look at how safely this could be done (I imagine we don't want to merge that PR though because it might be confusing since it's a collection of PRs on main). But if it looks good I'll split out the relevant commit to add. |
…e things (geopandas#3080) Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
…e things (#3080) Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
xref #3060
With 6b4857e we are down to a mere 34992 warnings (commits after are fixes for old pandas).
With 5d7d146 we are down to 346
482c989 is a bit more convoluted than the others. Originally I didn't think we needed this, but there are warnings about this in e.g. test_clip.py (and with it we are down to 311). Since we have quite some logic for
GeoSeries._constructor_expanddim
, I opted to directly construct the dataframe (and fortunately we can actually do this in the DataFrame case as theSeries._name
being undefined as part offrom_mgr
is not an issue)Would not surprise me if this is not the 'right' way of doing some of this, I've kept some tabs on some of the relevant pandas PRs but not across it all.