Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT: add pandas blockmanager alternative apis for _constructor like things #3080

Merged
merged 10 commits into from
Dec 30, 2023

Conversation

m-richards
Copy link
Member

@m-richards m-richards commented Nov 20, 2023

xref #3060

With 6b4857e we are down to a mere 34992 warnings (commits after are fixes for old pandas).

With 5d7d146 we are down to 346

482c989 is a bit more convoluted than the others. Originally I didn't think we needed this, but there are warnings about this in e.g. test_clip.py (and with it we are down to 311). Since we have quite some logic for GeoSeries._constructor_expanddim, I opted to directly construct the dataframe (and fortunately we can actually do this in the DataFrame case as the Series._name being undefined as part of from_mgr is not an issue)

Would not surprise me if this is not the 'right' way of doing some of this, I've kept some tabs on some of the relevant pandas PRs but not across it all.

@m-richards m-richards marked this pull request as ready for review November 20, 2023 11:27
@martinfleis martinfleis reopened this Nov 20, 2023
Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While my knowledge of pandas internals is limited, this looks like a plausible solution to me.

def _constructor_sliced_from_mgr(self, mgr, axes):
is_row_proxy = mgr.index.is_(self.columns)

assert isinstance(mgr, SingleBlockManager)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this line for any reason? If we need to do this check, can it be a TypeError instead? Removing it would also allow use removing an import from pandas.core.internals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it is needed, this assertion should be guaranteed by pandas (I wrote it in my PR, so I assume that's why Matt copied it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, borrowed from Joris, but also the check below only makes sense if len(mgr.blocks)==1 i.e. a singleblockmanager. (if the assertion ever not to hold, I don't think we gain a lot by having the assertion vs having some exotic crash in the subsequent pandas code. Happy to remove.

@@ -74,6 +68,17 @@ def _geoseries_expanddim(data=None, *args, **kwargs):
return df


def _geoseries_expanddim(data=None, *args, **kwargs):
# pd.Series._constructor_expanddim == pd.DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# pd.Series._constructor_expanddim == pd.DataFrame

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is fine to keep to understand what logic _geoseries_expanddim is supposed to emulate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just add a second sentence to give some context to it.

df = pd.DataFrame(data, *args, **kwargs)
if isinstance(data, GeoSeries):
# pandas default column name is 0, keep convention
geo_col_name = data.name if data.name is not None else 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test covering this line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also doesn't seem to be covered on main. This block was added in #2296, which was a complex PR and includes a bunch of tests (including a test that ensures to_frame() returns a GeoDataFrame with "0" as column name), but so maybe this wasn't needed in the final state of that PR (it went through some iterations)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: appears to be redundant in my local tests. Seems plausible to me it could have been redundant in that PR in the end.

Had a quick look and it appears we still have codecov corresponding to that branch, which indicates it was covered then (https://app.codecov.io/gh/geopandas/geopandas/commit/aa80aa851ad5ee385aa67e7c153c9f3626b11951/blob/geopandas/geoseries.py).

I'm guessing this was then some compatibility thing for an old version of pandas which we no longer test against.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1598 to +1599
if not any(isinstance(block.dtype, GeometryDtype) for block in mgr.blocks):
return pd.DataFrame._from_mgr(mgr, axes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is essentially the same logic as what we have in _geodataframe_constructor_with_fallback right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the intent.
I suppose you could alternatively write this as something like

def _constructor_from_mgr(self, mgr, axes):
    df = GeoDataFrame._from_mgr(mgr, axes)
    return _geodataframe_constructor_with_fallback(df)

to make that more explicit (and/ or less confusing) - I suppose that's effectively what pandas is doing right now when it throws the warnings. I suppose this is arguably better? (though if the recommendation for a subclass to implement _constructor_from_mgr is to just call the subclasses _constructor, it seems like the process of splitting this out into its own method is superfluous)

I did also wonder if we now have enough _constructor variants where it makes sense to collect them together into a separate module but I think that would be a mess because of partially initalised imports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to leave it as is. In any case, your snippet above would pass df to the GeoDataFrame constructor again (inside _geodataframe_constructor_with_fallback), so that would be unnecessary for this case.

though if the recommendation for a subclass to implement _constructor_from_mgr is to just call the subclasses _constructor, it seems like the process of splitting this out into its own method is superfluous

I think the problem is that pandas doesn't necessarily know that (and it also doesn't know if the subclass' _from_mgr will return a fully instantiated object)

def _constructor_sliced_from_mgr(self, mgr, axes):
is_row_proxy = mgr.index.is_(self.columns)

assert isinstance(mgr, SingleBlockManager)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it is needed, this assertion should be guaranteed by pandas (I wrote it in my PR, so I assume that's why Matt copied it)

@@ -74,6 +68,17 @@ def _geoseries_expanddim(data=None, *args, **kwargs):
return df


def _geoseries_expanddim(data=None, *args, **kwargs):
# pd.Series._constructor_expanddim == pd.DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is fine to keep to understand what logic _geoseries_expanddim is supposed to emulate

df = pd.DataFrame(data, *args, **kwargs)
if isinstance(data, GeoSeries):
# pandas default column name is 0, keep convention
geo_col_name = data.name if data.name is not None else 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also doesn't seem to be covered on main. This block was added in #2296, which was a complex PR and includes a bunch of tests (including a test that ensures to_frame() returns a GeoDataFrame with "0" as column name), but so maybe this wasn't needed in the final state of that PR (it went through some iterations)

@@ -629,6 +634,11 @@ def _constructor_from_mgr(self, mgr, axes):
def _constructor_expanddim(self):
return _geoseries_expanddim

def _constructor_expanddim_from_mgr(self, mgr, axes):
df = pd.DataFrame._from_mgr(mgr, axes)
existing_name = mgr.axes[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected this needs to be mgr.axes[0][0] (first [0] to get the first axis (i.e. the columns' Index, in the BlockManager twisted order of (columns, index)) and the second [0] to get the single scalar name out of that Index object).
But if the tests are not failing ..?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check, I think this is the same point as the missing coverage above, that this argument/ value might be redundant.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm and get green CI again! 🚀

@martinfleis martinfleis merged commit 1b3a305 into geopandas:main Dec 30, 2023
19 checks passed
@martinfleis
Copy link
Member

Thanks @m-richards! I believe that all of the @jorisvandenbossche's comments were resolved so merging to get it all green again.

@jorisvandenbossche
Copy link
Member

@m-richards we should evaluate how safe it is to backport to this to release in 0.14.x.

It probably then also requires to backport one of your open PRs to address the regression?

m-richards added a commit to m-richards/geopandas that referenced this pull request Jan 26, 2024
…e things (geopandas#3080)

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
@m-richards
Copy link
Member Author

I created #3159 to have a look at how safely this could be done (I imagine we don't want to merge that PR though because it might be confusing since it's a collection of PRs on main). But if it looks good I'll split out the relevant commit to add.

jorisvandenbossche pushed a commit to jorisvandenbossche/geopandas that referenced this pull request Jan 27, 2024
…e things (geopandas#3080)

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
jorisvandenbossche pushed a commit that referenced this pull request Jan 31, 2024
…e things (#3080)

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants