Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.concat([..], axis=1) fails #1230

Closed
jorisvandenbossche opened this issue Dec 1, 2019 · 6 comments · Fixed by #2046
Closed

BUG: pd.concat([..], axis=1) fails #1230

jorisvandenbossche opened this issue Dec 1, 2019 · 6 comments · Fixed by #2046
Labels
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 1, 2019

This "works":

In [86]: geopandas.__version__
Out[86]: '0.5.1'

In [87]: countries = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))

In [89]: cities = geopandas.read_file(geopandas.datasets.get_path("naturalearth_cities"))

In [90]: pd.concat([cities, countries], axis=1) 
Out[90]: 
             name                                      geometry      pop_est      continent                      name iso_a3  gdp_md_est                                           geometry
0    Vatican City   POINT (12.45338654497177 41.90328217996012)     920938.0        Oceania                      Fiji    FJI      8374.0  (POLYGON ((180 -16.06713266364245, 180 -16.555...
1      San Marino     POINT (12.44177015780014 43.936095834768)   53950935.0         Africa                  Tanzania    TZA    150600.0  POLYGON ((33.90371119710453 -0.950000000000000...
2           Vaduz   POINT (9.516669472907267 47.13372377429357)     603253.0         Africa                 W. Sahara    ESH       906.5  POLYGON ((-8.665589565454809 27.65642588959236...
3      Luxembourg   POINT (6.130002806227083 49.61166037912108)   35623680.0  North America                    Canada    CAN   1674000.0  (POLYGON ((-122.84 49.00000000000011, -122.974...
4         Palikir   POINT (158.1499743237623 6.916643696007725)  326625791.0  North America  United States of America    USA  18560000.0  (POLYGON ((-122.84 49.00000000000011, -120 49....
..            ...                                           ...          ...            ...                       ...    ...         ...                                                ...
[202 rows x 8 columns]

But once you do something spatial with the resulting GeoDataFrame (anything that accesses the "geometry column"), things breaks (due to there being two columns with the "geometry" name).

In 0.6.0 this already started failing when doing the concat:

In [10]: pd.concat([cities, pd.DataFrame(countries)], axis=1)  
---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-10-e059be43b4e0> in <module>
----> 1 pd.concat([cities, pd.DataFrame(countries)], axis=1)

~/scipy/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    253     )
    254 
--> 255     return op.get_result()
    256 
    257 

~/scipy/pandas/pandas/core/reshape/concat.py in get_result(self)
    474 
    475             cons = self.objs[0]._constructor
--> 476             return cons._from_axes(new_data, self.new_axes).__finalize__(
    477                 self, method="concat"
    478             )

~/scipy/pandas/pandas/core/generic.py in _from_axes(cls, data, axes, **kwargs)
    407         # for construction from BlockManager
    408         if isinstance(data, BlockManager):
--> 409             return cls(data, **kwargs)
    410         else:
    411             if cls._AXIS_REVERSED:

~/scipy/geopandas/geopandas/geodataframe.py in __init__(self, *args, **kwargs)
     74             index = self.index
     75             try:
---> 76                 self["geometry"] = _ensure_geometry(self["geometry"].values)
     77             except TypeError:
     78                 pass

~/scipy/geopandas/geopandas/geodataframe.py in __getitem__(self, key)
    554         GeoDataFrame.
    555         """
--> 556         result = super(GeoDataFrame, self).__getitem__(key)
    557         geo_col = self._geometry_column_name
    558         if isinstance(key, str) and key == geo_col:

~/scipy/pandas/pandas/core/frame.py in __getitem__(self, key)
   2693             indexer = np.where(indexer)[0]
   2694 
-> 2695         data = self.take(indexer, axis=1)
   2696 
   2697         if is_single_key:

~/scipy/pandas/pandas/core/generic.py in take(self, indices, axis, is_copy, **kwargs)
   3432             indices, axis=self._get_block_manager_axis(axis), verify=True
   3433         )
-> 3434         result = self._constructor(new_data).__finalize__(self)
   3435 
   3436         # Maybe set copy if we didn't actually change the index.

... last 4 frames repeated, from the frame below ...

~/scipy/geopandas/geopandas/geodataframe.py in __init__(self, *args, **kwargs)
     74             index = self.index
     75             try:
---> 76                 self["geometry"] = _ensure_geometry(self["geometry"].values)
     77             except TypeError:
     78                 pass

RecursionError: maximum recursion depth exceeded
@machow
Copy link

machow commented Dec 4, 2019

This appears to be causing plotnine to break on 0.6.* versions. For example...

import geopandas
from plotnine import *

ne = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
ggplot() + geom_map(ne)

Raises: AttributeError: 'GeometryArray' object has no attribute 'view'

Stacktrace

~/.virtualenvs/tidytuesday/lib/python3.6/site-packages/plotnine/ggplot.py in repr(self)
86 # in the jupyter notebook.
87 if not self.figure:
---> 88 self.draw()
89 plt.show()
90 return '<ggplot: (%d)>' % self.hash()

~/.virtualenvs/tidytuesday/lib/python3.6/site-packages/plotnine/ggplot.py in draw(self, return_ggplot)
179 # new frames knowing that they are separate from the original.
180 with pd.option_context('mode.chained_assignment', None):
--> 181 return self._draw(return_ggplot)
182
183 def _draw(self, return_ggplot=False):

~/.virtualenvs/tidytuesday/lib/python3.6/site-packages/plotnine/ggplot.py in _draw(self, return_ggplot)
186 # assign a default theme
187 self = deepcopy(self)
--> 188 self._build()
189
190 # If no theme we use the default

~/.virtualenvs/tidytuesday/lib/python3.6/site-packages/plotnine/ggplot.py in _build(self)
305 # Prepare data in geoms
306 # e.g. from y and width to ymin and ymax
--> 307 layers.setup_data()
308
309 # Apply position adjustments

~/.virtualenvs/tidytuesday/lib/python3.6/site-packages/plotnine/layer.py in setup_data(self)
70 def setup_data(self):
71 for l in self:
---> 72 l.setup_data()
73
74 def draw(self, layout, coord):

~/.virtualenvs/tidytuesday/lib/python3.6/site-packages/plotnine/layer.py in setup_data(self)
416 return type(data)()
417
--> 418 data = self.geom.setup_data(data)
419
420 check_required_aesthetics(

~/.virtualenvs/tidytuesday/lib/python3.6/site-packages/plotnine/geoms/geom_map.py in setup_data(self, data)
89 inplace=True)
90
---> 91 data = pd.concat([data, bounds], axis=1, copy=False)
92 return data
93

~/.virtualenvs/tidytuesday/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
227 verify_integrity=verify_integrity,
228 copy=copy, sort=sort)
--> 229 return op.get_result()
230
231

~/.virtualenvs/tidytuesday/lib/python3.6/site-packages/pandas/core/reshape/concat.py in get_result(self)
424 new_data = concatenate_block_managers(
425 mgrs_indexers, self.new_axes, concat_axis=self.axis,
--> 426 copy=self.copy)
427 if not self.copy:
428 new_data._consolidate_inplace()

~/.virtualenvs/tidytuesday/lib/python3.6/site-packages/pandas/core/internals/managers.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
2052 values = values.copy()
2053 elif not copy:
-> 2054 values = values.view()
2055 b = b.make_block_same_class(values, placement=placement)
2056 elif is_uniform_join_units(join_units):

AttributeError: 'GeometryArray' object has no attribute 'view'

@Sangarshanan
Copy link
Contributor

since Geopandas allows only one geometry column to be specified, I was wondering as to what the ideal solution be for this? I have been working around it by converting it to a Dataframe

@m-richards
Copy link
Member

I'd like to work on fixing this. As far as I can see, there are two options

  1. Raise an exception when concat is called with arguments where the geometry column is not unique
  2. Mangle the repeated columns with a suffix (i.e. geometry, geometry_1, geometry_2, ...) and show a warning that these columns have been altered

Would appreciate some input on what approach is best, or if there are other alternatives.
Having a quick look I think this would be handled in the GeoDataFrame init - and would also then catch this (more contrived) variant of the error:

from geopandas import GeoDataFrame
from shapely.geometry import Point
import pandas as pd

geoms = [Point(0, 0), Point(1, 1)]
df = pd.DataFrame({"col1": [0, 1], "geometry": geoms})
df['geometry2'] = df['geometry']
df = df.rename(columns={'geometry2': 'geometry'})

gdf = GeoDataFrame(df)

@martinfleis
Copy link
Member

@m-richards we also have to figure out which of the geometry columns, in case of unique names, should be set as an active geometry after concat (none?)

In any case, I think that raising is a good way forward. If you renamed you, again, need to figure out which should be active and also which is from which gdf. I'd say that leaving that to be resolved by a user is safer option.

@m-richards
Copy link
Member

Currently, the geometry column of the first geodataframe gets set in __finalize__:

def __finalize__(self, other, method=None, **kwargs):
"""propagate metadata from other to self"""
self = super().__finalize__(other, method=method, **kwargs)
# merge operation: using metadata of the left object
if method == "merge":
for name in self._metadata:
object.__setattr__(self, name, getattr(other.left, name, None))
# concat operation: using metadata of the first object
elif method == "concat":
for name in self._metadata:
object.__setattr__(self, name, getattr(other.objs[0], name, None))
return self
.

So it turns out that there's another curious edge case:

In [3]: cities = geopandas.read_file(geopandas.datasets.get_path("naturalearth_cities")).rename_geometry('geom')

In [4]: countries = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres")).rename_geometry('geom')

In [5]: gdf = pd.concat([countries, cities], axis=1)

In [6]: gdf.geometry
Out[6]:
                                                  geom                         geom
0    MULTIPOLYGON (((180.00000 -16.06713, 180.00000...    POINT (12.45339 41.90328)
1    POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...    POINT (12.44177 43.93610)
2    POLYGON ((-8.66559 27.65643, -8.66512 27.58948...     POINT (9.51667 47.13372)
3    MULTIPOLYGON (((-122.84000 49.00000, -122.9742...     POINT (6.13000 49.61166)
4    MULTIPOLYGON (((-122.84000 49.00000, -120.0000...    POINT (158.14997 6.91664)
..                                                 ...                          ...
197                                               None    POINT (31.24802 30.05191)
198                                               None   POINT (139.74946 35.68696)
199                                               None     POINT (2.33139 48.86864)
200                                               None  POINT (-70.66899 -33.44807)
201                                               None    POINT (103.85387 1.29498)

[202 rows x 2 columns]

I don't know if that should become a separate issue, but it's certainly not great, because if the geometry column has been set normally, the _ensure_geometry check would have failed here - and any methods acting on the geometry will fail as well.

@m-richards
Copy link
Member

Also, the "use the metadata of the first geodataframe" approach in __finalize__ also permits this:

In [14]: cities = geopandas.read_file(geopandas.datasets.get_path("naturalearth_cities"))
In [16]: cities2 = cities.to_crs(crs=27700)

In [19]: gdf =pd.concat([cities, cities2], axis=0)

In [20]: gdf.geometry
Out[20]:
0                   POINT (12.45339 41.90328)
1                   POINT (12.44177 43.93610)
2                    POINT (9.51667 47.13372)
3                    POINT (6.13000 49.61166)
4                   POINT (158.14997 6.91664)
                        ...
197      POINT (3692494.09369 -1687352.07231)
198       POINT (3930565.10961 9764719.36170)
199        POINT (717570.12268 -105552.10645)
200    POINT (-6222461.46130 -12320935.04244)
201     POINT (13067799.34178 13920712.76573)
Name: geometry, Length: 404, dtype: geometry

In [21]: gdf.crs
Out[21]:
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

I think it probably makes sense to open a separate issue for that though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants