Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Value error using sjoin with pandas v0.23 #731

Closed
uvchik opened this issue May 24, 2018 · 13 comments
Closed

Value error using sjoin with pandas v0.23 #731

uvchik opened this issue May 24, 2018 · 13 comments
Labels
Milestone

Comments

@uvchik
Copy link

uvchik commented May 24, 2018

I use the sjoin function to add the region name (polygons) to every point within the region. Some points are not in any region, therefore I filter these points and buffer them step by step. So the points layer without intersection becomes smaller and smaller. If there is only one row left I get the following error in pandas v0.23 which I did not get before (pandas < v0.23). Using geopandas v0.3.0.

My call:

new = gpd.sjoin(rest_points, polygons, how='left', op='intersects')

Error message:

ValueError: You are trying to merge on object and int64 columns.
If you wish to proceed you should use pd.concat

class: GeoDataFrame
method: merge(self, *args, **kwargs)
line: result = DataFrame.merge(self, *args, **kwargs)

I do not understand the error and why it happens only with the last point (last row) and only with the newest pandas version. I had a look at "What's New" but could not find anything.

Full message:

  File "virtualenv/lib/python3.5/site-packages/geopandas/tools/sjoin.py", line 140,
    in sjoin suffixes=('_%s' % lsuffix, '_%s' % rsuffix))
  File "virtualenv/lib/python3.5/site-packages/geopandas/geodataframe.py", line 418,
     in merge result = DataFrame.merge(self, *args, **kwargs)
  File "virtualenv/lib/python3.5/site-packages/pandas/core/frame.py", line 6379,
     in merge copy=copy, indicator=indicator, validate=validate)
  File "virtualenv/lib/python3.5/site-packages/pandas/core/reshape/merge.py", line 60,
     in mergevalidate=validate)
  File "virtualenv/lib/python3.5/site-packages/pandas/core/reshape/merge.py", line 554,
     in __init__self._maybe_coerce_merge_keys()
  File "virtualenv/lib/python3.5/site-packages/pandas/core/reshape/merge.py", line 980,
        in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on object and int64 columns.
If you wish to proceed you should use pd.concat
@jorisvandenbossche
Copy link
Member

@uvchik Thanks for the report!

This is related to a change in pandas to prevent merging on in compatible columns: pandas-dev/pandas#18352 (but the detection of those cases seems a bit too eager).

Would it be possible to make a small reproducible example for this case? (when you only have a single row). Does the row actually fall within any of the polygons?

@jorisvandenbossche
Copy link
Member

OK, it seems that it gives this error if there is no matching row. Eg:

In [67]: from shapely.geometry import Point, Polygon

In [68]: import geopandas

In [74]: polygons = geopandas.GeoDataFrame({'col2': [1, 2], 'geometry': [Polygon([(0, 0), (1, 0), (1, 1), (0, 1)]), Polygon([(1, 0), (2, 0), (2, 1), (1, 1)])]})

In [75]: rest_points = geopandas.GeoDataFrame({'col1': [1], 'geometry': [Point(0.5, 0.5)]})

In [76]: geopandas.sjoin(rest_points, polygons, how='left', op='intersects')
Out[76]: 
   col1         geometry  index_right  col2
0     1  POINT (0.5 0.5)            0     1

In [77]: rest_points = geopandas.GeoDataFrame({'col1': [1], 'geometry': [Point(-0.5, 0.5)]})

In [78]: geopandas.sjoin(rest_points, polygons, how='left', op='intersects')
...
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

The underlying reason is that the "key column" that gets created under the hood is of object dtype if it is empty. So we would need to ensure it is of float dtype.

@jorisvandenbossche jorisvandenbossche added this to the 0.4 milestone May 24, 2018
@vedal
Copy link

vedal commented Jun 24, 2018

In case someone else encounters this error: I got the same error when my longitude/latitude coordinates where in the wrong order.

point_df = geopandas.GeoDataFrame({'geometry': [Point(69.905930,17.169982)]})
poly_df = geopandas.GeoDataFrame({'geometry': shapes})
pointInPolys = sjoin(point_df, poly_df, how='right', op='intersects')

only worked after changing the order of the Point coordinates to Point(17.169982,69.905930). Here, shapes is a list of Polygon objects. A correct output for the point Point(69.905930,17.169982) should have been "out of range" or something similar, since it ends up far from all the Polygons in shapes

@uvchik
Copy link
Author

uvchik commented Jun 25, 2018

This confirms the bug because as @jorisvandenbossche already pointed out, the error occurs if there is no matching row. It does not matter for what reason no match can be found (thank you @jorisvandenbossche for pointing that out).

Both points are within the range. One is near Norway, the other is near India. You could expect geopandas to raise an out-of-range-error if one coordinate is greater than 90. In that case a wrong order could be automatically detected, otherwise it is not possible.

You could write your own test against your own bounding box, but this is not the topic of this issue.

@bnaul
Copy link
Contributor

bnaul commented Jul 2, 2018

Does anyone have a proposed solution for this? I would say geopandas is basically incompatible with the latest pandas since this bug affects a common core use case.

One approach I can see is to just add a temporary column w/ the right dtype enforced to use for the join. Kinda gross but would get the job done:

result = result.set_index('_key_left')
joined = (
          left_df
          .merge(result, left_index=True, right_index=True, how='left')
          )
right_df['_key'] = right_df.index.values.astype(joined_df._key_right.dtype)  # tmp key
joined = (
              joined
              .merge(right_df.drop(right_df.geometry.name, axis=1),
              how='left', left_on='_key_right', right_on='_key',
              suffixes=('_%s' % lsuffix, '_%s' % rsuffix))
         )
right_df.drop('_key', axis=1, inplace=True)
joined = joined.set_index(index_left).drop(['_key_right'], axis=1)

@jorisvandenbossche
Copy link
Member

I think a fix would be:

--- a/geopandas/tools/sjoin.py
+++ b/geopandas/tools/sjoin.py
@@ -114,7 +114,7 @@ def sjoin(left_df, right_df, how='inner', op='intersects',
 
     else:
         # when output from the join has no overlapping geometries
-        result = pd.DataFrame(columns=['_key_left', '_key_right'])
+        result = pd.DataFrame(columns=['_key_left', '_key_right'], dtype=float)
 
     if op == "within":
         # within implemented as the inverse of contains; swap names

Can you check if that solves the issue for you?

@bnaul
Copy link
Contributor

bnaul commented Jul 3, 2018

I think that does it! dtype=object does not since I guess it gets coerced back into an int at some subsequent step. But I can confirm the example you posted above works now w/ this patch and pandas 0.23.

I thought this might cause one of the columns to stay a float but it seems like in the case where everything can remain an int the behavior is the same. Any other downsides you can think of?

ljwolf added a commit to ljwolf/geopandas that referenced this issue Jul 7, 2018
ljwolf added a commit to ljwolf/geopandas that referenced this issue Jul 7, 2018
@uvchik
Copy link
Author

uvchik commented Jul 16, 2018

Thank you fixing this bug 😄 🎉

@grant-smittkamp
Copy link

The new panda update is very bad. Whatever this merge issue on version 0.23.0 sucks. It is breaking something to do with merging strings, floats and objects.

@jorisvandenbossche
Copy link
Member

@grant-smittkamp This should be fixed with the new geopandas 0.4 release

@uvchik
Copy link
Author

uvchik commented Aug 20, 2018

Is this fix backward compatible? Should it work with pandas v0.22 and geopandas v0.4.0? We have some problems with this combination but I am not sure if it is caused by this fix or not.

@Malouke
Copy link

Malouke commented Sep 6, 2018

i am sorry but the version 0.23 is sucks ,
i love to use pandas but sorry gus someimes you take the decison to change the great features by the bads ones.

@Popebl
Copy link

Popebl commented Oct 11, 2018

pd.merge is a great feature for process data. i develop a tool on 0.22 using may pd.merge. but it can not work on 0.23. i love to use pandas but sorry guy sometimes you take the decision to change the great features by the bad ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants