-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: check CRS authority via SRID number when reading from postgis #3329
BUG: check CRS authority via SRID number when reading from postgis #3329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nicholas-ys-tan, this approach looks sound to me. To me it makes sense that we treat postgis rather than the proj.db as the point of truth for data coming out of postgis. I think passing in the dataframe rather than the connection makes sense - especially for the chunksize case as it means this is only queried once upfront.
geopandas/io/sql.py
Outdated
crs = "epsg:{}".format(srid) | ||
|
||
if spatial_ref_sys_df is not None: | ||
entry = spatial_ref_sys_df.loc[spatial_ref_sys_df["srid"] == srid] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that spatial_ref_sys is not exhaustive, in principle this could be an empty match. It could be worth checking the length of this is zero and falling back to epsg as the authority with a warning. (I expect it's probably not straightfoward to find such an example for test coverage, but I think that's okay as this is an edge case of CRS in proj but not postgis, or messed up postgis where spatial_ref_sys is missing data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @m-richards ,
I've added an additional check that the srid
is in the spatial_ref_sys
table, otherwise falling back to epsg and added a user warning when it does this. I've written a test to check this using mock.patch
geopandas/io/sql.py
Outdated
spatial_ref_sys_sql = "SELECT * FROM spatial_ref_sys" | ||
spatial_ref_sys_df = ( | ||
pd.read_sql(spatial_ref_sys_sql, con) if crs is None else None | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an idea how large this table typically is? Because we are now reading this full table every time when calling read_postgis
, and ideally we would already do the filtering step here in the query, instead of doing it on the dataframe. I know we don't know the srid
value at this point for specifying it in the query, but I wonder if we could maybe pass con
to _df_to_geodf
and then do this query there with directly filtering on srid. We would then only have to "cache" this somehow for repeated calls to _df_to_geodf
from a single read_postgis
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know for sure if it's typical, but my test postgis database seems to populate this with 8500 rows, not ideally to be fetching on every call.
Perhaps it would be cleaner to try and pre-process the first row to determine the srid so it / the crs can be passed in? - though this would be a bit messy in the generator case I guess.
(We could maybe try and parse the input sql and try and paste on a limit clause to do a single fetch for the srid and then the full query, but at that point we may as well be trying to parse the sql to call Find_SRID directly. I'd prefer if we didn't have any sql parsing though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mine is 8000 rows with approx 7.3mb
postgres=# SELECT pg_size_pretty( pg_total_relation_size('spatial_ref_sys') );
pg_size_pretty
----------------
7280 kB
(1 row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a decent chunk of data to have to transfer each time. I would try to look into querying only the one srid
value that is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche and @m-richards
I've pushed new commits where the connection is passed in and the SRID is queried from the SRS table with a filter in the sql statement, and cached.
geopandas/io/.sql.py.swp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was added by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies, missed that, deleted
geopandas/io/sql.py
Outdated
warnings.warn(warning_msg, UserWarning, stacklevel=2) | ||
crs = "epsg:{}".format(srid) | ||
|
||
if not spatial_ref_sys_df.empty: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will error if spatial_ref_sys_df = None
?
I think you can put this part in an else
clause of try/except/else
, so that it is only executed if there was no exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I missed that would happen
added else
and removed the consequent redundant spatial_ref_sys_df = None
assert df.crs == "ESRI:54052" | ||
|
||
@pytest.mark.skipif(not HAS_PYPROJ, reason="pyproj not installed") | ||
@mock.patch("shapely.get_srid") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe could also add a test mocking _get_spatial_ref_sys_df
to raise an error? (to improve test coverage, otherwise would have to delete the spatial_ref_sys table to get something similiar?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added additional test that would raise pandas.errors.DatabaseError
Can you also update with latest main to get the CI fix from #3339? |
Co-authored-by: Matt Richards <45483497+m-richards@users.noreply.github.com>
Co-authored-by: Matt Richards <45483497+m-richards@users.noreply.github.com>
73e5935
to
0c13933
Compare
Rebased but still seem to have one failing CI due to codecov in windows, is this from my end? |
ignore that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, looks good!
Should resolve #2545
It looks at the
spatial_ref_sys
table in postgis to look up the auth name instead of assuming EPSG. However, if for whatever reason the table does not exist, it should default back to current behaviour of assuming epsg.Not sure if this is the best way though, I didn't want to pass the connection data into
_df_to_geodf
, so read it into a dataframe first. Would be nice if such a look up for the authority name could be done purely inpyproj.CRS
but I couldn't find anything.Please advise if there is a more aesthetic approach to this.