Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: check CRS authority via SRID number when reading from postgis #3329

Merged
merged 10 commits into from
Jun 17, 2024

Conversation

nicholas-ys-tan
Copy link
Contributor

@nicholas-ys-tan nicholas-ys-tan commented Jun 6, 2024

Should resolve #2545

It looks at the spatial_ref_sys table in postgis to look up the auth name instead of assuming EPSG. However, if for whatever reason the table does not exist, it should default back to current behaviour of assuming epsg.

Not sure if this is the best way though, I didn't want to pass the connection data into _df_to_geodf, so read it into a dataframe first. Would be nice if such a look up for the authority name could be done purely in pyproj.CRS but I couldn't find anything.

Please advise if there is a more aesthetic approach to this.

@nicholas-ys-tan nicholas-ys-tan changed the title BUGL check for authority of CRS based on SRID number BUG: check for authority of CRS based on SRID number Jun 6, 2024
@nicholas-ys-tan nicholas-ys-tan marked this pull request as ready for review June 7, 2024 12:50
Copy link
Member

@m-richards m-richards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nicholas-ys-tan, this approach looks sound to me. To me it makes sense that we treat postgis rather than the proj.db as the point of truth for data coming out of postgis. I think passing in the dataframe rather than the connection makes sense - especially for the chunksize case as it means this is only queried once upfront.

crs = "epsg:{}".format(srid)

if spatial_ref_sys_df is not None:
entry = spatial_ref_sys_df.loc[spatial_ref_sys_df["srid"] == srid]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that spatial_ref_sys is not exhaustive, in principle this could be an empty match. It could be worth checking the length of this is zero and falling back to epsg as the authority with a warning. (I expect it's probably not straightfoward to find such an example for test coverage, but I think that's okay as this is an edge case of CRS in proj but not postgis, or messed up postgis where spatial_ref_sys is missing data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @m-richards ,

I've added an additional check that the srid is in the spatial_ref_sys table, otherwise falling back to epsg and added a user warning when it does this. I've written a test to check this using mock.patch

geopandas/io/sql.py Outdated Show resolved Hide resolved
@m-richards m-richards changed the title BUG: check for authority of CRS based on SRID number BUG: check CRS authority via SRID number when reading from postgis Jun 9, 2024
Comment on lines 177 to 180
spatial_ref_sys_sql = "SELECT * FROM spatial_ref_sys"
spatial_ref_sys_df = (
pd.read_sql(spatial_ref_sys_sql, con) if crs is None else None
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an idea how large this table typically is? Because we are now reading this full table every time when calling read_postgis, and ideally we would already do the filtering step here in the query, instead of doing it on the dataframe. I know we don't know the srid value at this point for specifying it in the query, but I wonder if we could maybe pass con to _df_to_geodf and then do this query there with directly filtering on srid. We would then only have to "cache" this somehow for repeated calls to _df_to_geodf from a single read_postgis call.

Copy link
Member

@m-richards m-richards Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know for sure if it's typical, but my test postgis database seems to populate this with 8500 rows, not ideally to be fetching on every call.

Perhaps it would be cleaner to try and pre-process the first row to determine the srid so it / the crs can be passed in? - though this would be a bit messy in the generator case I guess.

(We could maybe try and parse the input sql and try and paste on a limit clause to do a single fetch for the srid and then the full query, but at that point we may as well be trying to parse the sql to call Find_SRID directly. I'd prefer if we didn't have any sql parsing though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mine is 8000 rows with approx 7.3mb

postgres=# SELECT pg_size_pretty( pg_total_relation_size('spatial_ref_sys') );
 pg_size_pretty 
----------------
 7280 kB
(1 row)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a decent chunk of data to have to transfer each time. I would try to look into querying only the one srid value that is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorisvandenbossche and @m-richards

I've pushed new commits where the connection is passed in and the SRID is queried from the SRS table with a filter in the sql statement, and cached.

geopandas/io/sql.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was added by accident?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apologies, missed that, deleted

warnings.warn(warning_msg, UserWarning, stacklevel=2)
crs = "epsg:{}".format(srid)

if not spatial_ref_sys_df.empty:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will error if spatial_ref_sys_df = None ?
I think you can put this part in an else clause of try/except/else, so that it is only executed if there was no exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I missed that would happen

added else and removed the consequent redundant spatial_ref_sys_df = None

assert df.crs == "ESRI:54052"

@pytest.mark.skipif(not HAS_PYPROJ, reason="pyproj not installed")
@mock.patch("shapely.get_srid")
Copy link
Member

@jorisvandenbossche jorisvandenbossche Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could also add a test mocking _get_spatial_ref_sys_df to raise an error? (to improve test coverage, otherwise would have to delete the spatial_ref_sys table to get something similiar?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added additional test that would raise pandas.errors.DatabaseError

@jorisvandenbossche
Copy link
Member

Can you also update with latest main to get the CI fix from #3339?

@nicholas-ys-tan
Copy link
Contributor Author

nicholas-ys-tan commented Jun 17, 2024

Can you also update with latest main to get the CI fix from #3339?

Rebased but still seem to have one failing CI due to codecov in windows, is this from my end?

@martinfleis
Copy link
Member

is this from my end?

ignore that.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, looks good!

geopandas/io/sql.py Outdated Show resolved Hide resolved
geopandas/io/sql.py Outdated Show resolved Hide resolved
@jorisvandenbossche jorisvandenbossche merged commit fcf9bc2 into geopandas:main Jun 17, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: CRS not found exception loading data frame
4 participants