TST: Proof of concept reducing IO dependence in tests #3250

m-richards · 2024-04-14T06:22:57Z

In #3237 @jorisvandenbossche left the following comment re the new tests I added there:

I know this is copied from the notebook, but would prefer for new tests to use a small dummy geodataframe (to avoid having
our tests depend too much on IO). Could reuse the dfs fixture and just use one of both dataframes.

Originally posted by @jorisvandenbossche in #3237 (comment)

I decided to have a look and see if it is possible to to this in a more generic way, since we use the nybb dataset quite a lot throughout the tests. (now it's also possible Joris meant this from a test runtime perspective as well, in which case I expect moving things from fiona / pyogrio to json will actually slow things down - but it could be possible to instead load files from an IO engine if present)

Here I've looked at if we were to store a copy of nybb in geojson (which we can read without fiona/pyogrio) for usage in the the non IO tests. We also need to then pass down the CRS, and the data dtype schema of the columns since geojson is not strongly typed. Having a quick look, this seems fairly feasible - I've just converted the usages in test_geodataframe to get some thoughts.

Just realised now, it may be better to target geoarrow instead of geojson? Pyarrow is currently an optional dependency, but pandas will make it required in 3.0 and thus it will transitively be required.

One thing I am not sure about is if this needs to handle platform specific int widths, not sure what shapefile does (I've generated the geojsons on windows, so if all the linux tests fail I expect that is why)

[pull] main from geopandas:main

martinfleis · 2024-04-14T08:36:32Z

I suppose there are two different cases where we may want to change the dependency on IO. One is to ensure that we don't need to skip stuff if GDAL is not available via pyogrio or Fiona. Your suggestion resolves that. The other is to speed up tests by depending on small data created on the fly in cases, where we don't need to use real-world geometries. From that perspective, this goes in the wrong direction.

I think that we could eventually try to do both, with preference for the latter when we have a choice.

m-richards · 2024-04-14T11:46:39Z

The other is to speed up tests by depending on small data created on the fly in cases, where we don't need to use real-world geometries. From that perspective, this goes in the wrong direction.

Agree that ideally we wouldn't use any real world data except where we absolutely have to. But there's a trade off between maintainer time to convert existing tests to use new data (and reviewer time to make sure the new data is suitably equivalent) and CI/ test runtime. If I compare the test time of this PR vs main on CI (just looking at 312-latest-conda-forge as a sample, it's not significantly different and if I run test_geodataframe locally it's between 1-3% slower from the few quick tests comparisons I've done - which aren't going to control for background cpu differences all that well. So if it's a step in the wrong direction in terms of performance it's hopefully not a very big one.

martinfleis · 2024-04-14T11:58:56Z

Yeah, the trade-off needs to be taken into account. That is partially why I said we could try to do that rather than should. I don't feel strongly about the need to change how our tests currently work and would personally try to focus our attention on different things than that.

m-richards and others added 6 commits April 5, 2024 21:29

Merge pull request #13 from geopandas/main

7e6250d

[pull] main from geopandas:main

Merge branch 'main' of github.com:m-richards/geopandas

33c729d

trial generating geojson test data

cd0b15a

structure in place

44c7282

switch to use nybb fixture

d0b36bd

more cleanup

b95e314

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST: Proof of concept reducing IO dependence in tests #3250

TST: Proof of concept reducing IO dependence in tests #3250

m-richards commented Apr 14, 2024 •

edited

martinfleis commented Apr 14, 2024

m-richards commented Apr 14, 2024

martinfleis commented Apr 14, 2024

TST: Proof of concept reducing IO dependence in tests #3250

Are you sure you want to change the base?

TST: Proof of concept reducing IO dependence in tests #3250

Conversation

m-richards commented Apr 14, 2024 • edited

martinfleis commented Apr 14, 2024

m-richards commented Apr 14, 2024

martinfleis commented Apr 14, 2024

m-richards commented Apr 14, 2024 •

edited