Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: Proof of concept reducing IO dependence in tests #3250

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

m-richards
Copy link
Member

@m-richards m-richards commented Apr 14, 2024

In #3237 @jorisvandenbossche left the following comment re the new tests I added there:

I know this is copied from the notebook, but would prefer for new tests to use a small dummy geodataframe (to avoid having
our tests depend too much on IO). Could reuse the dfs fixture and just use one of both dataframes.

Originally posted by @jorisvandenbossche in #3237 (comment)

I decided to have a look and see if it is possible to to this in a more generic way, since we use the nybb dataset quite a lot throughout the tests. (now it's also possible Joris meant this from a test runtime perspective as well, in which case I expect moving things from fiona / pyogrio to json will actually slow things down - but it could be possible to instead load files from an IO engine if present)

Here I've looked at if we were to store a copy of nybb in geojson (which we can read without fiona/pyogrio) for usage in the the non IO tests. We also need to then pass down the CRS, and the data dtype schema of the columns since geojson is not strongly typed. Having a quick look, this seems fairly feasible - I've just converted the usages in test_geodataframe to get some thoughts.

Just realised now, it may be better to target geoarrow instead of geojson? Pyarrow is currently an optional dependency, but pandas will make it required in 3.0 and thus it will transitively be required.

One thing I am not sure about is if this needs to handle platform specific int widths, not sure what shapefile does (I've generated the geojsons on windows, so if all the linux tests fail I expect that is why)

@martinfleis
Copy link
Member

I suppose there are two different cases where we may want to change the dependency on IO. One is to ensure that we don't need to skip stuff if GDAL is not available via pyogrio or Fiona. Your suggestion resolves that. The other is to speed up tests by depending on small data created on the fly in cases, where we don't need to use real-world geometries. From that perspective, this goes in the wrong direction.

I think that we could eventually try to do both, with preference for the latter when we have a choice.

@m-richards
Copy link
Member Author

The other is to speed up tests by depending on small data created on the fly in cases, where we don't need to use real-world geometries. From that perspective, this goes in the wrong direction.

Agree that ideally we wouldn't use any real world data except where we absolutely have to. But there's a trade off between maintainer time to convert existing tests to use new data (and reviewer time to make sure the new data is suitably equivalent) and CI/ test runtime. If I compare the test time of this PR vs main on CI (just looking at 312-latest-conda-forge as a sample, it's not significantly different and if I run test_geodataframe locally it's between 1-3% slower from the few quick tests comparisons I've done - which aren't going to control for background cpu differences all that well. So if it's a step in the wrong direction in terms of performance it's hopefully not a very big one.

@martinfleis
Copy link
Member

Yeah, the trade-off needs to be taken into account. That is partially why I said we could try to do that rather than should. I don't feel strongly about the need to change how our tests currently work and would personally try to focus our attention on different things than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants