Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Method to sample points randomly from within geometries #2860

Merged
merged 19 commits into from
May 1, 2023

Conversation

martinfleis
Copy link
Member

@martinfleis martinfleis commented Apr 4, 2023

Ad discussed in person and during the dev call, to help the review process of #2363, it was decided to split the PR into multiple smaller ones dealing with one task per PR.

This PR implements the sampling based on samples either from a uniform distribution or using pointpats. Mostly based off #2363, with some minor changes and exposure of seed and random generator for better control.

@martinfleis martinfleis added this to the 0.13 milestone Apr 4, 2023
@martinfleis
Copy link
Member Author

The CI failure is unrelated and is also on main.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

geopandas/base.py Outdated Show resolved Hide resolved
geopandas/base.py Outdated Show resolved Hide resolved
geopandas/base.py Show resolved Hide resolved
geopandas/base.py Outdated Show resolved Hide resolved
geopandas/base.py Outdated Show resolved Hide resolved
geopandas/base.py Outdated Show resolved Hide resolved
geopandas/tools/_random.py Outdated Show resolved Hide resolved
geopandas/tests/test_geom_methods.py Show resolved Hide resolved
Copy link
Member

@m-richards m-richards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A couple of minor documentation comments.

This is something I will be quite keen to use myself, and replace where I've written sampling code by hand.

doc/source/docs/user_guide/sampling.ipynb Outdated Show resolved Hide resolved
doc/source/docs/user_guide/sampling.ipynb Outdated Show resolved Hide resolved
doc/source/docs/user_guide/sampling.ipynb Show resolved Hide resolved
geopandas/base.py Outdated Show resolved Hide resolved
geopandas/base.py Outdated Show resolved Hide resolved
xmin, ymin, xmax, ymax = geom.bounds
candidates = []
while len(candidates) < size:
batch = points_from_xy(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as a first implementation, but this sampling is potentially quite wasteful, if you have a lot of points, and your first sample gets say 95% of size, you would only need to target for another 5% but this will then try to draw the full length of size again. (But this is perhaps better than drawing too few points by guessing how many sampled points will be accepted on the next iteration).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tricky to get the heuristic right if we wanted to use different size. It super depends on the convexity of each polygon. Maybe a less wasteful option would be to go with a number larger that size initially to have a higher chance of hitting the size at one go. But I'd leave that for later if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, sounds good

Copy link
Member

@ljwolf ljwolf Apr 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I kept the chunk size constant as a first pass to be conservative.

I think the statistically efficient method is to sample each round proportional to the hit rate times the remaining sample size, but that can cause the size of a round to get very large very quickly.

"source": [
"## Variable number of points\n",
"\n",
"You can also sample different number of points from different geometries if you pass an array specifying the size of the sample per geometry."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow to pass a column name as well? (instead of just the values, so I assume gdf.sample_points(gdf["col"]) works for sure)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have in plotting? Not now. Do we want to?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just add a changelog note?

@jorisvandenbossche jorisvandenbossche changed the title ENH: random sampling based on #2363 ENH: Method to sample points randomly from within geometries May 1, 2023
@martinfleis
Copy link
Member Author

Can you just add a changelog note?

Done.

@jorisvandenbossche
Copy link
Member

Failures are unrelated: the py38 is one is related to pyogrio (if that keeps failing, might be something with the 0.6.0 release), and the dev build is failing because of scikit-learn/scikit-learn#26290

@jorisvandenbossche jorisvandenbossche merged commit e8ddf25 into geopandas:main May 1, 2023
15 of 17 checks passed
@jorisvandenbossche
Copy link
Member

Thanks @ljwolf and @martinfleis!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants