-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Method to sample points randomly from within geometries #2860
ENH: Method to sample points randomly from within geometries #2860
Conversation
The CI failure is unrelated and is also on main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…andas into simple_sampling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! A couple of minor documentation comments.
This is something I will be quite keen to use myself, and replace where I've written sampling code by hand.
xmin, ymin, xmax, ymax = geom.bounds | ||
candidates = [] | ||
while len(candidates) < size: | ||
batch = points_from_xy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine as a first implementation, but this sampling is potentially quite wasteful, if you have a lot of points, and your first sample gets say 95% of size, you would only need to target for another 5% but this will then try to draw the full length of size
again. (But this is perhaps better than drawing too few points by guessing how many sampled points will be accepted on the next iteration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is tricky to get the heuristic right if we wanted to use different size. It super depends on the convexity of each polygon. Maybe a less wasteful option would be to go with a number larger that size
initially to have a higher chance of hitting the size
at one go. But I'd leave that for later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I kept the chunk size constant as a first pass to be conservative.
I think the statistically efficient method is to sample each round proportional to the hit rate times the remaining sample size, but that can cause the size of a round to get very large very quickly.
"source": [ | ||
"## Variable number of points\n", | ||
"\n", | ||
"You can also sample different number of points from different geometries if you pass an array specifying the size of the sample per geometry." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we allow to pass a column name as well? (instead of just the values, so I assume gdf.sample_points(gdf["col"])
works for sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we have in plotting? Not now. Do we want to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just add a changelog note?
Done. |
Failures are unrelated: the py38 is one is related to pyogrio (if that keeps failing, might be something with the 0.6.0 release), and the dev build is failing because of scikit-learn/scikit-learn#26290 |
Thanks @ljwolf and @martinfleis! |
Ad discussed in person and during the dev call, to help the review process of #2363, it was decided to split the PR into multiple smaller ones dealing with one task per PR.
This PR implements the sampling based on samples either from a uniform distribution or using pointpats. Mostly based off #2363, with some minor changes and exposure of seed and random generator for better control.