-
Notifications
You must be signed in to change notification settings - Fork 22
Fix duplicate context point sampling in "int" strategy for pandas #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tom-andersson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jonas-scholz123, thanks for the PR! Happy to merge, but some initial thoughts:
I wouldn't characterise this as a bug per se, because we may want the model to learn to handle multiple complimentary observations at the same spatial location (which get encoded as a larger density channel blob).
That said, we should be more explicit about the replacement behaviour. I can imagine the replace=True case being problematic when the number of points being sampled is close to the dataset size (so duplicates are likely) and the user needs exactly N context points (e.g. for a sensor placement experiment).
My main concern is breaking backwards compatibility, with the TaskLoader now triggering a numpy ValueError if the user requests more than the number of dataset points. Please could you either
- make
replacean attribute of the TaskLoader with an easy to understand variable name, which defaults to False but users can set to True if needed, or - explicitly catch N > df.index.size and raise a
SamplingTooManyPointsErrorwith an intuitive error message, which youassertRaisesin your unit test.
tests/test_task_loader.py
Outdated
| x_coords = task["X_c"][0][0] | ||
| y_coords = task["X_c"][0][1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this x1 and x2 to follow the notation in the rest of the codebase
tests/test_task_loader.py
Outdated
|
|
||
| num_unique_coords = len(self.df.xs("2020-01-01", level="time").index) | ||
|
|
||
| task = tl("2020-01-01", num_unique_coords, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use explicit keyword arguments here? Took me a minute to remember what the 10 refers to from the position haha. Could also do something like not_relevant_for_this_test=10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do
tests/test_task_loader.py
Outdated
|
|
||
| num_unique_coords = len(self.df.xs("2020-01-01", level="time").index) | ||
|
|
||
| task = tl("2020-01-01", num_unique_coords, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting two small issues with this test:
- even with
replace=Trueas we had before, we could by chance not sample any duplicates and pass this test. p(no_duplicates) goes down as N approaches the size of the possible values though, which you've done here, so maybe add a comment to that effect. - you aren't setting the random seed of the TaskLoader with
seed_overridein the call site here, so the test will be stochastic, is that deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I noticed those, but the chance that you don't hit a single duplicate point when sampling 1000x from a dataset with 1000 locations is basically 0, so I didn't bother to make it more robust!
Happy to set a random seed
|
Thanks for your CR! Some thoughts:
I'm not sure I follow, we want the same data frame row to appear multiple times in the same task/context set/encoding? Why? I would understand if we had different measurements at the same location and would want to be able to sample those, but in what scenario would users want the same observation twice?
I think this is quite unintuitive, e.g. when I specify a "split", of 80% context 20% target, I would expect
I'm not sure which is the case now, but I think 3 and either 1 or 2 are currently not the case.
Does deepsensor guarantee backward stability at this point? I would argue that |
554cd32 to
4a3f9de
Compare
|
I've made the changes now that I think lead to the most intuitive API, let me know if breaking backward compatibility is a hard no, happy to change it to your approach (1) in that case |
4a3f9de to
0a09d63
Compare
|
Thanks Jonas and sorry for the delay. LGTM! |
📝 Description
Bugfix: previously, when sampling an integer number of context points from a pandas dataframe, duplicate points could be sampled, leading to unexpected behaviour such as duplicate context points and not sampling all points when passing N=number of total available context points.
This fixes the issue by passing
replace=Falsein the appropriate function call, and adds a unit test to cover the new behaviour.✅ Checklist before requesting a review
pip install .[dev]and runningpre-commit install(or alternatively, manually runningruff formatbefore commiting)If changing or adding source code:
pytest).If changing or adding documentation:
jupyter-book build docs --all) and the changes look good from a manual inspection of the HTML indocs/_build/html/.