New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Adding issubset function for Dependent Coordinates #414
Conversation
…sing and breaks the Sentinel datalib (and possibly others)
…rdinates.issubset".
Looking at this now. |
This is what I would have expected for def issubset(self, other):
mine = zip(*[self[dim].coordinates.ravel() for dim in self.dims])
other = zip(*[other[dim].coordinates.ravel() for dim in self.dims])
return set(mine).issubset(other) I don't really follow what you mean by "From this point forward, it gets expensive. We're essentially doing NN interpolation here." I think there is some more work to do here.
|
I can look in more detail (basically write the tests that I think are failing) in the morning, and probably fix them before lunch. |
Please do. As you can see I really don't know what I'm doing. mine = zip(*[self[dim].coordinates.ravel() for dim in self.dims])
other = zip(*[other[dim].coordinates.ravel() for dim in self.dims])
return set(mine).issubset(other) is a much cleaner implementation of what I was trying to do near the end of the function. But it could be very expensive computationally for cases with lots of coordinates. You are essentially finding the subset of overlapping data points which you want to pull out of the dataset if the result is True. If False, hopefully it will shortcut very rapidly. So, I chose to do that last, and catch any of the easier-to-find falses before that. Although arguably the 1-D comparisons (not pairwise) might be just as expensive... |
I don't think you save any time by checking dimensions separately in the False case, and obviously it takes a lot longer in the True case. Each one of these:
takes about the same amount of time as one of these:
You can do a million coordinates with 3 dimensions in less than a second. |
Ok. I'll fix it up. |
Or... are you working on this right now? |
There are some difficult cases with other of type Coordinates. The easiest is when it contains stacked coordinates with the same dimensions: if self.name in other.dims: return self.issubset(other[self.name]) Another easy case is when the necessary dims are all unstacked: if all(dim in other.dims for dim in self.dims): return all(self[dim].issubset(other[dim]) for dim in self.dims) A more challenging case is when there are stacked or dependent coordinates that match but may be transposed or contain additional dimensions: for dim, coords in other.items(): if all(d in coords.dims for d in self.dims): if isinstance(coords, StackedCoordinates): o = StackedCoordinates([other[dim] for dim in self.dims]) elif isinstance(coords, DependentCoordinates): o = StackedCoordinates([other[dim].coordinates.ravel() for dim in self.dims], dims=self.dims) return self.issubset(o) The solution here covers all of the above cases, plus additional cases where the necessary dimensions are mixed (stacked/dependent and unstacked).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed changes StackedCoordinates and DependentCoordinates issubset fixes and tests.
Each issubset
method accepts either a Coordinates object (so that Coordinates.issubset
can call it with enough information for mixed coordinates types) or an object of the same type, which is where the real subset check occurs.
This emphasizes how #268 would be a nice thing to get around to!
@jmilloy Is this ready to go? |
Yes, you could re-review, but I think it is good to go. |
It was missing and breaks the Sentinel datalib (and possibly others).
@jmilloy Would you mind quickly reviewing this -- I want to merge it in for the newest release.