Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Adding issubset function for Dependent Coordinates #414

Merged
merged 5 commits into from Jul 13, 2020

Conversation

mpu-creare
Copy link
Contributor

It was missing and breaks the Sentinel datalib (and possibly others).

@jmilloy Would you mind quickly reviewing this -- I want to merge it in for the newest release.

…sing and breaks the Sentinel datalib (and possibly others)
@mpu-creare mpu-creare requested a review from jmilloy July 8, 2020 20:44
@mpu-creare mpu-creare self-assigned this Jul 8, 2020
@mpu-creare mpu-creare added the bug Something isn't working label Jul 8, 2020
@mpu-creare mpu-creare added this to In progress in 2.2.0 via automation Jul 8, 2020
@jmilloy
Copy link
Collaborator

jmilloy commented Jul 8, 2020

Looking at this now.

@coveralls
Copy link

coveralls commented Jul 8, 2020

Coverage Status

Coverage decreased (-0.1%) to 91.148% when pulling d59561d on feature/DependentCoords-issubset into 7e9268f on develop.

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 8, 2020

This is what I would have expected for DependentCoordinates.issubset, and passes all of your tests.

    def issubset(self, other):
        mine = zip(*[self[dim].coordinates.ravel() for dim in self.dims])
        other = zip(*[other[dim].coordinates.ravel() for dim in self.dims])
        return set(mine).issubset(other)

I don't really follow what you mean by "From this point forward, it gets expensive. We're essentially doing NN interpolation here."

I think there is some more work to do here.

  • DependentCoordinates.issubset needs to handle other of type Coordinates
  • I don't think that StackedCoordinates.issubset is correct.
  • I'm not sure that Coordinates.issubset handles DependentCoordinates correctly.

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 8, 2020

I can look in more detail (basically write the tests that I think are failing) in the morning, and probably fix them before lunch.

@mpu-creare
Copy link
Contributor Author

I can look in more detail (basically write the tests that I think are failing) in the morning, and probably fix them before lunch.

Please do. As you can see I really don't know what I'm doing.

mine = zip(*[self[dim].coordinates.ravel() for dim in self.dims])
other = zip(*[other[dim].coordinates.ravel() for dim in self.dims])
return set(mine).issubset(other)

is a much cleaner implementation of what I was trying to do near the end of the function. But it could be very expensive computationally for cases with lots of coordinates. You are essentially finding the subset of overlapping data points which you want to pull out of the dataset if the result is True. If False, hopefully it will shortcut very rapidly. So, I chose to do that last, and catch any of the easier-to-find falses before that. Although arguably the 1-D comparisons (not pairwise) might be just as expensive...

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 9, 2020

I don't think you save any time by checking dimensions separately in the False case, and obviously it takes a lot longer in the True case. Each one of these:

    if not all([self[dim].issubset(other) for dim in self.dims]):
            return False

takes about the same amount of time as one of these:

mine = zip(*[self[dim].coordinates.ravel() for dim in self.dims])
other = zip(*[other[dim].coordinates.ravel() for dim in self.dims])
mine.issubset(theirs)

You can do a million coordinates with 3 dimensions in less than a second.

@mpu-creare
Copy link
Contributor Author

Ok. I'll fix it up.

@mpu-creare
Copy link
Contributor Author

Or... are you working on this right now?

There are some difficult cases with other of type Coordinates.

The easiest is when it contains stacked coordinates with the same dimensions:

    if self.name in other.dims:
        return self.issubset(other[self.name])

Another easy case is when the necessary dims are all unstacked:

    if all(dim in other.dims for dim in self.dims):
        return all(self[dim].issubset(other[dim]) for dim in self.dims)

A more challenging case is when there are stacked or dependent coordinates
that match but may be transposed or contain additional dimensions:

    for dim, coords in other.items():
        if all(d in coords.dims for d in self.dims):
            if isinstance(coords, StackedCoordinates):
                o = StackedCoordinates([other[dim] for dim in self.dims])
            elif isinstance(coords, DependentCoordinates):
                o = StackedCoordinates([other[dim].coordinates.ravel() for dim in self.dims], dims=self.dims)
            return self.issubset(o)

The solution here covers all of the above cases, plus additional cases where
the necessary dimensions are mixed (stacked/dependent and unstacked).
Copy link
Collaborator

@jmilloy jmilloy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed changes StackedCoordinates and DependentCoordinates issubset fixes and tests.

Each issubset method accepts either a Coordinates object (so that Coordinates.issubset can call it with enough information for mixed coordinates types) or an object of the same type, which is where the real subset check occurs.

This emphasizes how #268 would be a nice thing to get around to!

@mpu-creare
Copy link
Contributor Author

@jmilloy Is this ready to go?

@jmilloy
Copy link
Collaborator

jmilloy commented Jul 13, 2020

Yes, you could re-review, but I think it is good to go.

@mpu-creare mpu-creare merged commit 9b9022b into develop Jul 13, 2020
2.2.0 automation moved this from In progress to Done Jul 13, 2020
@mpu-creare mpu-creare deleted the feature/DependentCoords-issubset branch July 13, 2020 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
2.2.0
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants