Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate datatree.py module into xarray.core. #6

Closed

Conversation

owenlittlejohns
Copy link
Collaborator

@owenlittlejohns owenlittlejohns commented Feb 19, 2024

This draft PR shows initial work to migrate the datatree.py module to xarray/core/datatree.py. Once the treenode.py PR is merged, I'll likely create a new branch putting the new work here on top of the latest from the main xarray repository main branch.

Most of the changes are import path changes, and type-hints, but there are a couple of things I wanted to ask about, so I'll leave comments in the code relating to those.

  • completes migration step for datatree/datatree.py Track merging datatree into xarray pydata/xarray#8572
  • Tests added
  • [N/A] User visible changes (including notable bug fixes) are documented in whats-new.rst
    Internal Changes (including notable bug fixes) are documented in whats-new.rst
  • [N/A] New functions/methods are listed in api.rst

@@ -1,15 +1,11 @@
# import public API
from .datatree import DataTree
from .extensions import register_datatree_accessor
Copy link
Collaborator Author

@owenlittlejohns owenlittlejohns Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without these changes, migrating datatree.py this early in proceedings led to circular dependencies:

  • xarray.core.datatree
  • ➡️ a few things, e.g., xarary.datatree_.datatree.formatting
  • ➡️ xarray.datatree_.datatree.__init__.py
  • ➡️ xarray.datatree_.datatree.extensions
  • ➡️ xarray.core.datatree.

I am a bit wary here, though, as I don't want to lose track of the things being pulled out of the public API. I'm guessing everything that was originally in here should ultimately end up in the main xarray public API? And that this is essentially the final step in the migration (ignoring future refactoring).

@TomNicholas - Maybe all I'm asking here is: should we make the addition of the datatree public API items into the xarray public API an explicit step on the main xarray issue?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing everything that was originally in here should ultimately end up in the main xarray public API?

Yes, I don't think there is anything superfluous.

And that this is essentially the final step in the migration (ignoring future refactoring).

should we make the addition of the datatree public API items into the xarray public API an explicit step on the pydata#8572?

Yep - I'll add it to the list.

@@ -1,6 +1,7 @@
import pytest

from xarray.datatree_.datatree import DataTree, register_datatree_accessor
from xarray.core.datatree import DataTree
from xarray.datatree_.datatree.extensions import register_datatree_accessor
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the minor change to avoid circular dependencies (register_datatree_accessor is no longer part of the public datatree API).

@@ -166,8 +167,7 @@ def test_assign_when_already_child_with_variables_name(self):
dt.ds = new_ds


class TestGet:
...
class TestGet: ...
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stuff at the bottom here is just from me running pre-commit run --all-files.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of wish I had thought of that. 🙃

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah - I regularly use that command whilst developing

from .ops import (
from xarray.datatree_.datatree.formatting import datatree_repr
from xarray.datatree_.datatree.formatting_html import (
datatree_repr as datatree_repr_html,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the comments on the datatree public API, this is the only other change that isn't really import paths or type hints. It's incredibly minor, happy to revert back to importing the whole datatree.formatting and datatree.formatting_html modules if that is preferred.

@@ -108,6 +108,10 @@ Internal Changes
`Matt Savoie <https://github.com/flamingbear>`_ and `Tom Nicholas
<https://github.com/TomNicholas>`_.

- Migrates ``datatree`` functionality into ``xarray/core``. (:pull: `8757`)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you won't know this actual PR value until you open against pydata/xarray (draft or real)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah - good catch!

@TomNicholas
Copy link

FYI the treenode.py module was merged now, so you should be able to open this against xarray main

@owenlittlejohns
Copy link
Collaborator Author

Thanks for the heads up @TomNicholas - I will get a PR up against the latest HEAD of the main xarray repository today. (I might clean up the commit history a bit and cherry pick the commits on a fresh branch, but will get that squared away ASAP)

Copy link
Owner

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just submitting so you can see what I found. No approval or not since you're going to put it up to xarray/main

xarray/core/datatree.py Show resolved Hide resolved
@@ -1027,7 +1021,7 @@ def drop_nodes(
if extra:
raise KeyError(f"Cannot drop all nodes - nodes {extra} not present")

children_to_keep = OrderedDict(
children_to_keep = dict(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately on this line, you can get rid of the dict wrapper and just use

children_to_keep = {name: child for name, child in self.children.items() if name not in names}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue.

https://github.com/flamingbear/xarray/blob/DAS-2062-migrate-datatree-module/xarray/core/datatree.py#L1447

This method as_array, I don't think exists on Dataset or DatasetView.

def as_array(self) -> DataArray:
    return self.ds.as_dataarray()

I can follow it back to Dataset, but I think that's a numpy method. My guess is that it should be self.ds.to_dataarray() but that's a guess.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset used to have a to_array method but it got renamed to the clearer to_dataarray sometime after I wrote to_array into datatree.

Copy link
Collaborator Author

@owenlittlejohns owenlittlejohns Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, given that the underlying method in Dataset has changed, would the recommendation here be to rename this method to to_dataarray, to provide consistency between the classes?

Copy link

@TomNicholas TomNicholas Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely. I don't think we need to be particularly concerned with exact backwards compatibility with a prototype, though it might be nice to keep a record of any small "breaking changes" like this so we can include it when we make xr.DataTree public.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue, you didn't cause.

https://github.com/flamingbear/xarray/blob/DAS-2062-migrate-datatree-module/xarray/core/datatree.py#L1508

This doc string looks like it was pulled from another location (Dataset and Dataarray's to_zarr) and then formatted badly. The issue is that In this case, the mode is given a default value of "w-" so that unless None is specified, the following remarks are incorrect.
I would probably update this below.

The default mode is “a” if append_dim is set. Otherwise, it is “r+” if region is set and w- otherwise.

to be

The default mode is “w-".

The to_zarr in dataset/dataarray does handle the None case*, but this currently does not. in create empty zarr group it just uses the mode passed in and there no logic to determine the correct type.

  • actually I thought that was in the dataset/dataarray, but it's in the api and now I'm second guessing myself, maybe that should be fixed when _datatree_to_zarr is migrated from io.py

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@owenlittlejohns I think something here needs updating.

@owenlittlejohns
Copy link
Collaborator Author

Here's the real PR for this: pydata#8789!

@flamingbear flamingbear deleted the branch mhs/migrate_treenode February 28, 2024 22:02
@owenlittlejohns owenlittlejohns deleted the DAS-2062-migrate-datatree-module branch March 26, 2024 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants