Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine2 #122

Merged
merged 19 commits into from
Mar 11, 2022
Merged

Combine2 #122

merged 19 commits into from
Mar 11, 2022

Conversation

martindurant
Copy link
Member

@martindurant martindurant commented Feb 3, 2022

  • combine on multiple dimensions at once
  • allow for files with variable number of chunks (so long as the chunk size is consistent)
  • allow merging into multiple variable names
  • derive the coordinate variable in different ways, including file name or dataset/variable attribute
  • cftime decoding (again!)
  • pre processing JSONs
  • post-processing output
  • Don't concatenate arrays that are the same in every dataset and don't depend on the dimensions being concated.
  • coords that should be int are being converted to float somewhere
  • Get inlining to work

(broke when running after previous test due to state)
Copy link
Collaborator

@lsterzinger lsterzinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome Martin, good work! The docs didn't render well for me, I moved the selector list under the coo_map parameter section and made a couple changes that helped the docs build properly

@martindurant
Copy link
Member Author

Yeah, we have some mix of numpydoc and whatever my IDE did automatically. Should have marked this WIP...

@lsterzinger
Copy link
Collaborator

I don't seem to know how to use github properly 😆 do you mind me committing directly to the PR? All that changed was the docstring

@martindurant
Copy link
Member Author

Go ahead. Interestingly, I got a notice that you requested changes, but I saw no changes :)

Coordinates to be stored internally as sets, since suplicates are always fine
and make no difference for the coord of interest
@martindurant martindurant mentioned this pull request Feb 21, 2022
7 tasks
@martindurant
Copy link
Member Author

Question:

Don't concatenate arrays that are the same in every dataset and don't depend on the dimensions being concated.

Should we avoid guessing this, and require the user to specify which arrays not to concatenate? Then we can avoid all those xarray compat arguments.

@keewis
Copy link
Contributor

keewis commented Mar 4, 2022

Should we avoid guessing this, and require the user to specify which arrays not to concatenate?

I'd very much support this. However, is there a reason why you want to specify which arrays to skip? At least for the datasets I'm working with, it is most common to concatenate variables that already have concat_dim (so something like xarray's data_vars="minimal" works pretty well).

Otherwise, what if that parameter, in addition to (or instead of) a sequence, can also be a callable that decides based on the name, attrs, and dims (or .zattrs and .zarray) whether to skip that variable?

@martindurant
Copy link
Member Author

is there a reason why you want to specify which arrays to skip?

I didn't mean to skip, but let's say there is a 1xN variable in each dataset which is the same in every dataset, there would be no good reason to produce a MxN 2d array in the output. Sometimes you can know this, and xarray has a number of ways of guessing, including the "minimal" compat options - but my idea is to require the user to say rather than rely on obscure flags.

Note that combine2 is designed to be able to get coordinates for the concat dimension(s) from a number of different sources, and so you don't necessarily know which variables depend on those dimensions or and which don't.

@keewis
Copy link
Contributor

keewis commented Mar 5, 2022

combine2 is designed to be able to get coordinates for the concat dimension(s) from a number of different sources, and so you don't necessarily know which variables depend on those dimensions or and which don't.

Right. My only concern was that this would make the workflow that is currently supported using "minimal" become more difficult. I guess I can extract that information from the metadata / zarr objects, but I wonder if this can be a helper function provided by kerchunk?

Edit: in case it helps, import pdb; pdb.set_trace() can be replaced by the builtin breakpoint(), and there's the debug-statements hook in pre-commit/pre-commit-hooks to make sure those are not committed

@martindurant
Copy link
Member Author

I am wondering whether it isn't a good thing to force users to pick the terms for each of their data variables. In some cases there will be no way to guess without downloading whole arrays' worth of data (or even then) - actually xarray gets it wrong some times too.

@martindurant
Copy link
Member Author

(to be sure, this is for cases where the concat coordinate is not already an array in the data, so that its relationship with the other variables is not clear)

@rsignell-usgs
Copy link
Collaborator

I like the idea of people of at least allowing users to specify explicitly what to do with variables, avoiding obscure flags. I think it will be easy for most use cases, and this would make the workflows more transparent.

@martindurant
Copy link
Member Author

@rabernat , all tasks here are done, so I would like to merge. It will break pangeo-forge recipes calling combine. We could make some version check or so until I get to fixing the other side, or just ignore it.

@rabernat
Copy link
Contributor

I think we are ok as long as you don't make a release. If you are going to make a release that is not backwards compatible, it would be great if you could add a version check to the imports in Pangeo Forge Recipes (or even better, just update PFR to use the new code path 😁 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants