Skip to content

Add data_vars support#73

Merged
JoelLucaAdams merged 10 commits intoepochpic:mainfrom
Chrimspie:open_mfdataset_variable
Nov 4, 2025
Merged

Add data_vars support#73
JoelLucaAdams merged 10 commits intoepochpic:mainfrom
Chrimspie:open_mfdataset_variable

Conversation

@Chrimspie
Copy link
Contributor

Use of open_mfdataset() can cause the machine to allocate unfeasible amounts of memory to store the dataset created from the SDF files. This function avoids the requesting of excess memory by only extracting data from the files for a single variable, given as an argument. The function opens SDF files one by one, extracts the requested variable, and appends it to a list of data arrays. Only after this information is extracted from all relevant SDF files is the output dataset created. This dataset is much smaller than that which would be created by open_mfdataset().

Example:

ds = open_mfdataset_variable("path/to/sim/*.sdf", "Derived_Number_Density_electron")
ds["Derived_Number_Density_electron"]
<xarray.DataArray 'Derived_Number_Density_electron' (time: 5, X_Grid_mid: 100,
                                                     Y_Grid_mid: 100)> Size: 400kB
array([[[0.74796254, 0.92079052, 0.83932642, ..., 0.83889195,
         0.8076792 , 0.74392353],
        [1.07479298, 1.18991263, 1.07810109, ..., 1.05977413,
         1.00096084, 0.91650156],
        [1.1129185 , 0.90517832, 1.1042725 , ..., 1.0131306 ,
         0.96613728, 1.13011126],
        ...,
        [1.16194319, 1.06974929, 1.11065085, ..., 0.89871079,
         0.8788494 , 1.05707098],
        [0.89594484, 0.96290852, 1.19858532, ..., 1.04243784,
         1.17190321, 1.2202328 ],
        [0.61944508, 0.74224332, 0.89385901, ..., 1.03116218,
         0.79594316, 0.81182455]]], shape=(5, 100, 100))
Coordinates:
  * time        (time) float64 40B 1.12e-11 2.51e-09 ... 7.506e-09 1.002e-08
  * X_Grid_mid  (X_Grid_mid) float64 800B 0.005 0.015 0.025 ... 0.985 0.995
  * Y_Grid_mid  (Y_Grid_mid) float64 800B 0.005 0.015 0.025 ... 0.985 0.995
Attributes:
    units:       1/m^3
    point_data:  False
    full_name:   Derived/Number_Density/electron
    long_name:   Derived Number Density electron
ds = open_mfdataset_variable("path/to/sim/*.sdf", "Particles_Px_proton", keep_particles=True)
ds["Particles_Px_proton"]
<xarray.DataArray 'Particles_Px_proton' (time: 1, ID_proton: 1920)> Size: 15kB
array([[-1.63747966e-22, -1.35334162e-22, -1.46445120e-22, ...,
         5.64178958e-22, -2.42414743e-21, -1.11331956e-21]],
      shape=(1, 1920))
Coordinates:
  * time                (time) float64 8B 2.417e-09
    X_Particles_proton  (ID_proton) float64 15kB 5.042e-05 ... 0.0005519
Dimensions without coordinates: ID_proton
Attributes:
    units:       kg.m/s
    point_data:  True
    full_name:   Particles/Px/proton
    long_name:   Particles $P_x$ proton

@LiamPattinson
Copy link
Collaborator

Thanks @Chrimspie!

@JoelLucaAdams Would it be possible to implement this as an optional argument to open_mfdataset of type data_vars: Iterable[str] | None = None? If supplied with None (the default), it could continue to load everything. If supplied with a list/tuple/etc of strings, it could only load those variables and discard everything else.

I think this is roughly how it's handled in plain Xarray, though it also has some extra options.

Co-authored-by: Chrimspie <chrimspie@gmail.com>
@JoelLucaAdams
Copy link
Collaborator

JoelLucaAdams commented Nov 3, 2025

Thanks again for writing this along with the tests @Chrimspie! (Like we discussed documentation can come later!)

@LiamPattinson Thats a great suggestion to stay more in-line with how xarray works!

I have refactored the code (and tests) to utilise data_vars instead of a whole new function and included @Chrimspie as a co-author since this was his original idea.

We load all the data in with the SDFDataStore method and then inside of the SDFPreprocess we now pass a new parameter called data_vars which removes all the variables that aren't mentioned in said list before it's fed over to the xarray open_mfdataset function.

A couple of things to note:

  • This new function silently fails if you provide data_vars don't exist and will just return time. I'm not sure how else to get around this behaviour when the variable you might be looking for is missing from that particular dataset but present in a later file.
  • data_vars don't work on particle data with ID_... since they are only dimensions and not coordinates. This behaviour is somewhat expected as the ID... dimension can change size over time so there's no way to make it a coordinate :(

Examples:

import sdf_xarray as sdfxr

ds = sdfxr.open_mfdataset(
    "/Users/joel/Source/sdf-xarray/tests/example_files_1D/*.sdf", 
     keep_particles=True, 
     data_vars=["Particles_Particles_Per_Cell_proton", "Electric_Field_Ez", "dist_fn_x_px_proton"]
)
<xarray.Dataset> Size: 143kB
Dimensions:                              (time: 11, X_x_px_proton: 16,
                                          Px_x_px_proton: 100, X_Grid_mid: 16)
Coordinates:
  * time                                 (time) float64 88B 5.467e-14 ... 2.4...
  * X_x_px_proton                        (X_x_px_proton) float64 128B 1.725e-...
  * Px_x_px_proton                       (Px_x_px_proton) float64 800B -2.97e...
  * X_Grid_mid                           (X_Grid_mid) float64 128B 1.725e-05 ...
Data variables:
    dist_fn_x_px_proton                  (time, X_x_px_proton, Px_x_px_proton) float64 141kB dask.array<chunksize=(1, 16, 100), meta=np.ndarray>
    Particles_Particles_Per_Cell_proton  (time) float64 88B nan nan ... 120.0
    Electric_Field_Ez                    (time, X_Grid_mid) float64 1kB dask.array<chunksize=(1, 16), meta=np.ndarray>
Attributes: (12/21)
    filename:         /Users/joel/Source/sdf-xarray/tests/example_files_1D/00...
    file_version:     1
    file_revision:    4
    code_name:        Epoch1d
    step:             0
    time:             5.466992913512341e-14
    ...               ...
    compile_machine:  noether
    compile_flags:    unknown
    defines:          0
    compile_date:     Mon Jul 29 09:55:15 2024
    run_date:         Thu Oct 17 11:08:44 2024
    io_date:          Thu Oct 17 11:08:44 2024

@JoelLucaAdams
Copy link
Collaborator

Potentially fixes #57 as you could load it in 1 distribution function at a time. FYI @LucyMeganArmitage

Copy link
Collaborator

@LiamPattinson LiamPattinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few suggestions, and the main issue that stands out to me is that I don't think the case in which separate_times == True and data_vars is not None is being handled.

JoelLucaAdams and others added 6 commits November 3, 2025 16:36
Co-authored-by: Liam Pattinson <LiamPattinson@users.noreply.github.com>
Co-authored-by: Liam Pattinson <LiamPattinson@users.noreply.github.com>
Co-authored-by: Liam Pattinson <LiamPattinson@users.noreply.github.com>
Co-authored-by: Liam Pattinson <LiamPattinson@users.noreply.github.com>
@JoelLucaAdams JoelLucaAdams changed the title Add open_mfdataset_variable Add data_vars support Nov 4, 2025
@JoelLucaAdams JoelLucaAdams changed the title Add data_vars support Add data_vars support Nov 4, 2025
Copy link
Collaborator

@LiamPattinson LiamPattinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks really clean now 👍

@JoelLucaAdams
Copy link
Collaborator

@Chrimspie Thanks again for the original code implementation, I hope the new way it works isn't too confusing!

@LiamPattinson Thanks again for reviewing it. You've reviewed so many of these PRs now that I think its high time you add your name to the list of contributors in pyproject.toml and citation.cff :)

@JoelLucaAdams JoelLucaAdams merged commit 331520e into epochpic:main Nov 4, 2025
5 checks passed
JoelLucaAdams added a commit that referenced this pull request Nov 13, 2025
@Chrimspie Chrimspie deleted the open_mfdataset_variable branch November 25, 2025 14:22
@Chrimspie Chrimspie restored the open_mfdataset_variable branch November 25, 2025 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants