Add data_vars support by Chrimspie · Pull Request #73 · epochpic/sdf-xarray

Chrimspie · 2025-10-31T14:28:27Z

Use of open_mfdataset() can cause the machine to allocate unfeasible amounts of memory to store the dataset created from the SDF files. This function avoids the requesting of excess memory by only extracting data from the files for a single variable, given as an argument. The function opens SDF files one by one, extracts the requested variable, and appends it to a list of data arrays. Only after this information is extracted from all relevant SDF files is the output dataset created. This dataset is much smaller than that which would be created by open_mfdataset().

Example:

ds = open_mfdataset_variable("path/to/sim/*.sdf", "Derived_Number_Density_electron")
ds["Derived_Number_Density_electron"]
<xarray.DataArray 'Derived_Number_Density_electron' (time: 5, X_Grid_mid: 100,
                                                     Y_Grid_mid: 100)> Size: 400kB
array([[[0.74796254, 0.92079052, 0.83932642, ..., 0.83889195,
         0.8076792 , 0.74392353],
        [1.07479298, 1.18991263, 1.07810109, ..., 1.05977413,
         1.00096084, 0.91650156],
        [1.1129185 , 0.90517832, 1.1042725 , ..., 1.0131306 ,
         0.96613728, 1.13011126],
        ...,
        [1.16194319, 1.06974929, 1.11065085, ..., 0.89871079,
         0.8788494 , 1.05707098],
        [0.89594484, 0.96290852, 1.19858532, ..., 1.04243784,
         1.17190321, 1.2202328 ],
        [0.61944508, 0.74224332, 0.89385901, ..., 1.03116218,
         0.79594316, 0.81182455]]], shape=(5, 100, 100))
Coordinates:
  * time        (time) float64 40B 1.12e-11 2.51e-09 ... 7.506e-09 1.002e-08
  * X_Grid_mid  (X_Grid_mid) float64 800B 0.005 0.015 0.025 ... 0.985 0.995
  * Y_Grid_mid  (Y_Grid_mid) float64 800B 0.005 0.015 0.025 ... 0.985 0.995
Attributes:
    units:       1/m^3
    point_data:  False
    full_name:   Derived/Number_Density/electron
    long_name:   Derived Number Density electron

ds = open_mfdataset_variable("path/to/sim/*.sdf", "Particles_Px_proton", keep_particles=True)
ds["Particles_Px_proton"]
<xarray.DataArray 'Particles_Px_proton' (time: 1, ID_proton: 1920)> Size: 15kB
array([[-1.63747966e-22, -1.35334162e-22, -1.46445120e-22, ...,
         5.64178958e-22, -2.42414743e-21, -1.11331956e-21]],
      shape=(1, 1920))
Coordinates:
  * time                (time) float64 8B 2.417e-09
    X_Particles_proton  (ID_proton) float64 15kB 5.042e-05 ... 0.0005519
Dimensions without coordinates: ID_proton
Attributes:
    units:       kg.m/s
    point_data:  True
    full_name:   Particles/Px/proton
    long_name:   Particles $P_x$ proton

LiamPattinson · 2025-10-31T14:55:15Z

Thanks @Chrimspie!

@JoelLucaAdams Would it be possible to implement this as an optional argument to open_mfdataset of type data_vars: Iterable[str] | None = None? If supplied with None (the default), it could continue to load everything. If supplied with a list/tuple/etc of strings, it could only load those variables and discard everything else.

I think this is roughly how it's handled in plain Xarray, though it also has some extra options.

Co-authored-by: Chrimspie <chrimspie@gmail.com>

JoelLucaAdams · 2025-11-03T15:21:25Z

Thanks again for writing this along with the tests @Chrimspie! (Like we discussed documentation can come later!)

@LiamPattinson Thats a great suggestion to stay more in-line with how xarray works!

I have refactored the code (and tests) to utilise data_vars instead of a whole new function and included @Chrimspie as a co-author since this was his original idea.

We load all the data in with the SDFDataStore method and then inside of the SDFPreprocess we now pass a new parameter called data_vars which removes all the variables that aren't mentioned in said list before it's fed over to the xarray open_mfdataset function.

A couple of things to note:

This new function silently fails if you provide data_vars don't exist and will just return time. I'm not sure how else to get around this behaviour when the variable you might be looking for is missing from that particular dataset but present in a later file.
data_vars don't work on particle data with ID_... since they are only dimensions and not coordinates. This behaviour is somewhat expected as the ID... dimension can change size over time so there's no way to make it a coordinate :(

Examples:

import sdf_xarray as sdfxr

ds = sdfxr.open_mfdataset(
    "/Users/joel/Source/sdf-xarray/tests/example_files_1D/*.sdf", 
     keep_particles=True, 
     data_vars=["Particles_Particles_Per_Cell_proton", "Electric_Field_Ez", "dist_fn_x_px_proton"]
)
<xarray.Dataset> Size: 143kB
Dimensions:                              (time: 11, X_x_px_proton: 16,
                                          Px_x_px_proton: 100, X_Grid_mid: 16)
Coordinates:
  * time                                 (time) float64 88B 5.467e-14 ... 2.4...
  * X_x_px_proton                        (X_x_px_proton) float64 128B 1.725e-...
  * Px_x_px_proton                       (Px_x_px_proton) float64 800B -2.97e...
  * X_Grid_mid                           (X_Grid_mid) float64 128B 1.725e-05 ...
Data variables:
    dist_fn_x_px_proton                  (time, X_x_px_proton, Px_x_px_proton) float64 141kB dask.array<chunksize=(1, 16, 100), meta=np.ndarray>
    Particles_Particles_Per_Cell_proton  (time) float64 88B nan nan ... 120.0
    Electric_Field_Ez                    (time, X_Grid_mid) float64 1kB dask.array<chunksize=(1, 16), meta=np.ndarray>
Attributes: (12/21)
    filename:         /Users/joel/Source/sdf-xarray/tests/example_files_1D/00...
    file_version:     1
    file_revision:    4
    code_name:        Epoch1d
    step:             0
    time:             5.466992913512341e-14
    ...               ...
    compile_machine:  noether
    compile_flags:    unknown
    defines:          0
    compile_date:     Mon Jul 29 09:55:15 2024
    run_date:         Thu Oct 17 11:08:44 2024
    io_date:          Thu Oct 17 11:08:44 2024

JoelLucaAdams · 2025-11-03T15:24:30Z

Potentially fixes #57 as you could load it in 1 distribution function at a time. FYI @LucyMeganArmitage

LiamPattinson

I've added a few suggestions, and the main issue that stands out to me is that I don't think the case in which separate_times == True and data_vars is not None is being handled.

src/sdf_xarray/__init__.py

Co-authored-by: Liam Pattinson <LiamPattinson@users.noreply.github.com>

LiamPattinson

Nice, this looks really clean now 👍

JoelLucaAdams · 2025-11-04T15:37:21Z

@Chrimspie Thanks again for the original code implementation, I hope the new way it works isn't too confusing!

@LiamPattinson Thanks again for reviewing it. You've reviewed so many of these PRs now that I think its high time you add your name to the list of contributors in pyproject.toml and citation.cff :)

Add data_vars support

Chrimspie added 3 commits October 31, 2025 13:57

Add open_mfdataset_variable

7ea9c5b

Check for no data in variable

35176ba

Add name to contributors

7811353

Refactor open_mfdataset_variable to data_vars

f47eaaa

Co-authored-by: Chrimspie <chrimspie@gmail.com>

LiamPattinson reviewed Nov 3, 2025

View reviewed changes

JoelLucaAdams and others added 6 commits November 3, 2025 16:36

Update src/sdf_xarray/__init__.py

4432bd7

Co-authored-by: Liam Pattinson <LiamPattinson@users.noreply.github.com>

Update src/sdf_xarray/__init__.py

a120451

Co-authored-by: Liam Pattinson <LiamPattinson@users.noreply.github.com>

Update src/sdf_xarray/__init__.py

68ffbdb

Co-authored-by: Liam Pattinson <LiamPattinson@users.noreply.github.com>

Update src/sdf_xarray/__init__.py

b6b2ab7

Co-authored-by: Liam Pattinson <LiamPattinson@users.noreply.github.com>

refactor data_vars logic to purge_unselected_data_vars

21c3057

Add data_vars support to separate_times

9e33b2f

JoelLucaAdams mentioned this pull request Nov 4, 2025

Add documentation for data_vars #74

Closed

JoelLucaAdams changed the title ~~Add open_mfdataset_variable~~ Add data_vars support Nov 4, 2025

JoelLucaAdams changed the title ~~Add data_vars support~~ Add data_vars support Nov 4, 2025

LiamPattinson approved these changes Nov 4, 2025

View reviewed changes

JoelLucaAdams merged commit 331520e into epochpic:main Nov 4, 2025
5 checks passed

LiamPattinson mentioned this pull request Nov 4, 2025

Add author Liam Pattinson #75

Merged

JoelLucaAdams added a commit that referenced this pull request Nov 13, 2025

Merge pull request #73 from Chrimspie/open_mfdataset_variable

5d260de

Add data_vars support

Chrimspie deleted the open_mfdataset_variable branch November 25, 2025 14:22

Chrimspie restored the open_mfdataset_variable branch November 25, 2025 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data_vars support#73

Add data_vars support#73
JoelLucaAdams merged 10 commits intoepochpic:mainfrom
Chrimspie:open_mfdataset_variable

Chrimspie commented Oct 31, 2025

Uh oh!

LiamPattinson commented Oct 31, 2025

Uh oh!

JoelLucaAdams commented Nov 3, 2025 •

edited

Loading

Uh oh!

JoelLucaAdams commented Nov 3, 2025

Uh oh!

LiamPattinson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LiamPattinson left a comment

Uh oh!

JoelLucaAdams commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Chrimspie commented Oct 31, 2025

Uh oh!

LiamPattinson commented Oct 31, 2025

Uh oh!

JoelLucaAdams commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoelLucaAdams commented Nov 3, 2025

Uh oh!

LiamPattinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LiamPattinson left a comment

Choose a reason for hiding this comment

Uh oh!

JoelLucaAdams commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JoelLucaAdams commented Nov 3, 2025 •

edited

Loading