Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatibility of SD1D output data format with xBOUT #3

Closed
TomNicholas opened this issue Mar 1, 2019 · 11 comments
Closed

Incompatibility of SD1D output data format with xBOUT #3

TomNicholas opened this issue Mar 1, 2019 · 11 comments

Comments

@TomNicholas
Copy link

TomNicholas commented Mar 1, 2019

So @omyatra asked me for help loading some SD1D data with xBOUT, but it's exposed a problem with the way SD1D (and possibly other modules) write out their data.

Omkar ran one of the examples cases, which creates output data split along the y-direction over 4 dump files. When I try to collect this with xbout I get an error:

ValueError: variable flux_ion not equal across datasets

SD1D writes the variable 'flux_ion' out to all dump files, but it's a boundary value so is only actually calculated in the processor which contains the target. This means that processor 3 has an array 'flux_ion' with non-zero values, which varies over time, whereas processors 0 through 2 have the same kind of array but populated with zeros.

The problem is that from the point of view of the dump-file-joining-algorithm in xBOUT this data is fundamentally ambiguous. The datasets read from the dump files need to be combined along the dimension 'y'. If we just consider dump files 2 & 3, then we are asking it to take two 1D arrays, each with just a 't' dimension and combine them along 'y'. The combining algorithm (xarray.auto_combine) will look at the values in these two 'flux_ion' arrays and if they are the same it will just keep one (because xarray.auto_combine will use xarray.merge). However the values aren't the same, they're zero in one and non-zero in the other, and xBOUT has no way to know whether you want to keep one array or the other, or if you actually want to stack the arrays up by adding a 'y' dimension to the 'flux_ion' variable (using xarray.concat).

If we compare against the BOUT++ default output variable 't_array' it becomes even clearer why flux_ion is ambiguous. 't_array' is also 1D along 't' and is also written to all files. But as simulation time is clearly a global variable then it would be nonsense for this t_array to take different values in different processors.

There are a few ways we could deal with this:

  1. Change BOUT++ to write out variables consistently in all files. If the values in the arrays were consistent across different processors then it would be fine. I guess this would require extra processor communications though.

  2. Change SD1D to never write out flux ion. I think that flux ion can be rederived from the other variables, in which case it never needs to be written at all, it can be recreated in post-processing instead.

  3. Change SD1D to only write out flux ion to a single dump file. I don't know whether BOUT's Field objects have the ability to handle that.

  4. Load flux ion from the data separately in a open_sd1ddataset function. This would be a workaround in this specific case, but it really doesn't solve the underlying issue.

@bendudson How does boutdata.collect handle this ambiguity? Is this likely to be a problem in other BOUT++ modules too? What do you think the best way of dealing with this in general is? I actually think that it should not be possible to write output like this in BOUT++ - perhaps a processor communication to verify consistency before writing would be the simplest general solution?

@ZedThree @d7919 @johnomotani

@dschwoerer
Copy link
Contributor

  1. That would change it inherently to a runtime error - especially during the dumping time, this van be really bad if we are throwing, as the dump files may end up in an inconsistent state, thus makeing them mostly unusable. See e.g. wall_limit not working as expected BOUT-dev#1557
    Further, there may other cases where a single BoutReal per process is sufficient for dumping, but they may be different. Disallowing that use case seems in-beneficial.

  2. does not improve the current state. Whether I cannot read it because it is not dumped, or because my reader isn't happy with the value, doesn't change anything, or am I missing the point?

  3. Field isn't involved at all. The datafiles handle that just fine, just conditionally dump that value on the last processor.

collect just takes the variable from the first dump file ...

In [1]: collect('flux_ion')
mxsub = 1 mysub = 100 mz = 1

nxpe = 1, nype = 2, npe = 2

Out[1]: BoutArray([0., 0., 0., 0.])

In [2]: DataFile('BOUT.dmp.1.nc')['flux_ion'][...]
Out[2]: BoutArray([0.0001445 , 0.0001445 , 0.00014449, 0.00014449])

In [3]: DataFile('BOUT.dmp.0.nc')['flux_ion'][...]
Out[3]: BoutArray([0., 0., 0., 0.])

Similar with 3 applied, collect fails (as it only checks the first file) and only directly reading from that file works.

@TomNicholas
Copy link
Author

I think we may have closed this issue prematurely, this actually still won't work even with @dschwoerer 's patch (it's my fault, this is more complex to solve than I originally thought).

That would change it inherently to a runtime error - especially during the dumping time, this van be really bad if we are throwing, as the dump files may end up in an inconsistent state, thus makeing them mostly unusable.

I'm not sure I understand your point here. I get that if BOUT crashes before all variables have been communicated properly then you would end up with inconsistent dump files, but we have to communicate anyway so how would communicating this BoutReal make that any worse?

Further, there may other cases where a single BoutReal per process is sufficient for dumping, but they may be different. Disallowing that use case seems in-beneficial.

Do we really need to support writing different variables to different files at all though? It makes the problem of concatenation considerably more complicated (I will explain why in a sec), and doesn't seem to give much benefit over just defining the variable globally. The cost of communicating the scalar will never be a limiting factor in performance scaling. If you want to print one number or string per process to a file for debugging purposes then you can always put it in the attributes, which will then basically be ignored on concatenation.

Also it seems to me that there is no physically-meaningful variable which is only defined on one process. You either have global quantities or spatially-varying ones, it doesn't make physical sense to have quantities which are only defined on part of the domain. Essentially, flux_ion should either be considered to be a global scalar and communicated, or the ion flux should be calculated (or at least written as a zero or NaN) for all grid cells so flux_ion = flux_ion(y), and you select the value at y=wall if that's what you're interested in. @bendudson what do you think about that?

Why different variables in different files is so inconvenient for xBOUT

Firstly, I am assuming that we eventually want a replacement for boutdata.collect which uses a function from the xarray top-level API, without having to do anything particularly complex or special-case for BOUT++, and not using lower-level private xarray functions either. (Guard cells would be trimmed using the preprocess argument to xarray.open_mfdataset.) We want this because then our new collection routine would benefit from all the optimization, parallelization, tests and bugfixing etc that all the rest of the xarray community provide, and the behaviour we want in xarray will be explicitly supported indefinitely.

That means the output of BOUT++ has to conform to some restrictions. Firstly it has to conform to the netCDF data model, because xarray is trying to combine all the dump files into a single netCDF dataset. That was the original problem with flux_ion - it violates the netCDF data model to have the same scalar variable in different files having different values, you either have to have a globally-consistent scalar or a flux_ion=flux_ion('y'). This was what was fixed by #4.

What I didn't realise until now was that the exact xarray API creates other restrictions on the output format of BOUT++ as well. In particular the planned API for xarray prevents you combining along a dimension which has some variables present in some datasets but not in others. Here it was decided that the two public-facing multidimensional combining functions should be xarray.auto_combine and xarray.manual_combine. The idea was that auto_combine could deal with any situation as long as there were global dimension coordinates to use to arrange the datasets, whereas manual_combine would a use a series of successive xarray.concat and xarray.merge operations.

Unfortunately BOUT++ does not have global coordinates so we can't use auto_combine, and if we allow writing out BoutReals into one processor but not others then we can't use manual_combine for those cases either. The reason we can't use manual_combine is that the current situation with flux_ion can't be joined with a concat alone or a merge alone along 'y'. That means we either have to:

  1. Not support writing out variables to one dump file and not others in BOUT++ so that we can always use manual_combine like xBOUT currently does (my preferred solution).

  2. Add global dimension coordinates to BOUT++ so that we can use auto_combine. This obviously wouldn't be backwards-compatible though.

  3. Try to persuade the xarray devs to change the planned API to something which supports having this type of output. They might but don't think they will, because the choice of auto_combine and manual_combine has already been discussed a fair amount, and I think it makes a lot of sense, but I will ask.

@d7919
Copy link
Member

d7919 commented Mar 14, 2019

Is it possible to ask x-array to ignore certain variables, this would seem like the easiest and least restrictive "fix"?

@TomNicholas
Copy link
Author

Is it possible to ask x-array to ignore certain variables, this would seem like the easiest and least restrictive "fix"?

Yes, you could pass a function to the preprocess argument of xarray.open_mfdataset (which calls the combining function) which looks like

def ignore(ds, vars):
    return ds.drop(vars)

You still can't use this method unless you know in advance that this data isn't just BOUT data, it's SD1D data for which flux_ion must be dropped, so this isn't a general solution.

I don't think solving this for SD1D is the hard bit, I'm more trying to formalise how BOUT should behave in order to make a sensible and general replacement for collect using xarray possible.

@dschwoerer
Copy link
Contributor

Is boutproject/BOUT-dev#1531 a more interresting example?

It adds performance information for each proc to the dump file - which would vary from proc to proc.

I think a general solution is not possible - as SD1D uses that to dump basically just a FieldPerp, while BOUT++ now uses it to store non-physical data.

Can we try - and on a failure add that variable to list of non-constant variables?

Than (potentially) BOUT++ specific functions like getting the average, all, or from a specific proc could be implemented ...

Disallowing that use in BOUT++ seems to me like a significant disadvantage, and I am failing to see why we should disallow ...

@d7919
Copy link
Member

d7919 commented Mar 14, 2019

Yes I think it's best if analysis code doesn't impose restrictions on the simulation. Presumably the new "collect" like interface could take a list of variables to ignore in order to make this more general (this isn't exclusive to SD1D)? If it really is meant to act like collect, such that you ask it to read a specific variable then we could perhaps auto generate the ignore list by getting a list of all variables and then just removing the one we asked for (this is said without looking at any of the code)?

@TomNicholas
Copy link
Author

Is boutproject/BOUT-dev#1531 a more interesting example?

It adds performance information for each proc to the dump file - which would vary from proc to proc.

It is more interesting, but it sort of highlights my point that it never makes sense to do this with physically-meaningful variables. In the case of performance information then it makes sense to write them differently to each file, but only because they aren't physical values. This would just be a good example of a set of variables to drop before concatenation - you want them in the dump files, but not in the final concatenated dataset.

If it really is meant to act like collect, such that you ask it to read a specific variable then we could perhaps auto generate the ignore list by getting a list of all variables and then just removing the one we asked for (this is said without looking at any of the code)?

It's not quite the same as collect because you join up all the dump files and then choose a variable afterwards, not the other way around. (This is still fast because with xarray the variable values are lazily loaded.) We could easily have a list of variables to drop though. In fact you could label them as ones you wanted to drop by using the netCDF attributes of the variable (like we do for staggered grids now). And it would also be easy to have a module-specific open_sd1ddataset which wraps open_boutdataset and tells open_boutdataset which variables to ignore.

Can we try - and on a failure add that variable to list of non-constant variables?

This could work but would likely be slow (you're going through all of the opening and concatenation logic again for every variable that you end up ignoring).

I think a general solution is not possible

Yes I think it's best if analysis code doesn't impose restrictions on the simulation.

The thing is you can't have it both ways, you can't replace collect with something that uses xarray's (still extremely general!) data structures, while at the same time retaining total flexibility in the output of BOUT.

Disallowing that use in BOUT++ seems to me like a significant disadvantage, and I am failing to see why we should disallow ...

I'm not suggesting actually restricting what BOUT can do, just not explicitly supporting everything it can possibly write out with the new collect function, so individual cases (like this one) might have to be worked around (e.g. with an open_sd1ddataset function).

Then (potentially) BOUT++ specific functions like getting the average, all, or from a specific proc could be implemented ...

If you really wanted to retain the information about which processor certain points were calculated by then you just write out a Field3D (or whatever) which contains process number. As long as it has the full-length (x, y, z) dimensions then that's fine. Then in xarray an average over all points in processor 5 might look like

avg = ds.where(ds['proc'] == 5, drop=True).mean()

@dschwoerer
Copy link
Contributor

So this isn't replacing collect, but rather squashoutput?

@dschwoerer
Copy link
Contributor

I'm not suggesting actually restricting what BOUT can do, just not explicitly supporting everything it can possibly write out with the new collect function, so individual cases (like this one) might have to be worked around (e.g. with an open_sd1ddataset function).

Maybe we could add a Field0D - that does wrap a BoutReal - which adds (if dumped) metadata to that variable saying that all variables are the same - other BoutReals which are getting dumped can thus be ignored?

@TomNicholas
Copy link
Author

TomNicholas commented Mar 14, 2019

So this isn't replacing collect, but rather squashoutput?

Sort of. Really it's not exactly analogous to either, but a more powerful tool which allows you to use xarray methods to get behaviour like collect:

T = open_boutdataset('BOUT.dmp.*.nc')['T'].values

or like squashoutput:

open_boutdataset('BOUT.dmp.*.nc').to_netcdf('single_file.nc')

Maybe we could add a Field0D - that does wrap a BoutReal - which adds (if dumped) metadata to that variable saying that all variables are the same - other BoutReals which are getting dumped can thus be ignored?

We could potentially do this... I am essentially treating BoutReals as Field0Ds at the moment, so a distinction might be useful. You would do this by having in the preprocessing function that if a variable is scalar and doesn't have the attribute flag saying to collect it then it gets ignored. This wouldn't be backwards-compatible though.

I should also probably mention that I'm already special-casing some of the BoutReals which are written to the dump files such as 'MXG', 'ixseps', 'nx'. I currently just have an explicit list of those "metadata" and add them as attributes to the final returned xarray.Dataset object.

@dschwoerer
Copy link
Contributor

dschwoerer commented Mar 14, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants