Merge 3a3fe9c into 875ede9

arviz-devs · Nov 10, 2019 · 58e4b7d · 58e4b7d
2 parents 875ede9 + 3a3fe9c
commit 58e4b7d
Show file tree

Hide file tree

Showing 3 changed files with 5,366 additions and 82 deletions.
diff --git a/schema.md b/schema.md
@@ -1,82 +1,95 @@
-# Design for netcdf storage format for mcmc traces
-
-/
-All data relating to a mcmc run.
-
-attrs:
-backend: "stan" or "pymc3"
-version: str
-model_name?: str
-comment?: str
-model_version?: str
-timestamp: int
-author?: str
-
-/coords
-For each dimension, we store a corresponding variable here, that contains the labels for that dimension.
-
-/model
-Each backend can store a representation of the model in here. (I guess
-source code for stan, no clue for pymc3 at this point.)
-
-/sample_stats
-Statistics computed while sampling, like step size (step_size), depth, diverging, energy (for HMC), log probability (lp), and log likelihood (log_likelihood).
-
-/initial_point
-    One point in parameter space for each chain, where that chain started.
-    TODO We could also store that as an attribute to each var in /trace
-
-/tuning_trace?
-    Same as /trace, but optional. It contains the trace during tuning. Optional. (only if store_tune=True or so?)
-
-/tuning_advi?
-    Some data about an initialization advi run? History of convergence stats, final sd and mu params? Optional
-
-/divergences
-    Points in parameter space where the leapfrog starts that lead to a divergence (excluding tuning).
-    Does stan have access to that info? We could also just store the accepted point of a divergent trajectory.
-
-/observed_data
-    All data that is used in observed variables (or the data (or transformed?) data sections in stan.)
-
-/warnings
-    A list of warnings during sampling. Eg low effective_n, divergences....
-    TODO Not sure about the format. Can we somehow share at least part of that between stan/pymc?
-    They mostly produce the same warnings I think.
-
-/prior?
-Samples from the prior distribution. Same shapes as in trace. (except for (sample, chain))
-
-/prior_predictive?
-Samples from the prior predictive distribution. Same vars as in /data
-
-/posterior_predictive?
-Samples from the posterior predicitve distribution. Same vars as in /data
-
-/trace
-TODO We could call this /posterior
-
-    attrs:
-            The final parameters for the sampler. ie the final mass matrix and step size.
-
-    /trace//var1
-            One entry for each variable. The first two dimensions should always be
-            `(chain, sample)`. I guess the decision whether or not we want to expose a stacked version `draw=('chain', 'sample')`
-            is up to arviz.
-
-            Variable names must not share names with coordinate names.
-
-            attrs:
-                is_free: Whether or not the variable is a free variable for the sampler, or a transformed one
-                domain: One of (“reals”, “pos-reals”, “integers”, “sym-pos-def”, "interval"...)
-                TODO For stuff like sym-pos-def we need to know along which dims it is a matrix.
-                TODO This data could also be stored in /model
-                sym_pos_axis: [dim_idx1, dim_idx2]
-                interval_lower:
-                interval_upper
-                TODO How would this deal with cases where the lower and upper bounds depend on the index?
-
-
-
-TODO: In order to reproduce the run, it may make sense to also store some data on the random state (in numpy, this is a tuple of arrays), as
-well as some version info.  Hopefully just `PyMC3, (3, 4, 1)` or similar works.
+# InferenceData schema specification
+The `InferenceData` schema scheme defines a data structure compatible with [NetCDF](https://www.unidata.ucar.edu/software/netcdf/) with 3 goals in mind, usefulness in the analysis of Bayesian inference results, reproducibility of Bayesian inference analysis and interoperability between different inference backends and programming languages.
+
+Currently there are 2 beta implementations of this design:
+* [ArviZ](https://arviz-devs.github.io/arviz/) in Python which integrates with:
+  - [emcee](https://emcee.readthedocs.io/en/stable/)
+  - [PyMC3](https://docs.pymc.io)
+  - [pyro](https://pyro.ai/)
+      and [numpyro](https://pyro.ai/numpyro/)
+  - [PyStan](https://pystan.readthedocs.io/en/latest/index.html),
+      [CmdStan](https://mc-stan.org/users/interfaces/cmdstan)
+      and [CmdStanPy](https://cmdstanpy.readthedocs.io/en/latest/index.html)
+  - [tensorflow-probability](https://www.tensorflow.org/probability)
+* [ArviZ.jl](https://github.com/sethaxen/ArviZ.jl) in Julia which integrates with:
+  - [Turing.jl](https://turing.ml/dev/) and indirectly any package using [MCMCChains.jl](https://github.com/TuringLang/MCMCChains.jl) to store results
+  - [CmdStan.jl](https://github.com/StanJulia/CmdStan.jl), [StanSample.jl](https://github.com/StanJulia/StanSample.jl) and [Stan.jl](https://github.com/StanJulia/Stan.jl)
+
+## Contents
+1. [Current design](#current-design)
+   1. [`posterior`](#posterior)
+   1. [`sample_stats`](#sample_stats)
+   1. [`posterior_predictive`](#posterior_predictive)
+   1. [`observed_data`](#observed_data)
+   1. [`constant_data`](#constant_data)
+   1. [`prior`](#prior)
+   1. [`sample_stats_prior`](#sample_stats_prior)
+   1. [`prior_predictive`](#prior_predictive)
+1. [Planned features](#planned-features)
+   1. [Sampler parameters](#sampler-parameters)
+   1. [Out of sample posterior_predictive samples](#out-of-sample-posterior_predictive-samples)
+1. [Examples](#examples)
+
+## Current design
+`InferenceData` stores all quantities relevant in order to fulfill its goals in different groups. Each group, described below, stores a conceptually different quantity generally represented by several multidimensional labeled variables.
+
+Each group should have one entry per variable. When relevant, the first two dimensions of each variable should be the sample identifier (`chain`, `draw`). For groups like `observed_data` or `constant_data` these two initial dimensions are omitted. Dimensions must be named and specify their index values, called coordinates. Coordinates can have repeated identifiers and may not be numerical. Variable names must not share names with dimensions.
+
+Moreover, each group contains the following attributes:
+* `created_at`: the date of creation of the group.
+* `inference_library`: the library used to run the inference.
+* `inference_library_version`: version of the inference library used.
+
+`InferenceData` data objects contain any combination the groups described below.
+
+#### `posterior`
+Samples from the posterior distribution p(theta|y).
+
+#### `sample_stats`
+Information and diagnostics for each `posterior` sample, provided by the inference backend. It may vary depending on the algorithm used by the backend (i.e. an affine invariant sampler has no energy associated). The name convention used for `sample_stats` variables is the following:
+* `lp`: (unnormalized) log probability for sample
+* `step_size`
+* `step_size_bar`
+* `tune`: boolean variable indicating if the sampler is tuning or sampling
+* `depth`:
+* `tree_size`:
+* `mean_tree_accept`:
+* `diverging`: HMC-NUTS only, boolean variable indicating divergent transitions
+* `energy`: HMC-NUTS only
+* `energy_error`
+* `max_energy_error`
+
+#### `posterior_predictive`
+Posterior predictive samples p(y|y) corresponding to the posterior predictive distribution evaluated at the `observed_data`. Samples should match with `posterior` ones and each variable should have a counterpart in `observed_data`. The `observed_data` counterpart variable may have a different name.
+
+#### `observed_data`
+Observed data on which the `posterior` is conditional. It should only contain data which is modeled as a random variable. Each variable should have a counterpart in `posterior_predictive`, however, the `posterior_predictive` counterpart variable may have a different name.
+
+#### `constant_data`
+Model constants, data included in the model which is not modeled as a random variable. It should be the data used to generate the `posterior` and `posterior_predictive` samples.
+
+#### `prior`
+Samples from the prior distribution p(theta). Samples should not match `posterior` samples. However, this group will still follow the convention on `chain` and `draw` as first dimensions. Each variable should have a counterpart in `posterior`.
+
+#### `sample_stats_prior`
+Information and diagnostics for each `prior` sample, provided by the inference backend. It may vary depending on the algorithm used by the backend (i.e. an affine invariant sampler has no energy associated). Variable names follow the same convention defined in [`sample_stats`](#sample_stats).
+
+#### `prior_predictive`
+Samples from the prior predictive distribution. Samples should match `prior` samples and each variable should have a counterpart in `posterior_predictive`/`observed_data`.
+
+## Planned features
+
+### Sampler parameters
+
+### Out of sample posterior_predictive samples
+#### `predictions`
+Out of sample posterior predictive samples p(y'|y). Sample should match `posterior` samples. Its variables should have a counterpart in `posterior_predictive`. However, variables in `predictions` and their counterpart in `posterior_predictive` may not share coordinated.
+
+#### `predictions_constant_data`
+Model constants used to get the `predictions` samples. Its variables should have a counterpart in `constant_data`. However, variables in `predictions_constant_data` and their counterpart in `constant_data` may not share coordinates.
+
+## Examples
+In order to clarify the definitions above, an example of `InferenceData` generation for a 1D linear regression is available in several programming languages and probabilistic programming frameworks. This particular inference task has been chosen because it is widely well known while still being very useful and it also allows to populate all the fields in the `InferenceData` object.
+* Python
+  - PyMC3
+  - [PyStan](schema/PyStan_schema_example.ipynb)