Clarify schema specification #913

OriolAbril · 2019-11-23T20:43:09Z

No description provided.

rpgoldman · 2019-11-24T19:29:10Z

doc/schema/schema.md

@@ -95,7 +95,7 @@ Parameters of the sampling algorithm and sampling backend to be used for analysi

 ### Out of sample `posterior_predictive` samples
 #### `predictions`
-Out of sample posterior predictive samples p(y'|y). Sample should match `posterior` samples. Its variables should have a counterpart in `posterior_predictive`. However, variables in `predictions` and their counterpart in `posterior_predictive` may not share coordinates.
+Out of sample posterior predictive samples p(y'|y). Samples should match `posterior samples. Its variables should have a counterpart in `posterior_predictive`. However, variables in `predictions` and their counterpart in `posterior_predictive` can have different coordinates.


Missing close backtick around "posterior".

To further clarify, suggest we change the final phrase to "can have different coordinate values. E.g., the posterior might have values [1, 2, 3] along coordinate x, while the posterior predictive might have [4, 5, 6]."

I like the simple name "predictions" as an alternative to "out of sample posterior predictive samples." That's an improvement!

Thanks. I would change the example wording in order to keep the notation: "e.g., the posterior predictive might have coordinates [0, 1, 2] along dimension x, while the predictions may have [3, 4, 5]".

As I understand xarray notation, x is not a coordinate but a dimension.

In xarray output, the first thing to be printed are the dimensions, then the coordinates (which in arviz are assigned to a dimension but in general xarray objects, it is possible to have coordinates without dimension assigned) and eventuallt data variables which have some dimensions assigned

Thanks. I would change the example wording in order to keep the notation: "e.g., the posterior predictive might have coordinates [0, 1, 2] along dimension x, while the predictions may have [3, 4, 5]".

If you are going to introduce the posterior predictive in this way, I think we should address all three:

"e.g., the posterior distribution and the posterior predictive might have coordinates [0, 1, 2] along dimension x, while the predictions might have [3, 4, 5]".

This captures the distinction between posterior predictive and predictions. [2, 3, 4, 5] would be a better choice for example than [3, 4, 5], because it shows that the coordinate values might be different, but they may overlap.

As I understand xarray notation, x is not a coordinate but a dimension.

I don't think their terminology is very good. It seems to me that there are multiple things:

A dimension, which is a property of a data variable

A dimension definition, which is a mapping of dimension name to value -- a dimension definition is what I believe an xarray Coordinate object refers to (this is sometimes what they mean when they say "coordinate").

A coordinate value ("the x coordinate is 3").

I find their use of these terms imprecise and confusing. It seems lie the same thing, x is sometimes referred to as a "dimension" and sometimes a "coordinate." This makes me 🤪

The point about the posterior is to distinguish the relationship between the posterior and posterior predictive samples (must have the same dimensions and coordinates) and between the posterior and (out of sample) predictions (may have different coordinates, but the same dimensions).

I don't think the posterior and the posterior predictive must share dimensions and coordinates. Take the linear regression used as schema example for instance. The posterior has 3 variables (intercept, slope and sigma) and 2 dimensions (chain and draw), whereas the posterior predictive has one variable (slack_comments) and 3 dimensions (chain, draw, developer)

Regarding terminology of dimension and coordinate it could be interesting to add a terminology section. I will reread xarray terminology section to see if we can base it on theirs. I think we only need a subset of xarray functionality (dimensions with multidimensional coordinates could be interesting, but I don't see the use of dimensions without coordinates nor of coordinates without dimensions) so let's hope we can achieve something we all agree and understand in the same way.

Right, that's correct, because the posterior predictive by default contains the observed RVs only, and those are in observed_data rather than posterior group.

But you aren't allowed to change the existing shape of these observed RVs in the posterior predictive.

I was thinking of this in terms of the posterior, because the observed_data group is created as a side-effect of creating the posterior group.

Oohh, I see now. I think this is in the posterior predictive section, though it may not be very well explained.

Maybe we could include "matching samples" and "matching variables" in the terminology

rpgoldman · 2019-11-24T19:31:38Z

doc/schema/schema.md

@@ -95,7 +95,7 @@ Parameters of the sampling algorithm and sampling backend to be used for analysi

 ### Out of sample `posterior_predictive` samples
 #### `predictions`
-Out of sample posterior predictive samples p(y'|y). Sample should match `posterior` samples. Its variables should have a counterpart in `posterior_predictive`. However, variables in `predictions` and their counterpart in `posterior_predictive` may not share coordinates.
+Out of sample posterior predictive samples p(y'|y). Samples should match `posterior samples. Its variables should have a counterpart in `posterior_predictive`. However, variables in `predictions` and their counterpart in `posterior_predictive` can have different coordinates.

 #### `predictions_constant_data`
 Model constants used to get the `predictions` samples. Its variables should have a counterpart in `constant_data`. However, variables in `predictions_constant_data` and their counterpart in `constant_data` may not share coordinates.


Here again, "may not share coordinates" is not the same as "possibly do not share coordinates": "may not" is the same as "are forbidden to." So I think that you should change that to "may have different coordinates."

As before, I think it would be helpful to distinguish between having different coordinates and having different coordinate values. Or possibly "different coordinate definitions." The xarray terminology is not helpful here.

rpgoldman

Sorry about all the little editing comments -- GitHub wouldn't let me edit myself, so I had to put them in comments.

This is very helpful. My only concern has to do with it seeming to rely on notions like "counterpart" and "matching" that are not sufficiently well-defined, where "sufficiently well-defined" would involve the ability to write code that would test whether these notions are satisfied.

rpgoldman · 2019-12-02T22:12:59Z

doc/schema/schema.md

+## Terminology
+The terminology used in this specification is based on [xarray's terminology](http://xarray.pydata.org/en/stable/terminology.html), however, no xarray knowledge is assumed in this description. There are also some extensions particular to  the InferenceData case.
+
+* **Variable**: NetCDF-like variables are multidimensional labeled arrays representing a single quantity. Variables and their dimensions must be named. They can also have attributes describing it. Relevant terms related to InferenceData variables are: *variable_name*, *values* (its data), *dimensions*, *coordinates*, and *attributes*.


"...describing them."

Should we have a sentence describing attributes, as well? I think they are just an arbitrary dictionary of added information, right?

Yes, I will add it whenever I have time

rpgoldman · 2019-12-02T22:14:33Z

doc/schema/schema.md

+The terminology used in this specification is based on [xarray's terminology](http://xarray.pydata.org/en/stable/terminology.html), however, no xarray knowledge is assumed in this description. There are also some extensions particular to  the InferenceData case.
+
+* **Variable**: NetCDF-like variables are multidimensional labeled arrays representing a single quantity. Variables and their dimensions must be named. They can also have attributes describing it. Relevant terms related to InferenceData variables are: *variable_name*, *values* (its data), *dimensions*, *coordinates*, and *attributes*.
+* **Dimension**: The dimensions of an object are its named axes. A variable containing 3D data can have dimensions `[chain, draw, dim0]`, thus, its `0th`-dimension is `chain`, its `1st`-dimension is `draw` and so on. Every dimension present in an InferenceData variable must share names with a *coordinate*. Given that dimensions must be named, dimension and dimension name are used equivalents.


Nit: is it required that each dimension correspond to a coordinate? I think not in the sense that xarray won't raise an exception if this doesn't hold. Suggestion: replace "must" with "should."

xarray supports both dimensions without coordinates and coordinates without dimensions. I do not see any usecase for them in InferenceData/ArviZ (ArviZ does not support any of the two and I don't recall anybody complaining either) so even if xarray supports it, I would rather InferenceData "not supporting" it.

rpgoldman · 2019-12-02T22:17:28Z

doc/schema/schema.md

+
+* **Variable**: NetCDF-like variables are multidimensional labeled arrays representing a single quantity. Variables and their dimensions must be named. They can also have attributes describing it. Relevant terms related to InferenceData variables are: *variable_name*, *values* (its data), *dimensions*, *coordinates*, and *attributes*.
+* **Dimension**: The dimensions of an object are its named axes. A variable containing 3D data can have dimensions `[chain, draw, dim0]`, thus, its `0th`-dimension is `chain`, its `1st`-dimension is `draw` and so on. Every dimension present in an InferenceData variable must share names with a *coordinate*. Given that dimensions must be named, dimension and dimension name are used equivalents.
+* **Coordinate**: A named array that labels a dimension. A coordinate named `chain` with values `[0, 1, 2, 3]` would label the `chain` dimension. Coordinate names and values can be loosely though of as labels and tick labels along a dimension. 


I think this is a horrible thing about xarray -- we have dimensions and coordinates that are named by the same string. I think of the named array not as labeling a dimension, but as defining that dimension. I.e., the Coordinate object for chain defines the chain dimension.

But even that is not quire right, since we can use a coords argument to restrict attention to a subset of a dimension...

It's not the fault of Arviz, of course, that xarray adopted such misleading terminology.

rpgoldman · 2019-12-02T22:19:09Z

doc/schema/schema.md

+* **Variable**: NetCDF-like variables are multidimensional labeled arrays representing a single quantity. Variables and their dimensions must be named. They can also have attributes describing it. Relevant terms related to InferenceData variables are: *variable_name*, *values* (its data), *dimensions*, *coordinates*, and *attributes*.
+* **Dimension**: The dimensions of an object are its named axes. A variable containing 3D data can have dimensions `[chain, draw, dim0]`, thus, its `0th`-dimension is `chain`, its `1st`-dimension is `draw` and so on. Every dimension present in an InferenceData variable must share names with a *coordinate*. Given that dimensions must be named, dimension and dimension name are used equivalents.
+* **Coordinate**: A named array that labels a dimension. A coordinate named `chain` with values `[0, 1, 2, 3]` would label the `chain` dimension. Coordinate names and values can be loosely though of as labels and tick labels along a dimension. 
+* **Group**: Dataset containing one or several variables with a conceptual link between them. Variables inside a group will generally share some dimensions too.


Group: In Arviz's InferenceData objects, a group is an xarray.Dataset containing...
For example, the posterior group contains a representation of the posterior distribution conditioned on the observatinos in the observed_data group.

rpgoldman · 2019-12-02T22:23:35Z

doc/schema/schema.md

+* **Dimension**: The dimensions of an object are its named axes. A variable containing 3D data can have dimensions `[chain, draw, dim0]`, thus, its `0th`-dimension is `chain`, its `1st`-dimension is `draw` and so on. Every dimension present in an InferenceData variable must share names with a *coordinate*. Given that dimensions must be named, dimension and dimension name are used equivalents.
+* **Coordinate**: A named array that labels a dimension. A coordinate named `chain` with values `[0, 1, 2, 3]` would label the `chain` dimension. Coordinate names and values can be loosely though of as labels and tick labels along a dimension. 
+* **Group**: Dataset containing one or several variables with a conceptual link between them. Variables inside a group will generally share some dimensions too.
+* **Matching samples**: Two variables (or groups) whose samples match are the ones generated with the same set of samples. Therefore, they will share dimensions and coordinates relative to samples. Sample dimensions (generally `(chain, draw)`) are the ones introduced by the sampling process. 


nit: should be just "are ones generated" not "the ones generated".

It's not really clear what "relative to samples" means. "Therefore, they will share the dimensions and coordinates corresponding to the sampling process."

Maybe an example, showing two variables that have different shapes but that match because they come from the same sampling process?

I'm not sure how groups in the same InferenceData object would match. E.g., even in posterior_predictive sampling, PyMC3 might make just a number of draws corresponding to the number of draws times the number of chains in the posterior, so that the posterior and posterior predictive would not match.

rpgoldman · 2019-12-02T22:24:47Z

doc/schema/schema.md

+* **Coordinate**: A named array that labels a dimension. A coordinate named `chain` with values `[0, 1, 2, 3]` would label the `chain` dimension. Coordinate names and values can be loosely though of as labels and tick labels along a dimension. 
+* **Group**: Dataset containing one or several variables with a conceptual link between them. Variables inside a group will generally share some dimensions too.
+* **Matching samples**: Two variables (or groups) whose samples match are the ones generated with the same set of samples. Therefore, they will share dimensions and coordinates relative to samples. Sample dimensions (generally `(chain, draw)`) are the ones introduced by the sampling process. 
+* **Matching variables**: Two groups with matching variables are groups that conceptually share variables, variable dimensions and coordinates of the variable dimensions but do not necessarily share variable names nor sample dimensions. Variable dimensions are the ones intrinsic to the data and model as opposed to sample dimensions which are the ones relative to the sampling process. When talking about specific variables, this same idea is expressed as one variable being the counterpart of the other. 


I really didn't understand this. What does it mean to conceptually share variables, if the variables can have different names?

This seems like an informal notion that's not capable of crisp definition, so maybe drop it?

We can try to define this better in a following PR maybe.

I am trying to include cases like PyStan, where the observed variable and the posterior predictive cannot have the same name. This is the reason for example that the argument data_pairs exists in plot_ppc. #843 may provide some context.

rpgoldman · 2019-12-02T22:25:28Z

doc/schema/schema.md

 ## Current design
-`InferenceData` stores all quantities relevant to fulfilling its goals in different groups. Each group, described below, stores a conceptually different quantity. In general, each quantity will be represented by several multidimensional labeled variables.
+`InferenceData` stores all quantities relevant to fulfilling its goals in different groups. Different groups generally distinguish conceptually different quantities in Bayesian inference, however, convenience in creation and usage of InferenceData objects also plays a role. In general, each quantity will be represented by several multidimensional labeled variables.


Could you add an example of what a "quantity" is here?

rpgoldman · 2019-12-02T22:27:54Z

doc/schema/schema.md

@@ -70,7 +81,7 @@ Information and diagnostics for each `posterior` sample, provided by the inferen
 * `max_energy_error`

 ### `posterior_predictive`
-Posterior predictive samples p(y|y) corresponding to the posterior predictive distribution evaluated at the `observed_data`. Samples should match with `posterior` ones and each variable should have a counterpart in `observed_data`. The `observed_data` counterpart variable may have a different name.
+Posterior predictive samples p(y|y) corresponding to the posterior predictive distribution evaluated at the `observed_data`. Samples should match with `posterior` ones and its variables should match `observed_data` variables. The `observed_data` counterpart variable may have a different name.


I followed this until the last sentence. How would the observed data counterpart variable have a different name? And if it did have a different name, in what way would it be a counterpart?

An example might clarify this, but IMO we should not rely on a notion of "counterpart" if we don't have a definition, in code, of how one could test two variables to see if they are counterparts.

I think a code returning True if two variables can be counterparts is possible, however, it would never be able to reject variables having the same shape by chance.

The "same variable/different name" already works in ArviZ without problem (see data_pairs argument in plot_ppc), I think only objects coming from Stan world need it.

rpgoldman · 2019-12-02T22:30:48Z

doc/schema/schema.md

@@ -79,7 +90,7 @@ Observed data on which the `posterior` is conditional. It should only contain da
 Model constants, data included in the model which is not modeled as a random variable. It should be the data used to generate the `posterior` and `posterior_predictive` samples.


Wouldn't these also be used to generate samples from the prior definition described immediately below?

rpgoldman · 2019-12-02T22:32:30Z

doc/schema/schema.md

@@ -79,7 +90,7 @@ Observed data on which the `posterior` is conditional. It should only contain da
 Model constants, data included in the model which is not modeled as a random variable. It should be the data used to generate the `posterior` and `posterior_predictive` samples.

 ### `prior`
-Samples from the prior distribution p(theta). Samples need not match `posterior` samples. However, this group will still follow the convention on `chain` and `draw` as first dimensions. Each variable should have a counterpart in `posterior`.
+Samples from the prior distribution p(theta). Samples need not match `posterior` samples. However, this group will still follow the convention on `chain` and `draw` as first dimensions. It should have matching variables with the `posterior` group.

 ### `sample_stats_prior`
 Information and diagnostics for each `prior` sample, provided by the inference backend. It may vary depending on the algorithm used by the backend. Variable names follow the same convention defined in [`sample_stats`](#sample_stats).


I don't think "for each prior sample" is correct, is it? Shouldn't it be

"Information and diagnostics for the samples in the prior group, provided by the inference backend."?

rpgoldman · 2019-12-28T21:16:23Z

For a project I very much want to make InferenceData to hold predictions. I was starting to edit io_pymc3.py to do this, and realized that there is a real problem with the way that this is done in PyMC3:

In PyMC3 in order to do out of sample predictions, you must make a new pm.Model object. The reason for this is that if you modify the constant data to make predictions, you may be changing the shape of variables in the model. This cannot be done, because those shapes are compiled into the model (even if they are held in pm.Data).

Therefore, the procedure for doing out-of-sample predictions in PyMC3 is as follows:

Create a model, m1, with observations and constant data.
Using m1 and pm.sample, generate a posterior trace t
Create a model m2 with new constant data.
Take trace t and thin it to remove all variables that are functions of the original constant data, giving t'.
Using m2, t', and pm.sample_posterior_predictive generate a predictive trace, t3.

So this means if we are to have az.from_pymc3 hold both the posterior and a predictive trace, we have to somehow give it two models from which to extract the two different sets of constant data.

This seems likely to be complex enough that I think we should consider the possibility that users will simply have two have two InferenceData objects, one for the posterior model (and posterior predictive and prior predictive), and one for the predictions. Two other reasons for doing this are as follows:

from_pymc3 takes only one set of coords, which is incompatible with having different shapes in the posterior and prediction groups.
a user who wants to make more than one prediction will still need to have multiple InferenceData objects, since there's only one predictions group.

It might make sense for prediction InferenceData objects to have some kind of cross-reference with a posterior InferenceData on which it bases its predictions.

One more thought: when building an InferenceData object for predictions, one could use the thinned trace (my t', above) as the trace argument to from_pymc3, instead of using the raw posterior trace. This would still mean one would need a separate InferenceData for the original, full posterior trace.

OriolAbril · 2019-12-28T21:49:18Z

I currently don't have much time, I'll try to answer properly whenever I have, I need some time to propery follow PyMC3 internals. This looks much more relevant to #916 than here though. I think some of your concerns are already adressed there, not all of them though. If #916 were to be of any help feel free to build on top of it.

rpgoldman · 2019-12-28T21:51:15Z

@OriolAbril Thanks! If I can get my initial version turning over, I will push it as a WIP merge request so you can see what I'm doing and check and see if it makes sense to you.

OriolAbril · 2020-01-16T19:23:14Z

How do you feel about merging this and continue improving the schema in other PRs?

We will have to update the file for predictions and for log likelihoods, having this here will only complicate things.

rpgoldman · 2020-01-16T21:06:49Z

How do you feel about merging this and continue improving the schema in other PRs?

We will have to update the file for predictions and for log likelihoods, having this here will only complicate things.

Works for me. I'm not a huge fan of long-lived feature branches, anyway.

Let's get your work so far in, and then see if we can digest the predictions support for PyMC3 and fit it in lter.

try to clarify schema specification

ed72625

rpgoldman reviewed Nov 24, 2019

View reviewed changes

terminology section

11729f8

rpgoldman reviewed Dec 2, 2019

View reviewed changes

rpgoldman mentioned this pull request Dec 30, 2019

Proposal for PyMC3 predictions support #916

Closed

addressed some comments

831e034

OriolAbril merged commit f92e9cc into master Jan 17, 2020

OriolAbril deleted the schema_wording branch January 17, 2020 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify schema specification #913

Clarify schema specification #913

OriolAbril commented Nov 23, 2019

rpgoldman Nov 24, 2019

OriolAbril Nov 24, 2019

OriolAbril Nov 24, 2019

rpgoldman Nov 24, 2019

rpgoldman Nov 24, 2019

rpgoldman Nov 24, 2019

OriolAbril Nov 24, 2019

OriolAbril Nov 24, 2019

rpgoldman Nov 24, 2019

OriolAbril Nov 24, 2019

rpgoldman Nov 24, 2019

rpgoldman left a comment

rpgoldman Dec 2, 2019

OriolAbril Dec 2, 2019

rpgoldman Dec 2, 2019

OriolAbril Dec 2, 2019

rpgoldman Dec 2, 2019

rpgoldman Dec 2, 2019

rpgoldman Dec 2, 2019

rpgoldman Dec 2, 2019

OriolAbril Jan 16, 2020

rpgoldman Dec 2, 2019

rpgoldman Dec 2, 2019

OriolAbril Dec 2, 2019

rpgoldman Dec 2, 2019

rpgoldman Dec 2, 2019

rpgoldman commented Dec 28, 2019 •

edited

OriolAbril commented Dec 28, 2019

rpgoldman commented Dec 28, 2019

OriolAbril commented Jan 16, 2020

rpgoldman commented Jan 16, 2020

		@@ -79,7 +90,7 @@ Observed data on which the `posterior` is conditional. It should only contain da
		Model constants, data included in the model which is not modeled as a random variable. It should be the data used to generate the `posterior` and `posterior_predictive` samples.

Clarify schema specification #913

Clarify schema specification #913

Conversation

OriolAbril commented Nov 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rpgoldman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rpgoldman commented Dec 28, 2019 • edited

OriolAbril commented Dec 28, 2019

rpgoldman commented Dec 28, 2019

OriolAbril commented Jan 16, 2020

rpgoldman commented Jan 16, 2020

rpgoldman commented Dec 28, 2019 •

edited