Add an `InferenceData` object #173

ColCarroll · 2018-08-23T23:38:39Z

This is a pretty big change - see also #169. I am sure there are some rough edges, but this seems like a flexible way to get to feature parity with current PyMC3 plotting.

InferenceData is the object that carries all the schema data that is available. Working with it a little bit, I like it - it is just a light wrapper on a netCDF.Dataset that accesses xarray.Dataset's. Supports tab completion, and usage looks like:

data = az.InferenceData('my_analysis.nc')
data.posterior.mu.mean()   # 42.0
data.sample_stats.diverging.sum()  # 2
data.prior.mu.mean()  # AttributeError until it gets implemented
print(data)

Inference data from "/home/my_analysis.nc" with groups:
	> posterior
	> sample_stats

I added sample_stats to the PyMC3 extractor. It was pretty easy, and I can follow up with a similar job on PyStan (or @ahartikainen can!). We should argue about names for those sample_stats as well as the required/optional stats then, and update the schema accordingly.
A funny thing about InferenceData is that it is file based, not memory based. I did not want to require every plotting function to require a filename, and I want them to work out of the box with PyMC3 or PyStan objects, so making a plot with one of these objects will write a file to disk. It uses tempfile to get a unique filename, and will always write into the same directory. By default, it writes to .arviz_data/, but that can be updated with
```
import arviz as az
az.config['default_data_directory'] = 'somewhere_else'
```
If anyone has a more elegant way of handling this, I'm all ears. I was thinking of at least adding a warning every once in a while about the existence of this folder along with a suggestion to clean it out. Maybe every time the number of files in the directory is a multiple of 10, spawn a warning?
I updated the sample data to use InferenceData. az.load_arviz_data('centered_eight') is a nice way to start playing with this.
I tried to update documentation and function names, and got most of the way there.

ahartikainen · 2018-08-23T23:50:12Z

I can update PyStan stuff, I just need the normalized names.

sample_stats == rhat, etc ?

And then we also have sampler_stats or diagnostics?

ColCarroll · 2018-08-23T23:55:15Z

Heh, this maybe means we need better names. I took these straight from their names in pymc3 after a NUTS run:

data.sample_stats

<xarray.Dataset>
Dimensions:           (chain: 4, draw: 500)
Coordinates:
  * chain             (chain) int64 0 1 2 3
  * draw              (draw) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
Data variables:
    depth             (chain, draw) int64 ...
    diverging         (chain, draw) bool False False False False False False ...
    energy            (chain, draw) float64 ...
    energy_error      (chain, draw) float64 ...
    max_energy_error  (chain, draw) float64 ...
    mean_tree_accept  (chain, draw) float64 ...
    step_size         (chain, draw) float64 ...
    step_size_bar     (chain, draw) float64 ...
    tree_size         (chain, draw) float64 ...
    tune              (chain, draw) bool ...

canyon289 · 2018-08-25T03:48:22Z

arviz/utils/xarray_utils.py

@@ -1,18 +1,21 @@
 from abc import ABC, abstractmethod, abstractstaticmethod
-import re


Does it make sense to call this xarray_utils now?

good point - I think we're close to getting rid of some of the original dataframe functions in utils.py. probably there should be a refactor then. Really, these classes to convert objects should probably live in their own module, with the inference_data object.

ahartikainen · 2018-08-25T06:47:44Z

I have updated the PyStan code (I also did update some dim calculations).
I can't push to your branch so maybe I wait until this is merged and create another PR.

ahartikainen · 2018-08-25T06:49:56Z

Also, I found that only divergent__-- >diverging and energy__ --> energy are "same". Not sure about the rest.

ColCarroll · 2018-08-25T12:19:29Z

sounds good to me! most sampler stats will have to be optional (divergent doesn't make much sense for metropolis-hastings samples, for example), so we will also have to implement some schema checking after this.

ColCarroll added 6 commits August 24, 2018 12:01

Change base data structure to be netcdf with groups

36fc431

Update documentation, examples

50a2a10

Make sure docs still build

e1b7db0

Update index language

65ead32

Fix lint errors

fbc23f8

Rebase, update violinplot to netcdf

8c60516

ColCarroll force-pushed the move_to_netcdf branch from 59b16f0 to 8c60516 Compare August 24, 2018 16:05

canyon289 reviewed Aug 25, 2018

View reviewed changes

canyon289 mentioned this pull request Aug 25, 2018

Gallery plot for violintraceplot #174

Closed

ColCarroll merged commit affe364 into arviz-devs:master Aug 26, 2018

ColCarroll mentioned this pull request Aug 28, 2018

Representing schema in xarray #169

Closed

ColCarroll deleted the move_to_netcdf branch July 4, 2019 02:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an `InferenceData` object #173

Add an `InferenceData` object #173

ColCarroll commented Aug 23, 2018

ahartikainen commented Aug 23, 2018

ColCarroll commented Aug 23, 2018

canyon289 Aug 25, 2018

ColCarroll Aug 25, 2018

ahartikainen commented Aug 25, 2018 •

edited

Loading

ahartikainen commented Aug 25, 2018

ColCarroll commented Aug 25, 2018

		@@ -1,18 +1,21 @@
		from abc import ABC, abstractmethod, abstractstaticmethod
		import re

Add an InferenceData object #173

Add an InferenceData object #173

Conversation

ColCarroll commented Aug 23, 2018

ahartikainen commented Aug 23, 2018

ColCarroll commented Aug 23, 2018

canyon289 Aug 25, 2018

Choose a reason for hiding this comment

ColCarroll Aug 25, 2018

Choose a reason for hiding this comment

ahartikainen commented Aug 25, 2018 • edited Loading

ahartikainen commented Aug 25, 2018

ColCarroll commented Aug 25, 2018

Add an `InferenceData` object #173

Add an `InferenceData` object #173

ahartikainen commented Aug 25, 2018 •

edited

Loading