(working_with_InferenceData)=

# Working with InferenceData

Here we present a collection of common manipulations you can use while working with `InferenceData`.

In [1]:
import arviz as az
import numpy as np

In [2]:
idata = az.load_arviz_data("centered_eight")
idata

## Get the dataset corresponding to a single group

In [3]:
post = idata.posterior
post

:::{tip} 
You'll have noticed we stored the posterior group in a new variable: `post`. As `.copy()` was not called, now using `idata.posterior` or `post` is equivalent.

Use this to keep your code short yet easy to read. Store the groups you'll need very often as separate variables to use explicitly, but don't delete the InferenceData parent. You'll need it for many ArviZ functions to work properly. For example: {func}`~arviz.plot_pair` needs data from `sample_stats` group to show divergences, {func}`~arviz.compare` needs data from both `log_likelihood` and `posterior` groups, {func}`~arviz.plot_loo_pit` needs not 2 but 3 groups: `log_likelihood`, `posterior_predictive` and `posterior`.
:::

## Add a new variable


In [4]:
post["log_tau"] = np.log(post["tau"])
idata.posterior

## Combine chains and draws

In [5]:
stacked = az.extract_dataset(idata)
stacked

You can also use {meth}`xarray.Dataset.stack` if you only want to combine the chain and draw dimensions. {func}`arviz.extract_dataset` is a convenience function aimed at taking care of the most common subsetting operations with MCMC samples. It can:
- Combine chains and draws
- Return a subset of variables (with optional filtering with regular expressions or string matching)
- Return a subset of samples. Moreover by default it returns a random subset to prevent getting non-representative samples due to bad mixing.
- Acess any group

(idata/random_subset)=
## Get a random subset of the samples

In [6]:
az.extract_dataset(idata, num_samples=100)

:::{tip}
Use a random seed to get the same subset from multiple groups: `az.extract_dataset(idata, num_samples=100, rng=3)` and `az.extract_dataset(idata, group="log_likelihood", num_samples=100, rng=3)` will continue to have matching samples
:::

## Obtain a NumPy array for a given parameter

Let's say we want to get the values for `mu` as a NumPy array.

In [7]:
stacked.mu.values

array([-3.47698606, -2.45587061, -2.82625433, ...,  4.59705819,
        5.89850592,  0.16138927])

## Get the number of variables

Let’s check how many groups are in our hierarchical model.

In [8]:
len(idata.observed_data.school)

8

## Get coordinate values

What are the names of the groups in our hierarchical model? You can access them from the coordinate name `school` in this case

In [9]:
idata.observed_data.school

## Get a subset of chains

Let’s evaluate only chain 0 and 2 here.

In [10]:
idata.sel(chain=[0, 2]).posterior

## Remove the first n draws (burn-in)

Let’s say we want to remove the first 100 samples, from all the chains and all `InferenceData` groups with draws.

In [11]:
burnin = idata.sel(draw=slice(100, None))

If you check the `burnin` object you will see that the groups `posterior`, `posterior_predictive`, `prior` and `sample_stats` have 400 draws compared to `idata` that has 500. The group `observed_data` has not been affected because it does not have the `draw` dimension. Alternatively, you can specify which group or groups you want to change.

In [12]:
burnin_posterior = idata.sel(draw=slice(100, None), groups="posterior")

## Compute posterior mean values along draw and chains dimensions

If you want to compute the mean value of the posterior samples, you can simply do the following:


In [13]:
idata.posterior.mean()

This will effectively compute the mean along all dimensions. This is probably what you want for `mu` and `tau`, which have two dimensions (`chain` and `draw`), but maybe not what you expected for `theta`, which has one more dimension `school`. You can specify along which dimension you want to compute the mean (or other functions).

In [14]:
idata.posterior.mean(dim=['chain', 'draw'])