# Section 5.1 Model Comparison Methods
*How can we determine which model better fits our needs?*

In [1]:
import os
import time

import arviz as az
import numpy as np
import pymc3 as pm
import scipy.stats as stats

from utils import metropolis_hastings

np.random.seed(0)

# Path Constants
if os.path.split(os.getcwd())[-1] != "notebooks":
    os.chdir(os.path.join(".."))
    
NETCDF_DIR = "inference_data"



## Learning Objectives
* Understanding of how to interpret WAIC, PSIS-LOO numerical metrics
* Understanding of how to interpret plot_compare

## Infinite parameters and Infinite Models
As Bayesian modelers not only do we have to handle infinite model parameters, we also have to handle infinite model defnitions.

Take our water example from Section 1.3. 
$$
\theta = Uniform(0,1) \\
p_{water} = Binom(\theta)
$$
In this model we're evaulating not just one possible proportion of water on a planet, but an infinite amount of proportions from 0 to 1.

But this begs the question why 0 to 1, or why even this model? This is also a valid model

$$
\theta = Beta(1,5) \\
p_{water} = Binom(\theta)
$$

as is this model

$$ 
\theta = Uniform(0,1) \\
\sigma = Uniform(0,100) \\
p_{water} = Norm(\theta, \sigma)
$$

### How do we pick?
The flexibility to design whatever model you like is simultaneously the beauty and challenge of Bayesian modelling. While this philosophy is nice, for the practioner there still is the question of which model to choose and luckily there are of tools that help. In particular we'll cover Widely Applicable Information Criteria and how it's used in cojunction with `plot_compare
* az.plot_compare

## Information Criterion
When you hear Information Criterion you usually hear the words entropy, divergence, and deviance. While too much for this tutorial the general idea of all these concepts is the same, "How can we quantify the uncertainty of the model?" and in the case of divergence "How can we quantify the difference between two distributions?"

The reason this question is particularly challenging in Bayesian Statistics because we don't get just one prediction, we get an infinite amount of predictions in a distribution. However this does not mean that Bayesian analysis is not suspect to phenomena such as overfitting or excess complexity.

$$AIC = D_{train} + 2k$$ 

* Information Criterion are a way of balancing train data and prediction fit versus number of parameters
* For example AIC well known and simple here's formula
* AIC has limitations though so WAIC is better choice. Formula is more complicated but same idea
* Here's some good references https://www.youtube.com/watch?v=gjrsYDJbRh0
* Practical advice Lower tends to be better
**Need to fill out this section**



## Additional Methods Pareto Smoothed Importance Sample 

**Determine example data to demonstrate these**