In [49]:
import pymc as pm
from pymc.math import dot
import pandas as pd
import numpy as np
import arviz as az
import aesara.tensor as at

%load_ext lab_black
%load_ext watermark

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark


# Brozek index prediction

This example goes over linear regression and Bayesian $R^2$.

Adapted from [unit 7: fat2d.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit7/fat2d.odc) and [fatmulti.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit7/fatmulti.odc).

The first lecture uses fat1.odc, but that file wasn't provided. You can see the code in the Lesson 11 video, though. It just uses the X8 predictor.

Data can be found [here](https://raw.githubusercontent.com/areding/6420-pymc/main/data/fat.tsv).

## Associated lecture videos:
### Unit 7 Lesson 11

In [1]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed?v=xomK4tcePmc&list=PLv0FeK5oXK4l-RdT6DWJj0_upJOG2WKNO&index=73" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

### Unit 7 Lesson 13

In [2]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed?v=xomK4tcePmc&list=PLv0FeK5oXK4l-RdT6DWJj0_upJOG2WKNO&index=75" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

## Problem statement

Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) were recorded for 252 men. Percentage of body fat is estimated through an underwater weighing technique.

The data set has 252 observations and 15 variables. Brozek index (Brozek et al., 1963) is obtained by the underwater weighing while other 14 anthropometric variables are obtained using scales and a measuring tape.

- y = Brozek index
- X1 = 1 (intercept)
- X2 = age
- X3 = weight
- X4 = height
- X5 = adipose
- X6 = neck  
- X7 = chest
- X8 = abdomen
- X9 = hip
- X10 = thigh
- X11 = knee   
- X12 = ankle
- X13 = bicep
- X14 = forearm
- X15 = wrist

These anthropometric variables are less intrusive but also less reliable in assessing the body fat index.

Set a linear regression to predict the Brozek index from these body measurements.

## Single predictor (X8)

This is a recreation of fat1.odc.

In [5]:
data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data["X8"].to_numpy(copy=True)

# p will be the number of predictors + intercept (1 + 1 in this case)
n, p = X.shape[0], 2

In [6]:
with pm.Model() as m:
    tau = pm.Gamma("tau", 0.001, 0.001)
    beta0 = pm.Normal("beta0_intercept", 0, tau=0.001)
    beta1 = pm.Normal("beta1_abdomen", 0, tau=0.001)
    variance = pm.Deterministic("variance", 1 / tau)

    mu = beta0 + beta1 * X
    likelihood = pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat1.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)

    trace = pm.sample(2000)
    ppc = pm.sample_posterior_predictive(trace)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
  aesara_function = aesara.function(
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta0_intercept, beta1_abdomen]


  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 12 seconds.
The acceptance probability does not match the target. It is 0.8854, but should be close to 0.8. Try to increase the number of tuning steps.


In [7]:
az.summary(trace, hdi_prob=0.95)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta0_intercept,-34.999,2.559,-39.751,-29.76,0.049,0.034,2772.0,2685.0,1.0
beta1_abdomen,0.583,0.027,0.528,0.635,0.001,0.0,2752.0,2680.0,1.0
tau,0.049,0.004,0.041,0.058,0.0,0.0,3730.0,3367.0,1.0
variance,20.527,1.829,17.313,24.401,0.03,0.022,3730.0,3367.0,1.0
br2,0.66,0.03,0.595,0.713,0.001,0.0,3730.0,3367.0,1.0


This matches the results from the U7 L11 video.

Another way to calculate the $R^2$ using a posterior predictive check (keeping in mind that there is no standard "Bayesian $R^2$") and the results will be slightly different:

In [8]:
# get the mean y_pred across all chains
y_pred = np.array(ppc.posterior_predictive.likelihood.mean(axis=(0, 1)))

az.r2_score(y, y_pred)

r2        0.660744
r2_std    0.000000
dtype: float64

In this case they agree, but that won't always be true.

## All predictors

Based on fat2d.odc.

In [21]:
data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data.iloc[:, 1:].to_numpy(copy=True)

# add intercept
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
n, p = X_aug.shape

# Zellner's g
g = p**2

n, p, g

(252, 15, 225)

(252, 14)

In [59]:
with pm.Model() as m2d:
    tau = pm.Gamma("tau", 0.001, 0.001)
    variance = pm.Deterministic("variance", 1 / tau)

    tau_beta = tau/g * dot(X_aug, X_aug.T)
    beta = pm.MvNormal("beta", 0, tau=tau_beta)
    mu = dot(X_aug, beta)
    likelihood = pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat2d.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)
    br2_adj = pm.Deterministic("br2_adj", 1 - (n - 1) * variance / sst)

    trace = pm.sample(2000)
    ppc = pm.sample_posterior_predictive(trace)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...


ValueError: Shape mismatch: A.shape[1] != x.shape[0]
Apply node that caused the error: CGemv{no_inplace}(likelihood{[12.6  6.9..25.3 30.7]}, TensorConstant{-1.0}, TensorConstant{[[  1.    ..   20.9 ]]}, beta, TensorConstant{1.0})
Toposort index: 3
Inputs types: [TensorType(float64, (252,)), TensorType(float64, ()), TensorType(float64, (252, 15)), TensorType(float64, (None,)), TensorType(float64, ())]
Inputs shapes: [(252,), (), (252, 15), (252,), ()]
Inputs strides: [(8,), (), (8, 2016), (8,), ()]
Inputs values: ['not shown', array(-1.), 'not shown', 'not shown', array(1.)]
Outputs clients: [[Elemwise{Composite{Switch(i0, ((i1 + (i2 * sqr((i3 / i4)))) - i5), i6)}}(Elemwise{gt,no_inplace}.0, TensorConstant{(1,) of -0..5332046727}, TensorConstant{(1,) of -0.5}, CGemv{no_inplace}.0, Elemwise{Composite{reciprocal(sqrt(Switch(GT(i0, i1), i0, i2)))}}.0, Elemwise{log,no_inplace}.0, TensorConstant{(1,) of -inf})]]

HINT: Re-running with most Aesara optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the Aesara flag 'optimizer=fast_compile'. If that does not work, Aesara optimizations can be disabled with 'optimizer=None'.
HINT: Use the Aesara flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

In [17]:
az.summary(trace, hdi_prob=0.95)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta[0],-11.891,14.161,-39.766,15.355,0.245,0.175,3341.0,3959.0,1.0
beta[1],0.057,0.03,-0.002,0.116,0.0,0.0,5126.0,5215.0,1.0
beta[2],-0.072,0.045,-0.158,0.017,0.001,0.001,3331.0,4692.0,1.0
beta[3],-0.062,0.101,-0.255,0.137,0.002,0.001,4318.0,5189.0,1.0
beta[4],0.064,0.282,-0.477,0.629,0.004,0.003,4322.0,4952.0,1.0
beta[5],-0.461,0.214,-0.874,-0.041,0.003,0.002,5295.0,5463.0,1.0
beta[6],-0.039,0.096,-0.233,0.145,0.001,0.001,6016.0,5835.0,1.0
beta[7],0.879,0.087,0.702,1.04,0.001,0.001,5416.0,5564.0,1.0
beta[8],-0.221,0.133,-0.492,0.027,0.002,0.001,4588.0,5134.0,1.0
beta[9],0.227,0.137,-0.038,0.502,0.002,0.001,5869.0,5810.0,1.0


In [18]:
y_pred = np.array(ppc.posterior_predictive.likelihood.mean(axis=(0, 1)))

az.r2_score(y, y_pred)

r2        0.748573
r2_std    0.000000
dtype: float64

Reading on g-priors

https://arxiv.org/abs/1702.01201
https://towardsdatascience.com/linear-regression-model-selection-through-zellners-g-prior-da5f74635a03
https://en.wikipedia.org/wiki/G-prior\

original paper:

Zellner, A. (1986). "On Assessing Prior Distributions and Bayesian Regression Analysis with g Prior Distributions". In Goel, P.; Zellner, A. (eds.). Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. Studies in Bayesian Econometrics and Statistics. Vol. 6. New York: Elsevier. pp. 233–243. ISBN 978-0-444-87712-3.

In [19]:
%watermark --iversions -v

Python implementation: CPython
Python version       : 3.10.4
IPython version      : 8.3.0

pymc  : 4.0.0b5
numpy : 1.22.3
pandas: 1.4.2
arviz : 0.12.1

