In [2]:
import pymc as pm
import pandas as pd
import numpy as np
import arviz as az
import aesara.tensor as at

%load_ext lab_black
%load_ext watermark

# Brozek index prediction

This example goes over linear regression and Bayesian $R^2$.

Adapted from [unit 7: fat1.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit7/fat2d.odc), [unit 7: fat2d.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit7/fat1.odc) and [fatmulti.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit7/fatmulti.odc).

Data can be found [here](https://raw.githubusercontent.com/areding/6420-pymc/main/data/fat.tsv).

## Associated lecture videos:
### Unit 7 Lesson 11

In [3]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed?v=xomK4tcePmc&list=PLv0FeK5oXK4l-RdT6DWJj0_upJOG2WKNO&index=73" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

### Unit 7 Lesson 13

In [4]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed?v=xomK4tcePmc&list=PLv0FeK5oXK4l-RdT6DWJj0_upJOG2WKNO&index=75" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

## Problem statement

Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) were recorded for 252 men. Percentage of body fat is estimated through an underwater weighing technique.

The data set has 252 observations and 15 variables. Brozek index (Brozek et al., 1963) is obtained by the underwater weighing while other 14 anthropometric variables are obtained using scales and a measuring tape.

- y = Brozek index
- X1 = 1 (intercept)
- X2 = age
- X3 = weight
- X4 = height
- X5 = adipose
- X6 = neck  
- X7 = chest
- X8 = abdomen
- X9 = hip
- X10 = thigh
- X11 = knee   
- X12 = ankle
- X13 = bicep
- X14 = forearm
- X15 = wrist

These anthropometric variables are less intrusive but also less reliable in assessing the body fat index.

Set a linear regression to predict the Brozek index from these body measurements.

## Single predictor (X8)

This is a recreation of fat1.odc.

In [5]:
data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data["X8"].to_numpy(copy=True)

# p will be the number of predictors + intercept (1 + 1 in this case)
n, p = X.shape[0], 2

In [6]:
with pm.Model() as m:
    tau = pm.Gamma("tau", 0.001, 0.001)
    beta0 = pm.Normal("beta0_intercept", 0, tau=0.001)
    beta1 = pm.Normal("beta1_abdomen", 0, tau=0.001)
    variance = pm.Deterministic("variance", 1 / tau)

    mu = beta0 + beta1 * X
    likelihood = pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat1.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)

    trace = pm.sample(2000)
    ppc = pm.sample_posterior_predictive(trace)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
  aesara_function = aesara.function(
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta0_intercept, beta1_abdomen]


  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 12 seconds.
The acceptance probability does not match the target. It is 0.8854, but should be close to 0.8. Try to increase the number of tuning steps.


In [7]:
az.summary(trace, hdi_prob=0.95)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta0_intercept,-34.999,2.559,-39.751,-29.76,0.049,0.034,2772.0,2685.0,1.0
beta1_abdomen,0.583,0.027,0.528,0.635,0.001,0.0,2752.0,2680.0,1.0
tau,0.049,0.004,0.041,0.058,0.0,0.0,3730.0,3367.0,1.0
variance,20.527,1.829,17.313,24.401,0.03,0.022,3730.0,3367.0,1.0
br2,0.66,0.03,0.595,0.713,0.001,0.0,3730.0,3367.0,1.0


This matches the results from the U7 L11 video.

Another way to calculate the $R^2$ using a posterior predictive check (keeping in mind that there is no standard "Bayesian $R^2$") and the results will be slightly different:

In [8]:
# get the mean y_pred across all chains
y_pred = np.array(ppc.posterior_predictive.likelihood.mean(axis=(0, 1)))

az.r2_score(y, y_pred)

r2        0.660744
r2_std    0.000000
dtype: float64

In this case they agree, but that won't always be true.

## Multinomial regression with all predictors

Based on fat2d.odc or fatmulti.odc (they appear to be identical).

In [6]:
data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data.iloc[:, 1:].to_numpy(copy=True)

# add intercept
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
n, p = X_aug.shape

# Zellner's g
g = p**2

n, p, g

(252, 15, 225)

In [10]:
X_aug.shape

(252, 15)

In [24]:
new_matrix = np.zeros((15, 15))
for i in range(15):
    for j in range(15):
        new_matrix[i, j] = np.inner(X_aug[:, i], X_aug[:, j])
new_matrix.shape

(15, 15)

In [27]:
np.equal(np.dot(X_aug.T, X_aug), new_matrix)

array([[ True,  True,  True,  True, False, False, False, False,  True,
         True, False, False, False, False, False],
       [ True,  True, False,  True, False, False, False,  True, False,
        False, False, False,  True, False, False],
       [ True, False,  True,  True, False, False,  True,  True, False,
        False, False, False,  True, False, False],
       [ True,  True,  True,  True, False, False, False,  True, False,
        False, False, False, False, False, False],
       [False, False, False, False, False, False,  True, False, False,
        False, False, False, False, False,  True],
       [False, False, False, False, False, False, False, False, False,
        False, False, False, False, False,  True],
       [False, False,  True, False,  True, False, False, False, False,
        False, False, False, False, False, False],
       [False,  True,  True,  True, False, False, False, False,  True,
         True, False, False, False, False,  True],
       [ True, False, Fa

In [29]:
np.allclose(np.dot(X_aug.T, X_aug), new_matrix)

True

In [30]:
np.dot(X_aug.T, X_aug).shape

(15, 15)

In [14]:
mu_beta = np.zeros(p)

In [38]:
with pm.Model() as m2d:
    tau = pm.Gamma("tau", 0.01, 0.01)
    variance = pm.Deterministic("variance", 1 / tau)

    tau_matrix = at.fill(at.zeros((15, 15)), tau)
    tau_beta = tau_matrix / g * dot(X_aug.T, X_aug)
    beta = pm.MvNormal("beta", mu_beta, tau=tau_beta)

    mu = dot(X_aug, beta)
    pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat2d.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)
    br2_adj = pm.Deterministic("br2_adj", 1 - (n - 1) * variance / sst)

    trace = pm.sample(1000)
    ppc = pm.sample_posterior_predictive(trace)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta]


  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 546 seconds.
There was 1 divergence after tuning. Increase `target_accept` or reparameterize.
There were 36 divergences after tuning. Increase `target_accept` or reparameterize.
There were 9 divergences after tuning. Increase `target_accept` or reparameterize.


In [39]:
az.summary(trace, hdi_prob=0.95)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta[0],-15.242,16.608,-47.655,16.373,0.439,0.318,1438.0,2268.0,1.0
beta[1],0.057,0.031,-0.005,0.118,0.001,0.0,2260.0,2449.0,1.0
beta[2],-0.081,0.051,-0.186,0.012,0.001,0.001,1527.0,2464.0,1.0
beta[3],-0.051,0.105,-0.264,0.157,0.002,0.002,2434.0,2632.0,1.0
beta[4],0.066,0.288,-0.507,0.605,0.006,0.005,2446.0,2441.0,1.0
beta[5],-0.44,0.224,-0.877,-0.008,0.004,0.003,3017.0,3011.0,1.0
beta[6],-0.031,0.1,-0.224,0.168,0.002,0.001,2531.0,2725.0,1.0
beta[7],0.875,0.086,0.714,1.044,0.002,0.001,2436.0,2804.0,1.0
beta[8],-0.203,0.139,-0.465,0.076,0.003,0.002,1854.0,2860.0,1.0
beta[9],0.23,0.143,-0.035,0.522,0.003,0.003,1803.0,650.0,1.0


In [40]:
y_pred = np.array(ppc.posterior_predictive.likelihood.mean(axis=(0, 1)))

az.r2_score(y, y_pred)

r2        0.748129
r2_std    0.000000
dtype: float64

Reading on g-priors

https://arxiv.org/abs/1702.01201
https://towardsdatascience.com/linear-regression-model-selection-through-zellners-g-prior-da5f74635a03
https://en.wikipedia.org/wiki/G-prior\

original paper:

Zellner, A. (1986). "On Assessing Prior Distributions and Bayesian Regression Analysis with g Prior Distributions". In Goel, P.; Zellner, A. (eds.). Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. Studies in Bayesian Econometrics and Statistics. Vol. 6. New York: Elsevier. pp. 233–243. ISBN 978-0-444-87712-3.

In [19]:
%watermark --iversions -v

Python implementation: CPython
Python version       : 3.10.4
IPython version      : 8.3.0

pymc  : 4.0.0b5
numpy : 1.22.3
pandas: 1.4.2
arviz : 0.12.1

