In [1]:
import pymc as pm
import pandas as pd
import numpy as np
import arviz as az
from pymc.math import dot
import aesara.tensor as at

%load_ext lab_black
%load_ext watermark

# Brozek index prediction

This example goes over linear regression and Bayesian $R^2$.

Adapted from [Unit 7: fat1.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit7/fat2d.odc), [fat2d.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit7/fat1.odc), and [fatmulti.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit7/fatmulti.odc).

Data can be found [here](https://raw.githubusercontent.com/areding/6420-pymc/main/data/fat.tsv).

Associated lecture videos:
Unit 7 lessons 11 and 13

## Problem statement

Percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) were recorded for 252 men. Percentage of body fat is estimated through an underwater weighing technique.

The data set has 252 observations and 15 variables. Brozek index (Brozek et al., 1963) is obtained by the underwater weighing while other 14 anthropometric variables are obtained using scales and a measuring tape.

- y = Brozek index
- X1 = 1 (intercept)
- X2 = age
- X3 = weight
- X4 = height
- X5 = adipose
- X6 = neck  
- X7 = chest
- X8 = abdomen
- X9 = hip
- X10 = thigh
- X11 = knee   
- X12 = ankle
- X13 = bicep
- X14 = forearm
- X15 = wrist

These anthropometric variables are less intrusive but also less reliable in assessing the body fat index.

Set a linear regression to predict the Brozek index from these body measurements.

## Single predictor (X8)

This is a recreation of fat1.odc.

In [2]:
data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data["X8"].to_numpy(copy=True)

# p will be the number of predictors + intercept (1 + 1 in this case)
n, p = X.shape[0], 2

In [3]:
with pm.Model() as m:
    tau = pm.Gamma("tau", 0.001, 0.001)
    beta0 = pm.Normal("beta0_intercept", 0, tau=0.001)
    beta1 = pm.Normal("beta1_abdomen", 0, tau=0.001)
    variance = pm.Deterministic("variance", 1 / tau)

    mu = beta0 + beta1 * X
    likelihood = pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat1.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)

    trace = pm.sample(2000)
    ppc = pm.sample_posterior_predictive(trace)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta0_intercept, beta1_abdomen]


Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 16 seconds.
The acceptance probability does not match the target. It is 0.8849, but should be close to 0.8. Try to increase the number of tuning steps.


In [4]:
az.summary(trace, hdi_prob=0.95)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta0_intercept,-34.996,2.465,-39.989,-30.284,0.047,0.033,2744.0,3032.0,1.0
beta1_abdomen,0.583,0.026,0.53,0.634,0.001,0.0,2753.0,2952.0,1.0
tau,0.049,0.004,0.041,0.058,0.0,0.0,3852.0,3859.0,1.0
variance,20.524,1.84,17.079,24.191,0.03,0.021,3852.0,3859.0,1.0
br2,0.66,0.031,0.599,0.717,0.0,0.0,3852.0,3859.0,1.0


This matches the results from the U7 L11 video.

Another way to calculate the $R^2$ using a posterior predictive check (keeping in mind that there is no standard "Bayesian $R^2$"). The results will be slightly different:

In [5]:
# get the mean y_pred across all chains
y_pred = np.array(ppc.posterior_predictive.likelihood.mean(axis=(0, 1)))

az.r2_score(y, y_pred)

r2        0.660282
r2_std    0.000000
dtype: float64

In this case they agree, but that won't always be true.

## Multinomial regression with all predictors

Based on fat2d.odc or fatmulti.odc (they appear to be identical).

In [6]:
data = pd.read_csv("../data/fat.tsv", sep="\t")

y = data["y"].to_numpy(copy=True)
X = data.iloc[:, 1:].to_numpy(copy=True)

# add intercept
X_aug = np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)
n, p = X_aug.shape

# Zellner's g
g = p**2

n, p, g

(252, 15, 225)

In [7]:
X_aug.shape

(252, 15)

In [8]:
mu_beta = np.zeros(p)

In [9]:
with pm.Model() as m2d:
    tau = pm.Gamma("tau", 0.01, 0.01)
    variance = pm.Deterministic("variance", 1 / tau)

    tau_matrix = at.fill(at.zeros((15, 15)), tau)
    tau_beta = tau_matrix / g * dot(X_aug.T, X_aug)
    beta = pm.MvNormal("beta", mu_beta, tau=tau_beta)

    mu = dot(X_aug, beta)
    pm.Normal("likelihood", mu=mu, tau=tau, observed=y)

    # Bayesian R2 from fat2d.odc
    sse = (n - p) * variance
    cy = y - y.mean()
    sst = dot(cy, cy)
    br2 = pm.Deterministic("br2", 1 - sse / sst)
    br2_adj = pm.Deterministic("br2_adj", 1 - (n - 1) * variance / sst)

    trace = pm.sample(1000)
    ppc = pm.sample_posterior_predictive(trace)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [tau, beta]


Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 694 seconds.


In [10]:
az.summary(trace, hdi_prob=0.95)

Unnamed: 0,mean,sd,hdi_2.5%,hdi_97.5%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta[0],-14.8,16.532,-46.372,18.975,0.376,0.268,1926.0,2198.0,1.0
beta[1],0.056,0.031,-0.006,0.113,0.001,0.0,3057.0,2687.0,1.0
beta[2],-0.08,0.051,-0.177,0.024,0.001,0.001,1864.0,2074.0,1.0
beta[3],-0.054,0.107,-0.262,0.162,0.002,0.002,2807.0,2607.0,1.0
beta[4],0.055,0.284,-0.484,0.63,0.005,0.004,3169.0,2751.0,1.0
beta[5],-0.446,0.225,-0.902,-0.022,0.004,0.003,3643.0,2980.0,1.0
beta[6],-0.032,0.102,-0.24,0.162,0.002,0.002,3183.0,2763.0,1.0
beta[7],0.878,0.087,0.709,1.05,0.002,0.001,2984.0,2321.0,1.0
beta[8],-0.205,0.14,-0.486,0.068,0.003,0.002,2404.0,2692.0,1.0
beta[9],0.222,0.138,-0.036,0.504,0.002,0.002,3154.0,2882.0,1.0


In [11]:
y_pred = np.array(ppc.posterior_predictive.likelihood.mean(axis=(0, 1)))

az.r2_score(y, y_pred)

r2        0.746982
r2_std    0.000000
dtype: float64

Need to do some more reading on g-priors:

https://arxiv.org/abs/1702.01201
https://towardsdatascience.com/linear-regression-model-selection-through-zellners-g-prior-da5f74635a03
https://en.wikipedia.org/wiki/G-prior\

original paper:

Zellner, A. (1986). "On Assessing Prior Distributions and Bayesian Regression Analysis with g Prior Distributions". In Goel, P.; Zellner, A. (eds.). Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. Studies in Bayesian Econometrics and Statistics. Vol. 6. New York: Elsevier. pp. 233–243. ISBN 978-0-444-87712-3.

In [12]:
%watermark -n -u -v -iv -p aesara,aeppl

Last updated: Thu Aug 11 2022

Python implementation: CPython
Python version       : 3.10.5
IPython version      : 8.4.0

aesara: 2.7.7
aeppl : 0.0.32

pymc  : 4.1.3
pandas: 1.4.3
numpy : 1.23.1
aesara: 2.7.7
arviz : 0.12.1

