In [1]:
import pymc as pm
import numpy as np
import arviz as az
import pandas as pd
import aesara.tensor as at
from aesara.tensor.subtensor import set_subtensor as set_st

%load_ext lab_black
%load_ext watermark

# Hald

This example demonstrates ...

Adapted from [unit 9: Hald.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit9/Hald.odc) and [unit 9: Haldssvs.odc](https://raw.githubusercontent.com/areding/6420-pymc/main/original_examples/Codes4Unit9/Haldssvs.odc)

Associated lecture videos
Unit 9 Lesson 4

Unit 9 Lesson 10

## Problem statement

A dataset on Portland cement originally due to Woods, Steinour and Starke (1932), and which has since then been widely analysed is now referred as Hald data cf. e.g., Hald (1952, pp. 635†652),

These data come from an experimental investigation of the heat evolved during the setting and hardening of Portland cements of varied composition and the dependence of this heat  on the percentages of four compounds in the clinkers from which the cement was  produced.

As observed by Woods, Steinour and Starke (1932, p. 1207): 
>This property is of interest in the construction of massive works such as dams, in which the great thicknesses severely hinder the outflow of the heat. The consequent rise in temperature while the cement is hardening may result in contractions and cracking when the eventual cooling to the surrounding temperature takes place. 

The four compounds considered by Woods, Steinour and Starke (1932) are tricalcium aluminate: 3CaO-Al2O3, tricalcium silicate: 3CaO-SiO2, tetracalcium aluminoferrite: 4CaO-Al2O3-Fe2O3, and beta-dicalcium silicate: 2CaO-SiO2, which we will denote by x1, x2, x3, and x4, respectively. The heat evolved after 180 days of curing, which we will denote by y, is measured in calories per gram of cement.

REFS:
Hald, Anders (1952). Statistical Theory with Engineering Applications. Wiley, New York.

Woods, H., Steinour, H. H., and Starke, H. R. (1932). Effect of composition
of Portland cement on heat evolved during hardening. Industrial and Engineering
Chemistry, 24, 1207†1214.




In [4]:
data = pd.read_csv("../data/hald_data.csv")
y = data["y"].to_numpy()
X = data.drop("y", axis=1).to_numpy()

In [5]:
Y = y.repeat(8).reshape(13, 8)

## Model 1

I don't understand why he is doing Y_new for comparison with Y. He's doing a PPC between y (observed) and Y_new, but why not just use Y? Guess I better watch the lecture... 


Coming back to this later

In [6]:
with pm.Model() as m:
    a = pm.Normal("a", 0, tau=0.00001, shape=4)
    b = pm.Normal("b", 0, tau=0.00001, shape=4)
    c = pm.Normal("c", 0, tau=0.00001, shape=4)
    d = pm.Normal("d", 0, tau=0.00001, shape=5)
    e = pm.Normal("e", 0, tau=0.00001, shape=3)
    f = pm.Normal("f", 0, tau=0.00001, shape=3)
    g = pm.Normal("g", 0, tau=0.00001, shape=4)
    h = pm.Normal("h", 0, tau=0.00001, shape=3)
    tau = pm.Gamma("tau", 12.5, 62.5, shape=8)
    # fmt: off
    _mu = [
        a[0] + a[1] * X[:, 0] + a[2] * X[:, 1] + a[3] * X[:, 3],  # i013
        b[0] + b[1] * X[:, 0] + b[2] * X[:, 1] + b[3] * X[:, 2],  # i012
        c[0] + c[1] * X[:, 0] + c[2] * X[:, 2] + c[3] * X[:, 3],  # i023
        d[0] + d[1] * X[:, 0] + d[2] * X[:, 1] + d[3] * X[:, 2] + d[4] * X[:, 3],  # i0123
        e[0] + e[1] * X[:, 0] + e[2] * X[:, 1],  # i01
        f[0] + f[1] * X[:, 0] + f[2] * X[:, 2],  # i02 (there were 2 i03 in orig)
        g[0] + g[1] * X[:, 1] + g[2] * X[:, 2] + g[3] * X[:, 3],  # i123
        h[0] + h[1] * X[:, 0] + h[2] * X[:, 3],  # i03
    ]
    # fmt: on
    mu = pm.math.stack(_mu, name="mu")
    lik = pm.Normal("lik", mu=mu.T, tau=tau, observed=Y)

    trace = pm.sample(2000)

    pm.sample_posterior_predictive(trace, extend_inferencedata=True)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a, b, c, d, e, f, g, h, tau]


Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 431 seconds.
There were 830 divergences after tuning. Increase `target_accept` or reparameterize.
There were 74 divergences after tuning. Increase `target_accept` or reparameterize.
There were 32 divergences after tuning. Increase `target_accept` or reparameterize.
There were 613 divergences after tuning. Increase `target_accept` or reparameterize.


In [7]:
ppc = az.summary(trace.posterior_predictive)["mean"].values.reshape(13, 8)

In [8]:
D2 = (Y - ppc) ** 2
L = np.sqrt(np.sum(D2, axis=0) + np.std(ppc, axis=0) ** 2)
L

array([15.90341627, 15.84237464, 15.97925794, 15.8988387 , 16.18672951,
       36.64429925, 16.61402571, 16.69703505])

## Model 2 (SSVS)

In [9]:
%watermark --iversions -v

Python implementation: CPython
Python version       : 3.10.4
IPython version      : 8.4.0

arviz : 0.12.1
numpy : 1.22.4
aesara: 2.6.6
pymc  : 4.0.0
pandas: 1.4.2

