In [1]:
import arviz as az
import numpy as np
import pymc as pm
from pymc.math import exp

np.set_printoptions(suppress=True)

# Gastric Cancer Data 

Adapted from [Codes for Unit 8: gastric.odc](https://www2.isye.gatech.edu/isye6420/supporting.html).

Associated lecture video: [Unit 8 Lesson 6](https://www.youtube.com/watch?v=t4pHpZxtC0U&list=PLv0FeK5oXK4l-RdT6DWJj0_upJOG2WKNO&index=87).

Data can be found [here](https://raw.githubusercontent.com/areding/6420-pymc/main/data/gastric.txt).

Stablein et al. (1981) provide data on 90 patients affected by locally advanced, nonresectable gastric carcinoma. The patients are randomized to two treatments: chemotherapy alone (coded as 0) and chemotherapy plus radiation (coded as 1). Survival time is reported in days. Recorded times are censored if the patient stopped participating in the study before it finished.

Stablein, D. M., Carter, W. H., Novak, J. W. (1981). Analysis of survival data with nonproportional hazard functions. Control. Clin. Trials,  2 , 2, 149--159.


## Data
Columns are, from left to right:
- type: Treatment type, chemotherapy (0) or chemotherapy + radiation (1)
- censored: If censored, meaning the patient survived the observation period, the time in days appears here rather than in the times column. 0 if not censored.
- times: Recorded days without cancer recurrence. NaN if censored.

## Censoring
The way PyMC censoring works is described in some detail in [this notebook](https://docs.pymc.io/projects/examples/en/latest/generalized_linear_models/GLM-truncated-censored-regression.html#censored-regression-model). This is accomplished in the code [here](https://github.com/aesara-devs/aeppl/blob/751979802f1aef5478fdbf7cc1839df07df60825/aeppl/truncation.py#L79) if you want to take a look. For right-censoring, try this: ```pm.Censored("name", dist, lower=None, upper=censored, observed=y)```. The censored values can be an array of the same shape as the y values. 

If the y value equals the right-censored value, [```pm.Censored```](https://docs.pymc.io/en/latest/api/distributions/generated/pymc.Censored.html#pymc.Censored) returns the complement to the CDF evaluated at the censored value. If the y value is greater than the censored value, it returns ```-np.inf```. Otherwise, the distribution you passed to the ```dist``` parameter works as normal. What I've been doing is setting the values in the censored array to ```np.inf``` if the corresponding y value is not censored, and equal to the y value if it should be censored.

## Model changes
I didn't implement S, f, or h from the original model. They should be simple enough, but I really just wanted to get another example of censoring up before HW6 is released. I will add those later.

PyMC really did not like the noninformative exponential prior on v (α in this model). To avoid the divide by zero errors, I just kept increasing lambda until the model ran all the way through. This is not ideal, but I haven't had time to look into it further. The results actually came out fairly close to the BUGS results.

## Tips
If your model is not working, keep making it simpler and simpler until it runs. Then add pieces of it back one at a time until you identify where the problem is.

```{note}
I haven't been able to get this method working on HW6 Q2. 

If someone gets it working please let me know!

I'm currently working on a different method for the homework question, I hope to get it done in the next couple of days.

```

In [2]:
data = np.loadtxt("./data/gastric.txt")
data.shape

(90, 3)

In [3]:
x = data[:, 0].copy()
censored = data[:, 1].copy()
y = data[:, 2].copy()
# for pymc, right-censored values must be greater than or equal to than the "upper" value
y[np.isnan(y)] = censored[np.isnan(y)]
censored[censored == 0] = np.inf

In [4]:
y

array([  17.,   42.,   44.,   48.,   60.,   72.,   74.,   95.,  103.,
        108.,  122.,  144.,  167.,  170.,  183.,  185.,  193.,  195.,
        197.,  208.,  234.,  235.,  254.,  307.,  315.,  401.,  445.,
        464.,  484.,  528.,  542.,  567.,  577.,  580.,  795.,  855.,
        882.,  892., 1031., 1033., 1306., 1335., 1366., 1452., 1472.,
          1.,   63.,  105.,  129.,  182.,  216.,  250.,  262.,  301.,
        301.,  342.,  354.,  356.,  358.,  380.,  381.,  383.,  383.,
        388.,  394.,  408.,  460.,  489.,  499.,  524.,  529.,  535.,
        562.,  675.,  676.,  748.,  748.,  778.,  786.,  797.,  945.,
        955.,  968., 1180., 1245., 1271., 1277., 1397., 1512., 1519.])

In [5]:
censored

array([  inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,
        882.,  892., 1031., 1033., 1306., 1335.,   inf, 1452., 1472.,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,  381.,   inf,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,  529.,   inf,
         inf,   inf,   inf,   inf,   inf,   inf,   inf,   inf,  945.,
         inf,   inf, 1180.,   inf,   inf, 1277., 1397., 1512., 1519.])

```{warning}
PyMC and BUGS do not specify the Weibull distribution in the same way!

α = v
β = λ ** (-1 / α)

```

In [6]:
with pm.Model() as m:
    beta0 = pm.Normal("beta0", 0, tau=0.0001)
    beta1 = pm.Normal("beta1", 0, tau=0.0001)
    α = pm.Exponential("α", 3)

    λ = exp(beta0 + beta1 * x)
    β = λ ** (-1 / α)

    obs_latent = pm.Weibull.dist(alpha=α, beta=β)
    likelihood = pm.Censored(
        "likelihood",
        obs_latent,
        lower=None,
        upper=censored,
        observed=y,
    )

    median0 = pm.Deterministic("median0", (np.log(2) * exp(-beta0)) ** (1 / α))
    median1 = pm.Deterministic("median1", (np.log(2) * exp(-beta0 - beta1)) ** (1 / α))

    trace = pm.sample(
        10000, tune=2000, cores=4, init="auto", step=[pm.NUTS(target_accept=0.9)]
    )

Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta0, beta1, α]


  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
  return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 2_000 tune and 10_000 draw iterations (8_000 + 40_000 draws total) took 27 seconds.


In [7]:
az.summary(trace, hdi_prob=0.9)

Unnamed: 0,mean,sd,hdi_5%,hdi_95%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta0,-6.629,0.668,-7.688,-5.499,0.006,0.004,12248.0,13567.0,1.0
beta1,0.263,0.234,-0.118,0.652,0.002,0.001,17278.0,16915.0,1.0
α,1.003,0.098,0.843,1.163,0.001,0.001,12434.0,13362.0,1.0
median0,520.721,90.1,374.131,662.282,0.622,0.44,20964.0,24066.0,1.0
median1,400.311,70.611,285.317,512.527,0.465,0.33,23123.0,25433.0,1.0
