# **IEOR E4650  Business Analytics (Fall 2019)**

##**Lecture 11: Weibull Model with covariates**

In this lecture, we discuss add covariates into Weibull model. 

Learning objective:

* Understand how to add covariates into Weibull model to model timing data



In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
link="https://drive.google.com/open?id=1uNSppU66HLaLHNM1noTFJ4s_9TQMbJDa"
_,id=link.split("=")
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('myfile.csv')  
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt

import scipy.stats as spst
from scipy.optimize import minimize
import scipy.special as spsp




import warnings
warnings.simplefilter("ignore")

Patients = pd.read_csv('myfile.csv')
Patients.head(10)


Unnamed: 0,id,age,censor,race,time,time_in_month
0,1,39.0,0,0.0,188,7
1,2,33.0,0,0.0,26,1
2,3,33.0,0,0.0,207,7
3,4,32.0,0,0.0,144,5
4,5,24.0,1,1.0,551,19
5,6,30.0,0,0.0,32,2
6,7,39.0,0,0.0,459,16
7,8,27.0,0,0.0,22,1
8,9,40.0,0,0.0,210,7
9,10,36.0,0,0.0,184,7


##Baseline model

Remember that we have the pdf of Weibull distribution follows:

$$f(t)=(c\lambda)(t\lambda)^{c-1}e^{-(t\lambda )^c}$$

where $c>0, \lambda>0, t>0$. $E(t)=\frac{\Gamma(1+\frac{1}{c})}{\lambda}$

The cdf of Weibull distribution follows:
$$F(t)=1-e^{-(t\lambda )^c}$$


Here, we have two options, 

(1) adding covariates to $\lambda$.

We have $\lambda=exp(\beta_0+\beta_1 x_1+\beta_2 x_2)$

(2) adding covariates to $c$

We have $c=exp(\beta_0+\beta_1 x_1+\beta_2 x_2)$



## Two segment Weibull model 

Previously, we assumed that we have two segments of people, one segment with higher $\lambda$ and one segment with lower $\lambda$. In this case, we will estimate $\lambda_1$, $\lambda_2$, $c$, and $p$. 

Here, to incorporate $x_1$ and $x_2$ to the model, we have 4 options

(1) incoporate $x_1$ and $x_2$ to $\lambda_1$ 

(2) incoporate $x_1$ and $x_2$ to $\lambda_x$ 

(3) incoporate $x_1$ and $x_2$ to $c$

(4) incoporate $x_1$ and $x_2$ to $p$

We have seen the cases when we discussed Poisson model.  $\lambda_1$, $\lambda_2$, $c$ are all positive parameters. Thus, we can use $exp(\beta_0+\beta_1 x_1+\beta_2 x_2)$. $p$ is between 0 and 1. Thus, we can use $\frac{exp(\beta_0+\beta_1 x_1+\beta_2 x_2)}{1+exp(\beta_0+\beta_1 x_1+\beta_2 x_2)}$








In [0]:
print(np.sum(np.isnan(x1)))
print(np.sum(np.isnan(x2)))

5
6


In [0]:
##fill in missing values using column means

Patients=Patients.fillna(Patients.mean())

In [0]:
y=Patients["time_in_month"].values
censor=Patients["censor"].values
x1=Patients["age"].values
x2=Patients["race"].values
def neg_LL(betas):
  lmbda1=0
  lmbda2=np.exp(betas[0])
  c=np.exp(betas[1])
  p=np.exp(betas[2]+betas[3]*x1+betas[4]*x2)/(1+np.exp(betas[2]+betas[3]*x1+betas[4]*x2))
  ind_L1r=(1-np.exp(-(y*lmbda1)**c))-(1-np.exp(-((y-1)*lmbda1)**c))
  ind_L2r=(1-np.exp(-(y*lmbda2)**c))-(1-np.exp(-((y-1)*lmbda2)**c))

  ind_Lr=ind_L1r*p+ind_L2r*(1-p)

  ind_L1c=1-(1-np.exp(-(y*lmbda1)**c))
  ind_L2c=1-(1-np.exp(-(y*lmbda2)**c))
  ind_Lc=ind_L1c*p+ind_L2c*(1-p)

  ind_L=ind_Lr*(1-censor)+ind_Lc*censor

 

  return -np.sum(np.log(ind_L))

guess=-np.random.rand(5)*0.01
model1=minimize(neg_LL,guess,method="BFGS")
print(model1.fun)
print(model1.x)

1620.651206516237
[-1.74162909  0.15107808 -2.6146115   0.02846304  0.53380578]


In [0]:
betas=model1.x
lmbda1=0
lmbda2=np.exp(betas[0])
c=np.exp(betas[1])
p=np.exp(betas[2]+betas[3]*x1+betas[4]*x2)/(1+np.exp(betas[2]+betas[3]*x1+betas[4]*x2))

print(betas[3])
print(betas[4])

print(lmbda1)
print(lmbda2)
print(c)
print(p)


0.028463038001646453
0.5338057833057609
0
0.1752346946487912
1.1630874740625388
[0.18174744 0.15771468 0.15771468 0.15397036 0.19818248 0.14670014
 0.18174744 0.13633034 0.18601873 0.16939204 0.25263661 0.17755283
 0.14317287 0.23685706 0.19036705 0.15029907 0.13633034 0.13971652
 0.13971652 0.2580485  0.15397036 0.15771468 0.13971652 0.14317287
 0.16542493 0.20853566 0.16542493 0.1265848  0.16939204 0.15276373
 0.18174744 0.16939204 0.17343451 0.14670014 0.20387684 0.21689717
 0.20274432 0.14670014 0.1265848  0.13633034 0.2267219  0.16153267
 0.15771468 0.16153267 0.16939204 0.13633034 0.25263661 0.1265848
 0.21689717 0.21327242 0.13301353 0.19479269 0.14670014 0.24204039
 0.2267219  0.13633034 0.17755283 0.22297971 0.16939204 0.13971652
 0.15029907 0.13971652 0.13971652 0.207384   0.16153267 0.13301353
 0.15397036 0.15029907 0.28621855 0.14670014 0.16542493 0.18174744
 0.22137378 0.15771468 0.15771468 0.22177064 0.16939204 0.13301353
 0.17343451 0.21650661 0.14317287 0.16153267 0.226

## Continuous observed heterogeneity

Again, everything is consistent with Poisson distribution. We will again, assume $\lambda=\lambda_0exp(\beta_1 x_1+ \beta_2 x_2)$ and $ \lambda_0 \sim Gamma(\gamma, \alpha)$.

Thus, again, gives us $\lambda \sim Gamma(\gamma, \frac{\alpha}{exp(\beta_1 x_1+ \beta_2 x_2)})$

Thus, we have 

$f(t)=\frac{c\gamma t^c(\frac{\alpha'}{\alpha'+t^c})^\gamma}{t(\alpha'+t^c)}$

$F(t)=1-(\frac{\alpha'}{\alpha'+t^c})^\gamma$

and the posterior distribution follows

$\lambda|t \sim Gamma(\gamma+1, \alpha'+t^c)$

where $\alpha'=\frac{\alpha}{exp(\beta_1 x_1+ \beta_2 x_2)}$.


