# **IEOR E4650  Business Analytics (Fall 2019)**

##**Lecture 10: Weibull Model**

In this lecture, we discuss how to model a timing model.  

Learning objective:

* Understand how to use Weibull model to model timing data
* Understand how the concept of hazard



## Baseline Model

In this lecture, we will discuss a new topic. Previously, the model focused on model the number of occurance. The other commonly used model models the timing of the occurance. This type of model is called timing model.

To illustrate timing model, we study a new dataset related to healthcare management. In this dataset, we track each patient who are released from the hospital. We record after how many days/months a patient will relapse. By the end of the observation period, we might have a patient who has not relapsed. They are recorded as "censored".



In [0]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
link="https://drive.google.com/open?id=1uNSppU66HLaLHNM1noTFJ4s_9TQMbJDa"
_,id=link.split("=")
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('myfile.csv')  
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt

import scipy.stats as spst
from scipy.optimize import minimize
import scipy.special as spsp




import warnings
warnings.simplefilter("ignore")

Patients = pd.read_csv('myfile.csv')
Patients.head(10)


##Baseline model

For now, let's not incorporate any covariates. We will focus on the outcome variable, which are

(1) the number of months before a patient relapse.

(2) whether the observation is censored.


Usually, people treat time as a continuous variable. It is important to find a continuous distribution defined on the positive range and has a flexible shape. Weibull distribution is one of the most commonly used distributions.

The pdf of Weibull distribution follows:

$$f(t)=(c\lambda)(t\lambda)^{c-1}e^{-(t\lambda )^c}$$

where $c>0, \lambda>0, t>0$. $E(t)=\frac{\Gamma(1+\frac{1}{c})}{\lambda}$

The cdf of Weibull distribution follows:
$$F(t)=1-e^{-(t\lambda )^c}$$

Exponential distribution is a special case of Weibull distribution when $c=1$.


###Modeling the likelihood

(1) Treating the observed timing as discrete. In this case, the outcome will be considered as "censored", since we are only observing the interger outcomes. What we have is the time that rounded up. $t=1$ means the event happened between $0$ and $1$. Similarly, $t=4$, means the event happened between $3$ and $4$. In this case, instead of using $f(t)$, we should use $F(t)-F(t-1)$ to (1) Treating the observed timing as discrete. In this case, the outcome will be considered as "censored", since we are only observing the interger outcomes. What we have is the time that rounded up. $t=1$ means the event happened between $0$ and $1$. Similarly, $t=4$, means the event happened between $3$ and $4$. In this case, instead of using $f(t)$, we should use $F(t)-F(t-1)$ to model the likelihood.

(2) We also have observations that are censored at $t$. Those people might relapse later, but we just could not follow the long enough. The probability of observing a censored patient is $1-F(t)$. 

In [0]:
y=Patients["time_in_month"].values
censor=Patients["censor"].values
def neg_LL(betas):

  return -np.sum(np.log(ind_L))

guess=
model1=minimize(neg_LL,guess,method="BFGS")
print(model1.fun)
print(model1.x)

### Model fitting

Here, let's plot the actual v.s. prediction distribution for those patients who relapsed. 



In [0]:
counts=Patients[Patients["censor"]==0].groupby("time_in_month").count()

 
prediction= 

plt.bar(counts.index.values, counts.iloc[:,4].values)
plt.scatter(y1,prediction,zorder=2)





### Hazard

One thing special about this timing model is that we have a concept called "hazard". It measures the likelihood for an event to happen at $t$ conditional on it has not happened before $t$.

Mathematically, it is defined as $$h(t)=\frac{f(t)}{1-F(t)}$$.

For Weibull distribution, it is $$h(t)=c\lambda t^{c-1}$$

We can see that when $\lambda$ directly determine the scale of the hazard. When $\lambda$ is higher, the hazard in general higher. $c$ will determine whether the hazard will be increasing over time ($c>1$) or decreasing over time ($c<1$). When $c=1$ (an exponential distribution), we have constant hazard. 

For the discrete case, we can define hazard using 

$$h(t)=\frac{F(t)-F(t-1)}{1-F(t-1)}$$.

This gives the probability of an event happening happening in period $t$ given it has not happened before.

### Business Recommendations

Using this model, we can predict what is going to happen in the future. For example, if we have a person who has not relapsed at the end of the second month, can we know the probability for this person to relapse by the end of the third month?

## Two segment Weibull model 

Here, we assume that we have two segments of people, one segment with higher $\lambda$ and one segment with lower $\lambda$. In this case, we will estimate $\lambda_1$, $\lambda_2$, $c$, and $p$. 

In [0]:
y=Patients["time_in_month"].values
censor=Patients["censor"].values
def neg_LL(betas):
 

  ind_L= 



  return -np.sum(np.log(ind_L))

guess=
model1=minimize(neg_LL,guess,method="BFGS")
print(model1.fun)
print(model1.x)

###Model fitting

In [0]:
counts=Patients[Patients["censor"]==0].groupby("time_in_month").count()

plt.bar(counts.index.values, counts.iloc[:,4].values)
plt.scatter(y1,prediction,zorder=2)





###Hazard

###Posterior analysis

In this model, we are able to identify two groups of people. 

* Group 1: with high $\lambda$. This group of people are will on average having higher risk relapsing.
* Group 2: with low $\lambda$. This group of people are will on average having lower risk relapsing.

By observing the $t$, we can update our information of which risk group a patient belongs to. Thus, we will have a better idea of when relapse might happen.

In [0]:
#Consider the following two patients
#Patient who relapsed in month 5.
#Patient who is censored and has not relapsed by the end of month 7.
#Do the posterior analysis for these two patients. 

## Continuous observed heterogeneity

We can moved on to assume that $\lambda \sim Gamma (\gamma, \alpha)$.

The distribution of $t$ is called Weibull-Gamma distribution and will have the following PDF:

$f(t)=\frac{c\gamma t^c(\frac{\alpha}{\alpha+t^c})^\gamma}{t(\alpha+t^c)}$

$F(t)=1-(\frac{\alpha}{\alpha+t^c})^\gamma$



In [0]:
##y=Patients["time"].values
y=Patients["time_in_month"].values
censor=Patients["censor"].values
def neg_LL(betas):
   
  return -np.sum(np.log(ind_L))

guess= 
model1=minimize(neg_LL,guess,method="BFGS")
print(model1.fun)
print(model1.x)

1633.7212721995447
[ 1.86109188 -0.34711302  0.34191669]


###Model fitting

In [0]:
counts=Patients[Patients["censor"]==0].groupby("time_in_month").count()
betas=model1.x
 

plt.bar(counts.index.values, counts.iloc[:,4].values)
plt.scatter(y1,prediction,zorder=2)





###Hazard

###Posterior Analysis

For Weibull-Gamma, the posterior distribution for a patient is 

$$\lambda|t \sim Gamma(\gamma+1, \alpha+t^c)$$

That is, if we see a patient relapsed in time $t$, then the distribution $\lambda$ for this patient will follow a gamma distribution.