## Name: Andrew Caide
### CSCI S-89C Deep Reinforcement Learning  
### Part II of Assignment 5

# Problem 1 (25 points)

In this problem we consider patients with end-stage liver disease (ESLD). We assume that patient's health condition is fully characterized by the Model for End-stage Liver Disease (MELD) score (Jae-Hyeon Ahn and John Hornberger, Involving patients in the cadaveric kidney transplant allocation process: a decision-theoretic perspective. Manage Sci. 1996;42(5):629–41).

The MELD score ranges from 6 to 40 and is derived based on the probability of survival at 3 months for patients with ESLD. Data in ESLD is usually sparse and often aggregated into Stages. We assume that there are 18 stages based on the ESLD: Stage 1, Stage2, ..., Stage 18. The time step is 1 year and the actions in Stages 1 through 18 are "wait" (denoted by 0) and "transplant" (denoted by 1). 

We assume that the Markov property holds. There are two additional states of the Markov Decision Process: "Posttransplant Life" (denoted by 19) and "Death" (which is denoted by 20 and combines so caled "Pretransplant Death" and "Posttransplant Death"). The only action availible in state "Posttransplant Life" is "wait" and "Death" is the terminal state with no actions. Assume that the length of an episode is T=50, unless it terminates earlier due to the transition to the absorbing state "Death."

We do not know the transition probabilities, but if a patient selects "wait," the possible transitions are   
1) Stage 1->Stage 1, Stage 1->Stage 2, Stage 1->Death  
2) For k in {2,3,4,...17}, Stage k->Stage (k-1), Stage k->Stage k, Stage k->Stage (k+1), Stage k->Death    
3) Stage 18->Stage 17, Stage 18->Stage 18, Stage 18->Death    

If a patient selects "transplant" at Stage k, k=1,2,...,18, the only possible transition is  
4) Stage k->"Posttransplant Life"

Finally, there are two more possible transitions"  
5) "Posttransplant Life"->"Posttransplant Life" and "Posttransplant Life"->"Death"  


The patient gets reward 1 in all states "Stage k" (k=1,2,...,18) and reward 0.2 in the "Posttransplant Life" state - assume that the patient gets these rewards on "exit" from the states, i.e. after we observe the corresponding stage. We assume the discounting parameter $\gamma=0.97$, one of the most common discounting rate used in medical decision making (Gold MR, Siegel JE, Russell LB, Weinstein MC. Cost-Effectiveness in Health and Medicine. Oxford University Press; New York: 1996).


Please consider statistics on 8,000 patients with ESLD saved in the 'ESLD_statistics.csv' file. Each row represents an episode (i.e. one patient) and the columns are the sequences of the patients' states and actions. This data were generated under the behavor policy:

$b(1|k)=0.02$ for $k\in\{1,2,3,4,5,6,7,8,9,10,11,12,13\}$;   
$b(1|14)=0.05$;   
$b(1|15)=0.10$;   
$b(1|16)=0.20$;   
$b(1|17)=0.40$;  
$b(1|18)=0.60$;  

which means that, for example, 5% of paients at stage 14 received a transplant.

---
---
## TASK:

Please use the Off policy MC control (for estimating $\pi_*$), which corresponds to the weighted importance sampling, to obtain the optimal policy. Please be specific and **answer at what stages it is worth considering a transplant and at which stages - not**.

In [1]:
import random
import numpy as np
import pandas as pd

data = pd.read_csv("ESLD_statistics.csv")

---

### Notes

#### Discount:    
$\gamma = 0.97$

#### Actions:
$A \in\{0:Wait, 1:Transplant\} $;    

#### Rewards:    
$R(A|S_{\in\{1,...,18\}}) = 1$    
$R(A|S_{19}) = 0.2$    
$R(A|S_{20}) = 0$


#### Behavior Policies:    
$b(1|k)=0.02$ for $k\in\{1,2,3,4,5,6,7,8,9,10,11,12,13\}$;   
$b(1|14)=0.05$;   
$b(1|15)=0.10$;   
$b(1|16)=0.20$;   
$b(1|17)=0.40$;  
$b(1|18)=0.60$;  

---
#### Possible Transition States:     
$S_{T+1}(0|S_{1}) = S\in\{S_{1}, S_{2}, Death\}$;  
$S_{T+1}(0|S_{\in\{2,...,17\}}) = S\in\{S_{T-1}, S_{T},S_{T+1}, Death\}$;  
$S_{T+1}(0|S_{18}) = S\in\{S_{17}, S_{18}, Death\}$; 

**If there's a transplant**    
$S_{T+1}(1|S_{T}) = S_{PTL} = S_{19}$; 

Note Important States:    
$S_{19}:$ Post-Transplant Life   
$S_{20}:$ Death

---

In [2]:
def get_reward(n):
    # n == state(t+1)
    if n < 19:
        return 1
    elif n == 19:
        return .2
    else: return 0
    
def behavior_policy_from_assignment(state, action):
    # State: 1-20, action: 0-1
    behaviors = {13:.02, 14:.05, 15:.1, 16:.2, 17:.4, 18:.6}
    
    if state <= 13:
        behavior = [1-behaviors[13], behaviors[13]]
    elif state > 13 and state < 19:
        behavior = [1-behaviors[state], behaviors[state]]
    elif state >= 19:
        behavior = [1, 0] # Once in states 19, 20, we always expect Action to be 0. 
    return behavior[action]

def e_soft_policy(state, action, eta = 0.7):
    # Default to staying at 0
    return([1-eta+eta/2, eta/2][action])
    

In [3]:
def offpolicy_MC_control(data, behavior_policy):
    
    gamma = 0.97
    # Initialize for all S, A
    Q = pd.DataFrame(0, index=range(1,21), columns={0,1})
    C = pd.DataFrame(0, index=range(1,21), columns={0,1})
    target_policy = Q.idxmax(1) # Consistency Rule: First max value
    
    for ep in range(len(data)):
        episode = data.loc[ep,:]
        G = 0 
        W = 1
         
        for step in range(0, 50):
            # Organize states
            st = episode[step*2]
            st1= episode[step*2+2]
            at = episode[step*2+1]
            
            G = gamma*G + get_reward(st1)
            C.loc[st,at] = C.loc[st,at] + W
            Q.loc[st,at] = Q.loc[st,at] + W/C.loc[st,at]*(G - Q.loc[st,at])
            target_policy[st] = Q.loc[st,:].idxmax(axis=1)
            
            if at != target_policy[st]:
                break
            W = W/behavior_policy(st, at)
            #W = W/get_behavior_policy(st, at)
    return [target_policy, Q]

In [4]:
esoft = offpolicy_MC_control(data, behavior_policy = e_soft_policy)
provided_policy = offpolicy_MC_control(data, behavior_policy = behavior_policy_from_assignment)

---
Using an e-soft $b(S,A)$ policy, the following five states have been identified as important windows to providing the patient with a transplant: $1, 13, 14, 15, 17$. More data about the policy choices and Action-State values can be found in the esoft variable.

In [5]:
esoft[0][esoft[0]==1].index.tolist()

[1, 13, 14, 15, 17]

Out of curiosoity, I also tried implementing the provided behavior policy to the off-policy MDP as a behavior policy. I wasn't sure exactly why the purpose of having been provided this policy was, so I understood it as the policy we had to use. 

In any case, the following states have been identified as critical: $1, 3, 6, 9, 10, 14$. Interestingly, states $1$ and $14$ have been identified as important by both policies. These two states, $1$ and $14$, are definitely important.

In [6]:
provided_policy[0][provided_policy[0]==1].index.tolist()

[1, 3, 6, 9, 10, 14]

The states below have been identified as less worth considering:

In [7]:
esoft[0][esoft[0]==0].index.tolist()

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 16, 18, 19, 20]

--- 

---