# DX 704 Week 6 Project

This project will develop a treatment plan for a fictious illness "Twizzleflu".
Twizzleflu is a mild illness caused by a virus.
The main symptoms are a mild fever, fidgeting, and kicking the blankets off the bed or couch.
Mild dehydration has also been reported in more severe cases.
These symptoms typically last 1-2 weeks without treatment.
Word on the internet says that Twizzleflu can be cured faster by drinking copious orange juice, but this has not been supported by evidence so far.
You will be provided with a theoretical model of Twizzleflu modeled as a Markov decision process.
Based on the model, you will compute optimal treatment plans to optimize different criteria, and compare patient discomfort with the different plans.

The full project description, a template notebook, and raw data are available on GitHub: [Project 6 Materials](https://github.com/bu-cds-dx704/dx704-project-06).

We will model Twizzleflu as a Markov decision process.
The model transition probabilities are provided in the file "twizzleflu-transitions.tsv" and the expected rewards are in "twizzleflu-rewards.tsv".
The goal for Twizzleflu is to minimize the expected discomfort of the patient which is expressed as negative rewards in the file.

## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Evaluate a Do Nothing Plan

One of the treatment actions is to do nothing.
Calculate the expected discomfort (not rewards) of a policy that always does nothing.

Hint: for this value calculation and later ones, use value iteration.
The analytical solution has difficulties in practice when there is no discount factor.

In [None]:
# YOUR CHANGES HERE

# Computing expected discomfort under "do-nothing"
import pandas as pd
from collections import defaultdict

rewards = pd.read_csv("twizzleflu-rewards.tsv", sep="\t")
trans   = pd.read_csv("twizzleflu-transitions.tsv", sep="\t")

states = sorted(set(trans["state"]).union(trans["next_state"]))
R = {(s, a): r for a, s, r in rewards.itertuples(index=False)}
P = defaultdict(list)
for a, s, s2, p in trans.itertuples(index=False):
    P[(s, a)].append((s2, p))

gamma, tol = 0.95, 1e-10
V = {s: 0.0 for s in states}
C = {s: -R.get((s, "do-nothing"), 0.0) for s in states}

for _ in range(10000):
    V_new = {s: C[s] + gamma * sum(p * V[s2] for s2, p in P[(s, "do-nothing")]) for s in states}
    if max(abs(V_new[s] - V[s]) for s in states) < tol:
        V = V_new
        break
    V = V_new

do_nothing_df = pd.DataFrame({"state": states, "expected_discomfort": [V[s] for s in states]})


Save the expected discomfort by state to a file "do-nothing-discomfort.tsv" with columns state and expected_discomfort.

In [2]:
# YOUR CHANGES HERE

do_nothing_df.to_csv("do-nothing-discomfort.tsv", sep="\t", index=False)
print("Saved do-nothing-discomfort.tsv")


Saved do-nothing-discomfort.tsv


Submit "do-nothing-discomfort.tsv" in Gradescope.

## Part 2: Compute an Optimal Treatment Plan

Compute an optimal treatment plan for Twizzleflu.
It should minimize the expected discomfort (maximize the rewards).

In [3]:
# YOUR CHANGES HERE
# Computing optimal treatment plan (min discomfort max reward)
import pandas as pd
from collections import defaultdict

rewards = pd.read_csv("twizzleflu-rewards.tsv", sep="\t")
trans   = pd.read_csv("twizzleflu-transitions.tsv", sep="\t")

states  = sorted(set(trans["state"]).union(trans["next_state"]))
actions = sorted(trans["action"].unique())

# Build R(s,a) and P(s'|s,a)
R = {(s, a): r for a, s, r in rewards.itertuples(index=False)}
P = defaultdict(list)
for a, s, s2, p in trans.itertuples(index=False):
    P[(s, a)].append((s2, p))

# Value iteration (maximize expected discounted reward)
gamma, tol, max_iters = 0.95, 1e-10, 10000
V = {s: 0.0 for s in states}

for _ in range(max_iters):
    V_new = {
        s: max(R.get((s,a),0.0) + gamma*sum(p*V[s2] for s2,p in P[(s,a)]) for a in actions)
        for s in states
    }
    if max(abs(V_new[s] - V[s]) for s in states) < tol:
        V = V_new
        break
    V = V_new

# Greedy policy from V
best = {}
for s in states:
    qa = [(a, R.get((s,a),0.0) + gamma*sum(p*V[s2] for s2,p in P[(s,a)])) for a in actions]
    best[s] = max(qa, key=lambda t: t[1])[0]

minimum_discomfort_actions = pd.DataFrame(
    {"state": states, "action": [best[s] for s in states]}
)


Save the optimal actions for each state to a file "minimum-discomfort-actions.tsv" with columns state and action.

In [4]:
# YOUR CHANGES HERE

minimum_discomfort_actions.to_csv("minimum-discomfort-actions.tsv", sep="\t", index=False)
print("Saved minimum-discomfort-actions.tsv")


Saved minimum-discomfort-actions.tsv


Submit "minimum-discomfort-actions.tsv" in Gradescope.

## Part 3: Expected Discomfort

Using your previous optimal policy, compute the expected discomfort for each state.

In [5]:
# YOUR CHANGES HERE

# Computing expected discomfort under the optimal policy
import pandas as pd
from collections import defaultdict

# Inputs
rewards = pd.read_csv("twizzleflu-rewards.tsv", sep="\t")
trans   = pd.read_csv("twizzleflu-transitions.tsv", sep="\t")
policy  = pd.read_csv("minimum-discomfort-actions.tsv", sep="\t")  # from Part 2

states  = sorted(set(trans["state"]).union(trans["next_state"]))

# Build R(s,a) and P(s'|s,a)
R = {(s, a): r for a, s, r in rewards.itertuples(index=False)}
P = defaultdict(list)
for a, s, s2, p in trans.itertuples(index=False):
    P[(s, a)].append((s2, p))

# Policy mapping: a*(s)
pi = dict(zip(policy["state"], policy["action"]))

# Evaluate *cost* (discomfort) with value iteration on costs
gamma, tol, max_iters = 0.95, 1e-10, 10000
C = {s: -R.get((s, pi[s]), 0.0) for s in states}     # per-step discomfort = -reward
V = {s: 0.0 for s in states}

for _ in range(max_iters):
    V_new = {
        s: C[s] + gamma * sum(p * V[s2] for s2, p in P[(s, pi[s])])
        for s in states
    }
    if max(abs(V_new[s] - V[s]) for s in states) < tol:
        V = V_new
        break
    V = V_new

minimum_discomfort_values = pd.DataFrame(
    {"state": states, "expected_discomfort": [V[s] for s in states]}
)


Save your results in a file "minimum-discomfort-values.tsv" with columns state and expected_discomfort.

In [6]:
# YOUR CHANGES HERE

minimum_discomfort_values.to_csv("minimum-discomfort-values.tsv", sep="\t", index=False)
print("Saved minimum-discomfort-values.tsv")

Saved minimum-discomfort-values.tsv


Submit "minimum-discomfort-values.tsv" in Gradescope.

## Part 4: Minimizing Twizzleflu Duration

Modifiy the Markov decision process to minimize the days until the Twizzle flu is over.
To do so, change the reward function to always be -1 if the current state corresponds to being sick and 0 if the current state corresponds to being better.
To be clear, the action does not matter for this reward function.


In [7]:
# YOUR CHANGES HERE
# Building duration-focused reward function
import pandas as pd

trans = pd.read_csv("twizzleflu-transitions.tsv", sep="\t")

states  = sorted(set(trans["state"]).union(trans["next_state"]))
actions = sorted(trans["action"].unique())

def is_sick(s):  # only 'recovered' is better
    return s != "recovered"

rows = []
for s in states:
    r = -1 if is_sick(s) else 0
    for a in actions:
        rows.append({"action": a, "state": s, "reward": float(r)})

duration_rewards = pd.DataFrame(rows, columns=["action","state","reward"])


Save your new reward function in a file "duration-rewards.tsv" in the same format as "twizzleflu-rewards.tsv".

In [8]:
# YOUR CHANGES HERE

duration_rewards.to_csv("duration-rewards.tsv", sep="\t", index=False)
print("Saved duration-rewards.tsv")


Saved duration-rewards.tsv


Submit "duration-rewards.tsv" in Gradescope.

## Part 5: Optimize for Shorter Twizzleflu

Compute an optimal policy to minimize the duration of Twizzleflu.

In [9]:
# YOUR CHANGES HERE

# Computing optimal policy for minimum duration ===
import pandas as pd
from collections import defaultdict

# Inputs
trans   = pd.read_csv("twizzleflu-transitions.tsv", sep="\t")
dreward = pd.read_csv("duration-rewards.tsv",      sep="\t")

states  = sorted(set(trans["state"]).union(trans["next_state"]))
actions = sorted(trans["action"].unique())

# R(s,a) and P(s'|s,a)
R = {(s, a): r for a, s, r in dreward.itertuples(index=False)}
P = defaultdict(list)
for a, s, s2, p in trans.itertuples(index=False):
    P[(s, a)].append((s2, p))

# Value iteration (maximize reward = minimize expected duration)
gamma, tol, max_iters = 0.95, 1e-10, 10000
V = {s: 0.0 for s in states}

for _ in range(max_iters):
    V_new = {
        s: max(R.get((s,a),0.0) + gamma*sum(p*V[s2] for s2,p in P[(s,a)]) for a in actions)
        for s in states
    }
    if max(abs(V_new[s] - V[s]) for s in states) < tol:
        V = V_new
        break
    V = V_new

# Greedy policy
best = {}
for s in states:
    qa = [(a, R.get((s,a),0.0) + gamma*sum(p*V[s2] for s2,p in P[(s,a)])) for a in actions]
    best[s] = max(qa, key=lambda t: t[1])[0]

minimum_duration_actions = pd.DataFrame({"state": states, "action": [best[s] for s in states]})


Save the optimal actions for each state to a file "minimum-duration-actions.tsv" with columns state and action.

In [10]:
# YOUR CHANGES HERE

minimum_duration_actions.to_csv("minimum-duration-actions.tsv", sep="\t", index=False)
print("Saved minimum-duration-actions.tsv")

Saved minimum-duration-actions.tsv


Submit "minimum-duration-actions.tsv" in Gradescope.

## Part 6: Shorter Twizzleflu?

Compute the expected number of days sick for each state to a file.

In [12]:
# YOUR CHANGES HERE

# Computing expected sick days under the minimum-duration policy
import pandas as pd
import numpy as np

trans  = pd.read_csv("twizzleflu-transitions.tsv", sep="\t")
policy = pd.read_csv("minimum-duration-actions.tsv", sep="\t")

states = sorted(set(trans["state"]).union(trans["next_state"]))
pi = dict(zip(policy["state"], policy["action"]))

def is_sick(s):  # only 'recovered' is healthy
    return s != "recovered"

# Focus on transient (sick) states; set v(recovered)=0 as boundary condition
sick_states = [s for s in states if is_sick(s)]
healthy_states = [s for s in states if not is_sick(s)]  # should be ["recovered"]

idx_sick = {s:i for i,s in enumerate(sick_states)}
n = len(sick_states)

# Build P_pi restricted to sick->sick transitions
Ppi_ss = np.zeros((n, n), dtype=float)
for _, row in trans.iterrows():
    s, a, s2, p = row["state"], row["action"], row["next_state"], row["probability"]
    if s in idx_sick and a == pi[s] and s2 in idx_sick:
        Ppi_ss[idx_sick[s], idx_sick[s2]] += p

# Cost vector: 1 per day while sick
c_sick = np.ones(n, dtype=float)

# Solve (I - P_ss) v_sick = c_sick
I = np.eye(n)
try:
    v_sick = np.linalg.solve(I - Ppi_ss, c_sick)
except np.linalg.LinAlgError:
    # Numerical fallback if nearly singular
    v_sick, *_ = np.linalg.lstsq(I - Ppi_ss, c_sick, rcond=None)

# Assemble full vector (recovered = 0)
v_all = {s: 0.0 for s in states}
for s in sick_states:
    v_all[s] = float(v_sick[idx_sick[s]])

minimum_duration_days = pd.DataFrame(
    {"state": states, "expected_sick_days": [v_all[s] for s in states]}
)



Save the expected sick days for each state to a file "minimum-duration-days.tsv" with columns state and expected_sick_days.

In [13]:
# YOUR CHANGES HERE

minimum_duration_days.to_csv("minimum-duration-days.tsv", sep="\t", index=False)
print("Saved minimum-duration-days.tsv")

Saved minimum-duration-days.tsv


Submit "minimum-duration-days.tsv" in Gradescope.

## Part 7: Speed vs Pampering

Compute the expected discomfort using the policy to minimize days sick, and compare the results to the expected discomfort when optimizing to minimize discomfort.

In [14]:
# YOUR CHANGES HERE

# Computing expected discomfort under two policies (speed vs pampering)
import pandas as pd
from collections import defaultdict

# Inputs
rewards = pd.read_csv("twizzleflu-rewards.tsv", sep="\t")              # original rewards
trans   = pd.read_csv("twizzleflu-transitions.tsv", sep="\t")
pi_speed = pd.read_csv("minimum-duration-actions.tsv", sep="\t")       # Part 5
pi_comfort = pd.read_csv("minimum-discomfort-actions.tsv", sep="\t")   # Part 2

states  = sorted(set(trans["state"]).union(trans["next_state"]))

# Build R(s,a) and P(s'|s,a)
R = {(s, a): r for a, s, r in rewards.itertuples(index=False)}
P = defaultdict(list)
for a, s, s2, p in trans.itertuples(index=False):
    P[(s, a)].append((s2, p))

# Policy maps
pi_speed_map   = dict(zip(pi_speed["state"],   pi_speed["action"]))
pi_comfort_map = dict(zip(pi_comfort["state"], pi_comfort["action"]))

# Evaluate expected DISCOMFORT (negative reward) for a fixed policy using value iteration
def eval_discomfort(pi_map, gamma=0.95, tol=1e-10, max_iters=10000):
    C = {s: -R.get((s, pi_map[s]), 0.0) for s in states}  # per-step discomfort
    V = {s: 0.0 for s in states}
    for _ in range(max_iters):
        V_new = {
            s: C[s] + gamma * sum(p * V[s2] for s2, p in P[(s, pi_map[s])])
            for s in states
        }
        if max(abs(V_new[s] - V[s]) for s in states) < tol:
            V = V_new
            break
        V = V_new
    return V

V_speed   = eval_discomfort(pi_speed_map)     # expected discomfort under min-duration policy
V_comfort = eval_discomfort(pi_comfort_map)   # expected discomfort under min-discomfort policy

policy_comparison = pd.DataFrame({
    "state": states,
    "speed_discomfort":   [V_speed[s] for s in states],
    "minimize_discomfort":[V_comfort[s] for s in states],
})


Save the results to a file "policy-comparison.tsv" with columns state, speed_discomfort, and minimize_discomfort.

In [15]:
# YOUR CHANGES HERE

policy_comparison.to_csv("policy-comparison.tsv", sep="\t", index=False)
print("Saved policy-comparison.tsv")

Saved policy-comparison.tsv


Submit "policy-comparison.tsv" in Gradescope.

## Part 8: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.

## Part 9: Acknowledgements

If you discussed this assignment with anyone, please acknowledge them here.
If you did this assignment completely on your own, simply write none below.

If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for. If you did not use any other libraries, simply write none below.

If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy. If you did not use any generative AI tools, simply write none below.

In [16]:
ack_text = """Acknowledgments

I used ChatGPT as a study partner to clarify  and check errors, and simplify code while completing this project. 


For transparency, I am including links to the transcripts of these interactions:

1. ChatGPT guidance: [https://chatgpt.com/share/68d20a8d-0870-8007-bb79-080efab54c33]
These resources document the support I received and ensure compliance with the class generative AI policy.
"""

with open("acknowledgments.txt","w") as f:
    f.write(ack_text)
