# Stanford CME 241 (Winter 2024) - Assignment 3

**Due: Jan 29 @ 11:59pm Pacific Time on Gradescope.**

Assignment instructions:
- **Please solve questions 1 and 2, and choose one of questions 3 or 4.**
- Empty code blocks are for your use. Feel free to create more under each section as needed.

Submission instructions:
- When complete, fill out your publicly available GitHub repo file URL and group members below, then export or print this .ipynb file to PDF and upload the PDF to Gradescope.

*Link to this ipynb file in your public GitHub repo (replace below URL with yours):* 

https://github.com/my-username/my-repo/assignment-file-name.ipynb

*Group members (replace below names with people in your group):* 

-Handi Zhao(hdzhao@stanford.edu);

-Sylvia Sun(ys3835@stanford.edu);

-Zhengji Yang(yangzj@stanford.edu)

## Imports

In [1]:
from dataclasses import dataclass
from rl.markov_decision_process import FiniteMarkovDecisionProcess
from rl.distribution import Categorical
from rl.dynamic_programming import policy_iteration_result, value_iteration_result
from scipy.stats import poisson
from typing import Tuple, Dict, Mapping



## Question 1
**Analytic Optimal Actions and Cost.** 
Consider a continuous-states, continuous-actions, discrete-time, non-terminating MDP with state space as $\mathbb{R}$ and action space as $\mathbb{R}$. When in state $s\in \mathbb{R}$, upon taking action $a\in \mathbb{R}$, one transitions to next state $s' \in \mathbb{R}$ according to a normal distribution $s' \sim \mathcal{N}(s, \sigma^2)$ for a fixed variance $\sigma^2 \in \mathbb{R}^+$. The corresponding cost associated with this transition is $e^{as'}$, i.e., the cost depends on the action $a$ and the state $s'$ one transitions to. The problem is to minimize the infinite-horizon **Expected Discounted-Sum of Costs** (with discount factor $\gamma < 1$). For this assignment, solve this problem just for the special case of $\gamma = 0$ (i.e., the myopic case) using elementary calculus. Derive an analytic expression for the optimal action in any state and the corresponding optimal cost.


In this problem, we define the reward function as the negative of expected transition cost, i.e. 
\begin{align}
\mathcal{R}(s, a) &= \mathbb{E}_{s'|s, a}[-e^{as'}] = -\int_\mathbb{R} e^{as'} \dfrac{1}{\sqrt{2\pi \sigma^2}} e^{-(s' - s)^2/2\sigma^2} ds'\\
&= -\int_\mathbb{R}\dfrac{1}{\sqrt{2\pi \sigma^2}}\text{exp}(-\dfrac{s'^2 - 2ss' - 2\sigma^2 as' + s^2}{2\sigma^2}) ds'\\
&= - \text{exp}(\dfrac{(s + \sigma^2 a)^2 - s^2}{2\sigma^2}) \int_\mathbb{R}\dfrac{1}{\sqrt{2\pi \sigma^2}}
e^{-(s' - s - \sigma^2 a)/2\sigma^2} ds'\\
&= - \text{exp}(\dfrac{(s + \sigma^2 a)^2 - s^2}{2\sigma^2})
\end{align}

Hence, the optimal value is determined by
\begin{align}
\mathcal{V}^*(s) &= \max_{a\in \mathbb{R}}\,\, \mathcal{R}(s, a) + \gamma\mathbb{E}_{s'|s, a}[\mathcal{V}^*(s')]\\
&= \max_{a\in \mathbb{R}}\,\, \mathcal{R}(s, a)\\
&= -\text{exp}(- s^2/2\sigma^2),
\end{align}
where $a^*(s) = -s/\sigma^2$.

So the optimal action for state $s$ is $-s/\sigma^2$ and the corresponding optimal cost for state $s$ is $\text{exp}(- s^2/2\sigma^2)$.

## Question 2
**Manual Value Iteration.** 
Consider a simple MDP with $\mathcal{S} = \{s_1, s_2, s_3\}, \mathcal{T} =\{s_3\}, \mathcal{A} = \{a_1, a_2\}$. The State Transition Probability function
$$\mathcal{P}: \mathcal{N} \times \mathcal{A} \times \mathcal{S} \rightarrow [0, 1]$$
is defined as:
$$\mathcal{P}(s_1, a_1, s_1) = 0.2, \mathcal{P}(s_1, a_1, s_2) = 0.6, \mathcal{P}(s_1, a_1, s_3) = 0.2$$
$$\mathcal{P}(s_1, a_2, s_1) = 0.1, \mathcal{P}(s_1, a_2, s_2) = 0.2, \mathcal{P}(s_1, a_2, s_3) = 0.7$$
$$\mathcal{P}(s_2, a_1, s_1) = 0.3, \mathcal{P}(s_2, a_1, s_2) = 0.3, \mathcal{P}(s_2, a_1, s_3) = 0.4$$
$$\mathcal{P}(s_2, a_2, s_1) = 0.5, \mathcal{P}(s_2, a_2, s_2) = 0.3, \mathcal{P}(s_2, a_2, s_3) = 0.2$$
The Reward Function 
$$\mathcal{R}: \mathcal{N} \times \mathcal{A} \rightarrow \mathbb{R}$$
is defined as:
$$\mathcal{R}(s_1, a_1) = 8.0, \mathcal{R}(s_1, a_2) = 10.0$$
$$\mathcal{R}(s_2, a_1) = 1.0, \mathcal{R}(s_2, a_2) = -1.0$$
Assume discount factor $\gamma = 1$.

Your task is to determine an Optimal Deterministic Policy **by manually working out** (not with code) simply the first two iterations of Value Iteration algorithm. 

- Initialize the Value Function for each state to be it's $\max$ (over actions) reward, i.e., we initialize the Value Function to be $v_0(s_1) = 10.0, v_0(s_2) = 1.0, v_0(s_3) = 0.0$. Then manually calculate $q_k(\cdot, \cdot)$ and $v_k(\cdot)$ from $v_{k - 1}( \cdot)$ using the Value Iteration update, and then calculate the greedy policy $\pi_k(\cdot)$ from $q_k(\cdot, \cdot)$ for $k = 1$ and $k = 2$ (hence, 2 iterations).
- Now argue that $\pi_k(\cdot)$ for $k > 2$ will be the same as $\pi_2(\cdot)$. Hint: You can make the argument by examining the structure of how you get $q_k(\cdot, \cdot)$ from $v_{k-1}(\cdot)$. With this argument, there is no need to go beyond the two iterations you performed above, and so you can establish $\pi_2(\cdot)$ as an Optimal Deterministic Policy for this MDP.

### Iteration 1

Calculation of $q_1(s, a)$ for each state-action pair:

1. For $s_1$:
   - $$q_1(s_1, a_1) = 8.0 + 0.2 \cdot 10.0 + 0.6 \cdot 1.0 = 10.6$$
   - $$q_1(s_1, a_2) = 10.0 + 0.1 \cdot 10.0 + 0.2 \cdot 1.0 = 11.2$$
   - $$v_1(s_1) = \max \{ q_1(s_1, a_1), q_1(s_1, a_2) \} = 11.2$$
   - The optimal action for $s_1$ is $a_2$. So $\pi_1(s_1) = a_2$.

2. For $s_2$:
   - $$q_1(s_2, a_1) = 1.0 + 0.3 \cdot 10.0 + 0.3 \cdot 1.0 = 4.3$$
   - $$q_1(s_2, a_2) = -1.0 + 0.5 \cdot 10.0 + 0.3 \cdot 1.0 = 4.3$$
   - $$v_1(s_2) = \max \{ q_1(s_2, a_1), q_1(s_2, a_2) \} = 4.3$$
   - The optimal action for $s_2$ can be either $a_1$ or $a_2$. So $\pi_1(s_2) = a_1  or  a_2$


### Iteration 2

Updating the Value Function:

1. For $s_1$:
   - $$q_2(s_1, a_1) = 8.0 + 0.2 \cdot 11.2 + 0.6 \cdot 4.3 = 12.82$$
   - $$q_2(s_1, a_2) = 10.0 + 0.1 \cdot 11.2 + 0.2 \cdot 4.3 = 11.98$$
   - $$v_2(s_1) = \max \{ q_2(s_1, a_1), q_2(s_1, a_2) \} = 12.82$$
   - The optimal action for $s_1$ becomes $a_1$. So $\pi_2(s_1) = a_1$

2. For $s_2$:
   - $$q_2(s_2, a_1) = 1.0 + 0.3 \cdot 11.2 + 0.3 \cdot 4.3 = 5.65$$
   - $$q_2(s_2, a_2) = -1.0 + 0.5 \cdot 11.2 + 0.3 \cdot 4.3 = 5.89$$
   - $$v_2(s_2) = \max \{ q_2(s_2, a_1), q_2(s_2, a_2) \} = 5.89$$
   - The optimal action for $s_2$ becomes $a_2$. So $\pi_2(s_2) = a_2$

### Argument for Optimality Beyond Iteration 2

The structure of $q_k(s, a)$ calculations suggests a pattern:

- $$q_k(s_1, a_1) - q_k(s_1, a_2) = -2.0 + 0.1 \cdot v_{k-1}(s_1) + 0.4 \cdot v_{k-1}(s_2)$$
- $$q_k(s_2, a_2) - q_k(s_2, a_1) = -2.0 + 0.2 \cdot v_{k-1}(s_1)$$

Given $v_{k-1}(s_1) \geq 12.82$ and $v_{k-1}(s_2) \geq 5.89$ for all $k \geq 3$, it follows that:

- $q_k(s_1, a_1) - q_k(s_1, a_2) \geq -2 + 0.1 \cdot 12.82 + 0.4 /cdot 5.89 = 1.638 >0$ for all $k \geq 3$
- $q_k(s_2, a_2) - q_k(s_2, a_1) \geq -2 + 12.89*0.2 = 0.578 > 0$ for all $k \geq 3$

Therefore, the policy obtained in the second iteration, $\pi_2$, remains optimal for all subsequent iterations, establishing $\pi_2$ as an Optimal Deterministic Policy for this MDP.

## Question 3

**Job-Hopping and Wages-Utility-Maximization.** 
You are a worker who starts every day either employed or unemployed. If you start your day employed, you work on your job for the day (one of $n$ jobs, as elaborated later) and you get to earn the wage of the job for the day. However, at the end of the day, you could lose your job with probability $\alpha \in [0,1]$, in which case you start the next day unemployed. If at the end of the day, you do not lose your job (with probability $1-\alpha$), then you will start the next day with the same job (and hence, the same daily wage). On the other hand, if you start your day unemployed, then you will be randomly offered one of $n$ jobs with daily wages $w_1, w_2, \ldots w_n \in \mathbb{R}^+$ with respective job-offer probabilities $p_1, p_2, \ldots p_n \in [0,1]$ (with $\sum_{i=1}^n p_i = 1$). You can choose to either accept or decline the offered job. If you accept the job-offer, your day progresses exactly like the **employed-day** described above (earning the day's job wage and possibly (with probability $\alpha$) losing the job at the end of the day). However, if you decline the job-offer, you spend the day unemployed, receive the unemployment wage $w_0 \in \mathbb{R}^+$ for the day, and start the next day unemployed. The problem is to identify the optimal choice of accepting or rejecting any of the job-offers the worker receives, in a manner that maximizes the infinite-horizon **Expected Discounted-Sum of Wages Utility**. Assume the daily discount factor for wages (employed or unemployed) is $\gamma \in [0,1)$. Assume Wages Utility function to be $U(w) = \log(w)$ for any wage amount $w \in \mathbb{R}^+$. So you are looking to maximize
$$\mathbb{E}[\sum_{u=t}^\infty \gamma^{u-t} \cdot \log(w_{i_u})]$$
at the start of a given day $t$ ($w_{i_u}$ is the wage earned on day $u$, $0\leq i_u \leq n$ for all $u\geq t$).

- Express with clear mathematical notation the state space, action space, transition function, reward function, and write the Bellman Optimality Equation customized for this MDP.
- You can solve this Bellman Optimality Equation (hence, solve for the Optimal Value Function and the Optimal Policy) with a numerical iterative algorithm (essentially a Dynamic Programming algorithm customized to this problem). Write Python code for this numerical algorithm. Clearly define the inputs and outputs of your algorithm with their types (int, float, List, Mapping etc.). For this problem, don't use any of the MDP/DP code from the git repo, write this customized algorithm from scratch.

## Question 4

**Two-Stores Inventory Control.** 
We extend the capacity-constrained inventory example implemented in [rl/chapter3/simple_inventory_mdp_cap.py](https://github.com/TikhonJelvis/RL-book/blob/master/rl/chapter3/simple_inventory_mdp_cap.py) as a `FiniteMarkovDecisionProcess` (the Finite MDP model for the capacity-constrained inventory example is described in detail in Chapters 1 and 2 of the RLForFinanceBook). Here we assume that we have two different stores, each with their own separate capacities $C_1$ and $C_2$, their own separate Poisson probability distributions of demand (with means $\lambda_1$ and $\lambda_2$), their own separate holding costs $h_1$ and $h_2$, and their own separate stockout costs $p_1$ and $p_2$. At 6pm upon stores closing each evening, each store can choose to order inventory from a common supplier (as usual, ordered inventory will arrive at the store 36 hours later). We are also allowed to transfer inventory from one store to another, and any such transfer happens overnight, i.e., will arrive by 6am next morning (since the stores are fairly close to each other). Note that the orders are constrained such that following the orders on each evening, each store's inventory position (sum of on-hand inventory and on-order inventory) cannot exceed the store's capacity (this means the action space is constrained to be finite). Each order made to the supplier incurs a fixed transportation cost of $K_1$ (fixed-cost means the cost is the same no matter how many units of non-zero inventory a particular store orders). Moving any non-zero inventory between the two stores incurs a fixed transportation cost of $K_2$. 

Model this as a derived class of `FiniteMarkovDecisionProcess` much like we did for `SimpleInventoryMDPCap` in the code repo. Set up instances of this derived class for different choices of the problem parameters (capacities, costs etc.), and determine the Optimal Value Function and Optimal Policy by invoking the function `value_iteration` (or `policy_iteration`) from file [rl/dynamic_programming.py](https://github.com/TikhonJelvis/RL-book/blob/master/rl/dynamic_programming.py).

Analyze the obtained Optimal Policy and verify that it makes intuitive sense as a function of the problem parameters.

In [2]:
@dataclass(frozen=True)
class InventoryState:
    store1_on_hand: int  # on hand inventory before transfer between stores
    store1_on_order: int
    store2_on_hand: int
    store2_on_order: int

    def inventory_position(self, store1) -> int:
        return self.store1_on_hand + self.store1_on_order if store1 else self.store2_on_hand + self.store2_on_order


InvOrderMapping = Mapping[
    InventoryState,
    Mapping[int, Categorical[Tuple[InventoryState, float]]]
]


class SimpleInventoryMDPCap(FiniteMarkovDecisionProcess[InventoryState, int]):

    def __init__(
        self,
        capacity_1: int,
        capacity_2: int,
        poisson_lambda_1: float,
        poisson_lambda_2: float,
        holding_cost_1: float,
        holding_cost_2: float,
        stockout_cost_1: float,
        stockout_cost_2: float,
        supplier_transportation_cost: int,
        between_store_transportation_cost: int,
    ):
        self.capacity_1: int = capacity_1
        self.capacity_2: int = capacity_2
        self.poisson_lambda_1: float = poisson_lambda_1
        self.poisson_lambda_2: float = poisson_lambda_2
        self.holding_cost_1: float = holding_cost_1
        self.holding_cost_2: float = holding_cost_2
        self.stockout_cost_1: float = stockout_cost_1
        self.stockout_cost_2: float = stockout_cost_2
        self.supplier_transportation_cost = supplier_transportation_cost
        self.store_transportation_cost = between_store_transportation_cost

        self.store1_poisson_distr = poisson(poisson_lambda_1)
        self.store2_poisson_distr = poisson(poisson_lambda_2)
        super().__init__(self.get_action_transition_reward_map())

    def get_action_transition_reward_map(self) -> InvOrderMapping:
        d: Dict[InventoryState, Dict[int, Categorical[Tuple[InventoryState, float]]]] = {}

        for alpha1 in range(self.capacity_1 + 1):
            for alpha2 in range(self.capacity_2 + 1):
                for beta1 in range(self.capacity_1 + 1 - alpha1):
                    for beta2 in range(self.capacity_2 + 1 - alpha2):
                        state: InventoryState = InventoryState(alpha1, beta1, alpha2, beta2)
                        ip1: int = state.inventory_position(store1=True)
                        ip2: int = state.inventory_position(store1=False)
                        d1: Dict[int, Categorical[Tuple[InventoryState, float]]] = {}

                        for order_transfer in range(-min(alpha2, self.capacity_1 - ip1), min(alpha1, self.capacity_2 - ip2) + 1):
                            ip1: int = ip1 - order_transfer
                            ip2: int = ip2 + order_transfer
                            for order1 in range(self.capacity_1 - ip1 + 1):
                                for order2 in range(self.capacity_2 - ip2 + 1):
                                    between_store_transportation_cost: float = - self.store_transportation_cost * abs(order_transfer)
                                    base_reward: float = - self.holding_cost_1 * (alpha1 - order_transfer) - self.holding_cost_2 * alpha2 + between_store_transportation_cost if order_transfer >= 0 else - self.holding_cost_1 * alpha1 - self.holding_cost_2 * (alpha2 + order_transfer) + between_store_transportation_cost
                                    supplier_transportation_cost = - self.supplier_transportation_cost * (order1 + order2)
                                    # None out of stock
                                    sr_probs_dict: Dict[Tuple[InventoryState, float], float] = {(InventoryState(ip1 - i, ip2 - j, order1, order2), base_reward + supplier_transportation_cost):
                         self.store1_poisson_distr.pmf(i) * self.store2_poisson_distr.pmf(j) for i in range(ip1) for j in range(ip2)}
                                    # Store 2 OOS
                                    for i in range(ip1):
                                        p1 = self.store1_poisson_distr.pmf(i)
                                        p2 = 1 - self.store2_poisson_distr.cdf(ip2 - 1)
                                        probability: float = p1 * p2
                                        reward: float = base_reward + supplier_transportation_cost - self.stockout_cost_2 * (self.poisson_lambda_2 - ip2 * (1 - self.store2_poisson_distr.pmf(ip2) / p2))
                                        sr_probs_dict[(InventoryState(ip1 - i, 0, order1, order2), reward)] = probability
                                    # Store 1 OOS
                                    for j in range(ip2):
                                        p1 = 1 - self.store1_poisson_distr.cdf(ip1 - 1)
                                        p2 = self.store2_poisson_distr.pmf(j)
                                        probability: float = p1 * p2
                                        reward: float = base_reward + supplier_transportation_cost - self.stockout_cost_1 * (self.poisson_lambda_1 - ip1 * (1 - self.store1_poisson_distr.pmf(ip1) / p1))
                                        sr_probs_dict[(InventoryState(0, ip2 - j, order1, order2), reward)] = probability
                                    # Both stores OOS
                                    p1 = 1 - self.store1_poisson_distr.cdf(ip1 - 1)
                                    p2 = 1 - self.store2_poisson_distr.cdf(ip2 - 1)
                                    probability: float = p1 * p2
                                    reward: float = base_reward + supplier_transportation_cost - self.stockout_cost_1 * (self.poisson_lambda_1 - ip1 * (1 - self.store1_poisson_distr.pmf(ip1) / p1)) - self.stockout_cost_2 * (self.poisson_lambda_2 - ip2 * (1 - self.store2_poisson_distr.pmf(ip2) / p2))
                                    sr_probs_dict[(InventoryState(0, 0, order1, order2), reward)] = probability

                                    d1[(order1, order2, order_transfer)] = Categorical(sr_probs_dict)
                        d[state] = d1
        return d

In [8]:
from pprint import pprint

user_capacity = 5
user_poisson_lambda = 1.0
user_holding_cost = 1.0
user_stockout_cost = 10.0
user_supplier_transportation_cost = 2
user_between_store_transportation_cost = 2

user_gamma = 0.9

si_mdp: FiniteMarkovDecisionProcess[InventoryState, int] =\
    SimpleInventoryMDPCap(
        capacity_1=user_capacity,
        capacity_2=user_capacity,
        poisson_lambda_1=user_poisson_lambda,
        poisson_lambda_2=user_poisson_lambda,
        holding_cost_1=user_holding_cost,
        holding_cost_2=user_holding_cost,
        stockout_cost_1=user_stockout_cost,
        stockout_cost_2=user_stockout_cost,
        supplier_transportation_cost=user_supplier_transportation_cost,
        between_store_transportation_cost=user_between_store_transportation_cost,
    )

print("MDP Value Iteration Optimal Value Function and Optimal Policy")
print("--------------")
opt_vf_vi, opt_policy_vi = value_iteration_result(si_mdp, gamma=user_gamma)
pprint(opt_vf_vi)
print(opt_policy_vi)
print()

MDP Value Iteration Optimal Value Function and Optimal Policy
--------------
{NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=4, store2_on_hand=3, store2_on_order=2)): -6.081715060022783,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=1, store2_on_hand=4, store2_on_order=0)): -35.26938818183663,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=1, store2_on_hand=4, store2_on_order=1)): -28.0358613140987,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=2, store2_on_hand=4, store2_on_order=0)): -27.035861314098696,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=2, store2_on_hand=4, store2_on_order=1)): -20.681737794365898,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=0, store2_on_hand=0, store2_on_order=0)): -32.0,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=5, store2_on_hand=3, store2_on_order=0)): -10.108537118482502,
 NonTerminal(state=Inven

In [9]:
user_capacity = 5
user_poisson_lambda = 1.0
user_holding_cost = 1.0
user_stockout_cost = 10.0
user_supplier_transportation_cost = 2
user_between_store_transportation_cost = 10

user_gamma = 0.9

si_mdp: FiniteMarkovDecisionProcess[InventoryState, int] =\
    SimpleInventoryMDPCap(
        capacity_1=user_capacity,
        capacity_2=user_capacity,
        poisson_lambda_1=user_poisson_lambda,
        poisson_lambda_2=user_poisson_lambda,
        holding_cost_1=user_holding_cost,
        holding_cost_2=user_holding_cost,
        stockout_cost_1=user_stockout_cost,
        stockout_cost_2=user_stockout_cost,
        supplier_transportation_cost=user_supplier_transportation_cost,
        between_store_transportation_cost=user_between_store_transportation_cost,
    )

print("MDP Value Iteration Optimal Value Function and Optimal Policy")
print("--------------")
opt_vf_vi, opt_policy_vi = value_iteration_result(si_mdp, gamma=user_gamma)
pprint(opt_vf_vi)
print(opt_policy_vi)
print()

MDP Value Iteration Optimal Value Function and Optimal Policy
--------------
{NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=0, store2_on_hand=0, store2_on_order=1)): -25.678794411714428,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=0, store2_on_hand=0, store2_on_order=2)): -23.03638323514327,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=0, store2_on_hand=0, store2_on_order=3)): -22.233369264429328,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=0, store2_on_hand=0, store2_on_order=4)): -22.04348769566779,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=0, store2_on_hand=0, store2_on_order=5)): -31.926305221587896,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=1, store2_on_hand=0, store2_on_order=0)): -25.678794411714428,
 NonTerminal(state=InventoryState(store1_on_hand=0, store1_on_order=1, store2_on_hand=0, store2_on_order=1)): -19.35758882342885,
 NonTermi

In [10]:
user_capacity = 2
user_poisson_lambda = 1.0
user_holding_cost = 1.0
user_stockout_cost = 10.0
user_supplier_transportation_cost = 5
user_between_store_transportation_cost = 2

user_gamma = 0.9

si_mdp: FiniteMarkovDecisionProcess[InventoryState, int] =\
    SimpleInventoryMDPCap(
        capacity_1=user_capacity,
        capacity_2=user_capacity,
        poisson_lambda_1=user_poisson_lambda,
        poisson_lambda_2=user_poisson_lambda,
        holding_cost_1=user_holding_cost,
        holding_cost_2=user_holding_cost,
        stockout_cost_1=user_stockout_cost,
        stockout_cost_2=user_stockout_cost,
        supplier_transportation_cost=user_supplier_transportation_cost,
        between_store_transportation_cost=user_between_store_transportation_cost,
    )

print("MDP Value Iteration Optimal Value Function and Optimal Policy")
print("--------------")
opt_vf_vi, opt_policy_vi = value_iteration_result(si_mdp, gamma=user_gamma)
pprint(opt_vf_vi)
print(opt_policy_vi)
print()

MDP Value Iteration Optimal Value Function and Optimal Policy
--------------
{NonTerminal(state=InventoryState(store1_on_hand=2, store1_on_order=0, store2_on_hand=2, store2_on_order=0)): -25.314779580545142,
 NonTerminal(state=InventoryState(store1_on_hand=2, store1_on_order=0, store2_on_hand=1, store2_on_order=1)): -24.314779580545142,
 NonTerminal(state=InventoryState(store1_on_hand=2, store1_on_order=0, store2_on_hand=1, store2_on_order=0)): -34.78444290780128,
 NonTerminal(state=InventoryState(store1_on_hand=2, store1_on_order=0, store2_on_hand=0, store2_on_order=2)): -23.31477958054515,
 NonTerminal(state=InventoryState(store1_on_hand=2, store1_on_order=0, store2_on_hand=0, store2_on_order=1)): -33.78444290780127,
 NonTerminal(state=InventoryState(store1_on_hand=2, store1_on_order=0, store2_on_hand=0, store2_on_order=0)): -39.895420509122715,
 NonTerminal(state=InventoryState(store1_on_hand=1, store1_on_order=1, store2_on_hand=2, store2_on_order=0)): -24.314779580545142,
 NonTermi

In [11]:
user_capacity = 2
user_poisson_lambda = 1.0
user_holding_cost = 10.0
user_stockout_cost = 1.0
user_supplier_transportation_cost = 5
user_between_store_transportation_cost = 2

user_gamma = 0.9

si_mdp: FiniteMarkovDecisionProcess[InventoryState, int] =\
    SimpleInventoryMDPCap(
        capacity_1=user_capacity,
        capacity_2=user_capacity,
        poisson_lambda_1=user_poisson_lambda,
        poisson_lambda_2=user_poisson_lambda,
        holding_cost_1=user_holding_cost,
        holding_cost_2=user_holding_cost,
        stockout_cost_1=user_stockout_cost,
        stockout_cost_2=user_stockout_cost,
        supplier_transportation_cost=user_supplier_transportation_cost,
        between_store_transportation_cost=user_between_store_transportation_cost,
    )

print("MDP Value Iteration Optimal Value Function and Optimal Policy")
print("--------------")
opt_vf_vi, opt_policy_vi = value_iteration_result(si_mdp, gamma=user_gamma)
pprint(opt_vf_vi)
print(opt_policy_vi)
print()

MDP Value Iteration Optimal Value Function and Optimal Policy
--------------
{NonTerminal(state=InventoryState(store1_on_hand=1, store1_on_order=0, store2_on_hand=0, store2_on_order=0)): -18.367879441171443,
 NonTerminal(state=InventoryState(store1_on_hand=1, store1_on_order=1, store2_on_hand=0, store2_on_order=2)): -20.834529414704445,
 NonTerminal(state=InventoryState(store1_on_hand=1, store1_on_order=1, store2_on_hand=0, store2_on_order=1)): -16.610066130330264,
 NonTerminal(state=InventoryState(store1_on_hand=2, store1_on_order=0, store2_on_hand=0, store2_on_order=0)): -28.354757330803537,
 NonTerminal(state=InventoryState(store1_on_hand=1, store1_on_order=1, store2_on_hand=2, store2_on_order=0)): -40.83452941470445,
 NonTerminal(state=InventoryState(store1_on_hand=1, store1_on_order=1, store2_on_hand=1, store2_on_order=0)): -26.610066130330264,
 NonTerminal(state=InventoryState(store1_on_hand=1, store1_on_order=0, store2_on_hand=1, store2_on_order=1)): -28.473313734199206,
 NonTer

> The obtained optimal policy makes intuitive sense. For instance, when $K_2$ is substantially higher than $K_1$, both stores tend to order from supplier rather than between store transfering as it doesn't worth the cost even if transfering between store is quicker. When store capacities are small, between stores transfering is also less likely to happen as the probability of having leftover inventory is comparably small. Also, when holding cost is higher than stockout cost, between store transfering is more likely to happen so as to make sure that less inventory is on hold. On the other hand, when stockout cost is higher than holding cost, between store transfering is less likely to happen so as to make sure that there's always enough inventory in stock.