In [1]:
import numpy as np
import scipy

# Question - 1

### Part - a
The states are as follows
$$\{S,1,7,3,8,5,6,W\}$$
This is because we can club together equivalent states, which are 
$$\{2 \equiv 7\}, \{3 \equiv 9\}, \{4 \equiv 8\}$$

The State transition matrix is stated below

$$
P = \begin{bmatrix}
                  0 & 0.25 & 0.25 & 0.25 & 0.25 & 0 & 0 & 0 \\
                  0 & 0 & 0.25 & 0.25 & 0.25 & 0.25 & 0 & 0 \\
                  0 & 0 & 0.25 & 0.25 & 0.25 & 0 & 0 & 0.25 \\
                  0 & 0 & 0.25 & 0 & 0.25 & 0.25 & 0.25 & 0 \\
                  0 & 0 & 0 & 0.25 & 0.5 & 0 & 0 & 0.25 \\
                  0 & 0 & 0.25 & 0.25 & 0.25 & 0 & 0.25 & 0 \\
                  0 & 0 & 0.25 & 0.25 & 0.25 & 0 & 0 & 0.25 \\
                  0 & 0 & 0 & 0 & 0 & 0 & 0 & 1
    \end{bmatrix}
$$
Note that the states in transition matrix are in the order of states written above.

### Part-b
The reward function would be 
$$
R(s) = \begin{cases}
                        0 & \text{if } s = W \\
                        -1 & \text{otherwise}
       \end{cases}
$$
In vector form it could be represented as
$$
R = \begin{pmatrix}
                -1 \\
                -1 \\
                -1 \\
                -1 \\
                -1 \\
                -1 \\
                -1 \\
                0
    \end{pmatrix}
$$
The discount factor should be 1, but since the $I - P$ is singular we would take the discount factor very close to 1 instead of exactly 1, we could take something like 1 - 1e-6.

*Why?* The reward should be -1 for all the states except for the winning state because our goal is to find the average number of steps to reach the goal state, while our discount factor should be 1 again for the same reason.

Now we can use the matrix form derived from the Bellman equation to solve for the value vector

$$ V = \left( I - \gamma P \right)^{-1}R $$

The Calculation can be found in the cell below.

The expected number of die throws should be slightly above 7.08 from the starting state.

In [18]:
gamma = 1

# Reward Vector
R = -np.ones(8)
R[-1] = 0

# transition Matrix
P = np.array(
            [[0 , 0.25 , 0.25 , 0.25 , 0.25 , 0 , 0 , 0],
             [0 , 0 , 0.25 , 0.25 , 0.25 , 0.25 , 0 , 0],
             [0 , 0 , 0.25 , 0.25 , 0.25 , 0 , 0 , 0.25],
             [0 , 0 , 0.25 , 0 , 0.25 , 0.25 , 0.25 , 0],
             [0 , 0 , 0 , 0.25 , 0.5 , 0 , 0 , 0.25],
             [0 , 0 , 0.25 , 0.25 , 0.25 , 0 , 0.25 , 0],
             [0 , 0 , 0.25 , 0.25 , 0.25 , 0 , 0 , 0.25],
             [0 , 0 , 0 , 0 , 0 , 0 , 0 , 1]]
)

def ValueVector(P: np.ndarray, R: np.ndarray, gamma: int) -> np.ndarray:
    """
    Given transition matrix, reward vector and discount factor this
    function calculates the value vector

    Args:
    -----
        P (np.ndarray): transition matrix
        R (np.ndarray): Reward vector
        gamma (int): the discount factor

    Returns:
    --------
        np.ndarray: the value vector
    """
    # The identity matrix
    I = np.eye(*P.shape)
    return scipy.linalg.inv((I - gamma*P)).dot(R)

V = ValueVector(P, R, 1-1e-6)
print(f'The Value vector is {V}')

The Value vector is [-7.08329803 -6.99996533 -5.33330844 -6.66663422 -5.33330844 -6.66663422
 -5.33330844  0.        ]


# Question-2