## Policy Iteration

Policy iteration is one classic algorithm from the dynamic programming paradigm for searching over a policy space in a Markov Decision Process setting.
It builds up on a theorem of RL which states that given any stationary policy $π$, we can have a deterministic stationary policy that is no worse than the existing policy. Policy iteration implements an iterative algorithm that always improves on an existing policy until that policy converges to a *global optimum*, which is great since most reinforcement learning algorithms only converge to a local optimum.
However, policy iteration is rarely used in practice, since it requires *full knowledge of all states and transition dynamics*, which are usually not given to us.
Nevertheless, it is a classic algorithm that is part of every introductory literature.

### Characteristics of Policy Iteration:

##### Model based
Policy iteration is model based, i.e. it requires full knowledge of all states, as well as transition dynamics.

##### Convergence
There exist a theorem which states that given any stationary policy $π$, we
can find a deterministic stationary policy that is no worse than the existing policy. This is done by the following steps:
 - policy evaluation - applying the Bellman expectation backup operator on the state value functions of the existing policy
 - policy improvement - choosing the actions that lead to the improved value functions

Repeatedly alternating between the value function evaluation step and policy improvement step guarantees the convergence of the algorithm to a *global optimum* for both finite and infinite horizon settings.

##### Bellman expectation backup operator
The Bellman expectation backup operator is used during policy evaluation to calculate the expected future sum of rewards for a given state.

##### Discount factor
The discount factor must take a value less than 1 and in our case: `self.discount_factor = 0.9`

##### Initialization
For policy iteration we keep track of both state value functions and policies in a table.
 - State values are initialized to 0 for all states.
 - Policies are initialized uniformly between actions for all states in the beginning.


In [6]:
import random
from environment import GraphicDisplay, Env


class PolicyIteration:
    def __init__(self, env):
        self.env = env
        # 2D list for the value function
        self.value_table = [[0.0] * env.width for _ in range(env.height)]
        # list of random policy (same probability of up, down, left, right)
        self.policy_table = [[[0.25, 0.25, 0.25, 0.25]] * env.width
                                    for _ in range(env.height)]
        # setting terminal state
        self.policy_table[2][2] = []
        self.discount_factor = 0.9

### Policy Evaluation

Policy evaluation **maps states** to **state values**.
Given a state $S$, it calculates the expected future sum of rewards for that state via the following formula:

$V'(S) = \sum_{k=1}^n π(S, A_k) * [R(S, A_k) + \alpha * \sum_{q=1}^m P(S_q | S, A_k) * V(S_{q})]$

where:
 - $S$ - current state
 - $S_{q}$ - possible next state when action $A$ is taken in state $S$
 - $π(S, A_k)$ - probability of taking action $A_k$ in state $S$ according to policy $π$
 - $R(S, A_k)$ - reward from the environment after taking action $A_k$ in state $S$
 - $\alpha$ - discount factor
 - $P(S_q | S, A_k)$ - transition probability to state $S_q$ when action $A_k$ is taken in state $S$
 - $V'(S)$ - updated value function given state $S$
 - $V(S_{q})$ - current value function given next state

In our example though, the environment is deterministic, not stochastic so we abstain from the usage of transition probabilities $P(S_q | S, A_k)$ in our implementation.

In [7]:
class PolicyIteration(PolicyIteration):
    def policy_evaluation(self):
        next_value_table = [[0.00] * self.env.width
                                    for _ in range(self.env.height)]

        # Bellman Expectation Equation for the every states
        for state in self.env.get_all_states():
            value = 0.0
            # keep the value function of terminal states as 0
            if state == [2, 2]:
                next_value_table[state[0]][state[1]] = value
                continue

            for action in self.env.possible_actions:
                next_state = self.env.state_after_action(state, action)
                reward = self.env.get_reward(state, action)
                next_value = self.get_value(next_state)
                value += (self.get_policy(state)[action] *
                          (reward + self.discount_factor * next_value))

            next_value_table[state[0]][state[1]] = round(value, 2)

        self.value_table = next_value_table

##### Multiple global optima

Notice that in the case of policy evaluation, the state value $V(S)$ is calculated over all possible actions of that state, weighted by the probability of taking that action from the policy.
As we are going to see below, the policy is a uniform distribution over actions that reveal the maximal expected future sums of rewards, i.e. value states of next states.
This means that the policy evaluation step ensures the bootstrapping of multiple optimal solutions if there exist many of them, i.e. alternative solutions that are also global optima do not get lost.

### Policy improvement

Policy improvement is the step that comes after policy evaluation. It has the following key aspects:
 - It uses the state values to **extract** the best actions from them and update the policy.
 - As mentioned above, we allow for multiple optimal actions that lead to multiple global optima.

In [8]:
class PolicyIteration(PolicyIteration):
    def policy_improvement(self):
        next_policy = self.policy_table
        for state in self.env.get_all_states():
            if state == [2, 2]:
                continue
            value = -99999
            max_index = []
            result = [0.0, 0.0, 0.0, 0.0]  # initialize the policy

            # for each action, calculate: V(S) = reward + (discount factor) * (next state value function)
            for index, action in enumerate(self.env.possible_actions):
                next_state = self.env.state_after_action(state, action)
                reward = self.env.get_reward(state, action)
                next_value = self.get_value(next_state)
                temp = reward + self.discount_factor * next_value

                # Here we allow multiple actions with same max values, in order to find many global optima
                if temp == value:
                    max_index.append(index)
                elif temp > value:
                    value = temp
                    max_index.clear()
                    max_index.append(index)

            # probability of action
            prob = 1 / len(max_index)

            for index in max_index:
                result[index] = prob

            next_policy[state[0]][state[1]] = result

        self.policy_table = next_policy

By going back and forth between one policy evaluation step and one policy improvement step, we are guaranteed to converge, hence terminate.

### Other functions

##### Helper functions

In [9]:
class PolicyIteration(PolicyIteration):
    # get action according to the current policy
    def get_action(self, state):
        random_pick = random.randrange(100) / 100

        policy = self.get_policy(state)
        policy_sum = 0.0
        # return the action in the index
        for index, value in enumerate(policy):
            policy_sum += value
            if random_pick < policy_sum:
                return index

    # get policy of specific state
    def get_policy(self, state):
        if state == [2, 2]:
            return 0.0
        return self.policy_table[state[0]][state[1]]

    def get_value(self, state):
        return round(self.value_table[state[0]][state[1]], 2)

##### Main method

In [10]:
if __name__ == "__main__":
    env = Env()
    policy_iteration = PolicyIteration(env)
    grid_world = GraphicDisplay(policy_iteration)
    grid_world.mainloop()

## Results

<h3 style="text-align:center">Initially</h3>
<img src="ipynb_results/initial.png" alt="initial.png" width="50%" />

<h3 style="text-align:center">Midway</h3>
<img src="ipynb_results/midway.png" alt="midway.png" width="50%" />

<h3 style="text-align:center">Converged</h3>
<img src="ipynb_results/converged_wa.png" alt="converged_wa.png" width="50%" />

<h3 style="text-align:center">Final</h3>
<img src="ipynb_results/final.png" alt="final.png" width="50%" />