## Monte Carlo on policy evaluation

Monte Carlo on policy evaluation is an important model free policy evaluation algorithm which uses the popular computational method called the Monte Carlo method.
It is important since it is usually the first model free algorithm studied in reinforcement learning.
Model free algorithms are the ones that do not need a full knowledge of all states and transition dynamics.
This makes Monte Carlo on policy evaluation very important since it can be applied into a wide range of real-world scenarios.
It is also agnostic to the Markov Decision Process setting, i.e. it can be applied into reinforcement learning problems that do not follow the MDP setting.
It is guaranteed to converge to a global optima.
Monte Carlo on policy evaluation can be implemented in three versions, which differ on how they calculate multiple visits of the same state given an episodic (terminating) history: first visit MC, every visit MC and incremental MC.

### Characteristics of Monte Carlo on policy evaluation:

##### Model free
Monte Carlo methods are model free, i.e. they do not require full knowledge of all states or transition dynamics.

##### On policy / Off policy
On policy methods attempt to evaluate or improve the policy that is used to make decisions.
Off policy methods evaluate or improve a policy different from that used to generate the data.

The Monte Carlo methods we will see here are on policy.
Nevertheless there are also off policy versions of Monte Carlo which we will not show here.

##### Convergence
Monte Carlo policy evaluation converges to a *global optimum* value function due to the law of large numbers.

##### Sample based
Monte Carlo methods are sample based.
Monte Carlo samples many histories for many trajectories which frees us from using a model.
As there is no bootstrapping and we need to calculate the return of a state until the end of an episode, one sample in the case of Monte Carlo methods is the full episode.
This means that the update rule for the state values only happens after the current episode has been completely sampled.

##### Unbiased estimation of state values
Because we are taking an average over the true distribution of returns in Monte Carlo, we obtain an unbiased estimator of the state value at each state.

##### Finite Horizon
Monte Carlo methods can only be used in a finite horizon setting, i.e. with episodic (terminating) domains only.
This is inherent from the fact that Monte Carlo update rule for the state value function only happens at the end of each episode, i.e. they are sample based.

##### Epsilon greedy policy
Monte Carlo methods are best used with epsilon greedy policies.
We set an epsilon value for the policy that decays with each step, denoting the probability that the next action will be random.
This way, we allow our agent to explore more in the beginning, where epsilon is near 1 and exploit what he has learned during the late steps, where the epsilon is near 0.

In our case, the update rule after each step for our epsilon is the following:
$ \epsilon = 1 / ( C_{\epsilon} * F_{\epsilon})$, where $ C_{\epsilon} $ is a counter that increments after each episode has ended, whereas $ F_{\epsilon} $ is a constant factor.

##### Markov Decision Process agnostic
Monte Carlo methods can be applied in non-MDP settings, i.e. they are MDP agnostic.

##### Discount factor
The discount factor must take a value less than $1$ and in our case: `self.discount_factor = 0.9`

##### Initialization
For Monte Carlo on policy evaluation we keep track of the following:
 - state value functions, initially set to $0$
 - for internal calculations we keep track of total reward up to a specific state as well as the number of times that state was visited
 - samples array contains the latest sampled episode. It is initially set to an empty array and it is cleared after each episode.

In [1]:
import numpy as np
import random
from collections import defaultdict
from environment import Env

# Monte Carlo Agent which learns every episode from the sample
class MCAgent:
    def __init__(self, actions):
        self.width = 5
        self.height = 5
        self.actions = actions
        self.discount_factor = 0.9
        self.decaying_epsilon_counter = 1
        self.decaying_epsilon_mul_factor = 0.2
        self.epsilon = None
        self.samples = []
        self.value_table = defaultdict(VisitState)

# class containing information for visited states
class VisitState:
    def __init__(self, total_G = 0, N = 0, V = 0):
        self.total_G = total_G
        self.N = N
        self.V = V

### Monte Carlo on policy evaluation

Monte Carlo methods sample an episode *first* and only after that do they update the V value function.
The class `MCAgent` is a parent class for the three versions of Monte Carlo on policy evaluation: first visit Monte Carlo, every visit Monte Carlo and incremental Monte Carlo.

##### Calculating the discounted returns

At the end of an episode, we start by calculating the discounted returns for each visited state.
We implement the method `preprocess_visited_states()` that calculates the discounted future sum of rewards $G_t$ for each state.
Notice that the calculation of $G_t$ for each visited state is a common process for any version of Monte Carlo methods.
During the calculations, the sample is reversed since it simplifies the calculations, i.e. the discount factor can be applied more easily to the $G_t$ sums in reverse and we do not need to calculate high powers of the discount factor.
In the end it returns the states and their discounted sums in the correct order.

In [2]:
class MCAgent(MCAgent):
    # for each episode calculate discounted returns and return info
    def preprocess_visited_states(self):
        # state name and G for each state as appeared in the episode
        all_states = []
        G = 0
        for reward in reversed(self.samples):
            state_name = str(reward[0])
            G = reward[1] + self.discount_factor * G
            all_states.append([state_name, G])
        all_states.reverse()

        self.decaying_epsilon_counter = self.decaying_epsilon_counter + 1

        return all_states

##### Abstract methods

We define the following two abstract methods:
 - `mc()`
 - `update_global_value_table()`

These have to be implemented from the specific version of Monte Carlo method.

In [3]:
class MCAgent(MCAgent):
    # to be defined in children classes
    def mc(self):
        pass

    # update visited states for first visit or every visit MC
    def update_global_value_table(self, state_name, G_t):
        pass

#### First Visit Monte Carlo

First visit Monte Carlo is a Monte Carlo method that considers only the first visits to a state *in one episode*.
Notice that we can consider multiple visits to a state, but not on the same episode.

We define a child class for the First Visit Monte Carlo agent.
 - in the method `mc()` we first call the `preprocess_visited_states()` method that will give us an array of visited states and their returns.
 - we make sure to check whether a state has already been visited or not.
 If it had been visited, we do not consider that state, we do not update the V values with it.
 - in the method `update_global_value_table()` we update the V values according to textbook update formulas.
 Notice that the visited states are saved in a dictionary.

##### Update rule

The update rule for V values in the First Visit Monte Carlo is the following:

$ V^{\pi}(S) = G_{total}(S) / N(S) $ where:
 - $ N(S) $ - the number of times the state has been visited during multiple episodes.
 Notice that although we are in the first visit case, the number of times a state has been visited can be more than 1.
 That same state could have been visited multiple times in *different episodes*.
 - $ G_{total}(S) $ - cumulative return of multiple visits to that state

In [4]:
from mc_agent import MCAgent, VisitState
from environment import Env

class FVMCAgent(MCAgent):
    def __init__(self, actions):
        super(FVMCAgent, self).__init__(actions)

    # for every episode, update V values of visited states
    def mc(self):
        all_states = super(FVMCAgent, self).preprocess_visited_states()
        visit_state = []
        for state in all_states:
            if state[0] not in visit_state:
                visit_state.append(state[0])
                self.update_global_value_table(state[0], state[1])

    # update V values of visited states for first visit or every visit MC
    def update_global_value_table(self, state_name, G_t):
        updated = False
        if state_name in self.value_table:
            state = self.value_table[state_name]
            state.total_G = state.total_G + G_t
            state.N = state.N + 1
            state.V = state.total_G / state.N
            updated = True
        if not updated:
            self.value_table[state_name] = VisitState(total_G=G_t, N=1, V=G_t)

#### Every Visit Monte Carlo

Every Visit Monte Carlo is a Monte Carlo method that does not differentiate if the state has been visited multiple times or not during an episode.

We define a child class for the Every Visit Monte Carlo agent.
 - in the method `mc()` we first call the `preprocess_visited_states()` method that will give us an array of visited states and their returns.
 - this time we do not check whether that state has already been visited or not. We update our V values with every state in the array.
 - in the method `update_global_value_table()` we update the V values according to textbook update formulas.
 Notice that the visited states are saved in a dictionary.

##### Update rule

The update rule for V values in the Every Visit Monte Carlo is the following:

$ V^{\pi}(S) = G_{total}(S) / N(S) $ where:
 - $ N(S) $ - the number of times the state has been visited during multiple episodes.
 One state can be visited multiple times in the same episode or in different episodes.
 - $ G_{total}(S) $ - cumulative return of multiple visits to that state.

In [5]:
from mc_agent import MCAgent, VisitState
from environment import Env


class EVMCAgent(MCAgent):
    def __init__(self, actions):
        super(EVMCAgent, self).__init__(actions)

    # for every episode, update V values of visited states
    def mc(self):
        all_states = super(EVMCAgent, self).preprocess_visited_states()
        for state in all_states:
            self.update_global_value_table(state[0], state[1])

    # update V values of visited states for first visit or every visit MC
    def update_global_value_table(self, state_name, G_t):
        updated = False
        if state_name in self.value_table:
            state = self.value_table[state_name]
            state.total_G = state.total_G + G_t
            state.N = state.N + 1
            state.V = state.total_G / state.N
            updated = True
        if not updated:
            self.value_table[state_name] = VisitState(total_G=G_t, N=1, V=G_t)

#### Incremental Monte Carlo

Incremental Monte Carlo is a Monte Carlo method that introduces a new update rule. It has the following key characteristics:
 - most importantly, it introduces the notion of a **learning rate**, which we will see below.
 - it can take two versions: Incremental First Visit Monte Carlo and Incremental Every Visit Monte Carlo.
 We will see the latter one, although the first one can be easily derived.

We define a child class for the Incremental Monte Carlo agent.
 - in the method `mc()` we first call the `preprocess_visited_states()` method that will give us an array of visited states and their returns.
 - We do not check whether that state has already been visited or not. We update our V values with every state in the array.
 - in the method `update_global_value_table()` we update the V values according to textbook update formulas.
 Notice that the visited states are saved in a dictionary.
 - `update_global_value_table()` is different for Incremental Monte Carlo.

##### Update rule

The update rule for V values in the Incremental Monte Carlo is the following:

$ V^{\pi}(S) = V^{\pi}(S) + \alpha [ G(S) - V^{\pi}(S) ] $ where:
 - $V^{\pi}(S)$ - state value of current state following the policy $\pi$
 - $ \alpha $ - it is called the **learning rate**.
 In our case, we use a **decaying, step-based learning rate** which takes the value of $ \alpha = 0.5 * 1 / N(S) $
 - $ N(S) $ - the number of times the state has been visited during multiple episodes.
 Notice that although we are in the first visit case, the number of times a state has been visited can be more than 1.
 That same state could have been visited multiple times in *different episodes*.
 - $ G(S) $ - return until the end of the episode of current state.

##### Setting the learning rate

Incremental Monte Carlo can be thought of as a general case of the previous two methods.
 - setting $\alpha = 1 / N(S)$ recovers the original Monte Carlo on policy evaluation algorithms.
 - setting $\alpha < 1 / N(S)$ gives a higher weight to older data
 - setting $\alpha > 1 / N(S)$ gives a higher weight to newer data, which can help learning in non-stationary domains.

If we are in a truly Markovian domain, Every Visit Monte Carlo will be more data efficient, because we update our average return for a state every time we visit the state.

In [6]:
from mc_agent import MCAgent, VisitState
from environment import Env

class IMCAgent(MCAgent):
    def __init__(self, actions):
        super(IMCAgent, self).__init__(actions)

    # for every episode, update V values of visited states
    def mc(self):
        all_states = super(IMCAgent, self).preprocess_visited_states()
        for state in all_states:
            self.update_global_visit_state(state[0], state[1])

    # redefined V value update of visited states for incremental MC
    def update_global_visit_state(self, state_name, G_t):
        updated = False
        if state_name in self.value_table:
            state = self.value_table[state_name]
            state.N = state.N + 1
            learning_rate = 0.5 * 1 / state.N
            state.V = state.V + learning_rate * (G_t - state.V)
            updated = True
        if not updated:
            self.value_table[state_name] = VisitState(total_G=G_t, N=1, V=G_t)

### Other methods

##### Helper methods

In [7]:
class MCAgent(MCAgent):
    # get action for the state according to the v function table
    # agent pick action of epsilon-greedy policy
    def get_action(self, state):
        self.epsilon = 1 / (self.decaying_epsilon_counter * self.decaying_epsilon_mul_factor)
        if np.random.rand() < self.epsilon:
            # take random action
            action = np.random.choice(self.actions)
        else:
            # take action according to the v function table
            next_state = self.possible_next_state(state)
            action = self.arg_max(next_state)
        return int(action)

In [8]:
class MCAgent(MCAgent):
    # append sample to memory(state, reward, done)
    def save_sample(self, state, reward, done):
        self.samples.append([state, reward, done])

    # compute arg_max if multiple candidates exit, pick one randomly
    @staticmethod
    def arg_max(next_state):
        max_index_list = []
        max_value = next_state[0]
        for index, value in enumerate(next_state):
            if value > max_value:
                max_index_list.clear()
                max_value = value
                max_index_list.append(index)
            elif value == max_value:
                max_index_list.append(index)
        return random.choice(max_index_list)

In [9]:
class MCAgent(MCAgent):
    # get the possible next states
    def possible_next_state(self, state):
        col, row = state
        next_state = [0.0] * 4

        if row != 0:
            next_state[0] = self.value_table[str([col, row - 1])].V
        else:
            next_state[0] = self.value_table[str(state)].V
        if row != self.height - 1:
            next_state[1] = self.value_table[str([col, row + 1])].V
        else:
            next_state[1] = self.value_table[str(state)].V
        if col != 0:
            next_state[2] = self.value_table[str([col - 1, row])].V
        else:
            next_state[2] = self.value_table[str(state)].V
        if col != self.width - 1:
            next_state[3] = self.value_table[str([col + 1, row])].V
        else:
            next_state[3] = self.value_table[str(state)].V

        return next_state

##### Main loop

Since all Monte Carlo methods are closely related, we define a common function called `mainloop()` in the parent class `MCAgent`.
All children MC agents inherit this method and can execute it in their static main functions.

In [10]:
class MCAgent(MCAgent):
    # to be called in a main loop
    def mainloop(self, env, verbose=False):
        for episode in range(1000):
            state = env.reset()
            action = self.get_action(state)

            while True:
                env.render()

                # forward to next state. reward is number and done is boolean
                next_state, reward, done = env.step(action)
                self.save_sample(next_state, reward, done)

                # get next action
                action = self.get_action(next_state)

                # at the end of each episode, update the v function table
                if done:
                    if(verbose):
                        print("episode : ", episode, "\t[3, 2]: ", round(self.value_table["[3, 2]"].V, 2),
                              "\t[2, 3]:", round(self.value_table["[2, 3]"].V, 2),
                              "\t[2, 2]:", round(self.value_table["[2, 2]"].V, 2),
                              "\t\tepsilon: ", round(self.epsilon, 2))
                    self.mc()
                    self.samples.clear()
                    break

Implementing the main functions for the three Monte Carlo agents is pretty straightforward now.

##### First Visit Monte Carlo agent

In [11]:
if __name__ == "__main__":
    env = Env()
    agent = FVMCAgent(actions=list(range(env.n_actions)))
    try:
        agent.mainloop(env)
    except:
        pass

##### Every Visit Monte Carlo agent

In [12]:
if __name__ == "__main__":
    env = Env()
    agent = EVMCAgent(actions=list(range(env.n_actions)))
    try:
        agent.mainloop(env)
    except:
        pass

##### Incremental Monte Carlo agent

In [13]:
if __name__ == "__main__":
    env = Env()
    agent = IMCAgent(actions=list(range(env.n_actions)))
    try:
        agent.mainloop(env)
    except:
        pass

### Results

All Monte Carlo agents will find the solution usually within 70 iterations.
The most effective agents to solve our problem seem to be the following:
 - First Visit Monte Carlo - In First Visit Monte Carlo, we discard states visited multiple times inside an episode that have high returns in their late visits.
 Basically, we only consider the first return of that state, which is of course much less (more discounted) than the returns of late visits.
 This in turn seems to encourage our agent not to waste time going back and forth in order to avoid being penalized by the triangles.