## Monte Carlo on policy Q evaluation
Monte Carlo Q evaluation is a special algorithm using Monte Carlo methods.
It inherits the same principles of other [Monte Carlo on policy evaluation](../3-monte-carlo/mc_agent.ipynb) algorithms with the distinction that it uses state-action pair values, or also called Q values instead of state values, or also called V values in the update steps.
Most importantly, for the incremental MC Q evaluation the update formula is the following:
$Q^{\pi}(s_t, a_t) \gets\ Q^{\pi}(s_t, a_t) + \alpha(G_t - Q^{\pi}(s_t, a_t))$.
MC Q evaluation would be a great choice for exhausting the solution space of non-MDP problem settings with complex transition dynamics.

### Characteristics of Monte Carlo on policy evaluation:

##### Model free
Monte Carlo methods are model free, i.e. they do not require full knowledge of all states or transition dynamics.

##### On policy / Off policy
On policy methods attempt to evaluate or improve the policy that is used to make decisions.
Off policy methods evaluate or improve a policy different from that used to generate the data.

The Monte Carlo methods we will see here are on policy.
Nevertheless there are also off policy versions of Monte Carlo which we will not show here.

##### Q values: state-action pair values
Monte Carlo on policy Q evaluation evaluates and updates Q values, i.e. state-action pair values insteal of state values, i.e. V values.
This basically means that it does not bind the expected future sum of rewards to a state, but rather binds the expectancies to state-action pairs.

##### Lower data efficiency
Agents that work with state-action pair values have lower data efficiency than their counterparts working with state values.
Differentiating between actions for an expected return does have the following drawback:
 - more memory is used to represent state-action pair values than state values
 - more data is needed to train the agent, i.e. the agent needs to spend more time interacting with the environment to learn a good policy

##### Convergence
Monte Carlo policy evaluation converges to a *global optimum* value function due to the law of large numbers.

##### Sample based
Monte Carlo methods are sample based.
Monte Carlo samples many histories for many trajectories which frees us from using a model.
As there is no bootstrapping and we need to calculate the return of a state until the end of an episode, one sample in the case of Monte Carlo methods is the full episode.
This means that the update rule for the state values only happens after the current episode has been completely sampled.

##### Unbiased estimation of state values
Because we are taking an average over the true distribution of returns in Monte Carlo, we obtain an unbiased estimator of the state value at each state.

##### Finite Horizon
Monte Carlo methods can only be used in a finite horizon setting, i.e. with episodic (terminating) domains only.
This is inherent from the fact that Monte Carlo update rule for the state value function only happens at the end of each episode, i.e. they are sample based.

##### Epsilon greedy policy
Monte Carlo methods are best used with epsilon greedy policies.
We set an epsilon value for the policy that decays with each step, denoting the probability that the next action will be random.
This way, we allow our agent to explore more in the beginning, where epsilon is near 1 and exploit what he has learned during the late steps, where the epsilon is near 0.

In our case, the update rule after each step for our epsilon is the following:
$ \epsilon \gets 1 / ( c_{\epsilon} * f_{\epsilon})$, where $ c_{\epsilon} $ is a counter that increments after each episode has ended, whereas $ f_{\epsilon} $ is a constant factor.

##### Markov Decision Process agnostic
Monte Carlo methods can be applied in non-MDP settings, i.e. they are MDP agnostic.

##### Discount factor
The discount factor must take a value less than $1$ and in our case: `self.discount_factor = 0.9`


##### Initialization
For Monte Carlo on policy evaluation we keep track of the following:
 - state value functions, initially set to $0$
 - for internal calculations we keep track of total reward up to a specific state as well as the number of times that state was visited
 - samples array contains the latest sampled episode. It is initially set to an empty array and it is cleared after each episode.
 - we set `self.decaying_epsilon_mul_factor` to a value of $0.05$, whereas for normal Monte Carlo the value was set to $0.2$.
 This is done to allow the agent explore longer, because as we said algorithms that work with Q values are less data efficient than their V value counterparts.

In [1]:
import numpy as np
import random
from collections import defaultdict
from environment import Env


# Monte Carlo Agent which learns every episodes from the sample
class MCAgent:
    def __init__(self, actions):
        self.width = 5
        self.height = 5
        self.actions = actions
        self.discount_factor = 0.9
        self.decaying_epsilon_counter = 1
        self.decaying_epsilon_mul_factor = 0.05
        self.samples = []
        self.q_value_table = defaultdict(VisitStateAction)

class VisitStateAction:
    def __init__(self, total_G = 0, N = 0, Q = 0):
        self.total_G = total_G
        self.N = N
        self.Q = Q


### Monte Carlo on policy Q evaluation

Monte Carlo methods sample an episode *first* and only after that do they update the Q value function.
The class `MCAgent` is a parent class for the three versions of Monte Carlo on policy Q evaluation: first visit Monte Carlo, every visit Monte Carlo and incremental Monte Carlo.

##### Calculating the discounted returns

At the end of an episode, we start by calculating the discounted returns for each visited *state-action* pair.
We implement the method `preprocess_visited_state_actions()` that calculates the discounted future sum of rewards $G_t$ for each state-action pair.
Notice that the calculation of $G_t$ for each visited state-action pair is a common process for any version of Monte Carlo Q evaluation methods.
During the calculations, the sample is reversed since it simplifies the calculations, i.e. the discount factor can be applied more easily to the $G_t$ sums in reverse and we do not need to calculate high powers of the discount factor.
In the end it returns the state-action pairs and their discounted sums in the correct order.

Notice that there is a new column in the array `rewards` of the method `preprocess_visited_state_actions()`.
It is there as a placeholder for the *actions* taken, and considers actions as part of the "identification" for the returns.

In [2]:
class MCAgent(MCAgent):

    # for each episode, calculate discounted returns and return info
    def preprocess_visited_states(self):
        # state action name and G for each state as appeared in the episode
        all_states = []
        G = 0
        for reward in reversed(self.samples):
            # reward[0] state info, *reward[1] action* info
            state_action_name = str([reward[0], reward[1]])
            G = reward[2] + self.discount_factor * G
            all_states.append([state_action_name, G])
        all_states.reverse()

        self.decaying_epsilon_counter = self.decaying_epsilon_counter + 1

        return all_states

##### Abstract methods

We define the following two abstract methods:
 - `mc()`
 - `update_global_value_table()`

These have to be implemented from the specific version of Monte Carlo method.

In [3]:
class MCAgent(MCAgent):
    # to be defined in children classes
    def mc(self):
        pass

    # update visited states for first visit or every visit MC
    def update_global_value_table(self, state_name, G_t):
        pass

#### First Visit Monte Carlo Q evaluation

First visit Monte Carlo is a Monte Carlo method that considers only the first visits to a state-action pair *in one episode*.
Notice that we can consider multiple visits to a state-action pair, but not on the same episode.

We define a child class for the First Visit Monte Carlo Q evaluation agent.
 - in the method `mc()` we first call the `preprocess_visited_state_actions()` method that will give us an array of visited state-action pairs and their returns.
 - we make sure to check whether a state-action pair has already been visited or not.
 If it had been visited, we do not consider that state, we do not update the Q values with it.
 - in the method `update_global_q_value_table()` we update the Q values according to textbook update formulas.
 Notice that the visited states are saved in a dictionary.

##### Update rule

The update rule for Q values in the First Visit Monte Carlo Q evaluation is the following:

$ Q^{\pi}(s_t, a_t) \gets G_{total}(s_t, a_t) / N(s_t, a_t) $ where:
 - $ N(s_t, a_t) $ - the number of times the state-action pair has been visited during multiple episodes.
 Notice that although we are in the first visit case, the number of times a state-action pair has been visited can be more than 1.
 That same state-action pair could have been visited multiple times in *different episodes*.
 - $ G_{total}(s_t, a_t) $ - cumulative return of multiple visits to that state-action pair.

In [4]:
from mc_q_eval_agent import MCAgent, VisitStateAction
from environment import Env


class FVMCAgent(MCAgent):
    def __init__(self, actions):
        super(FVMCAgent, self).__init__(actions)

    # for every episode, agent updates q function of visited state action pairs
    def mc(self):
        all_state_actions = super(FVMCAgent, self).preprocess_visited_state_actions()
        visit_state_action = []
        for state_action in all_state_actions:
            if state_action[0] not in visit_state_action:
                visit_state_action.append(state_action[0])
                self.update_global_q_value_table(state_action[0], state_action[1])

    # update visited states for first visit or every visit MC
    def update_global_q_value_table(self, state_action_name, G_t):
        updated = False
        if state_action_name in self.q_value_table:
            state_action = self.q_value_table[state_action_name]
            state_action.total_G = state_action.total_G + G_t
            state_action.N = state_action.N + 1
            state_action.Q = state_action.total_G / state_action.N
            updated = True
        if not updated:
            self.q_value_table[state_action_name] = VisitStateAction(total_G=G_t, N=1, Q=G_t)

#### Every Visit Monte Carlo Q evaluation

Every Visit Monte Carlo Q evaluation is a Monte Carlo method that does not differentiate if the state has been visited multiple times or not during an episode.

We define a child class for the Every Visit Monte Carlo agent.
 - in the method `mc()` we first call the `preprocess_visited_state_actions()` method that will give us an array of visited state-action pairs and their returns.
 - this time we do not check whether that state-action pair has already been visited or not. We update our Q values with every state-action pair in the array.
 - in the method `update_global_q_value_table()` we update the Q values according to textbook update formulas.

 Notice that the visited state-action pairs are saved in a dictionary.

##### Update rule

The update rule for Q values in the Every Visit Monte Carlo Q evaluation is the following:

$ Q^{\pi}(s_t, a_t) \gets G_{total}(s_t, a_t) / N(s_t, a_t) $ where:
 - $ N(s_t, a_t) $ - the number of times the state-action pair has been visited during multiple episodes.
 One state-action pair can be visited multiple times in the same episode or in different episodes.
 - $ G_{total}(s_t, a_t) $ - cumulative return of multiple visits to that state-action pair.

In [5]:
from mc_q_eval_agent import MCAgent, VisitStateAction
from environment import Env


class EVMCAgent(MCAgent):
    def __init__(self, actions):
        super(EVMCAgent, self).__init__(actions)

    # for every episode, agent updates q function of visited state action pairs
    def mc(self):
        all_state_actions = super(EVMCAgent, self).preprocess_visited_state_actions()
        for state_action in all_state_actions:
                self.update_global_q_value_table(state_action[0], state_action[1])

    # update visited states for first visit or every visit MC
    def update_global_q_value_table(self, state_action_name, G_t):
        updated = False
        if state_action_name in self.q_value_table:
            state_action = self.q_value_table[state_action_name]
            state_action.total_G = state_action.total_G + G_t
            state_action.N = state_action.N + 1
            state_action.Q = state_action.total_G / state_action.N
            updated = True
        if not updated:
            self.q_value_table[state_action_name] = VisitStateAction(total_G=G_t, N=1, Q=G_t)


#### Incremental Monte Carlo Q evaluation

Incremental Monte Carlo Q evaluation is a Monte Carlo method that introduces a new update rule. It has the following key characteristics:
 - most importantly, it introduces the notion of a **learning rate**, which we will see below.
 - it can take two versions: Incremental First Visit Monte Carlo Q evaluation and Incremental Every Visit Monte Carlo Q evaluation.
 We will see the latter one, although the first one can be easily derived.

We define a child class for the Incremental Monte Carlo Q evaluation agent.
 - in the method `mc()` we first call the `preprocess_visited_state_actions()` method that will give us an array of visited state-action pairs and their returns.
 - We do not check whether that state-action pair has already been visited or not. We update our Q values with every state-action in the array.
 - in the method `update_global_q_value_table()` we update the Q values according to textbook update formulas.
 Notice that the visited state-action pairs are saved in a dictionary.
 - `update_global_q_value_table()` is different for Incremental Monte Carlo Q evaluation.

##### Update rule

The update rule for Q values in the Incremental Monte Carlo Q evaluation is the following:

$ Q^{\pi}(s_t, a_t) \gets Q^{\pi}(s_t, a_t) + \alpha [ G(s_t, a_t) - Q^{\pi}(s_t, a_t) ] $ where:
 - $Q^{\pi}(s_t, a_t)$ - state value of current state following the policy $\pi$
 - $ \alpha $ - it is called the **learning rate**.
 In our case, we use a **decaying, step-based learning rate** which takes the value of $ \alpha = 0.5 * 1 / N(s_t) $
 - $ N(s_t, a_t) $ - the number of times the state has been visited during multiple episodes.
 Notice that although we are in the first visit case, the number of times a state has been visited can be more than 1.
 That same state could have been visited multiple times in *different episodes*.
 - $ G(s_t, a_t) $ - return until the end of the episode of current state.

##### Setting the learning rate

Incremental Monte Carlo can be thought of as a general case of the previous two methods.
 - setting $\alpha = 1 / N(s_t)$ recovers the original Monte Carlo on policy evaluation algorithms.
 - setting $\alpha < 1 / N(s_t)$ gives a higher weight to older data
 - setting $\alpha > 1 / N(s_t)$ gives a higher weight to newer data, which can help learning in non-stationary domains.

If we are in a truly Markovian domain, Every Visit Monte Carlo will be more data efficient, because we update our average return for a state every time we visit the state.

In [6]:
from mc_q_eval_agent import MCAgent, VisitStateAction
from environment import Env


class IMCAgent(MCAgent):
    def __init__(self, actions):
        super(IMCAgent, self).__init__(actions)

    # for every episode, agent updates q function of visited state action pairs
    def mc(self):
        all_state_actions = super(IMCAgent, self).preprocess_visited_state_actions()
        for state_action in all_state_actions:
            self.update_global_q_value_table(state_action[0], state_action[1])

    # redefined update visited states for incremental MC
    def update_global_q_value_table(self, state_action_name, G_t):
        updated = False
        if state_action_name in self.q_value_table:
            state_action = self.q_value_table[state_action_name]
            state_action.N = state_action.N + 1
            learning_rate = 0.5 * 1 / state_action.N
            state_action.Q = state_action.Q + learning_rate * (G_t - state_action.Q)
            updated = True
        if not updated:
            self.q_value_table[state_action_name] = VisitStateAction(total_G=G_t, N=1, Q=G_t)


### Other methods

##### Helper methods

In [7]:
class MCAgent(MCAgent):

    # get action for the state according to the q function table
    # agent pick action of epsilon-greedy policy
    def get_action(self, state):
        epsilon = 1 / (self.decaying_epsilon_counter * self.decaying_epsilon_mul_factor)
        if np.random.rand() < epsilon:
            # take random action
            action = np.random.choice(self.actions)
        else:
            # take action according to the q function table
            q_values = self.possible_Q_values(state)
            action = self.arg_max(q_values)
        return int(action)

In [8]:
class MCAgent(MCAgent):
    # append sample to memory(state, reward, done)
    def save_sample(self, state, action, reward, done):
        self.samples.append([state, action, reward, done])

    # compute arg_max if multiple candidates exit, pick one randomly
    @staticmethod
    def arg_max(next_state):
        max_index_list = []
        max_value = next_state[0]
        for index, value in enumerate(next_state):
            if value > max_value:
                max_index_list.clear()
                max_value = value
                max_index_list.append(index)
            elif value == max_value:
                max_index_list.append(index)
        return random.choice(max_index_list)


In [9]:
class MCAgent(MCAgent):
    # get the possible next states
    def possible_Q_values(self, state):
        Q_values = [self.q_value_table[str([state, x])].Q for x in range(4)]
        return Q_values

##### Main loop

Since all Monte Carlo methods are closely related, we define a common function called `mainloop()` in the parent class `MCAgent`.
All children MC agents inherit this method and can execute it in their static main functions.

In [10]:
class MCAgent(MCAgent):
    # to be called in a main loop
    def mainloop(self, env, verbose=False):
        for episode in range(1000):
            state = env.reset()
            action = self.get_action(state)

            while True:
                env.render()

                # forward to next state. reward is number and done is boolean
                next_state, reward, done = env.step(action)

                self.save_sample(state, action, reward, done)

                # update state
                state = next_state
                # get next action
                action = self.get_action(next_state)

                # at the end of each episode, update the q function table
                if done:
                    self.mc()
                    self.samples.clear()

                    if verbose:
                        print("episode : ", episode, "\tepsilon: ", self.epsilon)
                    break

Implementing the main functions for the three Monte Carlo agents is pretty straightforward now.

##### First Visit Monte Carlo Q evaluation agent

In [11]:
if __name__ == "__main__":
    env = Env()
    agent = FVMCAgent(actions=list(range(env.n_actions)))
    try:
        agent.mainloop(env)
    except:
        pass

##### Every Visit Monte Carlo Q evaluation agent

In [12]:
if __name__ == "__main__":
    env = Env()
    agent = EVMCAgent(actions=list(range(env.n_actions)))
    try:
        agent.mainloop(env)
    except:
        pass

##### Incremental Monte Carlo Q evaluation agent

In [13]:
if __name__ == "__main__":
    env = Env()
    agent = IMCAgent(actions=list(range(env.n_actions)))
    try:
        agent.mainloop(env)
    except:
        pass

### Results

 - Crucial to make Monte Carlo Q evaluation work is to lower `self.decaying_epsilon_mul_factor` to a value of $0.05$, whereas for normal Monte Carlo the value was set to $0.2$.
 This is done to allow the agent explore longer, because as we said algorithms that work with Q values are less data efficient than their V value counterparts.
 - All Monte Carlo agents will converge to an optimal policy usually within 300 iterations.
 - Crucial for making Incremental Monte Carlo Q evaluation find a solution in Grid World is the **decaying learning rate**, that decays with increasing number of episodes.