## Deep Monte Carlo Q Evaluation
We have already seen [Monte Carlo Q Evaluation](../9-deep-monte-carlo-q-evaluation/deep_mc_q_eval_agent.ipynb) in the previous examples.
Monte Carlo Q evaluation is a special algorithm using Monte Carlo methods.
Deep Monte Carlo Q evaluation is just an extension of Monte Carlo Q evaluation that uses neural networks, i.e. a *non-linear* function as state-action value function approximator.
This is done to scale up to decision making in really large domains of huge state spaces.
Since neural networks use gradient descent to update the weights and specify a *learning rate*, the version of Monte Carlo here resembles Incremental Monte Carlo.

As we are going to see here, deep MC Q evaluation shows acceptable capabilities in learning an optimal policy in the environment of extended Grid World.

### Characteristics of deep Monte Carlo Q evaluation:
Deep Monte Carlo Q evaluation inherits all the characteristics of normal [Monte Carlo Q evaluation](../5-monte-carlo-q-evaluation/mc_q_eval_agent.ipynb).

##### Neural Network as value function approximator
This can be any neural network architecture that we see fit.
In this example, we are using a very simple Dense Neural Network architecture.
Dense Neural Networks are simply a *stack of matrices with activation functions in between*.

##### Batch training
Each episode sampled by the agent is used for training the agent.
Gradient descent is not applied to each tuple individually: rather in batches of tuples.

##### Continuous or very large state and action spaces
Using a neural network as a value function approximator has its benefits.
It allows us to solve problems with continuous state and action spaces.
Such problems would be almost impossible to solve with normal Q learning, as they are very resource hungry.


##### Initialization
For the Q learning aspect we keep track of the following:
 - `self.learning_rate` is initially set to $0.1$ and decays with decaying factor `self.learning_rate_decay = 0.001` according to the formula: $lr_t = lr_0 \times 1 / (1 + decay\_factor \times iteration)$
 - `self.epsilon` is set to $1.0$ and decays each step taken via the variable `self.epsilon_decay = 0.9999` until a minimum of `self.epsilon_min = 0.01` is reached
 - `self.discount_factor` is set to $0.99$

For the neural network we keep track of the following:
 - we keep track of a model (neural network) in `self.model = self.build_model()`


In [None]:
import copy
import pylab
import random
import numpy as np
from environment import Env
from keras.layers import Dense
from keras.optimizers import Adam
from keras.models import Sequential

EPISODES = 1000

# this is Deep MC Q Evaluation Agent for the GridWorld
# Utilize Neural Network as q function approximator
class DeepMCQEvalAgent:
    def __init__(self):
        self.load_model = False
        # actions which agent can do
        self.action_space = [0, 1, 2, 3, 4]
        # get size of state and action
        self.action_size = len(self.action_space)
        self.state_size = 15
        self.discount_factor = 0.99
        self.learning_rate = 0.1
        self.learning_rate_decay = 0.001

        self.epsilon = 1.  # exploration
        self.epsilon_decay = .9999
        self.epsilon_min = 0.01
        self.model = self.build_model()

        self.samples = []

        if self.load_model:
            self.epsilon = 0.05
            self.model.load_weights('./save_model/deep_mc_q_eval.h5')


### Deep Monte Carlo Q evaluation

The process for deep Monte Carlo Q evaluation is the following:
 1. Interact with the environment and sample an episode
 2. Calculate returns for each state-action pair
 3. Choose whether to keep only the first visits to state-action pairs or every visit
 4. For each state visited, predict Q values from the model for all possible actions
 5. Update the Q value of the chosen action from the sample with the calculated return from point 1
 6. Feed the new updated Q values to the network and do gradient descent

Let us now see the code snippets that make the above recipe work:

##### 2. Calculating returns

In [None]:
class DeepMCQEvalAgent(DeepMCQEvalAgent):
    # for every episode, calculate return of visited states
    def calculate_returns(self):
        # state name and G for each state as appeared in the episode
        all_states = []
        G = 0
        for reward in reversed(self.samples):
            G = reward[1] + self.discount_factor * G
            state_info = reward[0]
            action = reward[1]
            done = reward[3]
            all_states.append([state_info, action, G, done])
        all_states.reverse()

        return all_states

##### 3. First-visit or Every-visit

In [None]:
class DeepMCQEvalAgent(DeepMCQEvalAgent):
    def first_or_every_visit_mc(self, first_visit=True):
        all_states = self.calculate_returns()
        visit_state_batch = []
        action_batch = []
        G_t_batch = []

        visit_state = []
        for state in all_states:
            state_info = state[0]
            action = state[1]
            G_t = state[2]
            done = state[3]
            if not first_visit or str(state_info) not in visit_state:
                visit_state.append(str(state_info))

                visit_state_batch.append(state_info)
                action_batch.append(action)
                G_t_batch.append(G_t)

        visit_state_batch = np.array(visit_state_batch, dtype=np.float32).reshape(-1, self.state_size)

        #print(np.shape(visit_state_batch))
        #print(np.shape(action_batch))
        #print(np.shape(G_t_batch))
        self.train_model(visit_state_batch, action_batch, G_t_batch)

##### 4-6. Training the model

In [None]:
class DeepMCQEvalAgent(DeepMCQEvalAgent):
    def train_model(self, visit_state_batch, action_batch, G_t_batch):
        target_batch = self.model.predict(visit_state_batch)
        # update target with observed G_t
        for target, action, G_t in zip(target_batch, action_batch, G_t_batch):
            target[action] = G_t

        # make batches with target G_t (returns)
        # and do the model fit!
        self.model.fit(visit_state_batch, target_batch, epochs=2, verbose=0, batch_size=32)

##### The model: neural network
Neural networks are built one layer after the other.
Our model is a dense neural network, i.e. a neural network comprised of only dense layers.
Dense layers are simply a *set of matrices* with an *activation function* in the end.

Notice that we are using the following:
 - Categorical cross-entropy as a loss function
 - Adam optimizer
 - decaying time-based learning rate

In [None]:
class DeepMCQEvalAgent(DeepMCQEvalAgent):
    # approximate Q function using Neural Network
    # state is input and Q Value of each action is output of network
    def build_model(self):
        model = Sequential()
        model.add(Dense(30, input_dim=self.state_size, activation='relu'))
        model.add(Dense(30, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.summary()
        model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=self.learning_rate, decay=self.learning_rate_decay))
        return model

### Other methods

##### Helper methods

In [None]:
class DeepMCQEvalAgent(DeepMCQEvalAgent):
    # append sample to memory(state, reward, done)
    def save_sample(self, state, action, reward, done):
        self.samples.append([state, action, reward, done])

In [None]:
class DeepMCQEvalAgent(DeepMCQEvalAgent):
    # get action from model using epsilon-greedy policy
    def get_action(self, state):
        if np.random.rand() <= self.epsilon:
            # The agent acts randomly
            return random.randrange(self.action_size)
        else:
            # Predict the reward value based on the given state
            state = np.array(state, dtype=np.float32).reshape(-1, self.state_size)
            q_values = self.model.predict(state)
            return np.argmax(q_values[0])

##### Main loop

In [None]:
if __name__ == "__main__":
    env = Env()
    agent = DeepMCQEvalAgent()

    global_step = 0
    scores, episodes = [], []

    for e in range(EPISODES):
        done = False
        score = 0
        state = env.reset()
        state = np.reshape(state, [15])

        while not done:
            if agent.epsilon > agent.epsilon_min:
                agent.epsilon *= agent.epsilon_decay

            # fresh env
            global_step += 1

            # get action for the current state and go one step in environment
            action = agent.get_action(state)
            next_state, reward, done = env.step(action)
            next_state = np.reshape(next_state, [15])

            # save tuple to episode
            agent.save_sample(state, action, reward, False)

            state = next_state
            # every time step we do training
            score += reward

            state = copy.deepcopy(next_state)

            if done:
                scores.append(score)
                episodes.append(e)
                pylab.plot(episodes, scores, 'b')
                pylab.savefig("./save_graph/deep_mc_q_eval2.png")

                # last tuple
                action = agent.get_action(state)
                agent.save_sample(state, action, 1, True)

                agent.first_or_every_visit_mc(first_visit=False)
                agent.samples.clear()

                print("episode:", e,
                      "\tscore:", score,
                      "\tglobal_step:", global_step,
                      "\tepsilon:", agent.epsilon,
                      "\tlearning_rate_decay:", agent.learning_rate_decay
                      )

        if e % 100 == 0:
            agent.model.save_weights("./save_model/deep_mc_q_eval2.h5")


<br/>
<h3 style="text-align:center">Results</h3>
<img src="./save_graph/deep_mc_q_eval.png" alt="deep_mc_q_eval.png" width="70%" />

From the graph above we can see the following:
 - we can clearly see a tendency of continuously getting better at the game from the agent.
 - if we were to smooth the plot, the improvements in scores would be much more clear.

We can conclude that deep Monte Carlo Q evaluation has the capability of learning and shows acceptable results in our extended Grid World environment.