# <center> CartPole, Deep Q-learning guided exercise

In [None]:
import numpy as np
import gym

## Using a neural network as a quality function approximator

The contiuous state space of the CartPole environemnt does not allow for a tabular learning of the quality function.
A typical approach in such cases is to use a neural network as a funcion approximator for the quality function, which has the state as input (4 input nodes, corresponding to the 4 dimensions of the state space), and returns the quality of each action as output (2 output nodes, i.e. push left or push right).

<center> $Q(\; \vec{s} \;, \text{push left} \; | \;\vec{w}) = \text{first output node of the neural network with weights } \vec{w} \text{ and input } \vec{s}$
    
<center> $Q(\; \vec{s} \;, \text{push right} \; | \;\vec{w}) = \text{second output node of the neural network with weights } \vec{w} \text{ and input } \vec{s}$

## General idea of deep-Q-learning

As for a generic tabular reinforcement learning algorithm, the point is to find a good approximation of the quality function, such that I can choose my policy by reliably knowing what I am going to gain with all the actions that I have.

For example, in the case of Q-learning, given an *experience* of starting state $s$, action taken $a$, reward obtained $r$, and new state reached $s'$, the quality associated to $s$ and $a$ is "corrected" (up to a factor given by the learning rate) by the following temporal difference error:

$$
r + \gamma \max_{b} Q(s', b) - Q(s, a) = \hat{Q}(s, a) - Q(s, a)
$$

where $\hat{Q}(a,s)$ is the target of the quality function at that step.

In the case of a quality function approximator, e.g. our NN, the algorithm can no longer directly change the specific number $Q(s, a)$, but it has to change the parameters/weights of the approximator affecting the shape of $Q(s,a| \vec{w})$.
A possible choice is to define the following mean square error as a loss function, for a given sample of experiences $\lbrace (s, a, r, s') \rbrace = \lbrace e \rbrace$

$$
\begin{align}
L(\vec{w}) = & \frac{1}{2} \sum_{\lbrace e \rbrace} (\hat{Q}(s, a | \vec{w}) - Q(s, a | \vec{w}))^2
\\
& \frac{1}{2} \sum_{\lbrace e \rbrace} (r + \gamma \max_{b} Q(s', b) - Q(s, a | \vec{w}))^2
\end{align}
$$

which uses the same off-policy estimate of the Q-learning as a target of the quality function.
Hopefully, by accumulating several experiences $(s, a, r, s')$, and minimizing the loss function with respect to the parameters $\vec{w}$, the "true" quality function will be well approximated, and the final best policy will reach a high return.

## Building the neural network

Below we define a simple two-layer network that can do the job of approximating the quality function of the cart-pole problem

In [None]:
from keras.layers import Input, Dense
from keras.optimizers import Adam
from keras.models import Model


def build_NN(input_dim, output_dim, adam_learning_rate):
    """
    Bulding the neural network as a keras model. The network has specified number of input and output nodes.
    Is has 2 hidden layers composed of 100 units. All the layers are dense.
    Activation functions are relu for the hidden layers , and linear for the output layer.
    The loss function that the network tries to minimize is a mean squared error.
    The weights are updated with the ADAM optimizer (an improvement of GD which 
    also considers information about the second moment of the error to adaptively change the 
    learning rate).
    """
    
    # Define the four layers of the neural network
    input_layer = Input(shape=(input_dim,))
    h1 = Dense(100, activation="relu")(input_layer)
    h2 = Dense(100, activation="relu")(h1)        
    output_layer = Dense(output_dim, activation="linear")(h2)
    
    # Putting the layers together and specifying the loss function and the GD algorithm.
    model = Model(input_layer, output_layer) 
    model.compile(loss="mse", optimizer=Adam(lr=adam_learning_rate))
    
    return model

The two methods that you need to train the network in our reinforcement learning setting are *predict* and *fit*. For the documentation see: https://keras.io/models/model/.
They are used within the "brain" presented below.

*Predict(state)* returns the output associated to the input, i.e. the quality of the two actions given the state, according to the current configuration of weights.

*Fit(state, $\hat{Q}$)* trains the network by performing one step of minimization of the loss function (the algorithm used for minimization is specified in the keras model construction). This requires the,  which are the "inputs" of the NN, and the estimates of the quality function from that state, i.e. the "labels".

## Building the brain that will perform the training

The following class will perform the crucial steps during the learning cycle. 

An important trick used here, that crucially imporove the performance, is to use a memory of experiences, $\lbrace (s, a, r, s') \rbrace$, and perform a training (i.e. a minimization step of the loss function) over samples of these experiences.
In the description below, the exact procedure is better explained.

Summary of the methods of the brain:
- *act(state, $\epsilon$)*: returns the action given the state and the episode. There is performed with epsilon-greedy exploration: with probability $\epsilon$ the action is chosen at random, otherwise the argmax of the estimated quality is selected.
- *remember(state, action, next_state, reward)*: add to the memory the *experience*. The memory has a size $\text{mem_size}$: when more experiences are obtained the first ones are removed. 
- *train()*:
> - sample a mini-batch of experiences uniformly from the memory. Minibatch size: $\text{batch_size}$
> - for each experience compute the new Q-learning estimate: $\hat{Q} = r + \gamma max_{a'} Q(s',a',\textbf{w})$ (you need the *Predict* method of the keras NN to get all the $Q$s). Note that if the state $s$ is terminal, the estimate is just the obtained reward. 
> - SGD step to adjust the $\textbf{w}$ with the minimization of $L(\textbf{w}) = \frac{1}{2}(\hat{Q} - Q)^2$ over the sampled expereinces (these uses the *Fit* method of keras model)

In [None]:
from collections import deque


class DQN_Brain:
    
    def __init__(self, input_dim, output_dim, learning_rate=.005, mem_size=5000, batch_size=64, gamma=1.):
        
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.batch_size = batch_size
        self.gamma = gamma
        self.memory = deque(maxlen=mem_size) # Define our experience replay bucket as a deque with size mem_size.
        self.model = build_NN(input_dim, output_dim, learning_rate)
        
        
    def act(self, state, explore_p):
        # With probability explore_p, randomly pick an action
        if explore_p > np.random.rand():
            return np.random.randint(self.output_dim)
        # Otherwise, find the action that should maximize future rewards according to our current Q-function policy.
        else:
            return np.argmax(self.model.predict(np.array([state])))
            
        
    def remember(self, state, action, next_state, reward):
        # Create a blank state. Serves as next_state if this was the last experience tuple before the epoch ended.
        terminal_state = np.array([None]*self.input_dim) 
        # Add experience tuple to bucket. Bucket is a deque, so older tuple falls out on overflow.
        self.memory.append((state, action, terminal_state if next_state is None else next_state, reward))
        
        
    def train(self):

        # Only conduct a replay if we have enough experience to sample from.
        if len(self.memory) < self.batch_size:
            return

        # Pick random indices from the bucket without replacement. batch_size determines number of samples.
        idx = np.random.choice(len(self.memory), size=self.batch_size, replace=False)
        minibatch = np.array(self.memory)[idx]

        # Extract the experience from our sample
        states = np.array(list(minibatch[:,0]))
        actions = minibatch[:,1]
        rewards = np.array(minibatch[:,3])
        next_states = np.array(list(minibatch[:,2]))
        
        # Compute a new estimate for each Q-value
        estimate = rewards + self.gamma * np.amax(self.model.predict(next_states), axis=1)

        # Get the network's current Q-value predictions for the states in this sample.
        predictions = self.model.predict(states)
        # Update the network's predictions with the new predictions we have.
        for i in range(len(predictions)):
            # Flag states as terminal (the last state before a epoch ended).
            terminal_state = (next_states[i] == np.array([None]*self.input_dim)).all()
            # Update each state's Q-value prediction with our new estimate.
            # Terminal states have no future, so set their Q-value to their immediate reward.
            predictions[i][actions[i]] = rewards[i] if terminal_state else estimate[i]

        # Propagate the new predictions through our network.
        self.model.fit(states, predictions, epochs=1, verbose=0)

## Pseudocode

Now you have all the ingredeints to write down the learning cycle that can solve the cart-pole problem.
The pseudocode below provides a possible implementation of the algorithm.

- construct the brain object with all the desired hyperparameters. The input and output dimensions have to corresponed to the state space dimension and the number of possible actions.
- for episode until a given rule is satisfied (e.g. run the cycle for a given number of episodes):
> - reset the environment to the starting state: *starting_state = env.reset()*
> - for each step in the episode (until any terminal state has been reached)   
> > - choose an exploration probability according to a given epsilon-greedy scheduling
> > - select an explorative action or a greedy one $a_t$ according to the quality estimate of the NN. You just need to call the *act(state, $\epsilon$)* method of the brain.
> > - execute the action $a_t$ and observe reward $r_{t+1}$, the new state $s_{t+1}$, and whether a terminal state is reached: *new_state, reward, done, _ = env.step(action)*
> > - if a terminal state is reached (*done = True*) break the cycle.
> > - save the experience $e_t = \{s_t,a_t,r_{t+1},s_{t+1}\}$ in a memory: *brain.remember(state, action, next_state, reward)*
> > - train the brain according to the stored experience: *brain.train()*

## Hyperparameters

As often happens in machine learning, you have a lot of freedom in choosing the hyperparameters and adding possible tricks to imporve the performances.
The right way to choose those parameters strongly depends on your specific problem, and usually there are no general and correct recipes.

Here we just list the crucial parameters and rules of these algorithm.

- **Discount factor**. It is suggested to choose a value close to 1. In this way, the return from the step $t$ is approximatley the number of steps from $t$ to the termination of the episode.

- **Learning rate**. A constant rate for the adam optimizer can do the job, but maybe the efficiency can be increased if the learning rate decreases with time.

- **Batch size**. Number of experiences sampled from the memory over which the algorithm is trained at each step.

- **Exploration rate**. Here you can employ an epsilon-greedy strategy. It is suggested to have a decreasing probability of exploration with the number of steps or episodes.

- **Max steps per episode**. You can tune this with *env.\_max\_episode\_steps*. You can also consider to increase this variable as the training advances.

- **Stopping rule**. You can stop the training after a given number of steps / episodes, but maybe it is better to define something that quantifies if the algorithm has learned (e.g. stop if the last 15 episodes have a reward equal to *\_max\_episode\_steps*).

- **Neural network**. You can also play with the network, e.g. modifying the units in the hidden layers, or the number of hidden layers...


In [None]:
## Write your algorithm here

## Test the model

You have created a cart pole that is able to balance the pole for a number of episode equal to *env.\_max\_episode\_steps*.
How does it behave for longer times?

Test it!

In [None]:
# This is for testing a trained model. 
# NOTE: the second argument is the keras model, not the brain object!
def Test(env, NN_keras_model, max_n_steps, render=True):
    
    env._max_episode_steps = max_n_steps
    state = env.reset()
    ep_reward = 0
    
    while True: # Cycle over the episode steps
        if render:
            env.render()
        action = np.argmax(NN_keras_model.predict(np.array([state]))) # Get action without exploration
        next_state, reward, done, _ = env.step(action) # Take action
        ep_reward += reward # Accumulate reward
        state = next_state # Advance state
        if done: # Episode is completed -- failure or max number of steps reached (success)
            print("Total reward: {}".format(ep_reward))
            break

In [1]:
# Test here

### Testing an already trained network

In [None]:
from keras.models import model_from_json

def load_NN(directory, name):
    # load json and create model
    json_file = open(directory + name+ '.json', 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    loaded_model = model_from_json(loaded_model_json)
    # load weights into new model
    loaded_model.load_weights(directory + name + '.h5')
    print("Loaded model from disk")
    return loaded_model

trained_model = load_NN('', 'trained_keras_model')

In [None]:
env = gym.make("CartPole-v1")
Test(env, trained_model, 10000, render=True)