# Deep Reinforcement Learning (DRL) 

Nowadays, Reinforcement Learning (RL) is a very important Machine Learning area. RL focuses on how agents take actions in an environment to maximize a reward. The agent-environment interaction is
represented as a Markov Process. Having the model of the environment, this can be solved using dynamic programming. However, that is not the case in RL problems and the model of the
environment can only be approximated using samples collected from the agent-environment interaction. Depending on what values of the model are approximated, the approaches are classified into
model-free and model-based. Deep Q-learning (DQN) is a model-free approach that uses a greedy policy derived from the value function. DQN uses the same principles as Q-Learning,
but approximates the Q-value function using Deep Neural Networks (DNNs). Thus, it is called Deep Reinforcement Learning (DRL).

In this tutorial, we will use DRL to solve the Cartpole problem. This is a classical control task (Watch a real cart pole system in this [Video](https://www.youtube.com/watch?v=XiigTGKZfks&feature=youtu.be)). The goal is to control the pole so that it remains upright, while the cart stays in the initial position. The cart pole problem represents a task with four-dimensional states: position and velocity of the cart and pole respectively. RL methods aim to find the optimal actions i.e, optimal forces applied to the cart that achieve the goal. To that end, the reward function used is given in terms of the cart pole position.

In this tutorial, we will use DQN applied to play a CartPole game using Gym [1] to simulate the CartPole and Keras to work with DNN. Gym is a useful package to develop, compare, and use predefined RL agents. 

`*Please install Gym in your workspace!*`


Moreover, you also need to install the package jdc (Jupyte Dynamic Classes). This package allows us to build a class across different cells in a Jupyter Notebook.

`*Please install jdc in your workspace!*` (pip install jdc)

** Now import keras, gym and jdc **

In [1]:
pip install Gym

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install jdc

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install PyOpenGL PyOpenGL_accelerate

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install python-opengl

[31mERROR: Could not find a version that satisfies the requirement python-opengl[0m
[31mERROR: No matching distribution found for python-opengl[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:

import keras
# RL environment
import gym 
#pip install jdc (Jupyter Dynamic Classes)
import jdc 

We will also need some packages for visualization, generation of random numbers, and generation of a memory replay mechanism [4]

In [6]:
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt

# Memory replay
import collections

## 1. Gym environment for CartPole

In this tutorial, we will use the environment CartPole-v1 [2] already configured in Gym.
The system has the following features:

- **Goal** : Keep the pole upright. Get an accumulated reward of 500 i.e. an episode where for each time step the pole is upright.
- **State** : [position of cart, velocity of cart, position of pole, angular velocity of pole]
- **Action** : The cart is controlled by applying a force of +1 or -1. 
- **Reward** : A reward of +1 is provided for every timestep that the pole remains upright. 
- **Initial Condition**: The pendulum starts upright 
- **Final Conditions**: The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

Before trying to control the cart pole. Let's get familiar with some basic functions from Gym to instantiate an env, execute an action, and render it.

Define the number of `episodes` you will run the cart_pole and the number of `timesteps` that compose an episode.
After executing the cell below, an external window will open and will display the cart pole when executing random actions. 
The pole will fall down!

In [7]:
def random_runs():
    
    episodes = 5 # TO DO: Run it for more episodes
    timesteps = 100 # TO DO: Add more steps
    
    env = gym.make("CartPole-v1") # instantiate which env to use
    
    for e in range(episodes): 
        env.reset()
        for t in range(timesteps):
            # Render the cartpole
            env.render(mode='rgb_array')

            # This will sample an action of the action set
            # Here only two options, move cart to left or to right
            action = env.action_space.sample()

            # Execute previous sampled action
            next_state, reward, done, info = env.step(action)

        print("Episode= {} Act= {}, Next_state= {}".format(e, action, next_state))
    env.close()

random_runs()

ImportError: 
    Error occurred while running `from pyglet.gl import *`
    HINT: make sure you have OpenGL install. On Ubuntu, you can run 'apt-get install python-opengl'.
    If you're running on a server, you may need a virtual frame buffer; something like this should work:
    'xvfb-run -s "-screen 0 1400x900x24" python <your_script.py>'
    

## 2. Define a DNN:

Create a function that takes the `state_size` and `action_size` as input parameters.
Inside the function, build a DNN with 4 Dense layers, where `input_size = state_size` and `output_size = action_size` 

This DNN can be declared as previous tutorials using Keras. Notice that this is not a Sequential Model. But it is an alternative way to define models in Keras, using a separated Input layer.

This DNN will learn to predict the reward of current state based on the data we trained. Thus, it is a regression problem! 

** Complete the function DNN() with the parameters that are missing **

In [None]:
# Neural Network model for Deep Q Learning

def DNN(input_size, action_size):

    input_sample = keras.layers.Input( ) # TO DO: specify the input size
    
    x = keras.layers.Dense(512, input_shape=input_size, activation="relu", kernel_initializer='he_uniform')(input_sample) 
    x = keras.layers.Dense(256, activation="relu", kernel_initializer='he_uniform')(x)
    x = keras.layers.Dense(64, activation="relu", kernel_initializer='he_uniform')(x)
    
    x = keras.layers.Dense(             , # TO DO: specify the number of neurons given the output_size
                           activation=  ,  # TO DO: specify a proper activation function for a regression problem
                           kernel_initializer='he_uniform')(x) 

    model = keras.models.Model(inputs=input_sample, outputs=x)
    
    model.compile(loss=                  , #TO DO: Choose a proper loss for regression 
                  optimizer=keras.optimizers.RMSprop(lr=0.00025, rho=0.95, epsilon=0.01),
                  metrics=["accuracy"])

    return model

** Instantiate a DNN for a system with state_size =4 and action_size=2 and print model summary **

In [None]:
test_model = DNN(input_size=(4,), action_size=2)
test_model.summary()

## 3. Implementing DQN

The easiest way to implement DQN is creating a class DQNAgent(), which contains the following functions:

    1. initialize_agent() 
    
    2. greedy_exploration() 
    
    3. fill_replay_memory()
    
    4. train()
    
    5. run()
    
    6. test_episode()
    

##### 3.1 Initialize Agent

It sets the environment, the DNN and hyperparameters

In [None]:
class DQNAgent:
    def __init__(self):
        # Define the env
        self.env = gym.make( ) # TO DO: Select the CartPole-v1
        
        self.state_size = self.env.observation_space.shape[0]
        self.action_size = self.env.action_space.n
        
        # Set hyperparameters
        self.runs = 1
        self.episodes = 1000 # by default, CartPole-v1 has max episode steps = 500
        self.memory = collections.deque(maxlen=2000) # replay memory
        self.gamma = 0.95   # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.001
        self.epsilon_decay = 0.999
        self.batch_size = 64
        self.train_start = 1000
        self.R_final = []
        self.acc_reward = []

        # Create our main model
        self.model = DNN(input_size=(self.state_size,), action_size = self.action_size)
        self.model.summary()

#### 3.2 Greedy Exploration of the action space

It selects the next action to execute using Epsilon-Greedy Exploration Exploitation. Epsilon Greedy Exploration is an exploration strategy that explores a state space with probability `p=epsilon` and exploits it with a probability of `p=1-epsilon`, where `epsilon` is a hyperparameter. 
Epsilon represents a trade-off between exploration i.e. randomly choose actions, and exploitation i.e. follow the current policy.

In [None]:
%%add_to DQNAgent 
# magic cell %%add_to from jdc. It adds the greedy_exploration() function to the class DQNAgent()

import random
import numpy as np

def greedy_exploration(self, state):
        p =   # TO DO: generate a random number between 0 and 1
        
        if  p <= self.epsilon: # exploration
            return random.randrange(self.action_size)
        else: # explotaition
            return np.argmax(self.model.predict(state)) # action with maximum predicted Q value given state

#### 3.3 Fill replay memory

It stores [state, action, reward, next_state, done_flag] of the current timestep into the memory mechanism
Moreover, we can boost our algorithm using an epsilon decay strategy. As we want to explore more at the beginning and then, decrease the number of explorations.

In [None]:
%%add_to DQNAgent

def fill_memory(self, state, action, reward, next_state, done):
    
    self.memory.append((  )) # TO DO: append input arguments to the memory
    
    # once, it trains, explore more at the beginning
    if len(self.memory) > self.train_start:
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

#### 3.4 Train function for the agent

It constructs training data from the memory to train our DNN. It uses our DNN to predict the current reward and constructs the truth labels for the training data. 
Finally, it trains the DNN

In [None]:
%%add_to DQNAgent

def train(self):
    # First fill memory with enough samples (1000) and then start training
    if len(self.memory) < self.train_start:
        return

    # construct training data from memory
    # randomly select samples from the memory to construct a batch
    memory_batch = random.sample(self.memory, min(len(self.memory), self.batch_size))

    state = np.zeros((self.batch_size, self.state_size))
    next_state = np.zeros((self.batch_size, self.state_size))

    action, reward, done = [], [], []
        
    for ind in range(self.batch_size):
        state[ind] = memory_batch[ind][0]
        action.append(memory_batch[ind][1])
        reward.append(memory_batch[ind][2])
        next_state[ind] = memory_batch[ind][3]
        done.append(memory_batch[ind][4])

    # Use our DNN to predict reward given a state
    # predicte next Qmax(s',a')
    target = self.model.predict(state)
    target_next = self.model.predict( ) #TO DO predict target_next

    # construct truth labels for training data 
    for i in range(self.batch_size):
        if done[i]:
            target[i][action[i]] = reward[i]
        else:
            # Standard - DQN
            # DQN chooses the max Q value among next actions
            # selection and evaluation of action is on the target Q Network
            # Q_max = max_a' Q_target(s', a')
            
            target[i][action[i]] = reward[i] + self.gamma * (np.amax(target_next[i]))

    # Train our DNN with batches
    # Use verbose=1 see training accuracy
    self.model.fit(state, target, batch_size=self.batch_size, verbose=0)

#### 3.5 Run Agent and train it

Main function. It runs each episode and each timestep. 
It selects the action for each timestep using greedy_exploration().
It executes the action and fetches the reward and next state for that action.
The memory is filled with new values.
If the accumulated reward is not 500, it calls train(). If 500 is reached, the trained weights of the DNN are saved and the training is  finished.


In [None]:
%%add_to DQNAgent

def run(self):
    for e in range(self.episodes):
        # Go to initial position
        state = self.env.reset()
        done = False
        # get state
        state = np.reshape(state, [1, self.state_size])
        i = 0
        r_test = []
        r_test.append(0)

        while not done:
            #self.env.render()
            
            # select action randomly or following policy
            action = self.greedy_exploration(state)
                
            # perform  action
            next_state, reward, done, _ = self.env.step(action)
            next_state = np.reshape(next_state, [1, self.state_size])
            r_test.append(reward)
            
            if not done or i == self.env._max_episode_steps-1:
                reward = reward
            else:
                reward = -100
            
            
            # Experience replay.
            # Save try in the memory D
            self.fill_memory(state, action, reward, next_state, done)

            # Update state
            state = next_state
            i = i + 1
            
            # If done, save trained model and exit
            if done:
                self.acc_reward.append(i)
                print("Episode:{}/{}, Accumulated Reward:{}, eps: {:.2}".format(e, self.episodes, i, self.epsilon))
                
                # Save accumulated reward for plotting
                if i == 500 :
                    print("Saving trained model as cartpole-dqn.h5")
                    self.model.save("cartpole-dqn.h5") 
                    return
                    
            # Train model using replay memory
            self.train()

            #End of simulation step
        # Accumulated reward
        R = 0
        
        for t in range(len(r_test)-1):
            R = R + (self.gamma ** (t)) * r_test[t + 1]

        self.R_final.append(R)
            
        # End of an episode

#### 3.6 Test Agent

The DNN will predict the action that maximizes the reward. It always selects the action following the learned policy and never the greedy_exploration()

In [None]:
%%add_to DQNAgent

def test_episode(self, e=None, plot_test=False):
    episodes = 50
    for e in range(episodes):
        # Reset environment
        state = self.env.reset()
        state = np.reshape(state, [1, self.state_size])
        done = False
        i = 0
        
        while not done:
            #Plot
            if plot_test==True:
                self.env.render(mode='rgb_array')
                
            
            # TO DO: select action following learned policy
            # not from the greedy exploration!
            action = 

            # perform action
            next_state, reward, done, _ = self.env.step(action)
            state = np.reshape(next_state, [1, self.state_size])
            
            i += 1

            if done:
                print("Episode: {}/{}, Accumulated Reward: {}".format(e, episodes, i))
                break

    self.env.close()

## 4.  It is time to use the DQNAgent() Class!

First, we need to declare an instance of the class, what we call `agent`.
We train this agent using the member function `run()`
It will take some time until we reach the goal of a reward equal to 500.
Once it is done, you will have a *.h5 file located in the same folder as this notebook.
This *.h5 contains the pre-trained weights of our DNN().

In [None]:
# First instantiate an agent from class DQNAgent()
agent = DQNAgent()
# Run it
agent.run()


After some episodes, the reward will be 500, the model is saved and now you can use your agent!

** Plot the sum of future discounted rewards R_final **

In [None]:
%matplotlib inline

actual_episodes = len(agent.R_final)

plt.plot(np.linspace(0, actual_episodes,actual_episodes), agent.R_final)
plt.show()

** Plot the accumulated reward per each episode acc_reward**

In [None]:
%matplotlib inline
actual_episodes = len(agent.acc_reward)
plt.plot(np.linspace(0, actual_episodes, actual_episodes), agent.acc_reward)
plt.show()

## 5. Testing your agent

Now that you have trained the agent, load the pre-trained weights, and run the agent using test_episode(). In this way, the selected action always comes from the prediction of the DNN and not from the greedy exploration-exploitation function. During training, we achieved our goal (a reward of 500). Therefore, now the CartPole should be controlled, and the pole is upright! Run it and see it for yourself!

In [None]:
# Load trained model
agent.model = keras.models.load_model("cartpole-dqn.h5")
agent.test_episode(plot_test=True)

Moreover, we see that reward 500 is achieved after fewer episodes (approx. 10) than during training. Our agent learned the best actions to keep the pole upright and the cart in the center! For better results, train more episodes!

### Final Comments

This was a simple example of DQN. There are a lot of possible applications, environments [3], and different rules for each environment. For instance, in the cart pole problem, we can define a continuos larger action space, the initial and final conditions can also be different. We could have added friction to the environment, etc. Moreover, Gym not only provides pre-defined environments but also the possibility to create your own.

### *References* 

[1] Gym: http://gym.openai.com/

[2] Cart pole: http://gym.openai.com/envs/CartPole-v1/

[3] List of Gym Environments: https://github.com/openai/gym/wiki/Table-of-environments

[4] Collections: https://docs.python.org/3/library/collections.html 

[5] Online Tutorial: https://pylessons.com/CartPole-reinforcement-learning/ 