This is the Part-11 of the Deep Reinforcement Learning Notebook series. In this Notebook I have introduced Noisy Nets.



The Notebook series is about Deep RL algorithms so it excludes all other techniques that can be used to learn functions in reinforcement learning and also the Notebook Series is not exhaustive i.e. it contains the most widely used Deep RL algorithms only..

##What are Noisy Nets?

One of the major problems in reinforcement learning is the exploration. There are various exploration strategies  like Epsilon-greedy , Boltzmann exploration. What these strategies do is they add noise to actions. But the exploration can also be improved more by adding noise to weights of the network . Noisy Nets is concept of a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration.

The key insight is that a single change to the weight vector can induce a consistent, and potentially very complex, state-dependent change in policy over multiple time steps – unlike dithering approaches where decorrelated (and, in the case of ε-greedy, state-independent) noise is added to the policy at every step

For Understanding the details of intuition you can read the paper Noisy Networks for Exploration (https://arxiv.org/abs/1706.10295)

Here I have implemented the noisy nets with Double DQN with Prioritized Experience Replay(PER) and target networks 

(Refer https://github.com/Rahul-Choudhary-3614/Deep-Reinforcement-Learning-Notebooks/blob/master/Deep_Reinforcement_Learning_Part_5_.ipynb for better understanding of Double DQN with Prioritized Experience Replay(PER) and target networks)

##The Algorithm Implementation

Below code setups the environment required to run and record the game and also loads the required library.

In [None]:
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Conv2D,Flatten,Input
from tensorflow.keras.models import Model
import gym
import numpy as np
import pickle
import random
from collections import deque
import cv2
from skimage.transform import resize
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML
from IPython.display import clear_output
from IPython import display as ipythondisplay

In [None]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

This part ensures the reproducibility of the code below by using a random seed and setups the environment.

In [None]:
RANDOM_SEED=1

# random seed (reproduciblity)
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

# set the env
env = (gym.make("UpNDown-v0")) # env to import
env.seed(RANDOM_SEED)
env.reset(); # reset to env 

Next we construct a SumTree and Memory object that will contain our sumtree and data.

In [None]:
class SumTree(object):

  data_pointer = 0
  
  """
  Here we initialize the tree with all nodes = 0, and initialize the data with all values = 0
  """
  def __init__(self, capacity):
      self.capacity = capacity # Number of leaf nodes (final nodes) that contains experiences
      
      # Generate the tree with all nodes values = 0
      # Remember we are in a binary node (each node has max 2 children) so 2x size of leaf (capacity) - 1 (root node)
      # Parent nodes = capacity - 1
      # Leaf nodes = capacity
      self.tree = np.zeros(2 * capacity - 1)
      
      """ tree:
          0
          / \
        0   0
        / \ / \
      0  0 0  0  [Size: capacity] it's at this line that there is the priorities score (aka pi)
      """
      
      # Contains the experiences (so the size of data is capacity)
      self.data = np.zeros(capacity, dtype=object)
  
  
  """
  Here we add our priority score in the sumtree leaf and add the experience in data
  """
  def add(self, priority, data):
      # Look at what index we want to put the experience
      tree_index = self.data_pointer + self.capacity - 1
      
      """ tree:
          0
          / \
        0   0
        / \ / \
tree_index  0 0  0  We fill the leaves from left to right
      """
      
      # Update data frame
      self.data[self.data_pointer] = data
      
      # Update the leaf
      self.update (tree_index, priority)
      
      # Add 1 to data_pointer
      self.data_pointer += 1
      
      if self.data_pointer >= self.capacity:  # If we're above the capacity, you go back to first index (we overwrite)
          self.data_pointer = 0
          
  
  """
  Update the leaf priority score and propagate the change through tree
  """
  def update(self, tree_index, priority):
      # Change = new priority score - former priority score
      change = priority - self.tree[tree_index]
      self.tree[tree_index] = priority
      
      # then propagate the change through tree
      while tree_index != 0:    # this method is faster than the recursive loop in the reference code
          
          """
          Here we want to access the line above
          THE NUMBERS IN THIS TREE ARE THE INDEXES NOT THE PRIORITY VALUES
          
              0
              / \
            1   2
            / \ / \
          3  4 5  [6] 
          
          If we are in leaf at index 6, we updated the priority score
          We need then to update index 2 node
          So tree_index = (tree_index - 1) // 2
          tree_index = (6-1)//2
          tree_index = 2 (because // round the result)
          """
          tree_index = (tree_index - 1) // 2
          self.tree[tree_index] += change
  
  
  """
  Here we get the leaf_index, priority value of that leaf and experience associated with that index
  """
  def get_leaf(self, v):
      """
      Tree structure and array storage:
      Tree index:
            0         -> storing priority sum
          / \
        1     2
        / \   / \
      3   4 5   6    -> storing priority for experiences
      Array type for storing:
      [0,1,2,3,4,5,6]
      """
      parent_index = 0
      
      while True: # the while loop is faster than the method in the reference code
          left_child_index = 2 * parent_index + 1
          right_child_index = left_child_index + 1
          
          # If we reach bottom, end the search
          if left_child_index >= len(self.tree):
              leaf_index = parent_index
              break
          
          else: # downward search, always search for a higher priority node
              
              if v <= self.tree[left_child_index]:
                  parent_index = left_child_index
                  
              else:
                  v -= self.tree[left_child_index]
                  parent_index = right_child_index
          
      data_index = leaf_index - self.capacity + 1

      return leaf_index, self.tree[leaf_index], self.data[data_index]
    
  @property
  def total_priority(self):
    return self.tree[0] # Returns the root node

In [None]:
class Memory(object):  # stored as ( s, a, r, s_ ) in SumTree

    PER_e = 0.01  # Hyperparameter that we use to avoid some experiences to have 0 probability of being taken
    PER_a = 0.6  # Hyperparameter that we use to make a tradeoff between taking only exp with high priority and sampling randomly
    PER_b = 0.4  # importance-sampling, from initial value increasing to 1
    
    PER_b_increment_per_sampling = 0.001
    
    absolute_error_upper = 1.  # clipped abs error

    def __init__(self, capacity):
        # Making the tree 
        """
        Remember that our tree is composed of a sum tree that contains the priority scores at his leaf
        And also a data array
        We don't use deque because it means that at each timestep our experiences change index by one.
        We prefer to use a simple array and to overwrite when the memory is full.
        """
        self.tree = SumTree(capacity)
        
    """
    Store a new experience in our tree
    Each new experience have a score of max_prority (it will be then improved when we use this exp to train our DDQN)
    """
    def store(self, experience):
        # Find the max priority
        max_priority = np.max(self.tree.tree[-self.tree.capacity:])
        
        # If the max priority = 0 we can't put priority = 0 since this exp will never have a chance to be selected
        # So we use a minimum priority
        if max_priority == 0:
            max_priority = self.absolute_error_upper
        
        self.tree.add(max_priority, experience)   # set the max p for new p

        
    """
    - First, to sample a minibatch of k size, the range [0, priority_total] is / into k ranges.
    - Then a value is uniformly sampled from each range
    - We search in the sumtree, the experience where priority score correspond to sample values are retrieved from.
    - Then, we calculate IS weights for each minibatch element
    """
    def sample(self, n):
        # Create a sample array that will contains the minibatch
        memory_b = []
        b_idx, b_ISWeights = np.empty((n,), dtype=np.int32), np.empty((n, 1), dtype=np.float32)
      
        # Calculate the priority segment
        # Here, as explained in the paper, we divide the Range[0, ptotal] into n ranges
        priority_segment = self.tree.total_priority / n       # priority segment

        # Here we increasing the PER_b each time we sample a new minibatch
        self.PER_b = np.min([1., self.PER_b + self.PER_b_increment_per_sampling])  # max = 1
        
        for i in range(n):
            """
            A value is uniformly sample from each range
            """
            a, b = priority_segment * i, priority_segment * (i + 1)
            value = np.random.uniform(a, b)
            """
            Experience that correspond to each value is retrieved
            """
            index, priority, data = self.tree.get_leaf(value)
            #P(j)
            sampling_probabilities = priority / self.tree.total_priority
            #  IS = (1/N * 1/P(i))**b /max wi == (N*P(i))**-b  /max wi
            b_ISWeights[i] = np.power(n * sampling_probabilities, -self.PER_b)
                                   
            b_idx[i]= index
            
            experience = [data]
            
            memory_b.append(experience)

        
        b_ISWeights=b_ISWeights/tf.math.reduce_max(b_ISWeights)
        
        return b_idx,memory_b, b_ISWeights
    
    """
    Update the priorities on the tree
    """
    def batch_update(self, tree_idx, abs_errors):
        abs_errors += self.PER_e  # convert to abs and avoid 0
        clipped_errors = np.minimum(abs_errors, self.absolute_error_upper)
        ps = np.power(clipped_errors, self.PER_a)

        for ti, p in zip(tree_idx, ps):
            self.tree.update(ti, p)

Next we implemented Noisy Dense Layer . This is the main implementation of this notebook. This is the layer that improves exploration 

In [None]:
class noisy_dense(tf.keras.Model):
  def __init__(self,units=32,activation_fn=None):
    super(noisy_dense,self).__init__()
    self.units = units
    self.activation_fn = activation_fn

  def f(self,x):
    return tf.multiply(tf.sign(x), tf.pow(tf.abs(x), 0.5))
  
  def build(self, input_shape):
    # Initializer of \mu and \sigma
    input_shape = tensor_shape.TensorShape(input_shape)
    last_dim = tensor_shape.dimension_value(input_shape[-1])

    self.w_mu = self.add_weight(
        shape=(last_dim, self.units),
        initializer="random_normal",
        trainable=True,
        )
    self.w_sigma = self.add_weight(
        shape=(last_dim, self.units), initializer="random_normal", trainable=True
        )
    
    self.b_mu = self.add_weight(
        shape=(self.units,),
        initializer="random_normal",
        trainable=True,
        )
    self.b_sigma = self.add_weight(
        shape=(self.units,), initializer="random_normal", trainable=True
        )

  def call(self,inputs):
    p = tf.random.normal((tf.shape(inputs)[1],1))
    q = tf.random.normal((1, self.units))
    f_p = self.f(p)
    f_q = self.f(q)
    w_epsilon = f_p*f_q
    b_epsilon = tf.squeeze(f_q)
     # w = w_mu + w_sigma*w_epsilon
    self.w = self.w_mu + tf.multiply(self.w_sigma, w_epsilon)
    ret = tf.matmul(inputs, self.w)

    # b = b_mu + b_sigma*b_epsilon
    self.b = self.b_mu + tf.multiply(self.b_sigma, b_epsilon)
    if self.activation_fn is None:
      return ret + self.b
    else:
      return self.activation_fn(ret + self.b)

  def get_config(self):
    config = super(noisy_dense, self).get_config()
    config.update({
        'units':
            self.units,
        'activation':
            activations.serialize(self.activation)
    })
    return config

  def compute_output_shape(self, input_shape):
    input_shape = tensor_shape.TensorShape(input_shape)
    input_shape = input_shape.with_rank_at_least(2)
    if tensor_shape.dimension_value(input_shape[-1]) is None:
      raise ValueError(
          'The innermost dimension of input_shape must be defined, but saw: %s'
          % input_shape)
    return input_shape[:-1].concatenate(self.units)

Creating a Noisy Net Model Class. 

In [None]:
class _noisy_dense_model(tf.keras.Model):
  
  def __init__(self,state_shape,action_shape,N_atoms):
    super(_noisy_dense_model,self).__init__()

    self.layer_1 = Conv2D(32,kernel_size=8,strides=4,activation='relu',input_shape=(60,60,4,))
    self.layer_2 = MaxPooling2D(pool_size=(2,2))
    self.layer_3 = Activation('relu')

    self.layer_4 = Conv2D(64,kernel_size=4,strides=2,activation='relu')
    self.layer_5 = MaxPooling2D(pool_size=(2,2))
    self.layer_6 = Activation('relu')

    self.layer_7 = Conv2D(64,kernel_size=3,strides=1,activation='relu')
    self.layer_8 = MaxPooling2D(pool_size=(2,2))
    self.layer_9 = Activation('relu')

    self.layer_10 = Flatten()
    self.layer_11 = noisy_dense(512,tf.keras.activations.relu)
    self.layer_12 = noisy_dense(self.action_shape,tf.keras.activations.softmax)
  
  def call(self,x):
    x = self.layer_3(self.layer_2(self.layer_1(x)))
    x = self.layer_6(self.layer_5(self.layer_4(x)))
    x = self.layer_9(self.layer_8(self.layer_7(x)))
    x = self.layer_12(self.layer_11(self.layer_10(x)))
    return x


Defining the Noisy DDQN Class

In [None]:
class Noisy_DDQN:

  def __init__(self,memory_size,path_1=None,path_2=None):
    self.memory=Memory(memory_size)
    self.state_shape= (60, 60, 4) # the state space
    self.action_shape=env.action_space.n # the action space
    self.gamma=[0.99] # decay rate of past observations
    self.learning_rate= 0.001 # learning rate in deep learning
    self.epsilon_initial_value=1.0 # initial value of epsilon
    self.epsilon_current_value=1.0# current value of epsilon
    self.epsilon_final_value=0.001 # final value of epsilon
    self.observing_episodes=10    #No of observations before training the training model
    self.observing_episodes_target_model=200
    self.batch_size=64
    if not path_1:
      self.target_model=_noisy_dense_model()    #Target Model is model used to calculate target values
      self.training_model=_noisy_dense_model()  #Training Model is model to predict q-values to be used.
    else:
      self.training_model = _noisy_dense_model()
      self.training_model.load_weights(path_1)
      self.target_model = _noisy_dense_model()
      self.target_model.load_weights(path_2)

Action Selection: The get_action method guides out action choice. Initially, when training begins we use exploration policy but later we do exploitation.

In [None]:
  def get_action(self, state,status='Training'):
    '''samples the next action based on the E-greedy policy'''
    if status=='Testing':
      q_values=(self.training_model(state))[0]   #Exploitation
      return np.argmax(q_values)
    if random.random() < self.epsilon_current_value:                                    #Exlporation
      action=random.choice(list(range((self.action_shape))))
    else:
      q_values=(self.training_model(state))[0]   #Exploitation
      max_Q = np.argmax(q_values)
      action = max_Q
    return action

This is the preprocessing we do to the image we obtained by interacting with the environment. Here I have done grayscaling and also cropped the image to remove game scores and area which I found was not necessary to train the agent. Then I have downscaled the image.This speeds up the training process.

In [None]:
  def get_frame(self,frame):
    frame=frame[25:-15,10:]
    frame=cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    frame=(resize(frame,(60,60)))/255.
    return frame

Updating the training_model

The update_training_model method updates the training model weights.

This is the same as the updating method used in DDQN With PER notebook

In [None]:
 def update_training_model(self):

    '''
    Updates the policy network using the NN model.
    '''


    tree_idx, batch, ISWeights_mb = self.memory.sample(self.batch_size)
    
    states_mb=np.zeros((self.batch_size,*self.state_shape))
    targets = np.zeros((self.batch_size,self.action_shape))
    absolute_errors=np.zeros((self.batch_size,1))

    for i in range(self.batch_size):
      state=batch[i][0][0]
      states_mb[i]=state
      preds_=self.training_model(state)
      targets[i]=preds_

      reward=(batch[i][0][2])
      
      terminal=batch[i][0][4]

     
      action=batch[i][0][1]
      if terminal:  # If we are in a terminal state, only equals reward
        targets[i, action] = np.asarray(reward)
      else:
        next_state=batch[i][0][3]
        preds_next_state_target_model=self.target_model(next_state)[0,action]
        targets[i, action] =  np.asarray(reward) + np.asarray(self.gamma)*np.asarray(preds_next_state_target_model)   # Take the Qtarget for action a'
      
      absolute_errors[i]=np.abs(np.sum(targets[i]-preds_,axis=1))
  
      # Update priority
      absolute_errors=absolute_errors/np.amax(absolute_errors)
      self.memory.batch_update(tree_idx, absolute_errors)

      optimizer = tf.keras.optimizers.Adam(learning_rate=self.learning_rate)
      def train_step(states, targets,ISWeights):
        with tf.GradientTape() as tape:
          preds= (self.training_model)(states,training=True)  #This is the 𝑄@(𝑠,𝑎)
          loss= ISWeights*(targets-preds)                 
          
        grads = tape.gradient(loss,self.training_model.trainable_variables)
        optimizer.apply_gradients(zip(grads, self.training_model.trainable_variables))
      train_step(states_mb,targets,ISWeights_mb)

Updating the target_model
The update_target_model method sets the target model weights to training model weights.

In [None]:
def update_target_model(self):
    self.target_model.set_weights(self.training_model.get_weights())

Training the model
This method creates a training environment for the model. Iterating through a set number of episodes, it uses the model to sample actions and play them. When such a timestep ends, the model is using the observations to update the policy.
We know that in a dynamic game we cannot predict action based on 1 observation(which is 1 frame of the game in this case) so we will use a stack of 4 frames to predict the output.

We can also downscale the rewards to help model learn faster.

In [None]:
  def train(self, episodes):
    '''
    train the model
    episodes - number of training iterations
    ''' 
    for episode in range(episodes):
      # each episode is a new game env
      state=env.reset()
      done=False
      state= self.get_frame(state)
      stacked_frame=np.stack((state,state,state,state),axis=2)
      stacked_frame=stacked_frame.reshape(1,stacked_frame.shape[0],stacked_frame.shape[1],stacked_frame.shape[2])
      episode_reward=0 #record episode reward
      lives = 5
      while not done:
        # play an action and record the game state & reward per episode
        action=self.get_action(stacked_frame)
        next_state, reward, done, info=env.step(action)
        reward = reward/10.0
        next_state=self.get_frame(next_state)
        next_state_ = next_state.reshape(1,next_state.shape[0],next_state.shape[1],1)
        stacked_frames_1 = np.append(next_state_, stacked_frame[:, :, :, :3], axis=3)
        experience = stacked_frame, action, reward, stacked_frames_1, 1*done
        self.memory.store(experience)
        stacked_frame=stacked_frames_1
        episode_reward+=reward
      print("Episode:{}  reward:{}".format(episode,episode_reward))
      if episode%50==0:
        self.evaluate(episode)
      if episode%self.observing_episodes==0 and episode!=0:
        self.update_training_model()
      if episode%self.observing_episodes_target_model==0 and episode!=0:
        self.update_target_model()
      if episode%500==0 and episode!=0:
        weights = self.training_model.get_weights()
        with open("training_model_{}.txt".format(episode), "wb") as fp:
          pickle.dump(weights, fp)
        self.target_model.save_weights("target_model_{}".format(episode))
      if self.epsilon_current_value > self.epsilon_final_value:
        self.epsilon_current_value=self.epsilon_current_value-(self.epsilon_initial_value-self.epsilon_final_value)/1000
        print('Current Epsilon Value:',self.epsilon_current_value)

In [None]:
memory_size=1000
no_of_episodes=1000

Agent=DDQN(memory_size)
Agent.train(no_of_episodes)

With the help of below code we run our algorithm and see the success of it

In [None]:
class tester:

  def __init__(self,path):
    self.model = _noisy_dense_model()
    with open(path, "rb") as fp:
      weights = pickle.load(fp)
    self.model(np.zeros(1,60,60,4));
    self.model.set_weights(weights)
      
  def get_action(self, state):
        '''samples the next action based on the E-greedy policy'''
        q_values=(self.model(state))[0]    #Exploitation
        action = np.argmax(q_values)
        return action
  
  def get_frame(self,frame):
    frame=frame[25:-15,10:]
    frame=cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
    frame=(resize(frame,(60,60)))/255.
    return frame

In [None]:
env=(wrap_env(gym.make("UpNDown-v0")))
state=env.reset()
test=tester("Actor.h5")
state=test.get_frame(state)
stacked_frames = np.stack((state,state,state,state),axis=2)
stacked_frames = stacked_frames.reshape(1,stacked_frames.shape[0],stacked_frames.shape[1],stacked_frames.shape[2]) 
while True:
  env.render('ipython')
  action = test.get_action(stacked_frames)
  next_state, reward, done, _=env.step(action)
  print(action,reward)
  next_state=test.get_frame(next_state)
  next_state_ = next_state.reshape(1,next_state.shape[0],next_state.shape[1],1)
  stacked_frames = np.append(next_state_, stacked_frames[:, :, :, :3], axis=3)
  if done:
    break
env.close()
show_video()