<a href="https://colab.research.google.com/github/asia281/rl2023/blob/main/Asia_of_lab_4_DQN_%2C_Function_approximation_lab_(students_version).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#DQN, Function Approximation, Pefrormance tricks

In this lab we study the basics of Q learning with function approximation by neural networks.

In [None]:
# Installing dependencies for visualization
!apt-get -qq -y install libcusparse8.0 libnvrtc8.0 libnvtoolsext1 > /dev/null
!ln -snf /usr/lib/x86_64-linux-gnu/libnvrtc-builtins.so.8.0 /usr/lib/x86_64-linux-gnu/libnvrtc-builtins.so
!apt-get -qq -y install xvfb freeglut3-dev ffmpeg> /dev/null
!pip -q install gym
!pip -q install pyglet
!pip -q install pyopengl
!pip -q install pyvirtualdisplay

E: Unable to locate package libcusparse8.0
E: Couldn't find any package by glob 'libcusparse8.0'
E: Couldn't find any package by regex 'libcusparse8.0'
E: Unable to locate package libnvrtc8.0
E: Couldn't find any package by glob 'libnvrtc8.0'
E: Couldn't find any package by regex 'libnvrtc8.0'
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m831.3/831.3 KB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import glob
import random
import time


import gym
import matplotlib.pyplot as plt
import numpy as np
from tensorflow.keras.optimizers import Adam
from tensorflow.python.keras.regularizers import l2
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model, clone_model

import numpy as np
import matplotlib.pyplot as plt
import collections

from base64 import b64encode
from IPython.display import HTML
from pyvirtualdisplay import Display

# Start virtual display
display = Display(visible=0, size=(1024, 768))
display.start()

def show_video(file_name):
    mp4 = open(file_name,'rb').read()
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    return HTML("""
    <video width=480 controls>
        <source src="%s" type="video/mp4">
    </video>
    """ % data_url)

We will start by defining a useful data structure:

In [None]:
Transition = collections.namedtuple('transition', ['state', 'action', 'reward', 'done', 'next_state'])

## CartPole
Debugging DQN is typically a complicated process, thus we have to start with a simple environment, that can be quickly iterated. Let's first construct working DQN for CartPole problem. We will use a small modification of the orginal CartPole env, we do reward reshape (to make problem easier for DQN). Precisely, we add a punishment for ending the episode:

In [None]:
class ModifiedCartPole:
    def __init__(self):
        self.env = gym.make('CartPole-v0')

    def reset(self):
        return self.env.reset()

    def step(self, action):
        obs, reward, done, _ = self.env.step(action)
        if done:
            reward = -10
        return obs, reward/10, done, {}

##Q-network. 
First we must create a network to approximate Q(s, a). We have two natural design choices:
- Q-network takes two inputs: state s and action a and predicts one value Q(s,a)
- Q-network takes one input: state s, and predicts a vector of Q(s, a) for all possible actions.

We will follow the second design choice (one of the reasons is that such network can faster predict the best action).

**Exercise: fill the code below to create Q-network**. Create a simple fully connected network with num_layers layers each with 64 neurons. The input is a vector of size 4, and the output is a vector of size 2 (we have 2 actions in cartpole).

In [None]:
def make_cartpole_network(input_size=4, num_action=2, num_layers=3, learning_rate=1e-4, weight_decay=0.):
  input_state = Input(batch_shape=(None, input_size))
  x = input_state
  for i in range(num_layers):
    x = Dense(64, activation='relu')(x)
  output = Dense(num_action)(x)
  model = Model(inputs=input_state, outputs=output)
  model.compile(
      loss='mse',
      optimizer=Adam(learning_rate=learning_rate)
  )
  return model

## Building DQN

We will start with some utils functions:

**Exercise: read the following functions, to understand them** (will be used later).

In [None]:
def predict_q_values(q_network, state):
  # Makes a prediction for a single state and returns array of Q-values
  return q_network.predict(np.array([state]))[0]

def choose_best_action(q_network, state):
  # Chooses best action according to Q-network
  action_values = predict_q_values(q_network, state)
  best_action = np.argmax(action_values)
  return best_action

def evaluate_state_batch(target_network, state_batch):
  '''This function can evaluate the whole batch of states at once, it
  is very useful to speedup the training when we calculate targets
  Arguments:
    - state_batch: list of states to evaluate
  Returns:
    - best actions: list of best action for every state
    - best vals: list of best state-action values for very state
    - action_values: list of all action-values for each state

  Here we named the argument target network instead of q_network, because this
  function will be used with target network.
  '''
  action_values = target_network.predict(np.array(state_batch))
  best_actions = np.argmax(action_values, axis=-1)
  best_vals = np.max(action_values, axis=-1)
  return best_actions, best_vals, action_values


def choose_action(q_network, state, epsilon):
  if random.random() < epsilon:
      return random.randint(0, 1)
  else:
      return choose_best_action(q_network, state)

While running the epizodes we will collect transitions and store them in a replay_buffer, which is just a list of transitions. Before we write a code for running epizodes we must first prepare a function that prepares training (since it is used while running the game) and a one for doing the training.

**Exercise: the training protocole is the heart of DQN. Fill the gaps in the following function**

In [19]:
def prepare_train_targets(target_network, replay_buffer, mini_batch_size, q_learning_rate=0.5, gamma=0.99):
  ''' Write a code to choose random samples from replay_buffer. 
  Choose mini_batch_size of samples and collect them in replay_batch. 
  Replay batch must be a list of transitions. '''

  #Hint: you can use random.sample method
  replay_batch = random.sample(replay_buffer, mini_batch_size) #<YOUR CODE HERE>

  # We will collect all states and next_states from the batch of transitions to evaluate them at once
  # ( optimization!)
  next_state_batch = []
  state_batch = []
  train_x, train_y = [], []

  for transition in replay_batch:
    next_state_batch.append(transition.next_state)
    state_batch.append(transition.state)

  _, next_state_values, _ = evaluate_state_batch(target_network, next_state_batch)
  _, _, state_action_vals = evaluate_state_batch(target_network, state_batch)

  for transition, state_vals, next_state_value in zip(replay_batch, state_action_vals, next_state_values):
    action = transition.action
    
    ''' Create train datapoints'''
    # Copy transition.state to x, use copy() method
    x = transition.state.copy()    
    ''' In out setting y must be a list of two values (Q-values for 2 actions).
    Note, that we want to update only of the y values (corresponding to chosen action)'''
    # Copy state_vals vector to y
    y = state_vals.copy()
    
    '''Calculate the target value for y[action]. Implement the Q-iteration formula:
    Q(s,a) <- q_learning_rate*(reward + gamma * max_a' Q_target(s', a')) + (1-q_learning_rate Q(s,a))
    
    Remember to separately treat the case where the transition is last in the epizode. 
    You can check if the transition is last, by looking at transition.done'''
    y[action] = 1 + q_learning_rate * (transition.reward - y[action])
    if transition.done:
      y[action] += gamma * next_state_value
    train_x.append(x)
    train_y.append(y)

  return np.array(train_x), np.array(train_y)

**Exercise: fill the gaps in the training function**

In [21]:
def train(q_network, target_network, replay_buffer, train_epochs, mini_batch_size, q_learning_rate, gamma):
  train_x, train_y =  prepare_train_targets(target_network, replay_buffer, mini_batch_size, q_learning_rate, gamma)
  '''For models in keras you can use fit() method that takes x and y as inputs and return history. 
      Hint: fit() method has verbose atribute that determines what this function prints on the screen 
      You can set verbose=0 to have no prints.'''
  history = q_network.fit(train_x, train_y, verbose=0 )# (here use keras.fit method)
  return history.history['loss']

Now, let us code the heart of DQN algorithm: the function that runs an epizode and trains Q-network.

**Exercise: fill the code in run_one_episode function**

In [17]:
def run_one_episode(q_network, target_network, env, epsilon, steps_so_far, replay_buffer,
                    train_epochs, mini_batch_size, train_every_n_steps, update_target_every_n_steps,
                    q_learning_rate, gamma):
    done = False
    episode_steps = 0
    state = env.reset()
    ep_actions = []
    loss_history = []
    while not done:
        action = choose_action(q_network, state, epsilon) # (use choose_action function)
        ep_actions.append(action)
        next_state, reward, done, _ = env.step(action)
        episode_steps += 1
        steps_so_far += 1

        new_transition = Transition(state, action, reward, done, next_state)# use state, action, reward, done, next_state to create new transition 
        replay_buffer.append(new_transition)

        state = next_state
        
        if steps_so_far % train_every_n_steps == 0 and len(replay_buffer) > mini_batch_size:
            history = train(q_network, target_network, replay_buffer, train_epochs, mini_batch_size, q_learning_rate, gamma) # use the train function from above here to train q_network
            loss_history.extend(history)
            if steps_so_far % update_target_every_n_steps == 0:
                #<YOUR CODE HERE> update weights of the target network i.e. make them equal to the weigths of q_network. 
                #Hint: you can use set_weights and get_weights methods
                target_network.set_weights(q_network.get_weights())

    return episode_steps, loss_history

Finally, we can complete the full DQN algorithm.

**Exercise: complete the full training loop code**

In [14]:
def run_dqn(train_steps=10000):
    env = ModifiedCartPole()
    
    # Here is a set of default parameters (tested), you can try to find better values
    epsilon = 0.6
    min_epsilon = 0.1
    epsilon_decay=0.995
    gamma=0.99
    q_learning_rate = 0.6
    train_every_n_steps=32
    mini_batch_size=128
    update_target_every_n_steps=128
    train_epochs=2

    replay_buffer = []

    q_network = make_cartpole_network() # create q-network using make_cartpole_network function
    target_network = make_cartpole_network() # create q-network using make_cartpole_network function
    
    steps_so_far = 0

    episode_lengths, loss_history = [], []
    episode_num = 0

    while steps_so_far < train_steps:
        episode_length, loss = run_one_episode(q_network, target_network, env, epsilon, steps_so_far, replay_buffer,
                    train_epochs, mini_batch_size, train_every_n_steps, update_target_every_n_steps,
                    q_learning_rate, gamma) # run episode using run_one_episode function
        if epsilon > min_epsilon:
            epsilon *= epsilon_decay
        episode_num += 1
        episode_lengths.append(episode_length)
        if loss is not None:
          loss_history.extend(loss)
        steps_so_far += episode_length
        print(f'Episode = {episode_num} | steps =  {steps_so_far} | episode_length = {episode_length} | epsilon = {epsilon}')

    return episode_lengths, loss_history

Let us now run the training (it may take several minutes to take the training of 5000-8000 steps). Do not expect the reward to grow monotonically. The training typically looks like a noisy process with some drift towards higher returns.

In [None]:
progress, loss_history, q_checkpoints = run_dqn(8000)

  logger.warn(
  deprecation(
  deprecation(


Episode = 1 | steps =  17 | episode_length = 17 | epsilon = 0.597
Episode = 2 | steps =  29 | episode_length = 12 | epsilon = 0.594015
Episode = 3 | steps =  41 | episode_length = 12 | epsilon = 0.5910449249999999
Episode = 4 | steps =  53 | episode_length = 12 | epsilon = 0.588089700375
Episode = 5 | steps =  65 | episode_length = 12 | epsilon = 0.5851492518731249
Episode = 6 | steps =  75 | episode_length = 10 | epsilon = 0.5822235056137594
Episode = 7 | steps =  88 | episode_length = 13 | epsilon = 0.5793123880856905
Episode = 8 | steps =  101 | episode_length = 13 | epsilon = 0.5764158261452621
Episode = 9 | steps =  111 | episode_length = 10 | epsilon = 0.5735337470145357
Episode = 10 | steps =  126 | episode_length = 15 | epsilon = 0.570666078279463
Episode = 11 | steps =  138 | episode_length = 12 | epsilon = 0.5678127478880657
Episode = 12 | steps =  149 | episode_length = 11 | epsilon = 0.5649736841486254
Episode = 13 | steps =  161 | episode_length = 12 | epsilon = 0.56214881

Let us now plot graphs:

In [None]:
def visualize_progress(progress, loss_history):
  plt.clf()
  plt.plot(progress, label="DQN progress")
  smoothed_progress = [0]
  for x in progress:
    smoothed_progress.append(0.8*smoothed_progress[-1] + 0.2*x)
  plt.plot(smoothed_progress, label="DQN learning (smoothed)")
  plt.legend(loc="upper left")
  plt.show()

  plt.clf()
  plt.plot(loss_history, label="Loss")
  plt.legend(loc="upper left")
  plt.show()

In [None]:
visualize_progress(progress, loss_history)

Let us see how the agent performs across the training:

In [None]:
def record_checkpoint(checkpoint):
  # This function records an episode of the agent equipped with a given chekpoint
  env = gym.make("CartPole-v0")
  model = make_cartpole_network()
  model.set_weights(checkpoint)
  max_ep_len=200
  envw = gym.wrappers.Monitor(env, "./", force=True)
  o, d, ep_len = envw.reset(), False, 0
  while not (d or (ep_len == max_ep_len)):
    envw.render()
    action = choose_best_action(model, o)
    o, r, d, info = envw.step(action)
  envw.close()

Lets take a look at first saved chekpoint:

In [None]:
record_checkpoint(q_checkpoints[0])
file_name = glob.glob('openaigym.video.*.mp4')[0]
show_video(file_name)

And the last:

In [None]:
record_checkpoint(q_checkpoints[-1])
file_name = glob.glob('openaigym.video.*.mp4')[0]
show_video(file_name)

#Ablation study
Let's see the what happens to DQN performance after turning off some of its mechanisms:
- target network
- sampling from replay_buffer

**Exercise: turn off the usage of target networks.** You can for example modify the code of run_dqn() and set target_network = q_network. Compare the results with previous run.

**Exercise: add the size limit to replay buffer.** Add a code to run_dqn() that clips its size to a given limit**. What happens if the replay buffer is very small?