<a href="https://colab.research.google.com/github/asrjy/ldrl/blob/main/Chapter%207%20-%20Deep%20Q-Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Deep Q-Networks

### Real-life value iteration

in both value iteration and it's equivalent q iteration loop over all states and for each state (or state-action pair) calculate the state's value or the q-value. also we assume we know all the states before hand in order to iterate over them. 

even if we know all states before hand, storing all transitions, their values, the actions and destination states, that's a lot of memory required and iterating over them requires lots of computational power as well. 

take atari for an example. it has a screen with a resolution of 210x160, each pixel holds one of 128 colors. so every frame of atari has 210x160=33600 pixels, and total number of screens (states) possible = 128 ^ 33600. even with the fastest supercomputer this takes years. and value iteration wants to go over these states just in case. when most of them wont even show up in practical use cases. 

it also limits us to discrete action spaces. 


### Tabular Q-Learning

intuition behind this is that we dont really need all of the states in the environment. we just need the ones obtained by interacting with the environment. if a state space is not shown to us by the environment, we don't really care about it's value. 

this modification to the value iteration method is called Q-learning. it works in the following way:
- start with empty table. mapping states with values of actions.
- obtain the current state (s), action performed (a), reward obtained for the action (r) and the new state (s') by interacting with the environment. the way we pick the action is not confined to any method, in this step. 
- update Q(s, a) value using the bellman approximation

    ![bellman](https://static.packt-cdn.com/products/9781838826994/graphics/Images/B14854_06_001.png)

repeat the above two steps until a threshold is reached where there is not much update in the bellman update, or we could stop after a number and run a test episode to get the reward. 

in the above algorithm, we update the q value by taking samples from the environment and assign new values and take samples from the environment again. this is a bad idea and could lead to unstable training. so we implement a learning rate based approach by changing the update equation to 
![tabular q learning rate](https://static.packt-cdn.com/products/9781838826994/graphics/Images/B14854_06_003.png)

this allows values of q to converge smoothly even if environment is noisy. 

### Tabular-Q Learning on FrozenLake


In [1]:
!pip install tensorboardX

Collecting tensorboardX
  Downloading tensorboardX-2.5-py2.py3-none-any.whl (125 kB)
[?25l[K     |██▋                             | 10 kB 16.1 MB/s eta 0:00:01[K     |█████▎                          | 20 kB 20.4 MB/s eta 0:00:01[K     |███████▉                        | 30 kB 16.8 MB/s eta 0:00:01[K     |██████████▌                     | 40 kB 11.4 MB/s eta 0:00:01[K     |█████████████                   | 51 kB 7.0 MB/s eta 0:00:01[K     |███████████████▊                | 61 kB 8.0 MB/s eta 0:00:01[K     |██████████████████▎             | 71 kB 7.8 MB/s eta 0:00:01[K     |█████████████████████           | 81 kB 6.2 MB/s eta 0:00:01[K     |███████████████████████▌        | 92 kB 6.8 MB/s eta 0:00:01[K     |██████████████████████████▏     | 102 kB 7.5 MB/s eta 0:00:01[K     |████████████████████████████▊   | 112 kB 7.5 MB/s eta 0:00:01[K     |███████████████████████████████▍| 122 kB 7.5 MB/s eta 0:00:01[K     |████████████████████████████████| 125 kB 7.5 MB/s 
I

In [2]:
import gym
import collections
from tensorboardX import SummaryWriter

In [3]:
ENV_NAME = "FrozenLake-v0"
GAMMA = 0.9
ALPHA = 0.2 # Learning Rate
TEST_EPISODES = 20

In [5]:
class Agent:
  def __init__(self):
    self.env = gym.make(ENV_NAME)
    self.state = self.env.reset()
    self.values = collections.defaultdict(float)
  
  def sample_env(self):
    """
    Performs a random action on the environment and returns old state (s), action taken (a), reward(r)
    and new state (s')
    """
    action = self.env.action_space.sample()
    old_state = self.state
    new_state, reward, is_done, _ = self.env.step(action)
    self.state = self.env.reset() if is_done else new_state
    return old_state, action, reward, new_state 

  def best_value_and_action(self, state):
    """
    This method takes the state of the environment and picks the best action to take in this state
    by choosing the action with the largest value. If there is no value for the state action pair in 
    the value table, it's value is taken as 0. This situation arises twice:
      - in the first test episode 
      - in the method that performs value update to get value of the next state
    """
    best_value, best_action = None, None
    for action in range(self.env.action_space.n):
      action_value = self.values[(state, action)]
      if best_value is None or best_value < action_value:
        best_value = action_value
        best_action = action 
    return best_value, best_action
  
  def value_update(self, s, a, r, next_s):
    """
    Performing bellman approximation from our state s, action a, reward r, next state next_state
    """
    best_v, _ = self.best_value_and_action(next_s)
    new_v = r + GAMMA * best_v
    old_v = self.values[(s, a)]
    self.values[(s, a)] = old_v * (1 - ALPHA) + new_v * ALPHA
  
  def play_episode(self, env):
    """
    Plays one full episode using the provided test environment. action is taken based on the current value
    of q table. this method is used to evaluate the current policy and check progress of learning. 
    this method does not alter the value table. only uses it to find the best action made. 
    """
    total_reward = 0.0
    state = env.reset()
    while True:
      _, action = self.best_value_and_action(state)
      new_state, reward, is_done, _ = env.step(action)
      total_reward += reward
      if is_done:
        break
      state = new_state
    return total_reward

In [7]:
if __name__ == "__main__":
  test_env = gym.make(ENV_NAME)
  agent = Agent()
  writer = SummaryWriter(comment = '-q-learning')
  iter_no = 0
  best_reward = 0.0
  while True:
    iter_no += 1
    s, a, r, next_s = agent.sample_env()
    agent.value_update(s, a, r, next_s)
    reward = 0.0
    for _ in range(TEST_EPISODES):
      reward += agent.play_episode(test_env)
    reward /= TEST_EPISODES
    writer.add_scalar("reward", reward, iter_no)
    if reward > best_reward:
      print(f"Best reward updated: {best_reward:.3f} -> {reward:.3f}")
      best_reward = reward 
    if reward > 0.80:
      print(f"Solved in {iter_no} iterations!")
      break
  writer.close()

Best reward updated: 0.000 -> 0.050
Best reward updated: 0.050 -> 0.100
Best reward updated: 0.100 -> 0.150
Best reward updated: 0.150 -> 0.200
Best reward updated: 0.200 -> 0.250
Best reward updated: 0.250 -> 0.350
Best reward updated: 0.350 -> 0.400
Best reward updated: 0.400 -> 0.550
Best reward updated: 0.550 -> 0.800
Best reward updated: 0.800 -> 0.900
Solved in 6142 iterations!


### Deep Q-Learning

above approach solves the issue of going through all  states in state space by only going through observed states. but it could still be an issue when thenumber of observable states is very large. in some environments, the observable states could be infinite (continuous states)

we could build a neural net that minimizes loss as a function of Q-function as follows:
- initialize Q(s, a) with some initial approximation
- interact with environment and obtain the tuple (s, a, r, s')
- calculate loss L = (Q(s, a) - r)^2 if episode has ended, or ![loss when episode has not ended](https://static.packt-cdn.com/products/9781838826994/graphics/Images/B14854_06_007.png) if episode has not ended. 
- minimize this loss using SGD and update Q(s, a)
- repeat from step 2 till convergence. 

#### Interaction with the Environment

issue with the above approach is that it needs to interact with the environment for the model to be trained. this works fine with small environments like frozen lake where we can take random actions and survive but for complex environments like pong, this is not possible. as an alternative, we could use the q-function approximation as a source of behaviour. 

but if our q function approximation is not good/perfect, then the agent will be stuck with bad actions and in some states it wont even behave differently. 

this is the explore-exploit dilemma faced in reinforcement learning. on one hand the agent needs to explore the environment to get a complete picture of the transition table and on the other hand we shouldnt waste time by randomly trying actions we have already tried and know the outcomes of. 

a method that performs such a mix of extreme behaviours is known as epsilon-greedy method. we start off with setting epsilon value to 1, which means 100% random actions and slowly reduce it's value to 0.05 which means 5% random actions. there are other solutions apart from epsilon greedy and this problem is one of the fundamental questions in RL problems. 


#### SGD Optimization
one of the fundamental requirements of SGD is that the training data be independent and identically distributed (iid data). but the data provided in our current situation does not satisfy both conditions 
- the traning data is not independent. even if we accumulate large number of samples, they will all be very close to each other as they belong to the same episode. 
- the training data is not similar to the data provided by the optimal policy we want to learn because the data will be the result of same other policy that's not optimal (could be random or epsilon-greedy). we dont want to learn a random policy, we want to learn the policy that gives the most reward. 

to deal with this, we use replay buffer where new training data from our latest experience and add it at the end of training data, removing the same amount of oldest data. 

#### Correlation between stpes
another issue with the lack of iid data is that, we are updating the value of Q(s, a) using Q(s', a'). they only have one step between them making them very similar. it's hard for the NNs to distinguish between them and when we try to alter the neural net's parameters to make Q(s, a) closer to the desired ones, it could indirectly alter the value produced by Q(s', a') or any other states nearby. this makes our training very unstable like chasing our own tail. 

a workaround for this could be, we keep a copy of our network and use it for the Q(s', a') value in the bellman equation. this network is synchronized with our main network only periodically. this is called target network. 

#### The Markov Property
a fundamental assumption made in our rl approach is that each observations from the environment are independent with each other. this is not usually true. for example, a pong screenshot is not interpretable without preceeding few screenshots. these sort of problems fall into the area of partially observable MDPs (POMDPs). another example of POMDPs is a card game where you don't see your opponent's cards, because the cards you have and cards on table could correspond to different cards in your opponent's hands (because you might get his cards, or he might get cards that were on table etc.,)

one workaround to this is using a set of observations as one observation. for example, in case of pong, we stack k subsequent frames and use them as observation at every state. the classic number of observations stack in atari is 4. 

#### Final form of DQN Training
there are many tricks and hacks to overcome problems faced in dqn models, but just epsilon-greedy, replay buffer and target network allowed DeepMind to succesfully train DQNs on 49 Atari games. 

the dqn algorithm for the above is as follows:
- initialize parameters for Q(s, a) and Q'(s, a) with random weights, epsilon = 1 and empty replay buffer
- with probability epsilon, select a random action a, otherwise a = argmax a [Q(s, a)]
- execute action a in emulator and observe reward r, and next state s'. 
- store transition (s, a, r, s') in replay buffer
- sample a random mini-batch of transitions from the replay buffer
- for every transtion from the mini batch, calculate target value y = r (reward) if the episode has ended or 
![y if episode has not ended](https://static.packt-cdn.com/products/9781838826994/graphics/Images/B14854_06_016.png)
- calculate loss as L = (Q(s, a) - y)^2 
- update Q(s, a) using SGD by minimizing loss 
- every N steps, copy weights from Q to Q'
- repeat till convergence

### DQN on Pong

even though rl models are not as compute hungry as state of the art image net models, dqn models from 2015 have 1.5m models. so care should be taken not to copy weights frequentyl in target network etc., a naive version of dqn that iterates over each sample of mini batch is twice as slow as the parallel version. a single extra copy of data batch could make it 13 times slower. 

#### Wrappers
tracking atari games in rl is quite demanding. so openai includes gym wrappers that are transformations that influence only performance and address atari platform features that making learning long and unstable. 

some helpful wrappers used are:
- converting individual lives in games to seperate episodes. only useful on some environments. 
- performing a random amount of no-op actions at the beginning of each episode to skip intros etc.,
- making an action decision every k steps instead of every single step. useful to reduce computational power required and helps in games that require past few frames to make a good decision.
- taking maximum of two pixels in the last two frames as some atari games have a flicker effect. these are not visible to human eye but could confuse the neural nets. 
- pressing FIRE at the beginning of game. some games require us to press FIRE to begin. 
- scaling every frame down from 210x160 three colors to single colored 84x84. some researchers do grayscale, some change y-color channel to ycbcr. 
- clipping reward to -1, 0, 1values. different games can have varying score scales. 
- converting observations from unsigned bytes to float32. 

we don't really need all of these wrappers for simple pong. sometimes when dqn is not converging, problem could be in wrongly wrapped environment. 

In [9]:
import cv2
import gym
import gym.spaces
import numpy as np
import collections


In [11]:
class FireResetEnv(gym.Wrapper):
  """
  This wrapper presses FIRE button in environments that require game to start. In addition to pressing FIRE,
  this wrapper checks for some corner cases present in some games. 
  """
  def __init__(self, env = None):
    super(FireResetEnv, self).__init__(env)
    assert env.unwrapped.get_action_meanings()[1] == "FIRE"
    assert len(env.unwrapped.get_action_meanings()) >= 3

  def step(self, action):
    return self.env.step(action)
  
  def reset(self):
    self.env.reset()
    obs, _, done, _ = self.env.step(1)
    if done:
      self.env.reset()
    obs, _, done, _ = self.env.step(2)
    if done:
      self.env.reset()
    return obs

In [12]:
class MaxAndSkipEnv(gym.Wrapper):
  """
  Combines the repitition of actions during k frames and pixels from two consecutive frames
  """
  def __init__(self, env=None, skip = 4):
    super(MaxAndSkipEnv, self).__init__(env)
    self._obs_buffer = collections.deque(maxlen = 2)
    self._skip = skip
  def step(self, action):
    total_reward = 0.0
    done = None 
    for _ in range(self._skip):
      obs, reward, done, info = self.env.step(action)
      self._obs_buffer.append(obs)
      total_reward += reward
      if done:
        break
    max_frame = np.max(np.stack(self._obs_buffer), axis=0)
    return max_frame, total_reward, done, info 
  def reset(self):
    self._obs_buffer.clear()
    obs = self.env.reset()
    self._obs_buffer.append(obs)
    return obs

In [13]:
class ProcessFrame84(gym.ObservationWrapper):
  """
  converts color 210x160 image to grayscale 84x84. conversion to grayscale is done using colorimetric grayscale
  conversion which is closer to human color perception than simple averaging of color channels. converts to
  grayscale, resizes image to 84x84, crops top and bottom parts as a result.
  """
  def __init__(self, env = None):
    super(ProcessFrame84, self).__init__(env)
    self.observation_space = gym.spaces.Box(low = 0, high = 255, shape = (84, 84, 1), dtype = np.uint8)

  def observation(self, obs):
    return ProcessFrame84.process(obs)

  @staticmethod

  def process(frame):
    if frame.size == 210 * 160 * 3:
      img = np.reshape(frame, [210, 160, 3]).astype(np.float32)
    elif frame.size == 250 * 160 * 3:
      img = np.reshape(frame, [250, 160, 3]).astype(np.float32)
    else:
      assert False, "Unknown resolution."
    img = img[:, :, 0] * 0.229 + img[:, :, 1] * 0.587 + img[:, :, 2] * 0.114
    resized_screen = cv2.resize(img, (84, 110), interpolation = cv2.INTER_AREA)
    x_t = resized_screen[18:102, :]
    x_t = np.reshape(x_t, [94, 84, 1])
    return x_t.astype(np.unit8)

In [None]:
class BufferWrapper(gym.ObservationWrapper):
  """
  creats a stack of subsequent frames along the first dimension and returns them as an observation. 
  """
  def __init__(self, env, n_steps, dtype = np.float32):
    super(BufferWrapper, self).__init__(env)
    self.dtype = dtype
    old_space = env.observation_space
    self.observation_space = gym.spaces.Box(
        old_space.low.repeat(n_steps, axis = 0)
        old_space.high.repeat(n_steps, axis = 0), dtype = dtype)
  
  def reset(self):
    self.buffer = np.zeros_like(
        self.observation_space.low, dtype = self.dtype
    )
    return self.observation(self.env.reset())
  
  def observation(self, observation):
    self.buffer[:-1] = self.buffer[1:]
    self.buffer[-1] = observation
    return self.buffer

In [None]:
class ImageToPyTorch(gym.ObservationWrapper):
  """
  this wrapper changes the observation from HWC(height, width, channel) to CHW(channel, height, width) 
  required by pytorch. 
  """
  def __init__(self, env):
    super(ImageToPyTorch, self).__init__(env)
    old_shape = self.observation_space.shape
    new_shape = (old_shape[-1], old_shape[0], old_shape[1])
    self.observation_space = gym.spaces.Box(
        low = 0.0, high = 1.0, shape = new_shape, dtype = np.float32
    )
  
  def observation(self, observation):
    return np.moveaxis(observation, 2, 0)

In [None]:
class ScaledFloatFrame(gym.ObservationWrapper):
  """
  this wrapper converts observation data from byte to float and sclaes every pixel value between 0.0 and 1.0
  """
  def observation(self, obs):
    return np.array(obs).astype(np.float32)/255.0

In [None]:
def make_env(env_name):
  """
  This function creates the environment and applies all wrappers on it
  """
  env = gym.make(env_name)
  env = MaxAndSkipEnv(env)
  env = FireResetEnv(env)
  env = ProcessFrame84(env)
  env = ImageToPyTorch(env)
  env = BufferWrapper(env, 4)
  return ScaledFloatFrame(env)

#### The DQN Model

In [None]:
import torch
import torch.nn as nn
import numpy as np

In [None]:
class DQN(nn.Module):
  def __init__(self, input_shape, n_actions):
    super(DQN, self).__init__()
    self.conv = nn.Sequential(
      nn.Conv2d(input_shape[0], 32, kernel_size = 8, stride = 4),
      nn.ReLU(),
      nn.Conv2d(32, 64, kernel_size = 4, stride = 2),
      nn.ReLU(),
      nn.Conv2d(64, 64, kernel_size = 3, stride = 1),
      nn.ReLU()
    )
    conv_out_size = self._get_conv_out(input_shape)
    self.fc = nn.Sequential(
      nn.Linear(conv_out_size, 512),
      nn.ReLU(),
      nn.Linear(512, n_actions)
    )
  
  def _get_conv_out(self, shape):
    """
    we dont know the exact number of values in the output from convolution layer produced from the input shape.
    since we need to pass this number to the fully connected layer, this function creates a fake tensor of such
    shape and returns the number of parameters. 
    """
    o = self.conv(torch.zeros(1, *shape))
    return int(np.prod(o.size))
  
  def forward(self, x):
    """
    pytorch does not have a flatten layer to flatten the 3d convoluted output to the fully connected layer.
    so this function does a forward pass using both conv net and fully connected network.
    the conv output is a 4d tensor. the first dimension is the batch size, second is the color channel which
    is our stack of subsequent frames, third and fourth are image dimensions,
    .view() allows us to reshape without creating a new memory object. 
    """
    conv_out = self.conv(x).view(x.size()[0], -1)
    return self.fc(conv_out)

#### Training


In [None]:
!pip install tensorboardX

In [None]:
from lib import wrappers
from lib import dqn_model
import argparse
import time
import numpy as np
import collections
import torch
import torch.nn as nn
import torch.optim as optim
from tensorboardX import SummaryWriter

In [None]:
DEFAULT_ENV_NAME = "PongNoFrameskip-v4"
MEAN_REWARD_BOUND = 19.0

In [None]:
GAMMA = 0.99
BATCH_SIZE = 32 # Batch Size sampled from replay buffer
REPLAY_SIZE = 10000 # Maximum Capacity of the buffer
REPLAY_START_SIZE = 10000 # Count of frames we wait before starting training to populate buffer
LEARNING_RATE = 1e-4 # Learning Rate for Adam Optimizer
SYNC_TARGET_FRAMES = 1000 # Frequecy of model weight sync from training model to target model

In [None]:
EPSILON_DECAY_LAST_FRAME = 150000 # During first 150000 frames, epsilon is linearly reduced to 0.01.
# In the original paper this value was 1000000 (1 Million)
EPSILON_START = 1.0
EPSILON_FINAL = 0.01

In [None]:
Experience = collections.namedtuple('Experience', field_names = ['state', 'action', 'reward', 'done', 'new_state'])

In [None]:
class ExperienceBuffer:
  def __init__(self, capacity):
    self.buffer = collections.dequeu(maxlen=capacity)
  
  def __len__(self):
    return len(self.buffer)
  
  def append(self, experience):
    self.buffer.append(experience)
  
  def sample(self, batch_size):
    indices = np.random.choice(len(self.buffer), batch_size, replace = False)
    states, actions, rewards, dones, next_states = zip(*[self.buffer[idx] for idx in indices])
    return np.array(states), np.array(actions), np.array(rewards, dtype = np.float32), np.array(dones, dtype = np.uint8), np.array(next_states)    

In [None]:
class Agent:
  def __init__(self, env, exp_buffer):
    self.env = env
    self.exp_buffer = exp_buffer
    self._reset()
  
  def _reset(self):
    self.state = env.reset()
    self.total_reward = 0.0
  
  @torch.no_grad()

  def play_step(self, net, epsilon = 0.0, device = 'cpu'):
    """
    Plays a step in the environment and stores its result in the buffer. Using epsilon, we eigher take a random
    step or use past model to obtain q values of all possible actions and choose the best
    """
    done_reward = None
    if np.random.random() < epsilon:
      action = env.action_space.sample()
    else:
      state_a = np.array([self.state], copy = False)
      state_v = torch.tensor(state_a).to(device)
      q_vals_v = net(state_v)
      _, act_v = torch.max(q_vals, dim = 1)
      action = int(act_v.item())
    new_state, reward, is_done, _ = self.env.step(action)
    self.total_reward += reward
    exp = Experience(self.state, action, reward, is_done, new_state)
    self.exp_buffer.append(exp)
    self.state = new_state
    if is_done:
      done_reward = self.total_reward
      self._reset()
    return done_reward
  
  def calc_loss(batch, net, tgt_net, device = "cpu"):
    """
    Calculates the loss for the sampled batch. 
    batch: typle of arrays repacked by sample() in the experience buffer
    net: used to calculate gradients
    tgt_net: used to calculate values for the next states and this won't affect gradients. We use detach()
    to prevent gradients from flowing into the target network's graph. 
    """
    states, actions, rewards, dones, next_states = batch
    states_v = torch.tensor(np.array(states, copy = false)).to(device)
    next_states_v = torch.tensor(np.array(next_states, copy = False)).to(device)
    actions_v = torch.tensor(actions).to(device)
    rewards_v = torch.tensor(rewards).to(device)
    done_mask = torch.BoolTensor(dones).to(device)
    state_action_values = net(states_v).gather(1, actions_v.unsqueeze(-1)).squeeze(-1)
    
