# Playing Atari using Deep Q-Learning

## Problem Statement

Train an agent to play Atari Breakout using Deep Q-Learning

## Given

The original game frames of shape of (210, 160, 3)

## Goal

Maximize score by designing a DQN Agent that makes optimal decisions based on the observed game frames

## Brainstorming

[Whiteboard](https://www.tutorialspoint.com/whiteboard.htm)

# Project Pipeline

## A1: Import Dependencies

This activity involves importing the necessary dependencies for the project.

In [1]:
# TASK BLOCK
# TAKS Import gymnasium, numpy, random, skimage, cv2

### **A**1.1 Import RL framework, OpenCV and relevant libraries

In [5]:
# SOLUTION BLOCK
import gymnasium as gym
import random
import numpy as np
from collections import deque
from skimage.color import rgb2gray
from skimage.transform import resize
import cv2

### **A**1.2 Import tensorflow modules

In [6]:
# TASK BLOCK
# TAKS Import tensorflow models, layers, optimizers, etc.

In [7]:
import tensorflow.compat.v1 as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers.legacy import RMSprop
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras import backend as K
from tensorflow.python.framework.ops import disable_eager_execution
tf.disable_v2_behavior()
disable_eager_execution()

### Assessment

1. Which tensorflow modules are essential for any deep learning project?
2. How do you import convolution layers module in tensorflow?
3. What is sequential in TensorFlow keras?

## A2: Define gym environment

This activity involves defining the constants for the project.

### **A**2.1 Define RL environment

In [5]:
# TASK BLOCK
# TASK Initialize episodes

In [6]:
# SOLUTION BLOCK
EPISODES = 50000

In [7]:
# TASK BLOCK
# TASK Create gym loop and make gym environment

In [9]:
# SOLUTION BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        observe = env.reset()
        while not done:
            env.render()

NameError: name 'done' is not defined

In [None]:
# TASK BLOCK
# TASK Initialize done, dead, step, score, start_life
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        # Initialize the variables
        observe = env.reset()
        while not done:
            env.render()

In [None]:
# SOLUTION BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        while not done:
            env.render()

In [1]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        while not done:
            # Take a random action
            env.render()

NameError: name 'gym' is not defined

### **A**2.2 Take action

In [2]:
# SOLUTION BLOCK
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            env.render()

NameError: name 'gym' is not defined

In [3]:
# TASK BLOCK
# TASK Define start_life and dead variables
# if the agent missed the ball, agent is dead but episode is not over
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            # TASK Define start_life and dead variables
            env.render()

NameError: name 'gym' is not defined

In [4]:
# SOLUTION BLOCK
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            env.render()

NameError: name 'gym' is not defined

In [None]:
# TASK BLOCK
# TASK Clip reward between -1 and 1
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            # Clip reward between -1 and 1
            env.render()

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
  logger.warn(


In [None]:
# SOLUTION BLOCK
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            env.render()

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
  logger.warn(


### Assessment

1. How do you take a random action in a gymnasium environment?
2. What are the conditions under which the episode ends for the Atari Breakout scenario?
3. Define how to display the frames in Atari.

## A3: Preprocessing Functions

Thisactivity involves defining the preprocessing functions used to transform the game frames.

### **A**3.1 Define pre-processing function

In [1]:
# TASK BLOCK
# TASK Define a function to pre-process the image frames from (210, 160, 3) to (84, 84, 1)

In [8]:
# SOLUTION BLOCK
def pre_processing(observe):

    # A2.1 Convert to gray scale 
    # A2.2 Resize
    observe = np.asarray(observe[0])
    print(observe.shape)
    processed_observe = np.uint8(
        resize(rgb2gray(observe), (84, 84), mode='constant') * 255)
    return processed_observe

In [None]:
# TASK BLOCK
# TASK Update the gym environment to include pre-processed image
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        # TASK: Add pre-processed state here
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            # TASK: Add pre-processed state here 
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            env.render()

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
  logger.warn(


In [9]:
# SOLUTION BLOCK
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        state = pre_processing(observe)
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            env.render()

(210, 160, 3)
(160, 3)
(160, 3)
(160, 3)


  logger.warn(


(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(

KeyboardInterrupt: 

### **A**3.2 Define history

In [None]:
# TASK BLOCK
# TASK Initialize initial history
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        state = pre_processing(observe)
        # TASK: Create a stack of 4 states to define the history
        # TASK: Re-shape the history to (1, 84, 84, 4)
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            env.render()

In [None]:
# SOLUTION BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            env.render()

In [None]:
# TASK BLOCK
# TASK Initialize next history
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            # Re-shape next_state to (1 , 84, 84, 1)
            # Define next history
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            env.render()

In [None]:
# SOLUTION BLOCK
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            env.render()

(210, 160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)
(160, 3)


### Assessment

1. What is the purpose of the `preprocess_frame()` function?
2. Explain the use of the `stacked_frames` parameter in the `stack_frames()` function.
3. How is the state defined in the Atari Breakout environment?

## A4: Create DQNAgent Class

This activity involves defining the `DQNAgent` class, which implements the Deep Q-Network (DQN) algorithm.

### **A**4.1 Initialize DQN Agent

In [None]:
# TASK BLOCK
# Define a class DQNAgent

In [24]:
# SOLUTION BLOCK
class DQNAgent:
    def __init__(self, action_size):

IndentationError: expected an indented block (2928508518.py, line 3)

In [25]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        # TASK - Initialize render and load_model variables

IndentationError: expected an indented block (3882795768.py, line 4)

In [26]:
# SOLUTION BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings

In [27]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK - Define environment settings

In [28]:
# SOLUTION BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size

In [29]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # TASK: Define exploration parameters

In [30]:
# SOLUTION BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # parameters about epsilon
        self.epsilon = 1.
        self.epsilon_start, self.epsilon_end = 1.0, 0.1
        self.exploration_steps = 1000000.
        self.epsilon_decay_step = (self.epsilon_start - self.epsilon_end) \
                                  / self.exploration_steps

In [None]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # parameters about epsilon
        self.epsilon = 1.
        self.epsilon_start, self.epsilon_end = 1.0, 0.1
        self.exploration_steps = 1000000.
        self.epsilon_decay_step = (self.epsilon_start - self.epsilon_end) \
                                  / self.exploration_steps
        # parameters about training
        # TASK- Define training parameters

In [None]:
# SOLUTION BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # parameters about epsilon
        self.epsilon = 1.
        self.epsilon_start, self.epsilon_end = 1.0, 0.1
        self.exploration_steps = 1000000.
        self.epsilon_decay_step = (self.epsilon_start - self.epsilon_end) \
                                  / self.exploration_steps
        # parameters about training
        # TASK- Define training parameters        
        # parameters about training
        self.batch_size = 32
        self.train_start = 50000
        self.update_target_rate = 10000
        self.discount_factor = 0.99
        self.memory = deque(maxlen=400000)
        self.no_op_steps = 30

In [None]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # parameters about epsilon
        self.epsilon = 1.
        self.epsilon_start, self.epsilon_end = 1.0, 0.1
        self.exploration_steps = 1000000.
        self.epsilon_decay_step = (self.epsilon_start - self.epsilon_end) \
                                  / self.exploration_steps
        # parameters about training
        # TASK- Define training parameters        
        # parameters about training
        self.batch_size = 32
        self.train_start = 50000
        self.update_target_rate = 10000
        self.discount_factor = 0.99
        self.memory = deque(maxlen=400000)
        self.no_op_steps = 30
        # build model
        # TASK- Build model

In [None]:
# SOLUTION BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # parameters about epsilon
        self.epsilon = 1.
        self.epsilon_start, self.epsilon_end = 1.0, 0.1
        self.exploration_steps = 1000000.
        self.epsilon_decay_step = (self.epsilon_start - self.epsilon_end) \
                                  / self.exploration_steps
        # parameters about training
        # TASK- Define training parameters        
        # parameters about training
        self.batch_size = 32
        self.train_start = 50000
        self.update_target_rate = 10000
        self.discount_factor = 0.99
        self.memory = deque(maxlen=400000)
        self.no_op_steps = 30
        # build model
        self.model = self.build_model()
        self.target_model = self.build_model()
        self.update_target_model()

In [1]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # parameters about epsilon
        self.epsilon = 1.
        self.epsilon_start, self.epsilon_end = 1.0, 0.1
        self.exploration_steps = 1000000.
        self.epsilon_decay_step = (self.epsilon_start - self.epsilon_end) \
                                  / self.exploration_steps
        # parameters about training
        # TASK- Define training parameters        
        # parameters about training
        self.batch_size = 32
        self.train_start = 50000
        self.update_target_rate = 10000
        self.discount_factor = 0.99
        self.memory = deque(maxlen=400000)
        self.no_op_steps = 30
        # build model
        self.model = self.build_model()
        self.target_model = self.build_model()
        self.update_target_model()
        # TASK- Initialize optimiyer, sess, av_q_max, etc.

In [53]:
# SOLUTION BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # parameters about epsilon
        self.epsilon = 1.
        self.epsilon_start, self.epsilon_end = 1.0, 0.1
        self.exploration_steps = 1000000.
        self.epsilon_decay_step = (self.epsilon_start - self.epsilon_end) \
                                  / self.exploration_steps
        # parameters about training
        # TASK- Define training parameters        
        # parameters about training
        self.batch_size = 32
        self.train_start = 50000
        self.update_target_rate = 10000
        self.discount_factor = 0.99
        self.memory = deque(maxlen=400000)
        self.no_op_steps = 30
        # build model
        self.model = self.build_model()
        self.target_model = self.build_model()
        self.update_target_model()
        # TASK- Initialize optimiyer, sess, av_q_max, etc.
        self.optimizer = self.optimizer()
        self.sess = tf.compat.v1.InteractiveSession()
        self.avg_q_max, self.avg_loss = 0, 0
        self.summary_placeholders, self.update_ops, self.summary_op = \
            self.setup_summary()
        self.summary_writer = tf.summary.FileWriter(
            'summary/breakout_dqn', self.sess.graph)
        self.sess.run(tf.global_variables_initializer())

        if self.load_model:
            self.model.load_weights("./save_model/breakout_dqn.h5")

### **A**4.2 Initialize functions for DQN Agent

In [54]:
# TASK - Define optimizer function
    # if the error is in [-1, 1], then the cost is quadratic to the error
    # But outside the interval, the cost is linear to the error

In [55]:
# SOLUTION BLOCK
class DQNAgent:
    def optimizer(self):
        a = K.placeholder(shape=(None,), dtype='int32') # Input values for action indices
        y = K.placeholder(shape=(None,), dtype='float32') # Target Q-Value

        py_x = self.model.output # Predicted Q-Value

        a_one_hot = K.one_hot(a, self.action_size) # One-hot encoding of action indices
        q_value = K.sum(py_x * a_one_hot, axis=1) # Q-Value for the selected action
        error = K.abs(y - q_value) # Absolute difference between the Target Q-Value and the predicted Q-Value

        quadratic_part = K.clip(error, 0.0, 1.0)# Error clipped to limit the impact of large values on the training process
        linear_part = error - quadratic_part # Linear part of the error
        loss = K.mean(0.5 * K.square(quadratic_part) + linear_part)

        optimizer = RMSprop(learning_rate=0.00025, epsilon=0.01) # The RMSprop optimizer is used to update the weights of the neural network during training
        updates = optimizer.get_updates(params=self.model.trainable_weights, loss=loss)
        train = K.function([self.model.input, a, y], [loss], updates=updates)

        return train

In [56]:
# TASK BLOCK
    # approximate Q function using Convolution Neural Network
    # state is input and Q Value of each action is output of network

In [57]:
# SOLUTION BLOCK
class DQNAgent:
    def build_model(self):
        model = Sequential()
        model.add(Conv2D(32, (8, 8), strides=(4, 4), activation='relu',
                         input_shape=self.state_size))
        model.add(Conv2D(64, (4, 4), strides=(2, 2), activation='relu'))
        model.add(Conv2D(64, (3, 3), strides=(1, 1), activation='relu'))
        model.add(Flatten())
        model.add(Dense(512, activation='relu'))
        model.add(Dense(self.action_size))
        model.summary()
        return model

In [58]:
    # after some time interval update the target model to be same with model

In [59]:
# SOLUTION BLOCK
class DQNAgent:
   def update_target_model(self):
        self.target_model.set_weights(self.model.get_weights())

In [60]:
# TASK BLOCK
   # get action from model using epsilon-greedy policy

In [1]:
# SOLUTION BLOCK
class DQNAgent:
   def get_action(self, history):
        history = np.float32(history / 255.0)
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        else:
            q_value = self.model.predict(history)
            return np.argmax(q_value[0])

In [62]:
# TASK BLOCK
    # save sample <s,a,r,s'> to the replay memory

In [2]:
# SOLUTION BLOCK
class DQNAgent:
    def replay_memory(self, history, action, reward, next_history, dead):
        self.memory.append((history, action, reward, next_history, dead))

In [64]:
# TASK BLOCK
    # pick samples randomly from replay memory (with batch_size)

In [65]:
# SOLUTION BLOCK
class DQNAgent:
    def train_replay(self):
        if len(self.memory) < self.train_start:
            return
        if self.epsilon > self.epsilon_end:
            self.epsilon -= self.epsilon_decay_step

        mini_batch = random.sample(self.memory, self.batch_size)

        history = np.zeros((self.batch_size, self.state_size[0],
                            self.state_size[1], self.state_size[2]))
        next_history = np.zeros((self.batch_size, self.state_size[0],
                                 self.state_size[1], self.state_size[2]))
        target = np.zeros((self.batch_size,))
        action, reward, dead = [], [], []

        for i in range(self.batch_size):
            history[i] = np.float32(mini_batch[i][0] / 255.)
            next_history[i] = np.float32(mini_batch[i][3] / 255.)
            action.append(mini_batch[i][1])
            reward.append(mini_batch[i][2])
            dead.append(mini_batch[i][4])

        target_value = self.target_model.predict(next_history)

        # like Q Learning, get maximum Q value at s'
        # But from target model
        for i in range(self.batch_size):
            if dead[i]:
                target[i] = reward[i]
            else:
                target[i] = reward[i] + self.discount_factor * \
                                        np.amax(target_value[i])

        loss = self.optimizer([history, action, target])
        self.avg_loss += loss[0]

In [66]:
# TASK BLOCK
# TASK - Define save_model function

In [67]:
# SOLUTION BLOCK
class DQNAgent:
     def save_model(self, name):
        self.model.save_weights(name)

In [68]:
# TASK BLOCK
    # make summary operators for tensorboard

In [69]:
class DQNAgent:
    def setup_summary(self):
        episode_total_reward = tf.Variable(0.)
        episode_avg_max_q = tf.Variable(0.)
        episode_duration = tf.Variable(0.)
        episode_avg_loss = tf.Variable(0.)

        tf.summary.scalar('Total Reward/Episode', episode_total_reward)
        tf.summary.scalar('Average Max Q/Episode', episode_avg_max_q)
        tf.summary.scalar('Duration/Episode', episode_duration)
        tf.summary.scalar('Average Loss/Episode', episode_avg_loss)

        summary_vars = [episode_total_reward, episode_avg_max_q,
                        episode_duration, episode_avg_loss]
        summary_placeholders = [tf.placeholder(tf.float32) for _ in
                                range(len(summary_vars))]
        update_ops = [summary_vars[i].assign(summary_placeholders[i]) for i in
                      range(len(summary_vars))]
        summary_op = tf.summary.merge_all()
        return summary_placeholders, update_ops, summary_op

### Assessment

1. What is the purpose of the `build_model()` method in the `DQNAgent` class?
2. Explain the purpose of the `Conv2D` layers in the neural network.
3. What is the activation function used in the last `Dense` layer of the model, and why is it chosen?

## A5: Main Training Loop

This activity involves defining the main training loop for the reinforcement learning agent.

### **A**5.1 Predict action

In [50]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        # TASK- Take empty steps in the beginning
        # this is one of DeepMind's idea.
        # just do nothing at the start of episode to avoid sub-optimal
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            env.render()
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)


NameError: name 'EPISODES' is not defined

In [None]:
# SOLUTION BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            env.render()
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)


In [None]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            env.render()
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            # Update steps
            # TASK- get action for the current history and go one step in environment


In [None]:
# SOLUTION BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1


In [None]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            # TASK- Change the action from random action to DQN policy
            # action = 
            # action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1


In [None]:
# SOLUTION BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            # TASK- Change the action from random action to DQN policy
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1

### **A**5.2 Perform Q-Learning

In [None]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1

In [None]:
# SOLUTION BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            agent.avg_q_max += np.amax(
                agent.model.predict(np.float32(history / 255.))[0])
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1

### **A**5.3 - Sample from replay memory

In [None]:
# TASK BLOCK
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            agent.avg_q_max += np.amax(
                agent.model.predict(np.float32(history / 255.))[0])
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1
            # TASK- Save the sample <a, a, r, s'> to the replay memory
            

In [None]:
# SOLUTION BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            agent.avg_q_max += np.amax(
                agent.model.predict(np.float32(history / 255.))[0])
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1
            # TASK- Save the sample <a, a, r, s'> to the replay memory
            # save the sample <s, a, r, s'> to the replay memory
            agent.replay_memory(history, action, reward, next_history, dead)
            # every some time interval, train model
            agent.train_replay()
            # update the target model with model
            if global_step % agent.update_target_rate == 0:
                agent.update_target_model()

            score += reward

In [None]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            agent.avg_q_max += np.amax(
                agent.model.predict(np.float32(history / 255.))[0])
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1
            # save the sample <s, a, r, s'> to the replay memory
            agent.replay_memory(history, action, reward, next_history, dead)
            # every some time interval, train model
            agent.train_replay()
            # update the target model with model
            if global_step % agent.update_target_rate == 0:
                agent.update_target_model()
            # TASK- If agent is dead, then reset the history

            score += reward

In [None]:
# SOLUTION BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            agent.avg_q_max += np.amax(
                agent.model.predict(np.float32(history / 255.))[0])
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1
            # save the sample <s, a, r, s'> to the replay memory
            agent.replay_memory(history, action, reward, next_history, dead)
            # every some time interval, train model
            agent.train_replay()
            # update the target model with model
            if global_step % agent.update_target_rate == 0:
                agent.update_target_model()
            # TASK- If agent is dead, then reset the history
            if dead:
                dead = False
            else:
                history = next_history
            score += reward

In [None]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            agent.avg_q_max += np.amax(
                agent.model.predict(np.float32(history / 255.))[0])
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1
            # save the sample <s, a, r, s'> to the replay memory
            agent.replay_memory(history, action, reward, next_history, dead)
            # every some time interval, train model
            agent.train_replay()
            # update the target model with model
            if global_step % agent.update_target_rate == 0:
                agent.update_target_model()
            if dead:
                dead = False
            else:
                history = next_history
            score += reward
            # TASK - Plot the score over episodes

In [None]:
# SOLUTION BLOCK
EPISODES=50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            agent.avg_q_max += np.amax(
                agent.model.predict(np.float32(history / 255.))[0])
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1
            # save the sample <s, a, r, s'> to the replay memory
            agent.replay_memory(history, action, reward, next_history, dead)
            # every some time interval, train model
            agent.train_replay()
            # update the target model with model
            if global_step % agent.update_target_rate == 0:
                agent.update_target_model()
            if dead:
                dead = False
            else:
                history = next_history
            score += reward

            # if done, plot the score over episodes
            if done:
                if global_step > agent.train_start:
                    stats = [score, agent.avg_q_max / float(step), step,
                             agent.avg_loss / float(step)]
                    for i in range(len(stats)):
                        agent.sess.run(agent.update_ops[i], feed_dict={
                            agent.summary_placeholders[i]: float(stats[i])
                        })
                    summary_str = agent.sess.run(agent.summary_op)
                    agent.summary_writer.add_summary(summary_str, e + 1)

                print("episode:", e, "  score:", score, "  memory length:",
                      len(agent.memory), "  epsilon:", agent.epsilon,
                      "  global_step:", global_step, "  average_q:",
                      agent.avg_q_max / float(step), "  average loss:",
                      agent.avg_loss / float(step))

                agent.avg_q_max, agent.avg_loss = 0, 0

        if e % 1000 == 0:
            agent.model.save_weights("./save_model/breakout_dqn.h5")

Fatal Python error: pygame_parachute: (pygame parachute) Segmentation Fault
Python runtime state: initialized

Thread 0x00007fe2c37fe700 (most recent call first):
  File "/home/dfki.uni-bremen.de/csingh/anaconda3/lib/python3.9/site-packages/ipykernel/parentpoller.py", line 36 in run
  File "/home/dfki.uni-bremen.de/csingh/anaconda3/lib/python3.9/threading.py", line 973 in _bootstrap_inner
  File "/home/dfki.uni-bremen.de/csingh/anaconda3/lib/python3.9/threading.py", line 930 in _bootstrap

Thread 0x00007fe2c3fff700 (most recent call first):
  File "/home/dfki.uni-bremen.de/csingh/anaconda3/lib/python3.9/site-packages/IPython/core/history.py", line 762 in _writeout_input_cache
  File "/home/dfki.uni-bremen.de/csingh/anaconda3/lib/python3.9/site-packages/IPython/core/history.py", line 779 in writeout_cache
  File "/home/dfki.uni-bremen.de/csingh/anaconda3/lib/python3.9/site-packages/IPython/core/history.py", line 60 in only_when_enabled
  File "/home/dfki.uni-bremen.de/csingh/anaconda3/l

### Assessment

1. How does the DQN agent update its Q-network during training?
2. What is the purpose of the train_replay method in the DQN loop, and when is it called?
3. Explain the role of the loss variable in the DQN loop and how it is calculated during training.

# References

* https://github.com/rlcode/reinforcement-learning