# Playing Atari using Deep Q-Learning

## Problem Statement

Train an agent to play Atari Breakout using Deep Q-Learning

## Given

The original game frames of shape of (210, 160, 3)
<img src="https://upload.wikimedia.org/wikipedia/en/thumb/2/2b/Breakout2600.svg/1920px-Breakout2600.svg.png" alt="Atari Breakout" width="25%">

## Goal

Maximize score by designing a DQN Agent that makes optimal decisions based on the observed game frames

## Brainstorming

[Whiteboard](https://www.tutorialspoint.com/whiteboard.htm)

# Project Pipeline

## A1: Import Dependencies

This activity involves importing the necessary dependencies for the project.

### **A**1.1 Import RL framework, OpenCV and relevant libraries

In [1]:
# TASK BLOCK
# TAKS Import gymnasium, numpy, random, skimage, cv2

### **A**1.2 Import tensorflow modules

In [27]:
# TASK BLOCK
# TAKS Import tensorflow models, layers, optimizers, etc.

### Assessment

1. Which tensorflow modules are essential for any deep learning project?
2. How do you import convolution layers module in tensorflow?
3. What is sequential in TensorFlow keras?

## A2: Define gym environment

This activity involves defining the constants for the project.

### **A**2.1 Define RL environment

In [5]:
# TASK BLOCK
# TASK Initialize episodes

In [7]:
# TASK BLOCK
# TASK Create gym loop and make gym environment

In [None]:
# TASK BLOCK
# TASK Initialize done, dead, step, score, start_life
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        # Initialize the variables
        observe = env.reset()
        while not done:
            env.render()

In [3]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        while not done:
            # Take a random action
            env.render()

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]


NameError: name 'EPISODES' is not defined

### **A**2.2 Take action

In [3]:
# TASK BLOCK
# TASK Define start_life and dead variables
# if the agent missed the ball, agent is dead but episode is not over
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            # TASK Define start_life and dead variables
            env.render()

NameError: name 'gym' is not defined

In [None]:
# TASK BLOCK
# TASK Clip reward between -1 and 1
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            # Clip reward between -1 and 1
            env.render()

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
  logger.warn(


### Assessment

1. How do you take a random action in a gymnasium environment?
2. What are the conditions under which the episode ends for the Atari Breakout scenario?
3. Define how to display the frames in Atari.

## A3: Preprocessing Functions

Thisactivity involves defining the preprocessing functions used to transform the game frames.

### **A**3.1 Define pre-processing function

In [1]:
# TASK BLOCK
# TASK Define a function to pre-process the image frames from (210, 160, 3) to (84, 84, 1)

In [None]:
# TASK BLOCK
# TASK Update the gym environment to include pre-processed image
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        # TASK: Add pre-processed state here
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            # TASK: Add pre-processed state here 
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            env.render()

A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7)
[Powered by Stella]
  logger.warn(


### **A**3.2 Define history

In [None]:
# TASK BLOCK
# TASK Initialize initial history
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        state = pre_processing(observe)
        # TASK: Create a stack of 4 states to define the history
        # TASK: Re-shape the history to (1, 84, 84, 4)
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            env.render()

In [None]:
# TASK BLOCK
# TASK Initialize next history
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        dead = False
        step, score, start_life = 0, 0, 5
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            # Re-shape next_state to (1 , 84, 84, 1)
            # Define next history
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            env.render()

### Assessment

1. What is the purpose of the `preprocess_frame()` function?
2. Explain the use of the `stacked_frames` parameter in the `stack_frames()` function.
3. How is the state defined in the Atari Breakout environment?

## A4: Create DQNAgent Class

This activity involves defining the `DQNAgent` class, which implements the Deep Q-Network (DQN) algorithm.

### **A**4.1 Initialize DQN Agent

In [None]:
# TASK BLOCK
# Define a class DQNAgent

In [25]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        # TASK - Initialize render and load_model variables

IndentationError: expected an indented block (3882795768.py, line 4)

In [27]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK - Define environment settings

In [29]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # TASK: Define exploration parameters

In [None]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # parameters about epsilon
        self.epsilon = 1.
        self.epsilon_start, self.epsilon_end = 1.0, 0.1
        self.exploration_steps = 1000000.
        self.epsilon_decay_step = (self.epsilon_start - self.epsilon_end) \
                                  / self.exploration_steps
        # parameters about training
        # TASK- Define training parameters

In [None]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # parameters about epsilon
        self.epsilon = 1.
        self.epsilon_start, self.epsilon_end = 1.0, 0.1
        self.exploration_steps = 1000000.
        self.epsilon_decay_step = (self.epsilon_start - self.epsilon_end) \
                                  / self.exploration_steps
        # parameters about training
        # TASK- Define training parameters        
        # parameters about training
        self.batch_size = 32
        self.train_start = 50000
        self.update_target_rate = 10000
        self.discount_factor = 0.99
        self.memory = deque(maxlen=400000)
        self.no_op_steps = 30
        # build model
        # TASK- Build model

In [1]:
# TASK BLOCK
class DQNAgent:
    def __init__(self, action_size):
        self.render = False
        self.load_model = False
        # environment settings
        # TASK: Define environment settings
        self.state_size = (84, 84, 4)
        self.action_size = action_size
        # parameters about epsilon
        # parameters about epsilon
        self.epsilon = 1.
        self.epsilon_start, self.epsilon_end = 1.0, 0.1
        self.exploration_steps = 1000000.
        self.epsilon_decay_step = (self.epsilon_start - self.epsilon_end) \
                                  / self.exploration_steps
        # parameters about training
        # TASK- Define training parameters        
        # parameters about training
        self.batch_size = 32
        self.train_start = 50000
        self.update_target_rate = 10000
        self.discount_factor = 0.99
        self.memory = deque(maxlen=400000)
        self.no_op_steps = 30
        # build model
        self.model = self.build_model()
        self.target_model = self.build_model()
        self.update_target_model()
        # TASK- Initialize optimiyer, sess, av_q_max, etc.

### **A**4.2 Initialize functions for DQN Agent

In [7]:
# TASK - Define optimizer function
    # if the error is in [-1, 1], then the cost is quadratic to the error
    # But outside the interval, the cost is linear to the error

In [9]:
# TASK BLOCK
    # approximate Q function using Convolution Neural Network
    # state is input and Q Value of each action is output of network

In [11]:
    # after some time interval update the target model to be same with model

In [13]:
# TASK BLOCK
   # get action from model using epsilon-greedy policy

In [15]:
# TASK BLOCK
    # save sample <s,a,r,s'> to the replay memory

In [17]:
# TASK BLOCK
    # pick samples randomly from replay memory (with batch_size)

In [19]:
# TASK BLOCK
# TASK - Define save_model function

In [21]:
# TASK BLOCK
    # make summary operators for tensorboard

### Assessment

1. What is the purpose of the `build_model()` method in the `DQNAgent` class?
2. Explain the purpose of the `Conv2D` layers in the neural network.
3. What is the activation function used in the last `Dense` layer of the model, and why is it chosen?

## A5: Main Training Loop

This activity involves defining the main training loop for the reinforcement learning agent.

### **A**5.1 Predict action

In [50]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        # TASK- Take empty steps in the beginning
        # this is one of DeepMind's idea.
        # just do nothing at the start of episode to avoid sub-optimal
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            env.render()
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)


NameError: name 'EPISODES' is not defined

In [None]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            env.render()
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            # Update steps
            # TASK- get action for the current history and go one step in environment


In [None]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            # TASK- Change the action from random action to DQN policy
            # action = 
            # action = env.action_space.sample()
            observe, reward, done, truncate, info = env.step(action)
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1


### **A**5.2 Perform Q-Learning

In [None]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    # TASK- Define DQN agent
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1

### **A**5.3 - Sample from replay memory

In [None]:
# TASK BLOCK
EPISODES = 50000
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    agent = DQNAgent(action_size=3)
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            agent.avg_q_max += np.amax(
                agent.model.predict(np.float32(history / 255.))[0])
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1
            # TASK- Save the sample <a, a, r, s'> to the replay memory
            

In [None]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    agent = DQNAgent(action_size=3)
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            agent.avg_q_max += np.amax(
                agent.model.predict(np.float32(history / 255.))[0])
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1
            # save the sample <s, a, r, s'> to the replay memory
            agent.replay_memory(history, action, reward, next_history, dead)
            # every some time interval, train model
            agent.train_replay()
            # update the target model with model
            if global_step % agent.update_target_rate == 0:
                agent.update_target_model()
            # TASK- If agent is dead, then reset the history

            score += reward

In [None]:
# TASK BLOCK
if __name__ == "__main__":
    env = gym.make("ALE/Breakout-v5", render_mode = 'human')
    agent = DQNAgent(action_size=3)
    for e in range(EPISODES):
        done = False
        observe = env.reset()
        state = pre_processing(observe)
        history = np.stack((state, state, state, state), axis=2)
        history = np.reshape([history], (1, 84, 84, 4))
        for _ in range(random.randint(1, agent.no_op_steps)):
            observe, _, _, _,_ = env.step(1)
        while not done:
            action = agent.get_action(history)
            if action == 0:
                real_action = 1
            elif action == 1:
                real_action = 2
            else:
                real_action = 3
            observe, reward, done, truncate, info = env.step(real_action)
            # TASK- Define average Q-max
            agent.avg_q_max += np.amax(
                agent.model.predict(np.float32(history / 255.))[0])
            next_state = pre_processing(observe)
            next_state = np.reshape([next_state], (1, 84, 84, 1))
            next_history = np.append(next_state, history[:, :, :, :3], axis=3)
            if start_life > info['lives']:
                dead = True
                start_life = info['lives']
            reward = np.clip(reward, -1., 1.)
            if agent.render:
                env.render()
            global_step += 1
            step += 1
            # save the sample <s, a, r, s'> to the replay memory
            agent.replay_memory(history, action, reward, next_history, dead)
            # every some time interval, train model
            agent.train_replay()
            # update the target model with model
            if global_step % agent.update_target_rate == 0:
                agent.update_target_model()
            if dead:
                dead = False
            else:
                history = next_history
            score += reward
            # TASK - Plot the score over episodes

### Assessment

1. How does the DQN agent update its Q-network during training?
2. What is the purpose of the train_replay method in the DQN loop, and when is it called?
3. Explain the role of the loss variable in the DQN loop and how it is calculated during training.

# References

* https://github.com/rlcode/reinforcement-learning