# 4. Challenge - Beating Atari Breakout with DQN

![Atari breakout game](https://upload.wikimedia.org/wikipedia/commons/5/53/Atari_breakout.jpg)


## Background

One of the main differences between Q Learning and it is that DQN uses neural nets to approximate the value of an action in a given state without trying to memorize all directly.

In this challenge, we're going to build an agent that solves **Atari Breakout** game. As it has many different states and possible actions, it would be unreasonable to build Q-Learning or SARSA models, which brings us to the DQN implementation.

The theory behind DQN has been already explained in the previous notebook, but the key takeaways are the following:

- DQN consists of two neural networks - one for **main neural network** is updated every step, while the **target neural network** is updated every Nth step and is to produce the target Q-value.
- $Q^{\pi}(s_t, a_t) \leftarrow Q^{\pi}(s_t, a_t) + \alpha [r(s_t, a_t) + \gamma max_{a_{t+1}} Q(s_{t+1}, a_{t+1}) - Q(s_{t}, a_{t})]$ is used to update the Q-values.

Another important concept used in DQN implementation is the **experience replay**. Since the neural network "wants" steady and consistent desired output, we are not going to train our neural network every step. Rather, we are going store the state-action-value information and train on batch after some time.

## Setup

OpenAI Gym used to provide an environment for Breakout, but now we have to use baselines. :)
To use it, we should run the following commands in the terminal:

```
git clone https://github.com/openai/baselines.git
cd baselines
pip install -e .
git clone https://github.com/openai/atari-py
wget http://www.atarimania.com/roms/Roms.rar  # or curl -O for mac
unrar x Roms.rar .
python -m atari_py.import_roms .
```

### Mac

```
git clone https://github.com/openai/baselines.git
cd baselines
pip install -e .
git clone https://github.com/openai/atari-py
brew install wget
wget http://www.atarimania.com/roms/Roms.rar  # or curl -O for mac
brew install rar
unrar x Roms.rar .
python -m atari_py.import_roms .
```


## Implementation

### Parameters

In [1]:
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [2]:
# Configuration paramaters for the whole setup
seed = 42
# Discount factor for past rewards
gamma = 0.99
# Epsilon greedy parameter
epsilon = 1.0
# Minimum epsilon greedy parameter
epsilon_min = 0.1
# Maximum epsilon greedy parameter
epsilon_max = 1.0

# Rate at which to reduce chance of random action being taken
epsilon_interval = (epsilon_max - epsilon_min)

# Size of batch taken from replay buffer
batch_size = 32
max_steps_per_episode = 10000

# Use the Baseline Atari environment because of Deepmind helper functions ("BreakoutNoFrameskip-v4")
env = make_atari(___)
# Warp the frames, grey scale, stake four frame and scale to smaller ratio
env = wrap_deepmind(env, frame_stack=True, scale=True)
env.seed(seed)



[42, 742738649]

### Models

As it has been mentioned in the previous part, the DQN uses two identical architecture neural networks - **main** and **target**. In contrast to the previous RL models, the input to our DQN is going to be image arrays of (84, 84, 4), therefore, convolutional structures are going to be used. 

In [3]:
num_actions = 4

def create_q_model():
    # Network defined by the Deepmind paper
    inputs = layers.Input(shape=(___))

    # Convolutions on the frames on the screen
    layer1 = layers.Conv2D(___, 8, strides=4, activation=___)(inputs)
    layer2 = layers.Conv2D(___, 4, strides=2, activation=___)(layer1)
    layer3 = layers.Conv2D(___, 3, strides=1, activation=___)(layer2)
    
    #Flatten layers
    layer4 = layers.___()(layer3)

    layer5 = layers.Dense(___, activation="relu")(layer4)
    action = layers.Dense(num_actions, activation="linear")(layer5)

    return keras.Model(inputs=inputs, outputs=action)


# Main model predicts Q-values for action
model = ___

# Target model predicts future rewards and is updated every 10000 steps
model_target = ___

In [4]:
# Train the model after 4 actions
update_after_actions = 4

# How often to update the target network
update_target_network = ___

### Experience replay

As it has been mentioned before, we are going to implement experience replay to our DQN system. In other words, our algorithm will gather data for certain amount of steps and train on batches instead of every episode.

In [5]:
optimizer = keras.optimizers.Adam(learning_rate=0.00025, clipnorm=1.0)

# Experience replay buffers
action_history = []
state_history = []
state_next_history = []
rewards_history = []
done_history = []
episode_reward_history = []

running_reward = 0
episode_count = 0
frame_count = 0

# Number of frames to take random action and observe output
epsilon_random_frames = 50000

# Number of frames for exploration
epsilon_greedy_frames = 1000000.0

# Maximum replay length
# Note: The Deepmind paper suggests 1000000 however this causes memory issues
max_memory_length = 100000

### Loss function

DQN wants to minimise the difference between $Q(s, a)$ and $R + \max Q^-(s', a')$ where $Q^-$ is the q value of the target network. To stabilise the training, the authors used a Huber loss:

![](https://miro.medium.com/max/1400/1*Z5rSREWD-60Mr_uiEPlvvg.png)

$$
L_\delta(a) = 
\begin{cases}
{1 \over 2} a^2 & \text{for } |a| \leq \delta \\
\delta(|a| - {1 \over 2} \delta) & \text{otherwise}
\end{cases}
$$

The loss is quadratic for small $a$ and linear for large $a$, which reduces the dramatic changes in the loss.

In [6]:
# Using huber loss for stability
loss_function = ___

### Training

In [7]:
while True:
    state = np.array(env.reset())
    episode_reward = 0

    for timestep in range(1, max_steps_per_episode):
        #Visualizing attempts
        # env.render();
        frame_count += 1

        # Use epsilon-greedy policy
        if frame_count < epsilon_random_frames or epsilon > np.random.rand(1)[0]:
            
            # Take random action
            action = ___
        else:
            # Predict action Q-values from states

            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_probs = model(state_tensor, training=False)
            
            # Take best action
            action = tf.argmax(action_probs[0]).numpy()

        # Decay probability of taking random action
        epsilon -= epsilon_interval / epsilon_greedy_frames
        epsilon = max(epsilon, epsilon_min)

        # Apply the action in our environment
        state_next, reward, done, _ = ___
        #Convert the next state to numpy array
        state_next = np.array(state_next)

        episode_reward += reward

        # Save actions and states in replay buffer
        action_history.append(___)
        state_history.append(___)
        state_next_history.append(___)
        done_history.append(___)
        rewards_history.append(___)
        state = state_next

        # Update every fourth frame and once batch size is over 32
        if frame_count % update_after_actions == 0 and len(done_history) > batch_size:

            # Get indices of samples for replay buffers
            indices = np.random.choice(range(len(done_history)), size=batch_size)

            # Using list comprehension to sample from replay buffer
            state_sample = np.array([state_history[i] for i in indices])
            state_next_sample = np.array([state_next_history[i] for i in indices])
            rewards_sample = [rewards_history[i] for i in indices]
            action_sample = [action_history[i] for i in indices]
            done_sample = tf.convert_to_tensor(
                [float(done_history[i]) for i in indices]
            )

            # Using target neural network to predict target Q-value
            future_rewards = ___
            # Q value = reward + discount factor * expected future reward
            updated_q_values = rewards_sample + gamma * tf.reduce_max(
                future_rewards, axis=1
            )

            # For the final frame, set the last value to -1
            updated_q_values = updated_q_values * (1 - done_sample) - done_sample

            # Create a mask so we only calculate loss on the updated Q-values
            masks = tf.one_hot(action_sample, num_actions)

            with tf.GradientTape() as tape:
                # Train the model on the states and updated Q-values
                q_values = model(state_sample)

                # Apply the masks to the Q-values to get the Q-value for action taken
                q_action = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)
                
                # Calculate loss between new Q-value and old Q-value
                loss = loss_function(updated_q_values, q_action)

            # Backpropagation
            grads = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(grads, model.trainable_variables))

        if frame_count % update_target_network == 0:
            # update the the target network with new weights
            model_target.set_weights(model.get_weights())
            # Log details
            template = "running reward: {:.2f} at episode {}, frame count {}"
            print(template.format(running_reward, episode_count, frame_count))

        # Limit the state and reward history
        if len(rewards_history) > ___:
            del rewards_history[:1]
            del state_history[:1]
            del state_next_history[:1]
            del action_history[:1]
            del done_history[:1]

        if done:
            break

    # Update running reward to check condition for solving
    episode_reward_history.append(episode_reward)
    if len(episode_reward_history) > 100:
        del episode_reward_history[:1]
    running_reward = np.mean(episode_reward_history)

    episode_count += 1

    if running_reward > 40:  # Condition to consider the task solved
        print("Solved at episode {}!".format(episode_count))
        break

running reward: 0.29 at episode 300, frame count 10000


KeyboardInterrupt: 