# Deep Q-Learning in Python

## Introduction

Deep Q-Learning is a type of reinforcement learning technique that uses
deep neural networks to approximate the optimal Q-function. Q-function
is the expected total future reward given a state and an action. Deep
Q-Networks (DQNs) are used to learn the optimal Q-values of all possible
actions for each state in a given environment.

In this Jupyter Notebook, we will implement a basic DQN algorithm in
Python using the popular deep learning library, Keras. We will train our
DQN on the classic Atari game, Breakout.

## Prerequisites

Before you start working on this notebook, make sure you have the
following libraries installed:

-   `tensorflow`
-   `keras`
-   `numpy`
-   `gym`

You can install the libraries by running the following command:

In [None]:
pip install tensorflow keras numpy gym

## Understanding the Environment

We will use the `gym` library to create and interact with the Breakout
environment. The Breakout environment is a 2D video game in which the
player moves a paddle to hit a ball and break bricks. The goal of the
game is to break as many bricks as possible without letting the ball
fall off the screen.

The environment has the following attributes:

-   `observation_space`: a `Box` object representing the observation
    space (a 210x160 RGB image).
-   `action_space`: a `Discrete` object representing the action space
    (move left, move right, do nothing).
-   `reward_range`: a tuple representing the minimum and maximum reward
    achievable (0 and 1, respectively).

We can create the environment as follows:

In [None]:
import gym

env = gym.make('Breakout-v0')

## Implementing the DQN Algorithm

### Step 1: Initialize the DQN model

We will create a deep neural network with three convolutional layers and
two fully connected layers. The input to the network will be a 84x84
grayscale image, which is a preprocessed version of the original
observation. The output of the network will be a Q-value for each
possible action.

In [None]:
from keras.models import Sequential
from keras.layers import Conv2D, Flatten, Dense

def create_model(input_shape, num_actions):
    model = Sequential()
    model.add(Conv2D(32, 8, strides=(4, 4), activation='relu', input_shape=input_shape))
    model.add(Conv2D(64, 4, strides=(2, 2), activation='relu'))
    model.add(Conv2D(64, 3, strides=(1, 1), activation='relu'))
    model.add(Flatten())
    model.add(Dense(512, activation='relu'))
    model.add(Dense(num_actions, activation='linear'))
    return model

input_shape = (84, 84, 4)
num_actions = env.action_space.n

model = create_model(input_shape, num_actions)

### Step 2: Define the target network

The target network is a copy of the DQN model that is used to compute
the target Q-values during training. We will update the target network
every `tau` steps.

In [None]:
def create_target_model(model):
    target_model = create_model(model.input_shape[1:], model.output_shape[-1])
    target_model.set_weights(model.get_weights())
    return target_model

target_model = create_target_model(model)

tau = 10000

### Step 3: Define the replay buffer

The replay buffer is a data structure that stores the agent’s
experiences (state, action, reward, next state) during training. We will
use a deque to store the experiences, and we will sample batches of
experiences from the replay buffer to train the DQN model.

In [None]:
from collections import deque

class ReplayBuffer:
    def __init__(self, max_size):
        self.buffer = deque(maxlen=max_size)
        
    def add(self, experience):
        self.buffer.append(experience)
        
    def sample(self, batch_size):
        idxs = np.random.choice(len(self.buffer), size=batch_size, replace=False)
        states, actions, rewards, next_states, dones = [], [], [], [], []
        for i in idxs:
            state, action, reward, next_state, done = self.buffer[i]
            states.append(state)
            actions.append(action)
            rewards.append(reward)
            next_states.append(next_state)
            dones.append(done)
        return np.array(states), np.array(actions), np.array(rewards), np.array(next_states), np.array(dones)

max_size = 1000000

buffer = ReplayBuffer(max_size)

### Step 4: Define the epsilon-greedy policy

The epsilon-greedy policy is used to balance exploration and
exploitation during training. We will start with a high exploration rate
(`epsilon`) and gradually decrease it over time.

In [None]:
def epsilon_greedy_policy(state, epsilon):
    if np.random.rand() < epsilon:
        return np.random.randint(num_actions)
    else:
        Q_values = model.predict(state)
        return np.argmax(Q_values[0])

epsilon_start = 1.0
epsilon_end = 0.1
epsilon_decay = 1000000

def get_epsilon(step):
    return max(epsilon_end, epsilon_start - (epsilon_start - epsilon_end) * step / epsilon_decay)

### Step 5: Implement the training loop

In the training loop, we will do the following:

-   Reset the environment and preprocess the initial observation.
-   Repeat the following until the episode is done:
    -   Choose an action using the epsilon-greedy policy.
    -   Execute the action and observe the next state, reward, and
        whether the episode is done.
    -   Store the experience in the replay buffer.
    -   Sample a batch of experiences from the replay buffer.
    -   Compute the target Q-values using the target network.
    -   Compute the predicted Q-values using the DQN model.
    -   Compute the loss between the target Q-values and the predicted
        Q-values.
    -   Backpropagate the loss and update the DQN model.
    -   Every `tau` steps, update the target network weights with the
        DQN weights.

In [None]:
from tqdm import trange

num_episodes = 10000
batch_size = 32
gamma = 0.99

episode_rewards = []

for i in trange(num_episodes):
    state = env.reset()
    state = preprocess_state(state)
    done = False
    episode_reward = 0
    while not done:
        epsilon = get_epsilon(i)
        action = epsilon_greedy_policy(state, epsilon)
        next_state, reward, done, info = env.step(action)
        next_state = preprocess_state(next_state)
        buffer.add((state, action, reward, next_state, done))
        episode_reward += reward
        state = next_state
        
        if len(buffer.buffer) >= batch_size:
            states, actions, rewards, next_states, dones = buffer.sample(batch_size)
            targets = rewards + (1 - dones) * gamma * np.amax(target_model.predict(next_states), axis=1, keepdims=True)
            Q_values = model.predict(states)
            Q_values[range(batch_size), actions] = targets.flatten()
            model.fit(states, Q_values, verbose=0)
            
            if i % tau == 0:
                target_model.set_weights(model.get_weights())
                
    episode_rewards.append(episode_reward)
    print('Episode {}/{}: reward={}'.format(i+1, num_episodes, episode_reward))

### Step 6: Evaluate the performance

We can evaluate the performance of the trained DQN by running the
following code:

In [None]:
num_eval_episodes = 10

eval_rewards = []

for i in range(num_eval_episodes):
    state = env.reset()
    state = preprocess_state(state)
    done = False
    eval_reward = 0
    while not done:
        Q_values = model.predict(state)
        action = np.argmax(Q_values[0])
        next_state, reward, done, info = env.step(action)
        next_state = preprocess_state(next_state)
        eval_reward += reward
        state = next_state
    eval_rewards.append(eval_reward)
    print('Evaluation Episode {}/{}: reward={}'.format(i+1, num_eval_episodes, eval_reward))
    
print('Average reward =', np.mean(eval_rewards))

## Conclusion

In this notebook, we have implemented a basic DQN algorithm in Python
using Keras. We have trained our DQN on the Atari game, Breakout, and
evaluated its performance. We hope this notebook has helped you
understand how DQN works and how to implement it in Python.