## Jupyter Notebook for Deep Q-Learning Algorithm.

DQN Method is a Model-Free, Value-based and Off-policy.

- It doesn't build any model of the environment.
- It approximates the policy indirectly by finding the value of taking an action.
- It train on historic data obtained from the environment.

### Algorithm Steps:

1. Initialise parameters for $Q_{policynet}(s,a)$ and $Q_{targetnet}(s,a)$ with random weights, $\epsilon = 1.0$, and empty replay buffer.
2. With probability $\epsilon$, select a random action $a$, otherwise $argmax_a Q_{policynet}(s,a)$.
3. Execute action $a$ and observe the reward $r$ and next state $s'$.
4. Store the transion $<s, a, r, s'>$ in the replay buffer.
5. Sample a random minibatch of transitions from the replay buffer.
6. For every transition in the buffer, calculate target $y = r$, if it is the terminal state or $y = r + \gamma max_{a'} Q_{targetnet}(s', a')$ otherwise.
7. Calculate loss $L = (Q_{policynet}(s,a) - y)^2$
8. Update Q(s,a) using Stocastic Gradient Decent algorithm by minimising the loss.
9. After every N steps copy parameters from $Q_{policynet}$ to $Q_{targetnet}$
10. Repeat from step 2 until convergence.

In [None]:
import gym
import time
import numpy as np

import torch
import torch.nn as nn

from utility import env_wrappers
from Agents.dqn_agent import DQN

import collections

import warnings
warnings.filterwarnings('ignore')

### Create Model Network.

In [None]:
class Net(nn.Module):
    def __init__(self, input_shape, n_actions):
        super(Net, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(input_shape[0], 32, kernel_size=8, stride=4),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )

        conv_out_size = self._get_conv_out(input_shape)
        self.fc = nn.Sequential(
            nn.Linear(conv_out_size, 512),
            nn.ReLU(),
            nn.Linear(512, n_actions)
        )
    
    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        return self.fc(conv_out)

### Agent-Environment Setup

In [None]:
environment = env_wrappers.make_env("PongNoFrameskip-v4")

obs_size = environment.observation_space.shape[0]
n_actions = environment.action_space.n

policy_net = Net(environment.observation_space.shape, environment.action_space.n)
target_net = Net(environment.observation_space.shape, environment.action_space.n)

agent = DQN(env=environment, network_policy=policy_net, network_target=target_net, memory_size=10000, 
            learning_rate=1e-4, target_perf=19.5, replay_warmup=10000, target_update=1000)

In [None]:
agent.run(32)

In [None]:
environment = env_wrappers.make_env("PongNoFrameskip-v4")

In [None]:
environment.reset().shape

In [None]:
environment.step(1)

In [None]:
env = gym.make("PongNoFrameskip-v4")

In [None]:
env.reset().shape

In [None]:
env.step(1)