# A first stab: DQN

[DQN](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) is a classical RL algorithm which should provide a nice baseline for further work.

Classical RL techniques woul probably not work very well without further feature engineering, because the current state space is quite large.

In [1]:
import tianshou as ts 
from tianshou.utils import TensorboardLogger

import torch
from torch import nn
from torch.utils.tensorboard import SummaryWriter

import numpy as np

import os
from datetime import datetime

In [2]:
from utils_preprocess import compute_frame_features, compute_foa_features

from env_base import BaseEnvironment

  from pkg_resources import resource_stream, resource_exists


## Data and environment initialisation

In [3]:
vid_filename = "001"
mat_filename = vid_filename + ".mat"
target_subject = 0

In [4]:
patch_bounding_boxes_per_frame, patch_centres_per_frame, speaker_info_per_frame = compute_frame_features(
    vid_filename
)

foa_centres_per_frame_per_subject, patch_weights_per_frame = compute_foa_features(
    mat_filename, patch_centres_per_frame
)
foa_centres_per_frame = [frame[target_subject] for frame in foa_centres_per_frame_per_subject]

In [5]:
markov_env = BaseEnvironment(
    1,
    patch_bounding_boxes_per_frame,
    patch_centres_per_frame,
    speaker_info_per_frame,
    foa_centres_per_frame,
    patch_weights_per_frame,
    frame_width=320, # from data_utils.py
    frame_height=180,
)

In [6]:
# env.observation_space.sample(), env.action_space.sample()

For efficiency, it's a good idea to set up some vectorized environments.

In [7]:
num_train_envs = 5
num_test_envs = 10

train_envs = ts.env.DummyVectorEnv([lambda: markov_env for _ in range(num_train_envs)])
test_envs = ts.env.DummyVectorEnv([lambda: markov_env for _ in range(num_test_envs)])

## DQN

First, let's construct the network.

The biggest headache comes from the observations: they're quite complex. So, we build multiple networks, each processing a part of an observation and combining their outputs in the end!

In [8]:
class Net(nn.Module):
    def __init__(self, observation_space, action_shape):
        super().__init__()

        self.num_patches = observation_space['patch_centres'].shape[0]

        # network for patch_centres
        self.patch_centres_net = nn.Sequential(
            nn.Linear(np.prod(observation_space['patch_centres'].shape), 64),
            nn.ReLU(inplace=True),
            nn.Linear(64, 64),
            nn.ReLU(inplace=True)
        )

        # network for patch_bounding_boxes
        self.patch_bboxes_net = nn.Sequential(
            nn.Linear(np.prod(observation_space['patch_bounding_boxes'].shape), 64),
            nn.ReLU(inplace=True),
            nn.Linear(64, 64),
            nn.ReLU(inplace=True)
        )

        # network for speaker_info
        self.speaker_info_net = nn.Sequential(
            nn.Linear(np.prod(observation_space['speaker_info'].shape), 32),
            nn.ReLU(inplace=True),
            nn.Linear(32, 32),
            nn.ReLU(inplace=True)
        )

        # combining the outputs of all networks
        self.combined_net = nn.Sequential(
            nn.Linear(64 + 64 + 32, 128),
            nn.ReLU(inplace=True),
            nn.Linear(128, 128),
            nn.ReLU(inplace=True),
            nn.Linear(128, np.prod(action_shape))
        )

    def forward(self, obs, state=None, info={}):
        patch_centres = torch.tensor(obs['patch_centres'], dtype=torch.float32)
        patch_bboxes = torch.tensor(obs['patch_bounding_boxes'], dtype=torch.float32)
        speaker_info = torch.tensor(obs['speaker_info'], dtype=torch.float32)

        patch_centres = patch_centres.view(patch_centres.size(0), -1)
        patch_bboxes = patch_bboxes.view(patch_bboxes.size(0), -1)
        speaker_info = speaker_info.view(speaker_info.size(0), -1)

        # pass through respective networks
        patch_centres_out = self.patch_centres_net(patch_centres)
        patch_bboxes_out = self.patch_bboxes_net(patch_bboxes)
        speaker_info_out = self.speaker_info_net(speaker_info)

        # combine outputs
        combined = torch.cat([patch_centres_out, patch_bboxes_out, speaker_info_out], dim=1)

        logits = self.combined_net(combined)

        return logits, state

In [9]:
state_shape = markov_env.observation_space
action_shape = markov_env.action_space.n

net = Net(state_shape, action_shape)
optim = torch.optim.Adam(net.parameters(), lr=1e-3)

In [10]:
net

Net(
  (patch_centres_net): Sequential(
    (0): Linear(in_features=4, out_features=64, bias=True)
    (1): ReLU(inplace=True)
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): ReLU(inplace=True)
  )
  (patch_bboxes_net): Sequential(
    (0): Linear(in_features=8, out_features=64, bias=True)
    (1): ReLU(inplace=True)
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): ReLU(inplace=True)
  )
  (speaker_info_net): Sequential(
    (0): Linear(in_features=2, out_features=32, bias=True)
    (1): ReLU(inplace=True)
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): ReLU(inplace=True)
  )
  (combined_net): Sequential(
    (0): Linear(in_features=160, out_features=128, bias=True)
    (1): ReLU(inplace=True)
    (2): Linear(in_features=128, out_features=128, bias=True)
    (3): ReLU(inplace=True)
    (4): Linear(in_features=128, out_features=2, bias=True)
  )
)

## Setting up DQN

First, we need to set up the policy, which is readily done in Tianshou.

In [11]:
policy = ts.policy.DQNPolicy(
    model=net, 
    optim=optim, 
    discount_factor=0.99,
    estimation_step=1,
    target_update_freq=50
)

Then, we need to set up the collectors, i.e., the objects that will be interacting with the environment according to the above policy and collect the generated data.

In classical DQN fashion, we store the data in a replay buffer.

In [12]:
train_collector = ts.data.Collector(policy, train_envs, ts.data.VectorReplayBuffer(6000, num_train_envs))

test_collector = ts.data.Collector(policy, test_envs)

## Training

In [13]:
num_epochs = 20
num_steps_per_epoch = 3000
step_per_collect = 10
episode_per_test = 5
batch_size = 30 # one second of data (videos are at 30FPS)

timestamp = datetime.now().strftime("%d%m%Y-%H%M%S")
log_path = os.path.join("logs", "dqn", "base", f"video_{vid_filename}", f"subject_{target_subject}", timestamp)
writer = SummaryWriter(log_path)
logger = TensorboardLogger(writer)

In [14]:
result = ts.trainer.offpolicy_trainer(
    policy, 
    train_collector, 
    test_collector,
    max_epoch=num_epochs,
    step_per_epoch=num_steps_per_epoch,
    step_per_collect=step_per_collect,
    episode_per_test=episode_per_test,
    batch_size=batch_size,
    logger=logger,
)

Epoch #1: 3001it [00:06, 445.45it/s, env_step=3000, len=120, loss=1.504, n/ep=2, n/st=10, rew=56.35]                          


Epoch #1: test_reward: 153.505796 ± 126.639527, best_reward: 154.035783 ± 125.348117 in #0


Epoch #2: 3001it [00:06, 454.86it/s, env_step=6000, len=120, loss=1.710, n/ep=2, n/st=10, rew=47.47]                          


Epoch #2: test_reward: 153.505796 ± 124.430770, best_reward: 154.035783 ± 125.348117 in #0


Epoch #3: 3001it [00:06, 445.62it/s, env_step=9000, len=120, loss=2.095, n/ep=2, n/st=10, rew=49.39]                          


Epoch #3: test_reward: 153.505796 ± 124.688519, best_reward: 154.035783 ± 125.348117 in #0


Epoch #4: 3001it [00:06, 442.91it/s, env_step=12000, len=120, loss=2.792, n/ep=2, n/st=10, rew=53.12]                          


Epoch #4: test_reward: 153.505796 ± 119.897613, best_reward: 154.035783 ± 125.348117 in #0


Epoch #5: 3001it [00:06, 449.41it/s, env_step=15000, len=120, loss=2.323, n/ep=2, n/st=10, rew=56.92]                          


Epoch #5: test_reward: 153.505796 ± 126.568137, best_reward: 154.035783 ± 125.348117 in #0


Epoch #6: 3001it [00:06, 453.59it/s, env_step=18000, len=120, loss=3.602, n/ep=2, n/st=10, rew=46.87]                          


Epoch #6: test_reward: 153.505796 ± 126.425746, best_reward: 154.035783 ± 125.348117 in #0


Epoch #7: 3001it [00:06, 452.84it/s, env_step=21000, len=120, loss=4.078, n/ep=2, n/st=10, rew=53.51]                          


Epoch #7: test_reward: 153.505796 ± 122.082291, best_reward: 154.035783 ± 125.348117 in #0


Epoch #8: 3001it [00:07, 380.55it/s, env_step=24000, len=120, loss=2.532, n/ep=2, n/st=10, rew=52.90]                          


Epoch #8: test_reward: 153.505796 ± 122.433377, best_reward: 154.035783 ± 125.348117 in #0


Epoch #9: 3001it [00:06, 451.10it/s, env_step=27000, len=120, loss=3.774, n/ep=2, n/st=10, rew=46.32]                          


Epoch #9: test_reward: 153.505796 ± 120.639651, best_reward: 154.035783 ± 125.348117 in #0


Epoch #10: 3001it [00:06, 456.66it/s, env_step=30000, len=120, loss=4.525, n/ep=2, n/st=10, rew=52.45]                          


Epoch #10: test_reward: 153.505796 ± 124.869452, best_reward: 154.035783 ± 125.348117 in #0


Epoch #11: 3001it [00:06, 454.32it/s, env_step=33000, len=120, loss=2.671, n/ep=2, n/st=10, rew=54.12]                          


Epoch #11: test_reward: 153.505796 ± 124.981207, best_reward: 154.035783 ± 125.348117 in #0


Epoch #12: 3001it [00:06, 453.76it/s, env_step=36000, len=120, loss=4.172, n/ep=2, n/st=10, rew=50.39]                          


Epoch #12: test_reward: 153.505796 ± 123.342576, best_reward: 154.035783 ± 125.348117 in #0


Epoch #13: 3001it [00:06, 452.23it/s, env_step=39000, len=120, loss=3.957, n/ep=2, n/st=10, rew=49.02]                          


Epoch #13: test_reward: 153.505796 ± 116.355178, best_reward: 154.035783 ± 125.348117 in #0


Epoch #14: 3001it [00:06, 449.24it/s, env_step=42000, len=120, loss=4.048, n/ep=2, n/st=10, rew=49.77]                          


Epoch #14: test_reward: 153.505796 ± 119.584744, best_reward: 154.035783 ± 125.348117 in #0


Epoch #15: 3001it [00:06, 451.30it/s, env_step=45000, len=120, loss=2.976, n/ep=2, n/st=10, rew=64.22]                          


Epoch #15: test_reward: 153.505796 ± 120.258150, best_reward: 154.035783 ± 125.348117 in #0


Epoch #16: 3001it [00:06, 471.52it/s, env_step=48000, len=120, loss=3.944, n/ep=2, n/st=10, rew=49.51]                          


Epoch #16: test_reward: 153.505796 ± 130.642619, best_reward: 154.035783 ± 125.348117 in #0


Epoch #17: 3001it [00:06, 472.22it/s, env_step=51000, len=120, loss=3.543, n/ep=2, n/st=10, rew=53.90]                          


Epoch #17: test_reward: 153.505796 ± 126.683213, best_reward: 154.035783 ± 125.348117 in #0


Epoch #18: 3001it [00:06, 453.12it/s, env_step=54000, len=120, loss=2.790, n/ep=2, n/st=10, rew=52.41]                          


Epoch #18: test_reward: 153.505796 ± 125.293721, best_reward: 154.035783 ± 125.348117 in #0


Epoch #19: 3001it [00:06, 454.56it/s, env_step=57000, len=120, loss=4.958, n/ep=2, n/st=10, rew=55.25]                          


Epoch #19: test_reward: 153.505796 ± 115.376002, best_reward: 154.035783 ± 125.348117 in #0


Epoch #20: 3001it [00:06, 454.52it/s, env_step=60000, len=120, loss=2.159, n/ep=2, n/st=10, rew=53.47]                          


Epoch #20: test_reward: 153.505796 ± 127.634869, best_reward: 154.035783 ± 125.348117 in #0


In [15]:
result

{'duration': '138.74s',
 'train_time/model': '125.34s',
 'test_step': 37779,
 'test_episode': 105,
 'test_time': '5.06s',
 'test_speed': '7465.91 step/s',
 'best_reward': 154.03578341935645,
 'best_result': '154.04 ± 125.35',
 'train_step': 60000,
 'train_episode': 200,
 'train_time/collector': '8.34s',
 'train_speed': '448.83 step/s'}

Well, that's quite a let down...

Although, there's not much to be surprised about: there is so very little information passed to the networks! 

Plus, I'm still not too sure that the problem is really amenable to RL in the first place.

At this point, I have two choices:
1. I keep fine-tuning hyperparameters until I get an acceptable result,
2. I try a different approach.

I'll opt for the second option, but I made this code into a notebook precisely because, that way, tinkering would be easier. So, if you wish to tune and fine-tune things, go ahead!

In [16]:
policy.eval()
policy.set_eps(0.05)

collector = ts.data.Collector(policy, train_envs)
collector.collect(n_episode=10)

{'n/ep': 10,
 'n/st': 3000,
 'rews': array([ 55.09495263,  50.14317183,  53.25710873,  58.55555754,
         59.01095125,  47.6988985 , 220.29657369,  52.25722118,
        368.95651349, 316.79558103]),
 'lens': array([120, 120, 120, 120, 120, 120, 510, 150, 810, 810]),
 'idxs': array([3, 4, 3, 4, 3, 4, 2, 4, 0, 1]),
 'rew': 128.20665298717904,
 'len': 300.0,
 'rew_std': 118.72347264807837,
 'len_std': 279.4995527724508}

It does perform quite well, every once in a while...

### TensorBoard visualisation

In [19]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [20]:
%tensorboard --logdir logs/dqn

Reusing TensorBoard on port 6006 (pid 3699), started 0:00:20 ago. (Use '!kill 3699' to kill it.)