# Q-Learning & DQNs (12 regular points + 2 extra credit points for both CS4803 and CS7643)

In this section, we will implement a few key parts of the Q-Learning algorithm for two cases - (1) A Q-network which is a single linear layer (referred to in RL literature as "Q-learning with linear function approximation") and (2) A deep (convolutional) Q-network, for some Atari game environments where the states are images.

Optional Readings: 
- **Playing Atari with Deep Reinforcement Learning**, Mnih et. al., https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
- **The PyTorch DQN Tutorial** https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html


In [5]:
%load_ext autoreload
%autoreload 2

import numpy as np
import gym

import torch
import torch.nn as nn
import torch.optim as optim

from core.dqn_train import DQNTrain
from utils.test_env import EnvTest
from utils.schedule import LinearExploration, LinearSchedule
from utils.preprocess import greyscale
from utils.wrappers import PreproWrapper, MaxAndSkipEnv

from linear_qnet import LinearQNet
from cnn_qnet import ConvQNet

if torch.cuda.is_available():
    device = torch.device('cuda', 0)
else:
    device = torch.device('cpu')
    
import minerl
import gym

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload




In [7]:
from configs.p1_linear import config as config_lin

env = gym.make('MineRLTreechop-v0')

# exploration strategy
exp_schedule = LinearExploration(env, config_lin.eps_begin,
        config_lin.eps_end, config_lin.eps_nsteps)

# learning rate schedule
lr_schedule  = LinearSchedule(config_lin.lr_begin, config_lin.lr_end,
        config_lin.lr_nsteps)

# train model
model = DQNTrain(LinearQNet, env, config_lin, device)
model.run(exp_schedule, lr_schedule)

Starting Minecraft process: ['/var/folders/_5/9n4k6nj50nb48gfy7_x3brnr0000gn/T/tmpjwt17x8w/Minecraft/launchClient.sh', '-port', '9016', '-env', '-runDir', '/var/folders/_5/9n4k6nj50nb48gfy7_x3brnr0000gn/T/tmpjwt17x8w/Minecraft/run']
Starting process watcher for process 33998 @ localhost:9016
This mapping 'snapshot_20161220' was designed for MC 1.11! Use at your own peril.
#################################################
         ForgeGradle 2.2-SNAPSHOT-3966cea        
  https://github.com/MinecraftForge/ForgeGradle  
#################################################
               Powered by MCP unknown               
             http://modcoderpack.com             
         by: Searge, ProfMobius, Fesh0r,         
         R4wk, ZeuX, IngisKahn, bspkrs           
#################################################
Found AccessTransformer: malmomod_at.cfg
:deobfCompileDummyTask
:deobfProvidedDummyTask
:getVersionJson
:extractUserdev
:downloadClient SKIPPED
:downloadServer SKIPPED
:spl

[09:57:35] [main/INFO]: [com.microsoft.Malmo.OverclockingClassTransformer:overclockRenderer:150]: MALMO: Hooked into call to Minecraft.updateDisplay()
[09:57:35] [main/INFO]: A re-entrant transformer '$wrapper.com.microsoft.Malmo.OverclockingClassTransformer' was detected and will no longer process meta class data
[09:57:35] [main/INFO]: Launching wrapped minecraft {net.minecraft.client.main.Main}
[09:57:35] [main/INFO]: [com.microsoft.Malmo.OverclockingClassTransformer:transform:58]: MALMO: Attempting to transform MinecraftServer
[09:57:35] [main/INFO]: [com.microsoft.Malmo.OverclockingClassTransformer:overclockRenderer:129]: MALMO: Found Minecraft, attempting to transform it
[09:57:35] [main/INFO]: [com.microsoft.Malmo.OverclockingClassTransformer:overclockRenderer:135]: MALMO: Found Minecraft.runGameLoop() method, attempting to transform it
[09:57:35] [main/INFO]: [com.microsoft.Malmo.OverclockingClassTransformer:overclockRenderer:150]: MALMO: Hooked into call to Minecraft.updateDis

AttributeError: 'Dict' object has no attribute 'n'

[09:57:51] [Realms Notification Availability checker #1/INFO]: Could not authorize you against Realms server: Invalid session id


You should get a final average reward of over 4.0 on the test environment.

## Part 2: Q-Learning with Deep Q-Networks

In `cnn_qnet.py`, implement the initialization and forward pass of a convolutional Q-network with architecture as described in this DeepMind paper:
    
"Playing Atari with Deep Reinforcement Learning", Mnih et. al. (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf)

### Deliverable 2 (4 points)

Run the following block of code to train our Deep Q-Network. You should get an average reward of ~4.0, full credit will be given if average reward at the final evaluation is above 3.5

In [10]:
from configs.p2_cnn import config as config_cnn

env = EnvTest((80, 80, 1))

# exploration strategy
exp_schedule = LinearExploration(env, config_cnn.eps_begin,
        config_cnn.eps_end, config_cnn.eps_nsteps)

# learning rate schedule
lr_schedule  = LinearSchedule(config_cnn.lr_begin, config_cnn.lr_end,
        config_cnn.lr_nsteps)

# train model
model = DQNTrain(ConvQNet, env, config_cnn, device)
model.run(exp_schedule, lr_schedule)

Evaluating...
Average reward: 0.50 +/- 0.00


Populating the memory 150/200...

Evaluating...





Average reward: 0.50 +/- 0.00




Evaluating...
Average reward: -0.80 +/- 0.00




Evaluating...





Average reward: -1.00 +/- 0.00




Evaluating...





Average reward: 1.40 +/- 0.00




Evaluating...
Average reward: 2.00 +/- 0.00




Evaluating...
Average reward: 4.10 +/- 0.00




Evaluating...
Average reward: 4.10 +/- 0.00




Evaluating...
Average reward: 4.00 +/- 0.00




- Training done.
Evaluating...
Average reward: 4.00 +/- 0.00





You should get a final average reward of over 4.0 on the test environment, similar to the previous case.

## Part 3: Playing Atari Games from Pixels - using Linear Function Approximation

Now that we have setup our Q-Learning algorithm and tested it on a simple test environment, we will shift to a harder environment - an Atari 2600 game from OpenAI Gym: Pong-v0 (https://gym.openai.com/envs/Pong-v0/), where we will use RGB images of the game screen as our observations for state.

No additional implementation is required for this part, just run the block of code below (will take around 1 hour to train). We don't expect a simple linear Q-network to do well on such a hard environment - full credit will be given simply for running the training to completion irrespective of the final average reward obtained.

You may edit `configs/p3_train_atari_linear.py` if you wish to play around with hyperparamters for improving performance of the linear Q-network on Pong-v0, or try another Atari environment by changing the `env_name` hyperparameter. The list of all Gym Atari environments are available here: https://gym.openai.com/envs/#atari

### Deliverable 3 (2 points)

Run the following block of code to train a linear Q-network on Atari Pong-v0. We don't expect the linear Q-Network to learn anything meaingful so full credit will be given for simply running this training to completion (without errors), irrespective of the final average reward.

In [30]:
from configs.p3_train_atari_linear import config as config_lina

# make env
env = gym.make(config_lina.env_name)
env = MaxAndSkipEnv(env, skip=config_lina.skip_frame)
env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1),
                    overwrite_render=config_lina.overwrite_render)

# exploration strategy
exp_schedule = LinearExploration(env, config_lina.eps_begin,
        config_lina.eps_end, config_lina.eps_nsteps)

# learning rate schedule
lr_schedule  = LinearSchedule(config_lina.lr_begin, config_lina.lr_end,
        config_lina.lr_nsteps)

# train model
model = DQNTrain(LinearQNet, env, config_lina, device)
print("Linear Q-Net Architecture:\n", model.q_net)
model.run(exp_schedule, lr_schedule)

Evaluating...


Linear Q-Net Architecture:
 LinearQNet(
  (fc_layer): Linear(in_features=25600, out_features=6, bias=True)
)


Average reward: -20.86 +/- 0.06




Evaluating...





Average reward: -20.96 +/- 0.03




- Training done.
Evaluating...





Average reward: -20.38 +/- 0.10


## Part 4: Playing Atari Games from Pixels - using Deep Q-Networks

This part is extra credit and worth 5 bonus points. We will now train our deep Q-Network from Part 2 on Pong-v0. 

Again, no additional implementation is required but you may wish to tweak your CNN architecture in `cnn_qnet.py` and hyperparameters in `configs/p4_train_atari_cnn.py` (however, evaluation will be considered at no farther than the default 5 million steps, so you are not allowed to train for longer). Please note that this training may take a very long time (we tested this on a single GPU and it took around 6 hours).

The bonus points for this question will be allotted based on the best evaluation average reward (EAR) before 5 million time stpes:

1. EAR >= 0.0 : 4/4 points
2. EAR >= -5.0 : 3/4 points
3. EAR >= -10.0 : 3/4 points
4. EAR >= -15.0 : 1/4 points

### Deliverable 4: (2 points. Extra Credit for both CS4803 and CS7643)

Run the following block of code to train your DQN:

In [None]:
from configs.p4_train_atari_cnn import config as config_cnna


# make env
env = gym.make(config_cnna.env_name)
env = MaxAndSkipEnv(env, skip=config_cnna.skip_frame)
env = PreproWrapper(env, prepro=greyscale, shape=(80, 80, 1),
                    overwrite_render=config_cnna.overwrite_render)

# exploration strategy
exp_schedule = LinearExploration(env, config_cnna.eps_begin,
        config_cnna.eps_end, config_cnna.eps_nsteps)

# learning rate schedule
lr_schedule  = LinearSchedule(config_cnna.lr_begin, config_cnna.lr_end,
        config_cnna.lr_nsteps)

# train model
model = DQNTrain(ConvQNet, env, config_cnna, device)
print("CNN Q-Net Architecture:\n", model.q_net)
model.run(exp_schedule, lr_schedule)