# 2019 Winter STAT 231B --- Final Deep Q-Network (DQN)
---
In this notebook, you will implement a DQN agent with replay buffer on OpenAI Gym's Atari/box2d game. 

A very good official pytorch tutorial is a good start. https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html


You can choose any atari / box2d game you like under this two webpage: 

https://gym.openai.com/envs/#box2d

- [EASY] The box2d game state is the smallest. e.g. LunarLander-V2, it has only 8 dims.

https://gym.openai.com/envs/#atari

Each atari game has two kind of input. 
- [MEDIUM] RAM version has a small state of only 128 dims. You can use fully connected layer to train.
- [HELL] Screen version takes image as state which is around 200*200*3 dims. You need conv layer to train.

I recommended you try [EASY] first. If you make it all right, typical you will train a good agent within 1000 epochs. Then, try [MEDIUM].

[HELL] Screen version typically need 10 hour to train. (I failed in screen version) If you can train it successfully, I will definitely give you highest bouns.

Definition of solved : See https://github.com/openai/gym/wiki/Leaderboard

There are no specific definition of solved for atari game.

We only require you to implement [EASY]. Challange your self on atari game. 

Upload two files for coding part in Final.

- A pdf files : Your report. Please write down specific algorithm, implementing detail and result (Include sample game screenshot and reward-epoch plot) Also, attach all the code at the end of the pdf.

- This ipynb files. (If you use google colab for training, you do not need to upload. Instead, share this ipynb to me. Press "Share" in the top-right and coby the link to your readme.md)


- PS. You do not need to follow my template if you prefer implement in your way. 
- PPS. You can use Tensorflow if you prefer to do so. 
- However, please define the same class as this template. Include at least : 
agent class with act and learn; replay class with push and sample; q-function class with deep network structure; a train function.

### 1. Import the Necessary Packages

In [0]:
!pip install box2d-py
!apt-get install -y xvfb python-opengl > /dev/null 2>&1
!pip install gym pyvirtualdisplay > /dev/null 2>&1



In [1]:
import gym
from gym import wrappers
import random
import torch
import numpy as np
from collections import deque, namedtuple
import matplotlib.pyplot as plt
import torch.nn.functional as F
import torch.nn as nn
import torch.optim as optim
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
%matplotlib inline

def show_video(folder):
    mp4list = glob.glob('%s/*.mp4' % folder)
    if len(mp4list) > 0:
        encoded = base64.b64encode(io.open(mp4list[0], 'r+b').read())
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay loop controls style="height: 400px;"> 
        <source src="data:video/mp4;base64,{0}" type="video/mp4" /> </video>'''.format(encoded.decode('ascii'))))
        
display = Display(visible=0, size=(400, 300))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '400x300x24', ':1001'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '400x300x24', ':1001'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

### 2. Try it

The following code will output a sample video whose action is random sampled.

In [2]:

atari_game = "LunarLander-v2"
env = gym.wrappers.Monitor(gym.make(atari_game), 'sample', force=True)
env.seed(0)
print('State shape: ', env.observation_space.shape)
print('Number of actions: ', env.action_space.n)

state = env.reset()

cr = 0
for j in range(2000):
    action = env.action_space.sample()
    env.render()
    state, reward, done, _ = env.step(action)
    
    cr += reward
    print('\r %.5f' % cr, end="")
    if done:
        break 
env.close()
show_video('sample')

State shape:  (8,)
Number of actions:  4
 -143.59446

In [3]:
print (state.shape)

(8,)


### 3. Define QNetwork, agent and replay buffer

In [6]:
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 128         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 1e-3               # learning rate 
UPDATE_EVERY = 5        # how often to update the network

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print (device)

class QNetwork(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, seed, fc1_units=256, fc2_units=256):
        super(QNetwork, self).__init__()
        self.seed   = torch.manual_seed(seed)
        self.state_size  = state_size
        self.action_size = action_size
        self.fc1_units   = fc1_units
        self.fc2_units   = fc2_units
        self.layer1 = nn.Linear(self.state_size, self.fc1_units, bias=True)
        self.layer2 = nn.Linear(self.fc1_units,  self.fc2_units, bias=True)
        self.layer3 = nn.Linear(self.fc2_units,  self.action_size, bias=True)
        
    def forward(self, state):
        """Build a network that maps state -> action values."""
        layer1 = F.relu(self.layer1(state))
        layer2 = F.relu(self.layer2(layer1))
        layer3 = self.layer3(layer2)
        return layer3
        



cuda:0


In [0]:
class Agent():
    """Interacts with and learns from the environment."""

    def __init__(self, state_size, action_size, seed, buffer_size, batch_size, learning_rate, update_every,gamma, tau):
        """Initialize an Agent object.
        
        Params
        ======
            state_size (int): dimension of each state
            action_size (int): dimension of each action
            seed (int): random seed
        """
        self.state_size = state_size
        self.action_size = action_size
        self.seed = random.seed(seed)
        self.buffer_size = buffer_size
        self.batch_size  = batch_size
        self.learning_rate = learning_rate
        self.update_every = update_every
        self.gamma        = gamma
        self.tau          = tau
        self.Q_network     = QNetwork(self.state_size, self.action_size, seed ) 
        self.Q_network_val = QNetwork(self.state_size, self.action_size, seed )       
        
        # Replay memory
        self.memory        = ReplayBuffer(self.action_size, self.buffer_size, self.batch_size, self.seed)
        self.optimizer     = optim.Adam(self.Q_network.parameters(), lr=self.learning_rate)
        
        self.steps_until_update = 0
       
    def step(self, state, action, reward, next_state, done):
      
        # Save experience in replay memory    
        self.memory.push(state, action, reward, next_state, done)
        self.steps_until_update = (self.steps_until_update + 1)%self.update_every
        
        if(self.steps_until_update==0):
          if(self.memory.__len__()>self.batch_size):
            sample = self.memory.sample()
            states, actions, rewards, next_states, dones = sample
            self.Q_network_val.eval()
            with torch.no_grad():
              target_rewards = rewards + self.gamma*(torch.max(self.Q_network_val.forward(next_states), dim=1, keepdim=True)[0])*(1-dones)

            self.Q_network.train()
            expected_rewards = self.Q_network.forward(states).gather(1, actions)
            loss             = F.mse_loss(expected_rewards, target_rewards)
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

            for Q_network_val_parameters, Q_network_parameters in zip(self.Q_network_val.parameters(), self.Q_network.parameters()):
                Q_network_val_parameters.data.copy_(self.tau * Q_network_parameters.data + (1.0 - self.tau) * Q_network_val_parameters.data)     

    def act(self, state, eps=0.1):
        """Returns actions for given state as per current policy.
        
        Params
        ======
            state (array_like): current state
            eps (float): epsilon, for epsilon-greedy action selection
        """

        state = torch.from_numpy(state).float().unsqueeze(0)       
        with torch.no_grad():
            action_values = self.Q_network(state)
        uniform_random = random.random()
        
        if(uniform_random >eps):
          action = np.argmax(action_values.cpu().data.numpy())
        else:
          action = np.random.randint(self.action_size)
        return action
 

           

class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, action_size, buffer_size, batch_size, seed):
        """Initialize a ReplayBuffer object.
        Params
        ======
            action_size (int): dimension of each action
            buffer_size (int): maximum size of buffer
            batch_size (int): size of each training batch
            seed (int): random seed
        """
        self.action_size = action_size
        self.buffer_size = buffer_size
        self.batch_size  = batch_size
        self.seed        = random.seed(seed) 
        self.position    = 0
        self.memory      = []
        self.transition = namedtuple("Transition", field_names=["state", "action", "reward", "next_state", "done"])
        
    def push(self, state, action, reward, next_state, done):
        """Add a new experience to memory."""
        self.current_transition = self.transition(state, action, reward, next_state, done)
        if(len(self.memory)< self.buffer_size):
          self.memory.append(None)
        self.memory[self.position] = self.current_transition
        self.position = (self.position + 1) % self.buffer_size
          
          
    def sample(self):
        """Randomly sample a batch of experiences from memory."""

        sample = random.sample(self.memory, self.batch_size)
        sample_list = self.transition(*zip(*sample))
        states = torch.from_numpy(np.vstack(sample_list.state)).float()
        actions = torch.from_numpy(np.vstack(sample_list.action)).long()
        rewards = torch.from_numpy(np.vstack(sample_list.reward)).float()
        next_states = torch.from_numpy(np.vstack(sample_list.next_state)).float()
        dones = torch.from_numpy(np.vstack(sample_list.done).astype(np.uint8)).float()
        
        return (states, actions, rewards, next_states, dones)
      

      

    def __len__(self):
        """Return the current size of internal memory."""
        return len(self.memory)

### 3. Train the Agent with DQN



In [0]:
def train_dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
    """Deep Q-Learning.
    
    Params
    ======
        n_episodes (int): maximum number of training episodes
        max_t (int): maximum number of timesteps per episode
        eps_start (float): starting value of epsilon, for epsilon-greedy action selection
        eps_end (float): minimum value of epsilon
        eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
    """
    scores = []                        # list containing scores from each episode
    scores_window = deque(maxlen=100)  # last 100 scores
    eps = eps_start                    # initialize epsilon
    
    env = gym.wrappers.Monitor(gym.make(atari_game), 'output', force=True)
    
    render = True
    for i_episode in range(0, n_episodes):
        if render and i_episode % 100 == 0:
            env = gym.wrappers.Monitor(gym.make(atari_game), 'output_%d' % i_episode, force=True)
        state = env.reset()
        score = 0
        for t in range(max_t):
            action = agent.act(state, eps)
            if render and i_episode % 100 == 0:
                env.render()
            next_state, reward, done, _ = env.step(action)
            agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break 
        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score
        eps = max(eps_end, eps_decay*eps) # decrease epsilon
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)), end="")
        if i_episode % 100 == 0:
            if render:
                env.close()
                show_video('output_%d' % i_episode)
                env = gym.make(atari_game)
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_window)))
        if np.mean(scores_window)>=200.0: # You can change for different game
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_window)))
            torch.save(agent.Q_network.state_dict(), 'checkpoint.pth')
            break
    return scores
BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 128         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 5e-4               # learning rate 
UPDATE_EVERY = 1        # how often to update the network
agent = Agent(state_size=8, action_size=4, seed=0, buffer_size= BUFFER_SIZE, batch_size= BATCH_SIZE, learning_rate= LR, update_every = UPDATE_EVERY, gamma= GAMMA, tau=TAU)

scores = train_dqn()

# plot the scores
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

Episode 0	Average Score: -408.14

Episode 0	Average Score: -408.14
Episode 100	Average Score: -136.08

Episode 100	Average Score: -136.08
Episode 200	Average Score: -43.57

Episode 200	Average Score: -43.57
Episode 300	Average Score: 65.75

Episode 300	Average Score: 65.75
Episode 352	Average Score: 111.29

You can load the parameter by this line.