# Deep Q-Network (DQN)
---
In this notebook, you will implement a DQN agent with OpenAI Gym's LunarLander-v2 environment.

### 1. Import the Necessary Packages

In [None]:
# render ai gym environment
#!pip install gymnasium[box2d]
import gymnasium as gym

# install package for displaying animation
#!pip install JSAnimation
from JSAnimation.IPython_display import display_animation

#!pip install progressbar
#import progressbar as pb

from collections import deque
import random
import time
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import animation
%matplotlib inline

is_ipython = 'inline' in plt.get_backend()
if is_ipython:
    print("IPython")
    from IPython import display
else:  
    print("PVD")
    #!python -m pip install pyvirtualdisplay
    from pyvirtualdisplay import Display
    display = Display(visible=True, size=(1400, 900))
    display.start()
    
plt.ion()


import torch
import torch.nn as nn
import torch.nn.functional as F

from DQN_utils import *
from agent_W import *

#device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print("using device: ", device) 

In [None]:
from Q_network import QNetwork

### 2. Instantiate the Environment and Agent

Initialize the environment in the code cell below.

In [None]:
#print([k for k in gym.envs.registry.keys() if "Lunar" in k])  #.all().keys()  #.make
#gym.envs.registry['LunarLanderContinuous-v2']

In [None]:
OLD_GYM = False
if OLD_GYM:
    #import gym
    env = gym.make("LunarLander-v2", options={'continuous': False,
                                              'gravity': -9.81,
                                              'enable_wind': True,
                                              'wind_power': 1.5,
                                              'turbulence_power': 0.15}
                  )
    env.seed(123)
    state, info = env.reset()
    obs = env.render()

    done = False
    while not done:
        action = env.action_space.sample()  # agent policy that uses the observation and info
        state, reward, done, trun, info = env.step(action)
        print(action, reward)
        obs = env.render(mode="rgb_array")
        plt.imshow(obs)

    obs_space = env.observation_space.shape
    action_size = env.action_space.n
    print('State shape: ', obs_space)
    print('Number of actions: ', action_size)

In [None]:
##### NEW GYM = GYMNASIUM
#!pip install gymnasium[box2d]
#import gymnasium as gym
env = gym.make("LunarLander-v2", render_mode="rgb_array",    #"human",       #
                                 continuous= False,
                                 gravity= -9.81,
                                 enable_wind= True,
                                 wind_power= 0.01,
                                 turbulence_power= 0.001)
state, info = env.reset(seed = 1234)
obs = env.render()
    
done = False
while not done:
    action = env.action_space.sample()  # agent policy that uses the observation and info
    state, reward, done, trun, info = env.step(action)
    done = done or trun
    #print(action, reward)
    obs = env.render()
    plt.imshow(obs)

state_shape = env.observation_space.shape
state_size = state_shape[0]
action_size = env.action_space.n
print('State shape: ', state_size)
print('Number of actions: ', action_size)
plt.imshow(obs)

In [None]:
#env.observation_space.sample()

#### Observation Space

The observation space is an 8-dimensional vector: 
* the coordinates of the lander (x & y), 
* its linear velocities (x & y), 
* its angle (radians), 
* its angular velocity, 
* and two booleans for whether/not each leg has ground contact.

Observation Highs:
* [1.5  1.5  5.0  5.0  3.14  5.0  True  True ]

Observation Lows:
* [-1.5  -1.5  -5.0  -5.0  -3.14  -5.0  False  False ]

Wind function:

`tanh`(sin(2 k (t+C)) + sin(pi k (t+C))). k is set to 0.01. C is sampled randomly between -9999 and 9999


#### Discrete Action Space

There are four discrete actions available:

* 0: do nothing
* 1: fire left orientation engine
* 2: fire main engine
* 3: fire right orientation engine


#### Rewards

After every step a reward is granted. The total reward of an episode is the sum of the rewards for all the steps within that episode.

For each step, the reward:

* is increased/decreased the closer/further the lander is to the landing pad.
* is increased/decreased the slower/faster the lander is moving.
* is decreased the more the lander is tilted (angle not horizontal).
* is increased by 10 points for each leg that is in contact with the ground.
* is decreased by 0.03 points each frame a side engine is firing.
* is decreased by 0.3 points each frame the main engine is firing.

The episode receives an additional reward of -100 or +100 points for crashing or landing safely respectively.

An episode is considered a solution if it scores at least 200 points.


Before running the next code cell, familiarize yourself with the code in **Step 2** and **Step 3** of this notebook, along with the code in `dqn_agent.py` and `model.py`.  Once you have an understanding of how the different files work together, 
- Define a neural network architecture in `model.py` that maps states to action values.  This file is mostly empty - it's up to you to define your own deep Q-network!
- Finish the `learn` method in the `Agent` class in `dqn_agent.py`.  The sampled batch of experience tuples is already provided for you; you need only use the local and target Q-networks to compute the loss, before taking a step towards minimizing the loss.

Once you have completed the code in `dqn_agent.py` and `model.py`, run the code cell below.  (_If you end up needing to make multiple changes and get unexpected behavior, please restart the kernel and run the cells from the beginning of the notebook!_)

You can find the solution files, along with saved model weights for a trained agent, in the `solution/` folder.  (_Note that there are many ways to solve this exercise, and the "solution" is just one way of approaching the problem, to yield a trained agent._)

### 3. Train the Agent with DQN

Run the code cell below to train the agent from scratch.  You are welcome to amend the supplied values of the parameters in the function, to try to see if you can get better performance!

In [None]:
#from dqn_agent import Agent
#torch.save(agent.qnetwork_local.state_dict(), 'dqn_windypoint.pth')

seed = 1234
agent = Agent(state_size=state_size, action_size=action_size, seed=seed, 
              fc1_units=64, fc2_units=64, learn_every=4)

try: agent.qnetwork_local.load_state_dict(torch.load('data/slvdpnt.pth'))
except: pass

In [None]:
def dqn(n_episodes=2000, max_t=1000, eps_start=1.0, eps_end=0.01, eps_decay=0.995):
    """Deep Q-Learning.
    
    Params
    ======
        n_episodes (int): maximum number of training episodes
        max_t (int): maximum number of timesteps per episode
        eps_start (float): starting value of epsilon, for epsilon-greedy action selection
        eps_end (float): minimum value of epsilon
        eps_decay (float): multiplicative factor (per episode) for decreasing epsilon
    """
    FIRST = True
    episode_lengths = []
    scores = []                        # list containing scores from each episode
    window_size = 100                  # scores to rolling-remember
    scores_window = deque(maxlen=window_size)
    eps = eps_start                    # initialize epsilon
    for i_episode in range(1, n_episodes+1):
        state, _ = env.reset(seed=SEED)
        score = 0
        episteps = 0
        for t in range(max_t):  #episteps):
            action = agent.act(state, eps)
            next_state, reward, done, trun, _ = env.step(action)
            agent.step(state, action, reward, next_state, done or trun)
            state = next_state
            score += reward
            episteps += 1
            if done or trun:
                break 
        scores_window.append(score)       # save most recent score
        scores.append(score)              # save most recent score
        episode_lengths.append(episteps)
        eps = max(eps_end, eps_decay*eps) # decrease epsilon
        
        cycle_steps = agent.steps%BUFFER_SIZE
        buffer_cycle = agent.steps//BUFFER_SIZE

        print("\rEpisode {:4d} | Episode Score: {:7.2f} | Eps Steps: {:4d} | Epsilon: {:1.3f} | Buffer cycle:{:7d} +{:3d}".format(i_episode, 
                                                                                                                            scores_window[-1], 
                                                                                                                            episode_lengths[-1],
                                                                                                                            eps,
                                                                                                                            cycle_steps,
                                                                                                                            buffer_cycle), end="")
                                                                                                                                        
        if i_episode % 100 == 0:
            chkpntname = "data/chkpnt{}.pth".format(i_episode)
            torch.save(agent.qnetwork_local.state_dict(), chkpntname)   
            print("\rEpisode {:4d} | Average Score: {:7.2f} | Avg Steps: {:4d} | Epsilon: {:1.3f} | Buffer cycle:{:7d} +{:3d}".format(i_episode, 
                                                                                                                            np.mean(scores_window), 
                                                                                                                            round(np.mean(episode_lengths)),
                                                                                                                            eps,
                                                                                                                            cycle_steps,
                                                                                                                            buffer_cycle))
        #episteps = (episteps - 1) if episteps>=100 else max_t
        
        if np.mean(scores_window)>=200. and FIRST:
            print('\nEnvironment solved in {:d} episodes!\tAverage Score: {:5.2f}'.format(i_episode-100, np.mean(scores_window)))
            torch.save(agent.qnetwork_local.state_dict(), 'data/slvdpnt.pth')
            FIRST = False
        elif np.mean(scores_window)>=250. :
            print("\nHigh Score!")
            torch.save(agent.qnetwork_local.state_dict(), 'data/hipnt.pth')
            break
             
    return scores, episode_lengths


In [None]:
scores, steps  = dqn(n_episodes=1200, max_t=1000)

##### Plots

In [None]:
# plot the scores
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), scores, 'b,--', linewidth=0.25, markersize=1.0,)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

In [None]:
# plot the normed scores
fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)
plt.plot(np.arange(len(scores)), norm(np.asarray(scores)), 'r,--', linewidth=0.25, markersize=1.0,)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

In [None]:
agent.memory.rewards[0]

In [None]:
# plot the normed rewards
Mrewards = list(agent.memory.rewards)
Mreward_mean = np.mean(Mrewards)
Mreward_std = np.std(Mrewards)
NMrewards = (Mrewards - Mreward_mean)/Mreward_std if Mreward_std!=0. else Mrewards

fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)
plt.plot(np.arange(len(NMrewards)), NMrewards, 'g,', linewidth=0.1, markersize=1.0,)
#plt.ylim(0, 0.25)
plt.ylabel('Normed Rewards')
plt.xlabel('Episode #')
plt.show()

##### Stats

Nrewards: 
* 103542

Mreward_mean: 
* -3.834

Mreward_std: 
* 11.025

In [None]:
len(agent.memory.memory)

In [None]:
states, actions, rewards, next_states, dones = disentangle(agent.memory.memory)
#sum(dones), 
#np.asarray(rewards).where(dones==1)
#finrews = [r for r,d in zip(rewards,dones) if d!=0]
[(list(s[-2:]), num2act[a], r) for s, a, r in zip(states, actions, rewards) if r>-100. and r<10]#[-300:-200]

In [None]:
#[(list(s[-2:]), num2act[a], r) for s, a, r in zip(states, actions, rewards)])


In [None]:
if True:
#if False:
    # Expects data to be pre-torched, normed or scaled, and well-shaped
    states, actions, rewards, next_states, dones = disentangle(agent.memory.memory)
    print("Initial states:", np.asarray(states).shape, np.asarray(states)[0])

        
    agent.qnetwork_local.eval()
    agent.qnetwork_target.eval()
    with torch.no_grad():    
   
        # To tensors
        Tstates = torch.from_numpy(np.vstack(states)).float().to(device)
        Tactions = torch.from_numpy(np.vstack(actions)).long().to(device)
        Trewards = torch.from_numpy(np.vstack(rewards)).float().to(device)
        Tnext_states = torch.from_numpy(np.vstack(next_states)).float().to(device)
        Tdones = torch.from_numpy(np.vstack(dones).astype(np.uint8)).float().to(device)

        # Get max predicted Q values (for next states) from target model
        Q_target_next = agent.qnetwork_target(Tnext_states)
        print("Q_target(Tnext_states) = Q_target_next:", Q_target_next.shape, Q_target_next[0])
        Q_target_next = Q_target_next.detach().max(1)[0]
        print("Q_target_next MAX:", Q_target_next.shape, Q_target_next[0])
        Q_target_next = Q_target_next.unsqueeze(1)
        print("Q_target_next UNSQUEEZED:", Q_target_next.shape, Q_target_next[0])
        
        # Q_target_next.shape, Q_target_next[0]

        # Compute Q targets for current states 
        Q_target = Trewards + (GAMMA * Q_target_next * (1 - Tdones))
        print("Q_target = Trewards + (GAMMA * Q_target_next * (1 - Tdones)):\n", Q_target.shape, Q_target[0])

        # Get expected Q values from local model
        Q_expected = agent.qnetwork_local(Tstates)
        print("Q_local(Tstates) = Q_expected:", Q_expected.shape, Q_expected[0])
        Q_expected = Q_expected.gather(1, Tactions) ## Local Q-value of the action taken
        print("Q_expected Gathered:", Q_expected.shape, Q_expected[0])
        print("Tactions:", Tactions.shape, Tactions[0])

        # Compute loss
        loss = F.mse_loss(Q_expected, Q_target)
        print("loss:", loss)  
        
    agent.qnetwork_target.train()
    agent.qnetwork_local.train()

In [None]:
# plot the normed rewards
Mrewards = list(agent.memory.rewards)
Mreward_mean = np.mean(Mrewards)
Mreward_std = np.std(Mrewards)
NMrewards = (Mrewards - Mreward_mean)/Mreward_std if Mreward_std!=0. else Mrewards

fig = plt.figure(figsize=(12,6))
ax = fig.add_subplot(111)
plt.plot(np.arange(len(NMrewards)), NMrewards, 'g,-', linewidth=0.1, markersize=1.0,)
plt.ylabel('Normed Rewards')
plt.xlabel('Episode #')
plt.show()

##### Watch

In [None]:
# watch a pre-trained agent

for chkpnt in range(200,1400,200):
    chkpntname = "data/chkpnt{}.pth".format(chkpnt)
    agent.qnetwork_local.load_state_dict(torch.load(chkpntname))

    state, _ = env.reset()
    img = plt.imshow(env.render())
    for j in range(200):
        action = agent.act(state)
        state, reward, done, trun, _ = env.step(action)
        img.set_data(env.render()) 
        plt.axis('off')
        title = "{:4d} | {:-4.2f} | {:s}".format(chkpnt, reward, num2act[action])
        plt.title(title)
        display.display(plt.gcf())
        display.clear_output(wait=True)
        if done or trun:
            break 
      
env.close()

### 5. Explore

In this exercise, you have implemented a DQN agent and demonstrated how to use it to solve an OpenAI Gym environment.  To continue your learning, you are encouraged to complete any (or all!) of the following tasks:
- Amend the various hyperparameters and network architecture to see if you can get your agent to solve the environment faster.  Once you build intuition for the hyperparameters that work well with this environment, try solving a different OpenAI Gym task with discrete actions!
- You may like to implement some improvements such as prioritized experience replay, Double DQN, or Dueling DQN! 
- Write a blog post explaining the intuition behind the DQN algorithm and demonstrating how to use it to solve an RL environment of your choosing.  