# LunarLander with DQN (Deep Q-Network)

In this lesson, you'll train a DQN agent on `LunarLander-v3` using Stable-Baselines3. We'll:
- Understand the environment
- Define a DQN model and key hyperparameters
- Train with evaluation callbacks
- Visualize learning curves and watch the agent land!


In [None]:
import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.callbacks import EvalCallback
import numpy as np
import os

print("Libraries imported.")


## Environment and Wrappers
We'll wrap the environment with `Monitor` to log episode rewards/lengths and set up an evaluation environment.


In [None]:
env = Monitor(gym.make("LunarLander-v3"))
eval_env = Monitor(gym.make("LunarLander-v3"))
log_dir = "./logs_dqn"
os.makedirs(log_dir, exist_ok=True)

print(env.observation_space, env.action_space)


## Define the DQN Model
Key hyperparameters: network architecture, buffer size, learning rate, gamma, exploration schedule, and target network updates.


In [None]:
model = DQN(
    "MlpPolicy",
    env,
    learning_rate=2.5e-4,
    buffer_size=100_000,
    learning_starts=10_000,
    batch_size=64,
    tau=1.0,
    gamma=0.99,
    train_freq=4,
    target_update_interval=10_000,
    exploration_fraction=0.1,
    exploration_initial_eps=1.0,
    exploration_final_eps=0.05,
    verbose=1,
)

callback = EvalCallback(
    eval_env,
    best_model_save_path=log_dir,
    log_path=log_dir,
    eval_freq=10_000,
    deterministic=True,
    render=False,
)


## Train and Evaluate
We'll train for 200k steps. Evaluation runs periodically via the callback, and we compute final mean reward.


In [None]:
model.learn(total_timesteps=200_000, callback=callback)
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=20)
print(f"Mean reward: {mean_reward:.2f} ± {std_reward:.2f}")


# Lunar Lander with Gymnasium and Deep Q-Learning (DQN)

DQN is an off-policy reinforcement learning algorithm that learns an action-value function $Q(s,a)$, which estimates the expected future reward of taking action a in state s.


In [None]:
import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

## Create the Environment

In [None]:
env = gym.make("LunarLander-v3", render_mode="rgb_array")

## Train the Model

In [None]:
# Initialize the DQN model
model = DQN(
    policy="MlpPolicy",         # Use a Multi-Layer Perceptron (MLP) neural network
    env=env,                    # The Gym environment
    
    # === Learning Parameters ===
    learning_rate=1e-3,         # Step size for optimizer (Adam by default). Higher = faster but riskier updates.

    # === Replay Buffer ===
    buffer_size=50_000,         # Max number of past transitions to store. Larger = more stable learning but more memory.
    learning_starts=1000,       # Number of steps before learning starts (helps fill the buffer with diverse experience).
    batch_size=64,              # Number of samples per training update from the buffer.

    # === Discounting and Target Network ===
    gamma=0.99,                 # Discount factor for future rewards. Close to 1 = long-term reward focus.
    tau=1.0,                    # Soft update rate for the target network. 1.0 = hard update every target_update_interval.

    # === Training Frequency ===
    train_freq=4,               # Train the model every 4 environment steps.
    target_update_interval=1000,  # Number of training steps between target network updates (delayed update stabilizes learning).

    # === Misc Settings ===
    verbose=1,                  # Verbosity level: 0 = silent, 1 = training info, 2 = debug.
)

# Train the model
model.learn(total_timesteps=200_000)

# Save the model
model.save("dqn_lunarlander")

# Evaluate the model
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

env.close()

## Interpretting Training Log

| Metric                        | Meaning                                                                                                                                                                                           |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **`ep_len_mean`**       | Average number of steps per episode. Max is 1000 for `LunarLander-v3`, so a high number means your agent usually survives (or crashes) near the end of each episode.                                        |
| **`ep_rew_mean`**      | Average reward per episode over recent rollouts. In `LunarLander`, good agents score **\~200**, random ones score near **0 or below**. A score below 200 indicates your agent hasn't converged yet. |
| **`exploration_rate`** = 0.05 | Current ε in ε-greedy exploration. Starts near 1 (more random) and decays to min (usually 0.05). You’re now mostly exploiting learned policy.                                                     |


## Render the Trained Agent

In [None]:
model = DQN.load("dqn_lunarlander")

# Create environment with human rendering
env = gym.make("LunarLander-v3", render_mode="human")

obs, _ = env.reset()
done = False

while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated

env.close()

: 