# Lunar Lander with Gymnasium and Deep Q-Learning (DQN)

DQN is an off-policy reinforcement learning algorithm that learns an action-value function Q(s,a)Q(s,a), which estimates the expected future reward of taking action aa in state ss.

In [5]:
import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.evaluation import evaluate_policy
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

## Create the Environment

In [6]:
env = gym.make("LunarLander-v3", render_mode="rgb_array")

## Train the Model

In [None]:
# Initialize the DQN model
model = DQN(
    policy="MlpPolicy",         # Use a Multi-Layer Perceptron (MLP) neural network
    env=env,                    # The Gym environment
    
    # === Learning Parameters ===
    learning_rate=1e-3,         # Step size for optimizer (Adam by default). Higher = faster but riskier updates.

    # === Replay Buffer ===
    buffer_size=50_000,         # Max number of past transitions to store. Larger = more stable learning but more memory.
    learning_starts=1000,       # Number of steps before learning starts (helps fill the buffer with diverse experience).
    batch_size=64,              # Number of samples per training update from the buffer.

    # === Discounting and Target Network ===
    gamma=0.99,                 # Discount factor for future rewards. Close to 1 = long-term reward focus.
    tau=1.0,                    # Soft update rate for the target network. 1.0 = hard update every target_update_interval.

    # === Training Frequency ===
    train_freq=4,               # Train the model every 4 environment steps.
    target_update_interval=1000,  # Number of training steps between target network updates (delayed update stabilizes learning).

    # === Misc Settings ===
    verbose=1,                  # Verbosity level: 0 = silent, 1 = training info, 2 = debug.
)

# Train the model
model.learn(total_timesteps=200_000)

# Save the model
model.save("dqn_lunarlander")

# Evaluate the model
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

env.close()

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 95.5     |
|    ep_rew_mean      | -154     |
|    exploration_rate | 0.982    |
| time/               |          |
|    episodes         | 4        |
|    fps              | 2452     |
|    time_elapsed     | 0        |
|    total_timesteps  | 382      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 102      |
|    ep_rew_mean      | -162     |
|    exploration_rate | 0.961    |
| time/               |          |
|    episodes         | 8        |
|    fps              | 2106     |
|    time_elapsed     | 0        |
|    total_timesteps  | 813      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 95.3     |
|    ep_rew_mean      | -146   

## Interpretting Training Log

| Metric                        | Meaning                                                                                                                                                                                           |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **`ep_len_mean`**       | Average number of steps per episode. Max is 1000 for `LunarLander-v3`, so a high number means your agent usually survives (or crashes) near the end of each episode.                                        |
| **`ep_rew_mean`**      | Average reward per episode over recent rollouts. In `LunarLander`, good agents score **\~200**, random ones score near **0 or below**. A score below 200 indicates your agent hasn't converged yet. |
| **`exploration_rate`** = 0.05 | Current ε in ε-greedy exploration. Starts near 1 (more random) and decays to min (usually 0.05). You’re now mostly exploiting learned policy.                                                     |


## Render the Trained Agent

In [None]:
model = DQN.load("dqn_lunarlander")

# Create environment with human rendering
env = gym.make("LunarLander-v3", render_mode="human")

obs, _ = env.reset()
done = False

while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated

env.close()