# **Lunar Lander DQN Simulation**

This simulation utilizes Deep Q-Network (DQN) to train an agent to land a spacecraft on the lunar surface. The DQN is a type of artificial intelligence model that combines deep neural networks and Q-learning, a type of reinforcement learning, to optimize decision-making processes in environments with discrete actions.

## **Overview**

The simulation uses the `LunarLander-v2` environment from the `gym` library, a toolkit for developing and comparing reinforcement learning algorithms. The aim is to guide a lunar lander to a safe landing on a landing pad. The environment provides a state containing positional and velocity information about the lander and returns rewards based on the lander's actions. The DQN agent learns from these rewards to improve its landing strategy over time.

The user interface for the simulation is created using the `pygame` library, providing a visual representation of the lunar lander's position, angle, and landing success rate.

## **Key Components**

1. **Environment Initialization:** The `LunarLander-v2` environment from `gym` is initialized, providing the framework for the DQN agent's interactions.
  
2. **Neural Network Model:** The DQN consists of a neural network model, implemented using `tensorflow` and `keras`. This model takes in the state of the lunar lander and outputs the best action to take.

3. **Experience Replay:** As the agent interacts with the environment, it stores its experiences in a memory buffer. Periodically, the agent samples from this buffer to train its neural network, improving its decision-making abilities.

4. **Exploration vs. Exploitation:** The agent employs an epsilon-greedy strategy, where it occasionally takes a random action (exploration) but predominantly chooses the action recommended by its neural network (exploitation). Over time, the agent reduces its exploration rate.

5. **Visual Simulation with Pygame:** The position, angle, and landing success rate of the lunar lander are visualized using the `pygame` library. This provides an interactive display, allowing users to watch the agent's progress in real-time.

---

## **Getting Started:**

**Install the required packages**

These packages facilitate various functionalities, ranging from the reinforcement learning environment, deep learning model creation, to the interactive visualization:

- **Gym**  # Reinforcement learning environment
- **Pygame**   # Game development and visualization
- **Numpy**  # Numerical computations
- **Tensorflow**  # Deep learning framework
- **Keras**  # High-level neural networks API

In [1]:
# Install the required packages
## -- pip install gym box2d-py tensorflow pygame -- ##
import gym
import pygame
import numpy as np
import random
import tensorflow as tf
from collections import deque
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers.legacy import Adam

**Pygame Setup and Display Configuration**

This section initializes the Pygame library, which is essential for creating interactive games and simulations. It then defines the display settings, setting the screen's width and height to 600x400 pixels. A window is created using these dimensions, and it's titled "Lunar Lander DQN Simulation." For visualization, background and rocket images are loaded and resized to fit appropriately on the screen. Lastly, standard colors (WHITE and GREEN) and a font for displaying iteration numbers are defined for further use in the simulation's user interface.

In [2]:
pygame.init()

# Display settings
SCREEN_WIDTH = 600
SCREEN_HEIGHT = 400
screen = pygame.display.set_mode((SCREEN_WIDTH, SCREEN_HEIGHT))
pygame.display.set_caption("Lunar Lander DQN Simulation")

# Load and resize images
background_image = pygame.transform.scale(pygame.image.load("moon.png"), (SCREEN_WIDTH, SCREEN_HEIGHT))
rocket_image = pygame.transform.scale(pygame.image.load("rocket.png"), (150, 150))

# Define colors and fonts
WHITE = (255, 255, 255)
GREEN = (0, 255, 0)
iteration_font = pygame.font.Font(None, 24)

**User Interface Rendering for Lunar Lander Simulation**

This set of functions is dedicated to rendering the visual components of the Lunar Lander DQN Simulation. The `draw_lander` function takes the x and y coordinates, along with the angle of the rocket, to draw a rotated rocket image on the screen. In contrast, the `draw_ui_text` function displays the current episode number and the success rate of landings to keep the user informed about the simulation's progress. The `draw_landing_pad` function visualizes the landing pad; its color indicates whether the landing is successful (GREEN) or not (WHITE). Finally, the `draw_lunar_ui` function integrates all these components to create the complete user interface for the simulation, showcasing the lander, the landing pad, the episode information, and the success rate on a lunar background.

In [5]:
def draw_lander(x, y, angle):
    """Draw the rocket on the screen with a given position and angle."""
    screen_x = int((x + 1) * SCREEN_WIDTH / 2)
    screen_y = int(SCREEN_HEIGHT - y * SCREEN_HEIGHT / 2)
    rotated_rocket = pygame.transform.rotate(rocket_image, -np.degrees(angle))
    rocket_rect = rotated_rocket.get_rect(center=(screen_x, screen_y))
    screen.blit(rotated_rocket, rocket_rect.topleft)

def draw_ui_text(episodeIndex, success_percentage):
    """Draw episode and success rate text on the screen."""
    episode_text = iteration_font.render(f"Episode: {episodeIndex+1}", True, WHITE)
    success_text = iteration_font.render(f"Success Rate: {success_percentage:.2f}%", True, WHITE)
    screen.blit(episode_text, (int(SCREEN_WIDTH * 0.07), int(SCREEN_HEIGHT * 0.9)))
    screen.blit(success_text, (int(SCREEN_WIDTH * 0.7), int(SCREEN_HEIGHT * 0.9)))

def draw_landing_pad(successful_landing):
    """Draw the landing pad on the screen."""
    pad_width = 100
    pad_height = 15
    pad_x = (SCREEN_WIDTH - pad_width) // 2
    pad_y = SCREEN_HEIGHT - pad_height
    color = GREEN if successful_landing else WHITE
    pygame.draw.rect(screen, color, (pad_x, pad_y, pad_width, pad_height))

def draw_lunar_ui(x, y, angle, episodeIndex, success_percentage, successful_landing):
    """Draw the entire user interface for the lunar lander simulation."""
    screen.blit(background_image, (0, 0))
    draw_lander(x, y, angle)
    draw_ui_text(episodeIndex, success_percentage)
    draw_landing_pad(successful_landing)
    pygame.display.flip()

**Deep Q-Network (DQN) Class Initialization**

This `DQN` class encapsulates the Deep Q-Network implementation tailored for the lunar lander's training. Upon initialization, the class sets up the environment, determining the size of the state and possible actions from the provided environment. A memory buffer is initialized with a maximum length of 2000 to store experiences for training. Several parameters, crucial for the Q-learning algorithm, are set, such as the discount factor (`gamma`), exploration rate (`epsilon`), its minimum value (`epsilon_min`), and its decay rate (`epsilon_decay`). The learning rate for the neural network training is also specified. The class then constructs two neural network models: the main model (`model`) used for decision-making and a target model (`target_model`) that aids in stabilizing the learning process. The target model's weights are initialized to match the main model's weights.

In [6]:
class DQN:
    """Deep Q-Network implementation for training the lunar lander."""
    
    def __init__(self, env):
        self.env = env
        self.state_size = env.observation_space.shape[0]
        self.action_size = env.action_space.n
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95  # discount factor
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()
        self.target_model = self._build_model()
        self.update_target_model()

**Neural Network Construction for DQN**

The `_build_model` function constructs the neural network architecture that underpins the Deep Q-Network (DQN). The model is built using a sequential arrangement of layers. The first layer consists of 24 neurons with a ReLU activation function, taking the state's size as its input dimension. This is followed by another dense layer with 24 neurons and a ReLU activation function. The final layer has a neuron count equal to the number of possible actions and uses a linear activation function, outputting the Q-values for each action. The model is compiled using the Mean Squared Error (MSE) as its loss function and the Adam optimizer, with the specified learning rate, to adjust the network's weights during training.

In [7]:
    def _build_model(self):
        """Build the neural network model for the DQN."""
        model = Sequential()
        model.add(Dense(24, input_dim=self.state_size, activation='relu'))
        model.add(Dense(24, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(learning_rate=self.learning_rate))

        return model

**DQN Model Operations and Decision-Making**

Within these methods, core operations of the DQN are defined:

- `update_target_model`: This function synchronizes the weights of the target model with those of the primary DQN model. This step is essential in Q-learning with neural networks to stabilize the learning process by having a fixed target for Q-value estimations.
  
- `remember`: This function captures and stores an experience in the agent's memory. An experience, in this context, comprises the state, the action taken, the reward received, the next state, and a flag indicating if the episode ended. By storing these experiences, the agent can later sample from them to train its neural network, a process known as experience replay.

- `act`: This function determines the action the agent should take. It uses an epsilon-greedy approach: with a probability of `epsilon`, it chooses a random action (exploration), and with a probability of \(1 - \epsilon\), it queries the DQN model to select the action with the highest predicted Q-value (exploitation). This blend of exploration and exploitation ensures the agent can discover new strategies while also capitalizing on what it has already learned.

In [8]:
    def update_target_model(self):
        """Update the target model with weights from the main model."""
        self.target_model.set_weights(self.model.get_weights())

    def remember(self, state, action, reward, next_state, done):
        """Store the experience in memory."""
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        """Select an action using the DQN or a random action."""
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])

**Experience Replay and DQN Training**

The `replay` function facilitates the training of the DQN model using a concept known as experience replay. By randomly sampling a batch of experiences from the agent's memory, the function breaks correlations between consecutive experiences, enhancing the stability of the training process. For each experience in the minibatch:

1. The current and next states are passed through the DQN and the target model, respectively, to obtain predicted Q-values.

2. For the given action, if the episode ended (`done` is True), the target Q-value is simply the received reward. If the episode continued, the target Q-value is computed as the sum of the received reward and the discounted maximum Q-value predicted for the next state by the target model.

3. With these target Q-values and the original states, the DQN model is trained for one epoch, adjusting its weights to minimize the difference between its predictions and the computed target Q-values.

4. Lastly, the exploration rate (`epsilon`) decays by a factor (`epsilon_decay`), promoting a gradual shift from exploration to exploitation as the agent learns more about the environment. However, the decay stops once `epsilon` reaches a minimum threshold (`epsilon_min`), ensuring some level of exploration is always maintained.

In [9]:
    def replay(self, batch_size):
        """Train the DQN using experiences from the memory."""
        minibatch = random.sample(self.memory, batch_size)
        states = np.array([experience[0][0] for experience in minibatch])
        next_states = np.array([experience[3][0] for experience in minibatch])
        targets = self.model.predict(states)
        next_state_targets = self.target_model.predict(next_states)
        
        for i, (state, action, reward, next_state, done) in enumerate(minibatch):
            if done:
                targets[i][action] = reward
            else:
                targets[i][action] = reward + self.gamma * np.amax(next_state_targets[i])
                
        self.model.fit(states, targets, epochs=1, verbose=0)
        
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

**Model Persistence: Saving and Loading Weights**

These functions, `load` and `save`, provide the capability for model persistence. The `load` function allows for the restoration of previously trained weights into the DQN model, facilitating the continuation of training or policy deployment without starting from scratch. Conversely, the `save` function offers the ability to store the current weights of the DQN model, ensuring that the knowledge acquired by the agent during its training sessions can be saved and reused at a later time. This ability is crucial for scenarios where training is time-consuming, or the learned policy needs to be transferred or deployed elsewhere.

In [10]:
    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

**Setting Up the Environment and Agent Initialization**

In this segment, the Lunar Lander environment is initialized using the `gym` library, which provides a standardized interface for reinforcement learning tasks. An instance of the DQN agent is then created, taking this environment as an argument, allowing it to interact and learn from it. Additionally, a batch size of 32 is specified for experience replay, indicating that 32 experiences will be randomly sampled from memory for each training iteration. The simulation is set to run for 1,000 episodes, and a counter (`successful_landings`) is initialized to zero to keep track of the number of times the agent successfully lands the lunar lander during these episodes.

In [11]:
env = gym.make('LunarLander-v2')
agent = DQN(env)
batch_size = 32
num_episodes = 1000
successful_landings = 0

AttributeError: 'DQN' object has no attribute '_build_model'

: 

**Training Loop Initialization**

This code initiates the main training loop for the agent over a specified number of episodes. For each episode:

- The cumulative reward, represented by `total_reward`, is initialized to zero. This will accumulate the rewards the agent receives during the episode, providing a measure of the agent's performance.

- A boolean flag, `successful_landing`, is set to `False` by default. This flag will be used later to determine if the agent achieved a successful landing in the current episode.

- The environment is reset using the `env.reset()` method, signifying the start of a new episode. This method returns the initial state of the environment.

- The obtained state is then reshaped to fit the expected input shape of the DQN model. This ensures compatibility when the state is passed to the neural network for action prediction.

In [None]:
# Loop over all episodes for training
for e in range(num_episodes):
    # Initialize the reward for this episode
    total_reward = 0
    # Flag to check if the landing was successful in this episode
    successful_landing = False
    # Reset the environment for a new episode and get initial state
    state, _ = env.reset()
    # Reshape the state to match the expected input shape for the DQN model
    state = np.reshape(state, [1, agent.state_size])

**Episode Time-Step Loop**

This code segment introduces a loop that processes each time step within an episode, with a maximum of 500 time steps per episode. For every time step:

- Positional and angular data are extracted from the current state of the environment. Specifically, `x` and `y` represent the horizontal and vertical positions of the lunar lander, respectively, while `angle` denotes its orientation. These extracted values can be used for visualization or to influence the agent's decision-making process during the episode.

In [None]:
    # Loop for each time step in the episode
    for time in range(500):
        # Extract position and angle information from the state
        x = state[0][0]
        y = state[0][1]
        angle = state[0][4]

**Time-Step Processing and Interaction with Environment**

In this section, the agent interacts with the environment at each time step, making decisions and receiving feedback:

- The success percentage is calculated by dividing the number of successful landings by the total episodes so far. This gives a running measure of the agent's performance.

- The `draw_lunar_ui` function is called to visualize the lunar lander's current state, including its position, angle, episode number, and success percentage.

- The agent then decides on an action based on the current state by calling the `act` method. 

- This chosen action is executed in the environment using the `env.step(action)` method, which returns the next state, the reward for the action, a flag indicating if the episode has ended (`done`), and other information.

- The received reward is added to the cumulative `total_reward` for the episode. If the episode ends prematurely (e.g., if the lander crashes), the reward is modified to a negative value to penalize the agent.

- The next state is reshaped to fit the input format expected by the DQN model.

- The experience, which comprises the current state, action taken, received reward, next state, and the `done` flag, is stored in the agent's memory using the `remember` method. This memory will be used later for experience replay, a technique to improve the stability of DQN training.

- The next state becomes the current state for the subsequent time step, preparing the agent for the next iteration of the loop.

In [None]:
        # Calculate the success percentage so far
        success_percentage = (successful_landings / (e + 1)) * 100
        # Draw the UI with current state information
        draw_lunar_ui(x, y, angle, e, success_percentage, successful_landing)
        # Get an action from the DQN agent
        action = agent.act(state)
        # Perform the action in the environment
        next_state, reward, done, _, _ = env.step(action)
        # Accumulate the reward
        total_reward += reward
        # Modify the reward if the episode is done
        reward = reward if not done else -10
        # Reshape the next state to match the expected input shape for the DQN model
        next_state = np.reshape(next_state, [1, agent.state_size])
        # Store this experience in the agent's memory
        agent.remember(state, action, reward, next_state, done)
        # Set the current state for the next iteration
        state = next_state

**User Interaction and Exit Handling**

This segment of the code handles user interactions within the `pygame` window. Specifically, it listens for events in the pygame window, and if a user requests to close the window or exit the simulation (e.g., by clicking the close button), the code responds by gracefully shutting down the `pygame` environment and then exiting the program entirely. This ensures a smooth user experience by allowing the user to terminate the simulation at any point.

In [None]:
        # Check for user exit request
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                pygame.quit()
                exit()

**Episode Termination and Performance Evaluation**

This section addresses the scenario when an episode concludes, either due to the agent achieving its objective or failing in its task. 

- The `done` flag indicates whether the episode has ended. If it's set to `True`, further checks are conducted to evaluate the agent's performance.

- The agent's cumulative reward (`total_reward`) for the episode is examined. If it surpasses a threshold of 50, it implies that the agent has successfully landed the lunar lander, and thus the count of `successful_landings` is incremented.

- Subsequently, the target model of the DQN is updated to synchronize with the weights of the main model using the `update_target_model` method. This step is crucial to maintain stability during the Q-learning process.

- The `break` statement then exits the time-step loop prematurely, marking the end of the current episode and paving the way for the next episode to begin.

In [None]:
        # Check if episode is done
        if done:
            # If the total reward is above a threshold, consider it a successful landing
            if total_reward > 50:
                successful_landings += 1
            # Update the target model of the DQN
            agent.update_target_model()
            break

**Periodic Training, Landing Evaluation, and Inter-Episode Delay**

This section of the code encompasses several key actions that occur during and after each episode:

- The agent's DQN model undergoes training every 10 episodes. This is achieved by checking if the current episode number (`e`) is a multiple of 10. If true, and the length of stored experiences in the agent's memory exceeds the defined batch size, the `replay` method is called. This method trains the DQN model using randomly sampled experiences from the agent's memory, leveraging the technique of experience replay.

- After all the time-steps of an episode are processed, the agent evaluates whether it achieved a successful landing. The criteria for success are that the cumulative reward (`total_reward`) for the episode exceeds 50 and the vertical position (`y`) of the lander is less than 0.1. If both conditions are met, the `successful_landing` flag is set to `True`.

- Finally, before transitioning to the next episode, there's a brief pause (or wait time) of 500 milliseconds. This delay, implemented using `pygame.time.wait(500)`, provides a short break, allowing users to visually process the outcome of the current episode before the next one commences.

In [None]:
        # Every 10 episodes, train the DQN with experiences from memory
        if e % 10 == 0:
            if len(agent.memory) > batch_size:
                agent.replay(batch_size)

    # Check the landing conditions at the end of an episode
    if total_reward > 50 and y < 0.1:
        successful_landing = True

    # Wait for a short duration before starting the next episode
    pygame.time.wait(500)

**Training Conclusion and Cleanup**

This segment marks the end of the training process:

- A message, "Training completed!", is printed to the console, signaling to the user that all episodes have been processed, and the training phase is over.
- The `pygame.quit()` function is called to gracefully shut down the `pygame` environment, ensuring all resources are freed and no lingering processes remain. This cleanup step is crucial to maintain system stability and avoid potential issues, especially when running extensive simulations.

In [None]:
print("Training completed!")
pygame.quit()