In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**The Lunar Lander Environment (Reinforcement Learning)**

The lunar lander environment is a reinforcement learning problem where the goal is to train an agent (a lander) to land on a designated landing pad on the surface of the moon. Reinforcement learning is a type of machine learning technique where an agent learns to make decisions by interacting with an environment and receiving rewards or penalties for its actions. The goal is to maximize the cumulative reward over time. The environment is provided by OpenAI Gym, which is a toolkit for developing and comparing reinforcement learning algorithms.

**Problem Statement**

The goal of the Lunar Lander environment is to implement a Deep Q-Learning alorithm using TensorFlow to land the lunar lander safely on the landing pad on the surface of the moon. The landing pad is designated by two flag poles and its center is at coordinates (0,0), but the lander is also allowed to land outside of the landing pad. The lander starts at the top center of the environment with a random initial force applied to its center of mass and has infinite fuel. 

**Markov Decision Process (MDP)**

The lunar lander environment is modeled as a Markov Decision Process, which is a formal framework for describing reinforcement learning problems. In an MDP, the future depends only on the current state and not on the history of how the agent got there. The agent interacts with the environment by choosing actions, observing new states, and receiving rewards. The lunar lander environment consists of the following:

* **State**- The current position, velocity, and other relevant variables that describe the lander's situation.

* **Action**- The agent's choice of applying force to the lander's engines in a particular direction.

* **Reward**- The feedback signal that the agent receives based on its actions (e.g., a positive reward for a successful landing, a negative reward for crashing or running out of fuel).

* **Transition Function**- The dynamics that determine how the lander's state changes based on the current state and the agent's action.

**Deep Q-Learning**

Deep Q-Learning is a popular reinforcement learning algorithm that combines Q-learning (a value-based method) with deep neural networks. The agent learns to estimate the expected future rewards (Q-values) for each state-action pair, and then takes the action with the highest Q-value in a given state.

**TensorFlow**

TensorFlow is a popular open-source machine learning library developed by Google. In the lunar lander environment, TensorFlow can be used to build and train the deep neural networks that approximate the Q-function for the agent.

**Agent Actions**

It's important to note that the agent has 4 discrete actions available: 

The agent has four discrete actions available, each corresponding to a numerical value:

1. Do nothing= 0
2. Fire right engine= 1
3. Fire main engine= 2
4. Fire left engine= 3

In [2]:
# Import all necessary packages and libraries
import numpy as np
import time
import random
from collections import deque, namedtuple

import gym
from gym.wrappers import NormalizeObservation
from gym.wrappers import TimeLimit
from gym.wrappers import ClipAction
import cv2

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.losses import MSE
from tensorflow.keras.optimizers import Adam

import matplotlib.pyplot as plt

2024-05-16 17:46:33.954112: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-16 17:46:33.954229: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-16 17:46:34.098865: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
# Setting the random seed for TensorFlow
tf.random.set_seed(42)

In [4]:
# Hyperparameter selection
MEMORY_SIZE = 100_000
GAMMA = 0.995
ALPHA = 1e-3
NUM_STEPS_FOR_UPDATE = 4

**Observation Space & Action Space**

In reinforcement learning, the observation space defines what information the agent receives from the environment at each step. It can include things like the position, velocity, and orientation of the Lunar Lander. 

The action space defines the set of actions the agent can take. For the Lunar Lander, there are 4 discrete actions, which were already mentioned previously. 

This information is crucial for designing the neural network. The size of the state vector (8 in this case) determines the number of input neurons to the network and the number of possible actions (4), determines the number of output neurons, where each neuron represents the probability of taking a particular action.

In [5]:
# Create the Lunar Lander environment
!apt-get update
!apt-get install -y swig
!pip install gym[box2d]
env = gym.make('LunarLander-v2')

# Get state and action sizes
state_size = env.observation_space.shape[0]  # Number of state values
num_actions = env.action_space.n  # Number of possible actions

print('State size:', state_size)
print('Number of actions:', num_actions)

env.close()

  pid, fd = os.forkpty()


Get:1 https://packages.cloud.google.com/apt gcsfuse-focal InRelease [1225 B]
Get:2 https://packages.cloud.google.com/apt cloud-sdk InRelease [1616 B]       
Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease                         
Get:4 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:6 https://packages.cloud.google.com/apt gcsfuse-focal/main amd64 Packages [21.0 kB]
Get:7 https://packages.cloud.google.com/apt cloud-sdk/main amd64 Packages [2964 kB]
Get:8 https://packages.cloud.google.com/apt cloud-sdk/main all Packages [1438 kB]
Hit:9 http://archive.ubuntu.com/ubuntu focal-backports InRelease   
Get:10 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [3608 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1502 kB]
Get:12 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [3669 kB]
Get:13 http://archive.ubuntu.

**State Size (8)**

The Lunar Lander's state is a vector that describes its current situation. It consists of 8 components:

1. **x position**- The horizontal position of the lander.

2. **y position**- The vertical position of the lander.

3. **x velocity**- The horizontal velocity of the lander.

4. **y velocity**- The vertical velocity of the lander.

5. **angle**- The angle of the lander in radians.

6. **angular velocity**- The angular velocity of the lander in radians per second.

7. **left leg contact**- A boolean (1 or 0) indicating if the left leg is in contact with the ground.

8. **right leg contact**- A boolean (1 or 0) indicating if the right leg is in contact with the ground.


**Resetting the Environment**

I'm going to reset the environment to its initial state using *env.reset()*. This ensures that each episode begins from a defined point, preventing the agent from continuing from a previous state. In addition to resetting the environment, *env.reset()* returns the initial observation of the agent in this new episode. This observation is crucial as it serves as the starting point for the agent's decision-making process.

It's important to note that the initial observation is crucial as it serves as the starting point for the agent's decision-making process. Furthermore, if the environment isn't reset between episodes, the agent might start from an arbitrary state reached in the previous episode. This can introduce bias into the learning process because the agent's actions and the resulting rewards would depend on the final state of the previous episode. Resetting helps ensure that the learning process starts fresh fresh (yes, double fresh) with each episode. 

In [6]:
# Reset the environment
initial_state = env.reset()

**Action**

Once the environment is reset, the agent can start taking action. It's important to note that the agent can only take one action per time step. 

In [7]:
# Select an action
action= 2  # Fire main engine

# Run a single time step of the environment
next_state, reward, terminated, truncated, info = env.step(action)

# Print results
print('Next state: ', next_state)
print('Reward: ', reward)
print('Terminated: ', terminated)  # True if episode is over due to termination
print('Truncated: ', truncated)  # True if the episode is over due to truncation
print('Info: ', info)  # Diagnostic information

# Update the current state
current_state = next_state

Next state:  [-0.01184988  1.3984901  -0.59748363 -0.27277595  0.01385155  0.13912511
  0.          0.        ]
Reward:  0.3355417830521333
Terminated:  False
Truncated:  False
Info:  {}


  if not isinstance(terminated, (bool, np.bool8)):


**Analyzing the Output**

The 'Next State' values are just the corresponding values from the 8 dimensional array representing the new state of the lander after taking action (in this case, firing the main engine). 

The 'Reward' received is a negative value, indicating that the action was not favorable according to the environment's reward function. This could be because the lander consumed fuel, moved further away from the landing pad, or experienced some other undesirable outcome.

The 'Terminated' flag indicates where the episode has ended. In this case, the result is 'False' meaning that the lander has neither landed successfully nor crashed yet, and the episode is still ongoing.

The 'Truncated' flag indicates whether the episode was terminated prematurely (due to reaching a time limit, for example). In this case, it's 'False', meaning the episode is progressing naturally. 

**Core Concepts**

The following information includes core concepts pertaining to the lunar lander environment. These concepts are crucial for the Deep Q-Learning algorithm because they address challenges in reinforcement learning, such as instability, inefficient learning, and the need for accurate Q-value approximation.

1. **Q-Networks**- In reinforcement learning, the Q-function, denoted as Q(s, a), estimates the expected future reward for taking action a in state s. In essence, it says how "good" it is to take a specific action in a given situation. In DQN, a neural network is used to approximate this Q-function. This network takes the state as input and outputs Q-values for each possible action. In the lunar lander environment, the Q-Network takes the current state (which could be the position, velocity, angle, etc.) as input and outputs the estimated Q-values for each possible action (applying force in different directions, for example). The goal is to learn the optimal Q-values, which will guide the agent to take actions that maximize the cumulative reward (in other words, achieve a successful landing).

2. **The Bellman Equation**- The Bellman equation is a fundamental equation in reinforcement learning that defines the relationship between the Q-value of a state-action pair and the Q-value of the next state-action pair. It expresses the idea that the optimal Q-value for a state-action pair is equal to the immediate reward received for taking that action plus the discounted maximum Q-value of the next state over all possible actions. In the lunar lander environment, the Bellman equation allows us to update the Q-values based on the observed rewards and transitions, enabling the Q-Network to learn the optimal policy iteratively.This equation helps the agent understand the value of taking certain actions in specific situations.

3. **Target Networks**- A second neural network, identical in structure to the Q-network, is introduced. This is called the target network. Its purpose is to provide stable target values for training the Q-network. The target network's weights are updated periodically by copying the weights of the Q-network. In this environment, using a Target Network helps stabilize the training process and improves the convergence of the Q-Network towards the optimal policy.

4. **Experience Replay**- This is a technique where the agent stores its experiences (state, action, reward, next state) in a memory buffer. During training, batches of experiences are randomly sampled from this buffer to update the Q-network. This helps break correlations between consecutive experiences and improves learning stability. Experience Replay improves the efficiency of learning by reusing past experiences and decorrelating the samples used for training the Q-Network.

Deep Q-Learning faces instability issues due to the constantly changing nature of the Q-values during learning. By using a target network and experience replay, it stabilizes the learning process. The target network provides consistent target values, and experience replay helps decorrelate updates and reduce variance.

In simpler terms, through iterative updates using the Bellman equation and the target network, the Q-network gradually learns to approximate the optimal Q-function, leading to better decision-making by the agent.

Basically, the training process goes as follows:

1. The agent starts by exploring randomly, collecting experiences.

2. Experiences are stored in the replay buffer.

3. Batches of experiences are sampled from the buffer.

4. For each experience, the target network is used to calculate the target Q-value.

5. The Q-network is updated to minimize the difference between its predicted Q-value and the target Q-value.

6. Periodically, the target network's weights are updated to match the Q-network's weights.

With these concepts in mind, let's proceed to implement the DQN algorithm. 

In [8]:
# Create the Q-Network
q_network = Sequential([
    Input(shape=(state_size,)),
    Dense(units=64, activation='relu'),
    Dense(units=64, activation='relu'),
    Dense(units=num_actions, activation='linear'),
])

# Create the target Q-Network
target_q_network = Sequential([
    Input(shape=(state_size,)),
    Dense(units=64, activation='relu'),
    Dense(units=64, activation='relu'),
    Dense(units=num_actions, activation='linear'),
])

# Set up Adam optimizer with learning rate
optimizer = Adam(learning_rate=ALPHA)

What I've done here is define both the main Q-Network and the Target Network. In addition to this, the Adam optimizer is now ready to update the Q-Network's weights during training.

**Unit Test**

Now I'm going to create a unit test to verify the code. This is an important step because it allows me to catch errors in the development process and it lets me ensure that my code is behaving as expected. So before moving on with the rest of the DQN algorithm, the unit test will verify if the neural networks and the optimizer have been correctly implemented. 

In [9]:
# Unit test
assert q_network.count_params()== target_q_network.count_params() # Networks should have the same number of parameters
assert isinstance(optimizer, Adam) # Optimizer should be Adam
assert optimizer.learning_rate== ALPHA # Learning rate should be ALPHA

# Print model summaries
print(q_network.summary())
print(target_q_network.summary())

None


None


**Model Summaries** 

The model summaries confirm both the Q-network and target Q-network have identical architectures: sequential models with an input layer of 8 neurons (matching the state size), two hidden layers of 64 neurons each using ReLU activation, and an output layer of 4 neurons (matching the number of actions) using linear activation. Both models have the same number of trainable parameters, indicating correct implementation for the DQN algorithm!

In [10]:
# Defining experience
class Experience:
    def __init__(self, state, action, reward, next_state, terminated, truncated):
        # Check if the state is a scalar or a nested sequence
        if isinstance(state, (list, tuple, np.ndarray)):
            self.state = np.array(state).flatten()  # Convert state to a flattened NumPy array
        else:
            self.state = np.array([state])  # Convert scalar state to a NumPy array

        self.action = action
        self.reward = reward

        # Check if the next_state is a scalar or a nested sequence
        if isinstance(next_state, (list, tuple, np.ndarray)):
            self.next_state = np.array(next_state).flatten()  # Convert next_state to a flattened NumPy array
        else:
            self.next_state = np.array([next_state])  # Convert scalar next_state to a NumPy array

        self.terminated = terminated
        self.truncated = truncated

**Deep Q-Network Loss**

Now, I'm going to calculate the loss for the algorithm. This loss measures how well the Q-network predicts the expected future reward (Q-values) for given states and actions.

Basically, the following function takes a batch of experiences as input, unpacks the relevant information (states, actions, rewards, etc.), and calculates the target Q-values using the Bellman equation and the target network. The target Q-values are then compared with the Q-values predicted by the main Q-network, and the mean squared error (MSE) between them is computed. This MSE loss is the output of the function and is used to update the Q-network's weights during training, ultimately improving the agent's decision-making capabilities.

In [11]:
# Define the loss function
def compute_loss(experiences, gamma, q_network, target_q_network):
    states_batch, action_batch, reward_batch, next_states_batch, terminated_batch, truncated_batch = experiences
    
    # Convert states and next_states to TensorFlow tensors
    processed_states = tf.convert_to_tensor(states_batch, dtype=tf.float32)
    next_states_batch = tf.convert_to_tensor(next_states_batch, dtype=tf.float32)
    
    # Convert other experience components to NumPy arrays
    action_batch = np.array(action_batch)
    reward_batch = np.array(reward_batch)
    terminated_batch = np.array(terminated_batch)
    truncated_batch = np.array(truncated_batch)
    
    # Compute Q values for current states and actions using the Q-network
    q_values = q_network(processed_states)
    q_values = tf.reduce_sum(q_values * tf.one_hot(action_batch, q_values.shape[1]), axis=1)
    
    # Compute Q values for next states using the target Q-network
    next_q_values = target_q_network(next_states_batch)
    max_next_q_values = tf.reduce_max(next_q_values, axis=1)
    
    # Compute target Q values
    target_q_values = reward_batch + gamma * max_next_q_values * (1 - terminated_batch) * (1 - truncated_batch)
    
    # Compute loss
    loss = tf.reduce_mean(tf.square(q_values - target_q_values))
    
    return loss

In [12]:
# Update the weights of the Q-Network
def agent_learn(experiences, gamma, q_network, target_q_network):
    states_batch, action_batch, reward_batch, next_states_batch, terminated_batch, truncated_batch = zip(*experiences)
    
    # Convert states and next_states to TensorFlow tensors
    processed_states = tf.convert_to_tensor(states_batch, dtype=tf.float32)
    next_states_batch = tf.convert_to_tensor(next_states_batch, dtype=tf.float32)

    # Convert other experience components to NumPy arrays
    action_batch = np.array(action_batch)
    reward_batch = np.array(reward_batch)
    terminated_batch = np.array(terminated_batch)
    truncated_batch = np.array(truncated_batch)

    # Compute the loss and perform gradient descent
    with tf.GradientTape() as tape:
        loss = compute_loss((processed_states, action_batch, reward_batch, next_states_batch, terminated_batch, truncated_batch), gamma, q_network, target_q_network)
    gradients = tape.gradient(loss, q_network.trainable_variables)
    optimizer.apply_gradients(zip(gradients, q_network.trainable_variables))

    # Soft update the weights of the Target Q-Network
    tau = 0.001
    for t, e in zip(target_q_network.trainable_variables, q_network.trainable_variables):
        t.assign(tau * e + (1 - tau) * t)


**Loop-a-Loop**

Now, let's create a loop to allow the agent to take many consecutive actions during an episode.

Reinforcement learning tasks are generally framed as episodes, where the agent starts in an initial state, interacts with the environment by taking actions, receives rewards, and eventually reaches a terminal state (in this case, successfully landing the lunar lander).

The agent needs to learn a policy that maps states to actions in a way that maximizes cumulative rewards over time. This learning happens through repeated interactions with the environment, where the agent explores different actions and learns from their consequences.

To learn effectively, reinforcement learning algorithms typically need a significant amount of experience data, consisting of transitions between states, actions, rewards, and next states. By running multiple steps within an episode, the agent can collect this valuable experience data.

Essentially, the core idea here is to create a training loop that allows the DQN agent to interact with the Lunar Lander environment, collect experiences, learn from those experiences, and gradually improve its policy (the strategy for choosing actions) over time. 

In the training loop, I will introduce the Epsilon-Greedy Concept. This concept is a simple yet effective approach to balance exploration and exploitation in reinforcement learning. For the lunar lander environment, it means that with a probability of epsilon, the agent will take a random action (exploration) to discover potentially better strategies, and with a probability of (1-epsilon), the agent will choose the action with the highest predicted Q-value according to its current knowledge (exploitation). This balance ensures that the agent both learns about its environment and makes use of the knowledge it has gained.

In [13]:
# Training loop
start_time = time.time()
replay_buffer = deque(maxlen=MEMORY_SIZE)

num_episodes = 2000
max_timesteps = 1000
total_point_history = []
num_p_av = 100
epsilon = 1.0
epsilon_decay = 0.995  # Add a decay rate for epsilon
BATCH_SIZE = 32

# Set the target network weights equal to the Q-Network weights
target_q_network.set_weights(q_network.get_weights())

for episode in range(num_episodes):
    total_points = 0
    state = env.reset()

    for t in range(max_timesteps):
        # Choose action using ε-greedy exploration
        if np.random.rand() <= epsilon:
            action = env.action_space.sample()  # Explore- random action
        else:
            state_qn = np.expand_dims(state, axis=0)
            q_values = q_network.predict(state_qn)
            action = np.argmax(q_values[0])  # Exploit- choose action with max Q-value

        # Take action and observe the environment
        next_state, reward, terminated, truncated, info = env.step(action)

        # Store experience in replay buffer
        experience = Experience(state, action, reward, next_state, terminated, truncated)
        replay_buffer.append(experience)

        # Update the Q-Network
        if len(replay_buffer) > BATCH_SIZE:  # Wait until we have enough experiences
            if t % NUM_STEPS_FOR_UPDATE == 0:
                # Sample random mini-batch of experience tuples from replay buffer
                experiences = random.sample(replay_buffer, BATCH_SIZE)
                agent_learn(experiences, GAMMA, q_network, target_q_network)

        # Update the Target Network every 100 steps
        if t % 100 == 0:
            target_q_network.set_weights(q_network.get_weights())

        state = next_state
        total_points += reward

        if terminated or truncated:
            break  # End episode

    total_point_history.append(total_points)

    # Print average reward every 100 episodes
    if episode % 100 == 0:
        avg_points = np.mean(total_point_history[-num_p_av:])
        print(f'\rEpisode {episode + 1} | Average Points: {avg_points:.2f}')

    # Decay epsilon for better exploitation
    epsilon *= epsilon_decay

    if avg_points > 200.0:
        print(f'\n\nEnvironment solved in {episode + 1} episodes!')
        break

total_time = time.time() - start_time
print(f'\nTotal Runtime: {total_time:.2f}')


ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

After the training loop, the only step missing is to plot the results to visualize how the agent improved over training. Unfortunately, I have not been able to fix the error from the training loop. It's given me quite a headache. 