### Colab notebook for Illini VEX Robotics Software Development Meeting - 1/25/2024
kidskoding (Anirudh Konidala)

### Goals for this meeting 
- [x] Understand Deep Q-Learning and how it works
- [x] Use a sample environment from OpenAI Gym, [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/),
   and implement Deep Q-Learning with Deep Q-Networks to train the agent to smoothly land on the moon's surface
   
### Next meeting: February 1st, 2024

### Recap from last semester!

**Reinforcement Learning** → A branch in Machine Learning (ML) where an agent learns to make decisions by interacting with an environment in order to maximize some cumulative reward.
- **Agent** → The decision maker that interacts with the environment
- **Environment** → The space or system the agent operates in and responds to said agent’s actions
- **State** → A representation of the environment’s current situation
- **Action** → Choices the agent can make that affect the environment
- **Reward** → A numerical value received after each action, indicating the desirability of the outcome

Reinforcement Learning balances **exploration** → trying new actions to discover their effects) and **exploitation** → choosing actions that are known to yield high rewards.

The classic Atari game, Pong, is one example of Reinforcement Learning.

- The **agent** would be the paddle that is controlled by the AI
- The **environment** would be the Pong game screen
- The **state** would be the position of the ball and both paddles
- The **action** would involve moving the paddle up, down, or keeping it stationary
- The **reward** would be Positive for scoring a point and negative for losing a point

**Action Space** → The set of all possible actions an agent can take in a given environment
   - **Discrete Action Space** → A finite set of actions (e.g., moving left or right)

**Observation Space** → The set of all possible states an agent can observe in a given environment

### Deep Q-Learning

**Deep Q-Learning (DQN)** is an extension of Q-Learning that focuses on using deep neural networks to approximate the Q-value
function (action-value function)
- The **Q-value function** is a function that takes in a state and action as input and outputs the expected cumulative reward, 
measuring how good an action is for an agent in a given state
    - Implemented using **Neural Networks (NNs)** or **Deep Q-Networks (DQNs)**

### Key Components
- **Q-network** → a neural network that takes the state as input and outputs Q-values for each possible action
- **Target Network** → a copy of the Q-network that acts as a reference for training the Q-network itself
    - stabilizes the training process for the DQN by preventing large fluctuations or changes in performance during training
- **Experience Replay** → An RL technique where agents can memorize and reuse past experiences to improve learning
    - Implemented via a **Replay Buffer** → a data structure (typically a deque) that stores the agent's past experiences 
    (typically state, action, reward, next state), allowing it to randomly sample and reuse these experience during its training

In [2]:
import gymnasium as gym

**Important Parameters** → These hyperparameters control the behavior and performance of the DQN
- **BUFFER_SIZE** → The maximum size of the replay buffer, which stores past experiences for training the Q-network.
- **BATCH_SIZE** → The number of experiences sampled from the replay buffer to train the Q-network in each training step.
- **GAMMA** → The discount factor used in the Q-learning update rule, which determines the importance of future rewards.
- **LR** → The learning rate for the optimizer, which controls how much to adjust the Q-network's weights with respect to the loss gradient.
- **EPSILON** → The initial value of epsilon for the epsilon-greedy policy, which determines the probability of choosing a random action versus the action suggested by the Q-network.
- **EPSILON_MIN** → The minimum value of epsilon, ensuring that there is always some probability of choosing a random action.
- **EPSILON_DECAY** → The decay rate for epsilon, which reduces epsilon after each episode to decrease the probability of choosing random actions over time.
- **TARGET_UPDATE_FREQ** → The frequency (in episodes) at which the target Q-network is updated with the weights of the current Q-network.

In [3]:
BUFFER_SIZE = 100000
BATCH_SIZE = 64
GAMMA = 0.99
LR = 1e-3
EPSILON = 1.0
EPSILON_MIN = 0.01
EPSILON_DECAY = 0.995
TARGET_UPDATE_FREQ = 10

Create the environment for training the agent by using the Lunar Lander environment from OpenAI Gym!

In [5]:
env = gym.make("LunarLander-v3", render_mode=None)

env.action_space.seed(42)
input_dim = env.observation_space.shape[0]
output_dim = 4

Create the Q-Network implementation, which is derived from the Neural Network module in the PyTorch library

- The Q-Network consists of three fully connected layers (fc1, fc2, fc3) that map the input state to the output action
    - The **input layer** takes the input state and maps it to a higher dimensional space -> Helps the network learn initial features from the input data
    - The **hidden layer** processes features learned by the first layer, allowing the DQN to extract more complex features and optimize the network's weights better
    - The **output layer** produces the Q-values for each possible action in the action space, which is equal to the number of possible actions
- The forward function defines the forward pass of the network, or **forward propogation**, 
where the input state is being passed through the DQN layers to produce the Q-values for each action
    - The DQN makes predictions based on the input data

In [None]:
import torch
import torch.nn as nn

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, action_dim)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

Initialize the two neural networks for a DQN algorithm: the **Q-network** and the **target network**

In [7]:
q_network = QNetwork(input_dim, output_dim)
target_network = QNetwork(input_dim, output_dim)
target_network.load_state_dict(q_network.state_dict())
target_network.eval()

QNetwork(
  (fc1): Linear(in_features=8, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=4, bias=True)
)

Initialize the optimizer and loss function for training the Q-network in a DQN algorithm

- **optimizer** - Initializes the **Adam (Adaptive Moment Estimation) optimizer** with the DQN's parameters to update the weights of the Q-network based 
on the loss calculated during training
    - Adam adjusts the learning rate for each parameter based on the first and second moments of the gradients
- **loss function** - Initializes the **Mean Squared Error (MSE) loss function** to calculate the difference between the predicted Q-values 
and the target Q-values, which computes the gradients for backpropogation

In [8]:
import torch.optim as optim

optimizer = optim.Adam(q_network.parameters(), lr=LR)
loss_fn = nn.MSELoss()

Initialize the replay buffer, which stores past experiences for training the Q-network in a DQN algorithm

- Done via an implementation of the **deque (double ended queue)** data structure
    - The store function appends new experiences to the buffer deque data structure
    - The sample function takes a random sample of experiences from the buffer for training the Q-network
    - The size function returns the current size of the buffer

In [11]:
import random
from collections import deque

class ReplayBuffer:
    def __init__(self, buffer_size, batch_size):
        self.buffer = deque(maxlen=buffer_size)
        self.batch_size = batch_size
    def store(self, experience):
        self.buffer.append(experience)
    def sample(self):
        return random.sample(self.buffer, self.batch_size)
    def size(self):
        return len(self.buffer)

replay_buffer = ReplayBuffer(BUFFER_SIZE, BATCH_SIZE)

Initialize the environment and print information about the observation and action spaces

1. **Reset** the environment to obtain the initial state and information
2. Print the **observation space** and **action space** of the environment
3. Print a sample observation from the observation space
4. **Observation** - The current state of the environment, which includes the agent's position, velocity, angle, and leg contact with the ground

### In our Lunar Lander environment, there are a total of 8 observations

- [0] x-coordinate
- [1] y-coordinate
- [2] x-velocity
- [3] y-velocity
- [4] angle
- [5] angular velocity
- [6] left leg touching ground
- [7] right leg touching ground

### and a total of 4 actions
- 0 - do nothing
- 1 - left engine
- 2 - main engine
- 3 - right engine

In [12]:
observation, info = env.reset(seed=42)
print("env.observation_space", env.observation_space)
print("env.action_space", env.action_space)
print("env.observation_space.sample()", env.observation_space.sample())

env.observation_space Box([ -2.5        -2.5       -10.        -10.         -6.2831855 -10.
  -0.         -0.       ], [ 2.5        2.5       10.        10.         6.2831855 10.
  1.         1.       ], (8,), float32)
env.action_space Discrete(4)
env.observation_space.sample() [ 0.8571803  -2.0454006  -7.4847665  -8.4603815  -0.5257534  -2.0264833
  0.4582417   0.45993498]


Run the Lunar Lander environment for a specified number of episodes using the DQN algorithm!

In [None]:
num_episodes = 500
for episode in range(num_episodes):
    # Reset the environment and get the initial state
    state, _ = env.reset()
    state = torch.tensor(state, dtype=torch.float32)
    total_reward = 0

    # Select an action using the epsilon-greedy approach
    # - Randomly select an action with the probability of epsilon
    # - Otherwise, select the action with the highest Q-value from the Q-network
    while True:
        if random.random() < EPSILON:
            action = env.action_space.sample()
        else:
            with torch.no_grad():
                action = torch.argmax(q_network(state)).item()

        # Take the action and observe the next state and reward
        next_state, reward, terminated, truncated, _ = env.step(action)
        next_state = torch.tensor(next_state, dtype=torch.float32)
        done = terminated or truncated

        # Store the experience in the replay buffer
        replay_buffer.store((state, action, reward, next_state, done))

        # Update the state to the next state and total reward to accumulate the reward from the current experience
        state = next_state
        total_reward += reward

        # Check if the replay buffer hasn't exceed capacity and enough experience to sample another batch
        if replay_buffer.size() >= BATCH_SIZE:
            # Decompose the tuple containing the random sample of the experience from the replay buffer
            batch = replay_buffer.sample()
            states, actions, rewards, next_states, dones = zip(*batch)

            # Convert the experience into PyTorch tensors for training the Deep Q Network
            states = torch.stack(states)
            actions = torch.tensor(actions).unsqueeze(1)
            rewards = torch.tensor(rewards, dtype=torch.float32).unsqueeze(1)
            next_states = torch.stack(next_states)
            dones = torch.tensor(dones, dtype=torch.float32).unsqueeze(1)
            
            # Calculate the Q-values for the current state and action from the Q-network, and then
            # compute the target Q-values using the target network
            q_values = q_network(states).gather(1, actions)
            with torch.no_grad():
                max_next_q_values = target_network(next_states).max(1, keepdim=True)[0]
                target_q_values = rewards + GAMMA * max_next_q_values * (1 - dones)
            
            # Compute the loss between the predicted Q-values and target Q-values
            loss = loss_fn(q_values, target_q_values)
            
            # 1. Reset the gradients of all the parameters to zero
            # 2. Backpropogate the loss to compute the gradients
            # 3. Update the weights of the Q-network using the optimizer
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Exit the simulation when finished!
        if done:
            break
            
    # Update epsilon using the decay rate -> reduces the probability of choosing a random action over time, 
    # encouraging exploitation of learned actions as the training progresses
    EPSILON = max(EPSILON_MIN, EPSILON * EPSILON_DECAY)
    
    # Update the target network with the weights of the Q-network at a specified frequency
    if episode % TARGET_UPDATE_FREQ == 0:
        target_network.load_state_dict(q_network.state_dict())

    # Print the episode number, total reward, and epsilon value
    print(f"Episode {episode}, Total Reward: {total_reward}, Epsilon: {EPSILON:.3f}")

### TODO for next meeting
- [ ] Continue to gain a greater understanding of Deep Q-Learning, along with how and why 
it makes the landing of the agent on the moon's surface much more smoother
    - Specifically focus on 
        - [ ] Backpropogation
        - [ ] Loss function
- [ ] Understand the code and why it works
- [ ] Perhaps have a brief look at Rainbow DQN??
- [ ] Begin implementing our own environment, possibly in Unity with a simple environment of soccer?