# Reinforcement Learning
- Reinforcement learning is one of the three main types of ways machine learning algorithms can be trained to learn about their environment. An agent interacts with an enviroment and through trial and error learns about the environment they are in as well as how to maximize the rewards it recieves. Reinforcement learning is an inceredibly useful practice to understand, as it is the only way we can realistically train a lot of models, such as self-driving cars and those used in the field of robotics.

- In order to learn more about implementing Reinforcement Learning, the python library Gymnasium offers a playground to test models by putting them in predefined environments and allowing users to expirment with their algorithms in a variety of use cases. This notebook will be my notes on the article linked at the top written by Arun Nanda, as well as notes on Reinforcement learning as a whole when specifically using Gymnasium to test the algorithms out.

# What is Gymnasium?
- As Arun Nanda puts it, "Gymnasium is an open-source Python library designed to support the development of RL algorithms." This includes providing an environment for the algorithms to be ran in, from simple environments like a racing game, to complex environments that mimic  real life scenarios. It also provides a streamlined API to allow for easy collaboration with others, as well as the ability to create custom environments and have them posted for others to use through the API. From beginners to long-time experts of RL algorithms, Gymnasium has something for everyone. Gymnasium can be used to do everything you'd need with RL algorithms, such as telling the environment what action the agent took, keeping track of the environment and what rewards are gleamed, training the model, and testing out how the model is doing performance wise.



# Setting Up Gymnasium
- Gymnasium requires a bit of a specific setup, as it has been tuned for specific verious of the dependencies it requires, such as NumPy and PyTorch. Therefore, performing any Gymnasium work in a virtual environment or something similar (like this cloud based notebook) is a must. To install Gymnasium and all the correct dependencies, simply use the command:

In [4]:
!pip install gymnasium

Collecting gymnasium
  Downloading gymnasium-1.0.0-py3-none-any.whl.metadata (9.5 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl.metadata (558 bytes)
Downloading gymnasium-1.0.0-py3-none-any.whl (958 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m958.1/958.1 kB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-1.0.0


- To make sure the right packages have been imported and are ready for use in the notebook, follow the simple group of import statements shown below.

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.distributions as distributions
import numpy as np
import gymnasium as gym

- If you'd like to view all the environments avaliable to you, simply loop through the registry keys avaliable in the environment, as shown here:

In [6]:
import gymnasium as gym
for i in gym.envs.registry.keys():
    print(i)

CartPole-v0
CartPole-v1
MountainCar-v0
MountainCarContinuous-v0
Pendulum-v1
Acrobot-v1
phys2d/CartPole-v0
phys2d/CartPole-v1
phys2d/Pendulum-v0
LunarLander-v3
LunarLanderContinuous-v3
BipedalWalker-v3
BipedalWalkerHardcore-v3
CarRacing-v3
Blackjack-v1
FrozenLake-v1
FrozenLake8x8-v1
CliffWalking-v0
Taxi-v3
tabular/Blackjack-v0
tabular/CliffWalking-v0
Reacher-v2
Reacher-v4
Reacher-v5
Pusher-v2
Pusher-v4
Pusher-v5
InvertedPendulum-v2
InvertedPendulum-v4
InvertedPendulum-v5
InvertedDoublePendulum-v2
InvertedDoublePendulum-v4
InvertedDoublePendulum-v5
HalfCheetah-v2
HalfCheetah-v3
HalfCheetah-v4
HalfCheetah-v5
Hopper-v2
Hopper-v3
Hopper-v4
Hopper-v5
Swimmer-v2
Swimmer-v3
Swimmer-v4
Swimmer-v5
Walker2d-v2
Walker2d-v3
Walker2d-v4
Walker2d-v5
Ant-v2
Ant-v3
Ant-v4
Ant-v5
Humanoid-v2
Humanoid-v3
Humanoid-v4
Humanoid-v5
HumanoidStandup-v2
HumanoidStandup-v4
HumanoidStandup-v5
GymV21Environment-v0
GymV26Environment-v0


- There are quite a bit of environments avaliable as of writing, at least 40 of them, including the duplicate versions of the same environment. Ensure that the version you're using is the right one for you, and that it is consistent in your application.

# Testing an Environment
- To see what one can do in Gymnasium simply, we'll go over some of the simple commands avaliable in Gymnasium. This includes checking the observation space, checking an example observation of the environment, and checking the action space. But firstly, what is an environment, and what is the observation and action spaces?


- The environment is what the RL algorithm's agent uses to determine how it will proceed, by interacting with the environment and determining how much of a reward it got for interacting in that specific way, and updating its policy accordingly. Some keywords, defined clearly, are
 - Environment: The world or region or space in which the agent interacts in a series of timesteps. At each timestep, the agent performs an action and recieves a reward before the Environment decides what the next state will look like.
 - State: "A mathematical state of the current configuration of the Environment" (Nanda). For example, a state could include for a self-driving car system something like the current velocity of the car and how far down the pedal is being pressed. These are only two small pieces of the overall state. A terminal state is one that does not lead to another state; That state is the end of the road.
 - Agent: The name for the algorithm that is performing the actions. The Agent will be the one with the final say on how to act at each time step after taking into account the policy.
 - Observation: A mathematical view of what the Agent sees when it views the Environment, for example, the readings of a sensor or an image of the current state decoded into vectors.
 - Action: The final decision the Agent makes on what to do at the next time step. Influences the Environment and the reward gleamed.
 - Reward: What the Environment tells the Agent about the results of it's Action. Can be good or bad depending on various factors in the Environment.
 - Return: The expected reward over future time steps. This is usually discounted to make it so that further out predictions are less credited. Can be tuned.
 - Policy: The instructions that dictate the way the Agent acts at each time step given all the information provided to the Agent at each time step. The policy is typically represented as a Probability Matrix that maps each state to an action. As Nanda puts it, "Given a finite set of m possible states and n possible actions, element P$_m$$_n$ in the matrix denotes the probability of taking action an in the state s$_m$." More on the math heavy side of this later.
 - Episode: The time steps in a series that starts from the initial random state and ends at the terminal state once the Agent has reached it.

- To begin seeing what an environment is, we'll look at some data returned by the Gymnasium environment, using the CartPole environment for simplicity. Create a new CartPole-v1 environment using the gym import as such:

In [7]:
import gymnasium as gym
env = gym.make('CartPole-v1')

- Now that we have the environment, we can check the observation space as so:

In [8]:
print("observation space: ", env.observation_space)

observation space:  Box([-4.8               -inf -0.41887903        -inf], [4.8               inf 0.41887903        inf], (4,), float32)


- What exactly are we looking at? Well, the Observation Space is the space of all possible states that the environment could be in. This is also the format in which this data will be stored and how the Agent will interact with it. It is typically represented as an object of datatype Box, which describes the parameters of the observations using an ndarray. What we just viewed shows us the bounds of each dimension, of which CartPole has four. In the two arrays, there are four entries each. Each entry in each position corresponds with the entry in the same position in the opposite array. For example, -4.8 and 4.8 go together to form the bounds for one of the variables. The four variables avaliable in CartPole are, in this order, Cart Position, Cart Velocity, Pole Angle, and Pole Angular Velocity.

- We can see an example of the observation provided to the Agent by using the following command:


In [9]:
observation, info = env.reset()
print("observation: ", observation)

observation:  [-0.01472102 -0.00383868 -0.02615577 -0.04561709]


- These four elements are the four variables discussed previously, with values filled out for how the environment currently is at this state.  

- The action space, now, is the space of all actions the Agent can take and the format they can be relayed in. This format can be viewed with the following command:


In [10]:
print("action space: ", env.action_space)

action space:  Discrete(2)


- The reason the model returned "Discrete(2)" is because there are only 2 actions the Agent can take at any given time step. This includes a 0, which is pushing the cart to the left, and a 1, which is pushing the cart to the right. When the algorithm first begins, it does not know which button does which, and will discover for itself through trial and error what it does.

# Building a Reinforcement Learning Agent
- Now that we know a lot of the background knowledge needed, we can continue on to finally getting down and implementing an Agent with an Algorithm in an Environment. We will continue using the CartPole Environment for simplicity state. Begin by making sure the environment has been created:

In [11]:
env = gym.make('CartPole-v1')

- You can use then use the "reset" command to reset the environment as well as provide it with a starting seed, if we wanted to reproduce the same initial state. The seed needs to be passed to the dependencies being used as well. This can be done as follows:

In [12]:
SEED = 1234

env.reset(seed=SEED);

In [13]:
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x796e05a01370>

## Training the Agent with a Policy
- The most important part of the RL Agent is the policy that controls the Action it takes for each State provided by the Environment. This is what fills in the Probability Matrix discussed previously. By creating an approriate policy, the rewards the Agent gleams at the end of each Episode can be maximized and the time taken to reach the predefined terminal state will be minimized.

- There are many different ways to define the Policy the Agent uses. The simplest is to allow the Agent to make a random decision at each time step without taking any factors into consideration. This is not only slow and leads to a very long run time, but the rewards that could have been gleamed are typically much lower than what they could have been. As such, using information gleaned from the Environment's current State and influencing the policy accordingly is a much better way of going about it.

- Nanda mentions various methods of policy optimization, such as the Bellman Equations and Proximal Policy Optimization (PPO), but settles on Policy Gradients for the tutorial. Some other methods that may be worth looking into include Evolutionary Strategies which are modeled off of natural evolution and Trust Region Policy Optimization (TRPO) which works by limiting the amount of change between each policy to a "safe" amount.



### Implementing a Simple Policy Gradient
- Policy Gradients are "a reinforcement learning technique that uses gradient descent to optimize a policy's parameters." The goal, as it is for many reinforcement learning techniques, is to maximize the long-term reward. Policy Gradients are good for handling continuous states and actions, incorporating domain-specific knowledge with ease, and are guaranteed to converge eventually to at least show off a locally optimal policy. However, there are some downsides, including them being hard to use in discrete systems, not leading to globally optimal parameters, and they are difficult to use in environments without policys. Why you'd be using the Policy Gradient algorithm without a Policy, I'm not quite sure.



- Despite the CartPole environment being discrete, we will be using the Policy Gradient algorithm to implement the environment for it. To begin setting up the Agent to use the Policy Gradient, we will create a neural network for the imlpementation of the policy, calculate the rewards and loss through functions we define as well as the probability of each action, then update the policy with all this information using backpropagation techniques.


- The neural network to implement the Policy Gradient will have input dimensions equivalent to the dimensions of the observation space of the environment, a single hidden layer with 64 neurons, and output dimensions equivalent to the dimensions of the action space of the environment. Therefore, the Policy Gradient takes an observation space and maps it to an action space using the 64 neurons that it will train with each iteration until the mapping reaches a satisfactory state (terminal state).

- We define the class "PolicyNetwok", which is initialized with the parameters that are
 - the number of dimensions in the input space (observation space)
 - the number of neurons in the hidden layer (in this case, 64)
 - the number of dimensions in the output space (action space)
 - the dropout, which is the fraction of data that is randomly zeroed to ensure that one neuron is not used more than the rest of them



- We then define the "Forward" function, which defines the flow of data throughout the neural network. In this case,
 - data x (which is the observation space data) is passed first through the first layer, which maps observation space data to the neural network's only hidden layer.
 - That hidden layer data is then subject to dropout, where random values are set to 0.
 - Then, ReLU (Rectified Linear Unit) is applied, which introduces non-linearity by taking all negative values and turning them into 0 (which supposedly helps the network learn complex patterns).
 - Finally, the data, after having been transformed by the dropout and ReLU, is mapped from the hidden layer to the action space and returned.

In [14]:
class PolicyNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout):
        super().__init__()

        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.layer1(x)
        x = self.dropout(x)
        x = F.relu(x)
        x = self.layer2(x)
        return x

- Next, we need to define a function to calculate the returns at each timestep. The returns are the culmative sum of the rewards from the beginning to the current timestep.
 - In this function, we'll pass in two parameters: the rewards that were created by the environment and the discount factor that we're currently using.
 - We'll start by defining an empty array for the returns, which we will be calculating, and a number to hold the current return, which starts at 0.
 - Then we loop through all the rewards in reverse order, to start with the oldest one first. We will apply the discount factor at each step to the current return before adding it to the reward. Then, the current return is added to the beginning of the returns list, to ensure the list remains in the correct order (since we're looping through the rewards in reverse order).
 - Once we have the returns list complete, we turn them into a PyTorch Tensor for use with the neural network, and normalize each data value to ensure that the mean is 0, standard deviation is 1, and no numbers get too big. This also help ensure smooth and stable training.


In [15]:
def calculate_stepwise_returns(rewards, discount_factor):
    returns = []
    R = 0

    for r in reversed(rewards):
        R = r + R * discount_factor
        returns.insert(0, R)
    returns = torch.tensor(returns)
    normalized_returns = (returns - returns.mean()) / returns.std()
    return normalized_returns

- Then, we need to implement a function that will allow us to simulate the forward pass of the Agent. This will be where the Agent uses the current policy and explores the environment until reaching a Terminal State, in which all the data is compiled and used to retrain the model.
 - To begin, we define the forward_pass function with three parameters: the current environment we're using, the policy that we're using, and the current discount factor of the test we're performing. (Line 1)
 - We then start prepping the collection of relevant data by creating empty containers. This includes an array to hold the log probabilities of each action, an array to hold the rewards the Agent collects, a flag to tell us if the terminal state has been reached or not, the culmunative sum of all the rewards for this pass (the return), and a confirmation that the policy is set to training mode. The environment is then reset to its initial state and testing can begin. (Lines 2-8)
 - So long as our terminal state flag "done" has not been flagged, the program will continue to run in a loop. Each of these times looped is a timestep. (Line 10)
 - The observation starts with the data from the reset of the environment, which is then convented into a PyTorch tensor for use with the neural network, and "unsqueezed", which adds a batch dimension to be used with the neural network input (not 100% on this one). (Line 11)
 - We then get a prediction of the action the Agent will take by running it through the Policy, which returns action "logits", or raw data about the probability of the actions that has not been normalized. The data is then normalized by the softmax function, which ensures all the action's probabilities add up to 1 and maintains a relative magnitude between each prediction. (Line 12-13)
 - From here we take the list of actions and their probabilities that the Agent will take them and create a categorical probability distribution, which basically says "you have this chance to pick this item" based off the probabilities we fed it. And then we go into the distribution we created and sample it, which will pick one of the actions randomly. The reason we don't just always pick the most likely action is to encourage exploring new avenues every now and then and facilitate learning. We then calculate the log probability of the sampled action, for use later when re-evaluating the policy. (Lines 14-16)
 - Then we finally apply our selected Action to the Environment and collect the data from interacting with it, including a new observation, the latest reward, if the environment reached a terminal state or ended early due to time (truncated), and any additional info that might be outputted by the environment. (Line 18)
 - Finally, we check if terminated or truncated are true, and if one is we signify to our flag "done" that we are done. We add the log probability of the selected action to the array of log probabilities of all actions so far, add the reward to the array of all rewards, and increase our episode_return variable by the amount of the reward. Should done by true, we continue on, otherwise the while loop goes back through it again and the Agent interacts with the environment once more. (Lines 20-23)
 - Once the loop is done we concatenate all the log_prob_actions into one tensor for easier use later on down the line and use the function we previously defined to calculate the discounted rewards for each step using the rewards list before returning the data on the episode's return, the stepwise returns at each timestep, and log probability of all the actions took (Lines 24-28)

In [16]:
def forward_pass(env, policy, discount_factor):
    log_prob_actions = []
    rewards = []
    done = False
    episode_return = 0

    policy.train()
    observation, info = env.reset()

    while not done:
        observation = torch.FloatTensor(observation).unsqueeze(0)
        action_pred = policy(observation)
        action_prob = F.softmax(action_pred, dim = -1)
        dist = distributions.Categorical(action_prob)
        action = dist.sample()
        log_prob_action = dist.log_prob(action)

        observation, reward, terminated, truncated, info = env.step(action.item())

        done = terminated or truncated
        log_prob_actions.append(log_prob_action)
        rewards.append(reward)
        episode_return += reward

    log_prob_actions = torch.cat(log_prob_actions)
    stepwise_returns = calculate_stepwise_returns(rewards, discount_factor)

    return episode_return, stepwise_returns, log_prob_actions

- We then need to use the information gleaned from the latest episode to calculate the loss, which is, as Nanda puts it, "the quantity on which we apply gradient descent." This is essentially what we expect the return to be over the episode, which is calculated by multiplying each timestep's return by the probability of the action it took at that timestep and summing all the relevant values, and then flipping the sign as this would typically be used to find the minimum value, but in Reinforcement Learning we want to find the maximum reward. This value of the loss we calculated here, which acts as essentially how far off the model was, is then passed back to be used for updating the Policy.




In [17]:
def calculate_loss(stepwise_returns, log_prob_actions):
    loss = -(stepwise_returns * log_prob_actions).sum()
    return loss

- We utilize the function to calculate the loss when we update the policy, which we will need another function for. This one will accept the stepwise returns and log_prob_actions we calculated from the forward pass, as well as an optimizer that will act upon the neural network's parameters. To that end, we need the loss to act upon the nearal network based off how the Agent did in the latest pass. We detatch the stepwise returns to ensure the data is treated as a fixed value and not part of the graph during backpropogation, then we calculate the loss with the function we previously declared. Then, we clear any previous data leftover in the optimizer to ensure we're only using data from this episode, compute the gradients of the loss with respect to our network's parameters, and then update the network parameters in respect to the learning rate we've set. The learning rate will be set in the Main. The loss is then converted into a Python Scalar from a PyTorch tensor to allow for easier logging and is then returned by the function. This function is the definition of the Policy Gradient Method.

In [18]:
def update_policy(stepwise_returns, log_prob_actions, optimizer):
    stepwise_returns = stepwise_returns.detach()
    loss = calculate_loss(stepwise_returns, log_prob_actions)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.item()

- Now we get to put it all together in the main function, utilizing all the functions we've written thus far to train and evaluate our Policy using an Agent. First, we need to define the hyperparameters, which are the parameters that we define and tune ourselves outside of the model that still impact how the model runs. Our hyperparameters include:
 - The max amount of epochs, or iterations, or episodes that the agent will work through before calling it quits.
 - The discount factor, which is passed to our policy when calculating returns.
 - The number of trials we wait before re-evaluating how the Agent is currently doing. If the average after these is above our reward threshold, the model has done what we wanted it to do and the trials end early.
 - The reward threshold, or how good the reward should be before the model has reached a satisfactory level.
 - The print interval, or how many trials we wait before printing out the current results.
 - The input dimension amount, which is determined by the environment we are currently using.
 - The hidden dimension amount, which controls how many neurons are in the hidden layer of our neural network.
 - The output dimension amount, which is determined by the environment we are currently using.
 - The dropout amount, which is the fraction of data that is randomly 0'd to ensure that we do not overly rely on the same path for too long.
 - The learning rate, which impacts how much influence the gradients have on the neural network's parameters and how quickly they will change. A value that should be just right, not too low or too high.
- Once we have our hyperparameters set and have initialized an empty array to hold our episode returns, we instantiate our policy using the dimension information we just defined and set up an optimizer using Adam. From there, we loop through a set loop a number of times equal to our max epochs, or less if we reach our reward threshold sooner. The loop goes as follows:
  - The forward pass is conducted, using the environment, the policy, and the discount factors to fill out the current episode's return, the stepwise returns, and the log probability of the actions. (Line 20)
  - The policy is updated using the stepwise returns, the log probability of the actions taken, and the optimizer we defined in the main using Adam. The loss that is returned is discarded by the use of the _ variable. (Line 21)
  - The current episode return is appended to the full list, and the mean of the last n trials is calculated by the syntax "[-N_TRIALS:]", where the [x:y] means that it will take a slice of the list from element x to element y inclusivly. The negative in front of N_TRIALS means to start counting from the opposite end, so a negative index here means N_TRIALS amount from the end of the list. Combine it all together and this is the last N_TRIALS amount of entries from the episode returns averaged together. (Line 23-24)
  - If the episode we're currently on is one that needs to be printed based off of our print_interval, we print out the current episode and the means rewards. If the mean return is greater than our reward threshold, we complete the simulation and stop the loop early after printing out how many episodes it took. (Line 26-30)

In [19]:
def main():
    MAX_EPOCHS = 500
    DISCOUNT_FACTOR = 0.99
    N_TRIALS = 25
    REWARD_THRESHOLD = 475
    PRINT_INTERVAL = 10
    INPUT_DIM = env.observation_space.shape[0]
    HIDDEN_DIM = 128
    OUTPUT_DIM = env.action_space.n
    DROPOUT = 0.5

    episode_returns = []

    policy = PolicyNetwork(INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT)

    LEARNING_RATE = 0.01
    optimizer = optim.Adam(policy.parameters(), lr = LEARNING_RATE)

    for episode in range(1, MAX_EPOCHS+1):
        episode_return, stepwise_returns, log_prob_actions = forward_pass(env, policy, DISCOUNT_FACTOR)
        _ = update_policy(stepwise_returns, log_prob_actions, optimizer)

        episode_returns.append(episode_return)
        mean_episode_return = np.mean(episode_returns[-N_TRIALS:])

        if episode % PRINT_INTERVAL == 0:
            print(f'| Episode: {episode:3} | Mean Rewards: {mean_episode_return:5.1f} |')

        if mean_episode_return >= REWARD_THRESHOLD:
            print(f'Reached reward threshold in {episode} episodes')
            break

Run the program. If the training doesn't converge, run it again, or change the SEED values.

In [20]:
main()

| Episode:  10 | Mean Rewards:  24.4 |
| Episode:  20 | Mean Rewards:  18.1 |
| Episode:  30 | Mean Rewards:  14.4 |
| Episode:  40 | Mean Rewards:  14.3 |
| Episode:  50 | Mean Rewards:  23.8 |
| Episode:  60 | Mean Rewards:  44.1 |
| Episode:  70 | Mean Rewards:  71.1 |
| Episode:  80 | Mean Rewards: 113.9 |
| Episode:  90 | Mean Rewards: 129.1 |
| Episode: 100 | Mean Rewards: 150.7 |
| Episode: 110 | Mean Rewards: 212.0 |
| Episode: 120 | Mean Rewards: 249.0 |
| Episode: 130 | Mean Rewards: 334.4 |
| Episode: 140 | Mean Rewards: 377.7 |
| Episode: 150 | Mean Rewards: 332.3 |
| Episode: 160 | Mean Rewards: 181.5 |
| Episode: 170 | Mean Rewards:  84.9 |
| Episode: 180 | Mean Rewards:  56.8 |
| Episode: 190 | Mean Rewards:  50.5 |
| Episode: 200 | Mean Rewards:  47.8 |
| Episode: 210 | Mean Rewards:  47.8 |
| Episode: 220 | Mean Rewards:  54.0 |
| Episode: 230 | Mean Rewards:  73.9 |
| Episode: 240 | Mean Rewards: 108.8 |
| Episode: 250 | Mean Rewards: 137.7 |
| Episode: 260 | Mean Rew

# References
- The following noteboook was created with the help of the lovely article linked here:
https://www.datacamp.com/tutorial/reinforcement-learning-with-gymnasium.