# Reinforcement Learning

Reinforcement Learning is a subfield of artificial intelligence (AI) where machines learn by experimenting, somewhat like students learning through trial and error. They aim to make the best decisions to maximize rewards in various scenarios. RL is used in applications like robotics, gaming, and autonomous systems.

## Proximal Policy Optimization

One of the priminent algorithms is **Proximal Policy Optimization (PPO)**, released by OpenAI in 2017, cited more than 10k times.  Instead of making big changes all at once, PPO encourages gradual and more stable improvements, which is crucial for the learning process. PPO finds application in various real-world situations, and we'll explore its principles and significance in more detail.

## Terminologies

Here are some quick terminologies before we start:

- **Agents** - the primary entity interacting with the environment
- **Environment** - the primary thing that the agents interact with
- **Policy ($\pi$)**: a policy defines the probability of an action given a state.  The policy guides the agent in selecting actions that maximize its expected cumulative reward over time.
- **States / observation** ($s_t$) - a snapshot of the environment at timestep $t$
- **Action** ($a_t$) - the possible actions that the agents can perform to the environment at timestep $t$
- **Rewards / returns** ($r_t$) - the rewards received each time the agents perform an action at timestep $t$ given state $s$

Let's look at a very simple code to understand all these terms.


In [1]:
#create the environment
import gymnasium as gym

#environment is the simulation / real-world environment
env = gym.make('Pendulum-v1', render_mode='human')

Here is a simple picture of this pendulum thing.

<img src = "figures/pendulum.gif" height=200>

The task is to move left and right such that it goes to the 12 o'clock and stay there.

In [2]:
#reset the environment and get  state0
state, _ = env.reset()  #cos, sin, velocity
done = False

print(state)

[-0.39593688 -0.91827774 -0.27370557]


In [3]:
action = env.action_space.sample()

print(action)  #action is swing left or right

[1.326761]


In [4]:
next_state, reward, done, truncated, info = env.step(action)          
print(next_state)
print(reward)  #how far it is from the upright position
print(done)

[-0.43069062 -0.9024996  -0.7633997 ]
-3.921277700930215
False


In [5]:
env.close() #env.close() won't close the window; just restart the kernel and it will close the window

  
## 1. Vanilla Policy Gradient Methods

Vanilla policy gradient methods have an objective function to maximize as:

$$\mathbb{E} \underbrace{(\log \pi_\theta(a_t | s_t)}_{\text{log prob.}} \underbrace{\hat{A}_t}_{\text{advantage func.}})$$

$\mathbb{E}$ stands for expected value but is simply short hand for empirical average.

### 1.1 Advantage function ($\hat{A}_t$)

The right side of the equation - the Advantage $\hat{A}_t$ function is simply calculating the *advantage* of taking a certain action, compared to the average performance.  If $\hat{A}$ is positive, it means the action taken is generally better than average, while negative means vice versa.  If it is zero, it means that the action taken is no better than average.

The equation is

$$\hat{A}_t = G_t - V(s)$$

#### 1.1.1 Discounted sum of rewards $(G_t)$

$G_t$ is the discounted sum of rewards at timestep $t$ until the end.

The formula is

   $$G_t = \gamma^0 r_{t} + \gamma^1 r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{T-t} r_T = \sum_{k=t}^{T} \gamma^{k-t} \cdot r_{k}$$

Note that to calculate $G_t$, you have to run many episodes, and then reverse back to calculate $G_t$.

The reason to calculate future rewards as well is so that the action taken also accounts for possible future rewards it can get.   

The **Discount Factor** ($\gamma$) is a parameter that determines the importance of future rewards. A value of 0 indicates that only immediate rewards matter, while a value of 1 considers all future rewards equally.  Putting discount factor is a way to constrain the farther the reward in the future, it may be less important.

#### 1.1.2 Value function $(V(s))$

As for the second part of the advntage function, $V(s)$ is the **state value function** or the **value function** that calculates the average discounted sum of rewards whenever the agent start in this state until the end.  It is commonly calculated using the **critic network** (or sometimes the value network).

#### Critic network

It is a neural network that takes in states $(s)$ and output its corresponding value $(V(s))$.  The way to train this network is simply:

1. Perform rollout and collect a sequences of states, actions, and rewards
2. Once you finish many episodes, 
   1. Go back to each time step, compute $G_t$ for each timestep
3. Minimize $\mathbb{E}(G_t - V(s))^2$

In short, critic network learns to estimate $(V(s))$ from many instances, thus it is like a **average** estimates of $(V(s))$

### 1.2 Log probablity ($\log \pi_\theta(a_t | s_t)$)

The left side $\log \pi_\theta(a_t | s_t)$ returns the log probability of actions given the current state.  Common way to get is through deep neural network.  This network is commonly called the **actor network**.   It takes in states and output the probability distributions of all actions.

#### Actor network

The way to train this **actor network** is simply:

1. Perform rollout and collect a sequences of states, actions, and rewards
2. Once you finish many episodes, 
   1. Go back to each time step, compute $G_t$ for each timestep 
3. Maximize $\mathbb{E} (\log \pi_\theta(a_t | s_t) \hat{A}_t)$

### In summary

In summary, it's basically maximizing the probability of actions, multiplied by its relative rewards compared to average.

$$\mathbb{E} \underbrace{(\log \pi_\theta(a_t | s_t)}_{\text{prob. of actions}} \underbrace{\hat{A}_t}_{\text{relative rewards}})$$

## 2. Trust Region Methods

The vanilla policy gradient methods often suffer from too large policy updates, causing it fail to find the solution due to the large solution space. 

**Trust Region Policy Gradient (TRPO)** modifies the objective function to maximize to

$$\mathbb{E}(\frac{\log \pi_\theta(a_t | s_t)}{\log \pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t) - \beta \text{KL}(\pi_{\theta_{\text{old}}}( \cdot | s_t), \pi_{\theta}( \cdot | s_t))$$

The first term calculates the **ratio** instead of the probability, which would help the model to find meaningful updates.  Larger difference, the better.  However, too large updates can cause a lot of problem, thus we constrain the difference between old policy and new policy using KL divergence.

Note: For those who don't know what is KL divergence, you can easily search up the equation, but it is a very simple equation of $P \log \frac{P}{Q}$ which simply measures how different is two distributions $P$ and $Q$.  Higher value means more difference.

## 3. Proximal Policy Methods

TRPO still suffers from choosing the right $\beta$ which varies from task to task.  

**Proximal Policy Policy Gradient (PPO)** modifies the objective function to maximize to

$$\mathbb{E}( 
   \min(
   \frac{\log \pi_\theta(a_t | s_t)}{\log \pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t , 
   \text{clip}(\frac{\log \pi_\theta(a_t | s_t)}{\log \pi_{\theta_{\text{old}}}(a_t | s_t)}, 1 - \epsilon, 1 + \epsilon)\hat{A}_t)           ) $$

Although this looks very difficult, it simply bounds the updates within $\epsilon$ which is specified commonly as 0.2. 

Let's look at the effect closely via this picture:

<img src = "figures/clip.png" height="300">

Here, the x-axis is the policy ratio, and the y-axis is simply the clipped objective function we just defined.  
When $A > 0$, it restricts the rewards to $1 + \epsilon$.  When $A < 0$, it restricts the rewards to $1 - \epsilon$.

### 3.1 Actor loss function

Based on what we learn, the **actor** loss function of PPO is simply

$$J(\theta) = -\min(
   \frac{\log \pi_\theta(a_t | s_t)}{\log \pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t , 
   \text{clip}(\frac{\log \pi_\theta(a_t | s_t)}{\log \pi_{\theta_{\text{old}}}(a_t | s_t)}, 1 - \epsilon, 1 + \epsilon)\hat{A}_t) $$

To encourage exploration, it adds the **entropy** term

$$J(\theta) = -\min(
   \frac{\log \pi_\theta(a_t | s_t)}{\log \pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t , 
   \text{clip}(\frac{\log \pi_\theta(a_t | s_t)}{\log \pi_{\theta_{\text{old}}}(a_t | s_t)}, 1 - \epsilon, 1 + \epsilon)\hat{A}_t) - \lambda S[\pi_\theta](s_t)$$

Here, we want *high* entropy, thus we put minus in front for a minimization problem.  $\lambda$ is simply a coefficient to control this entropy bonus.

#### Entropy

To understand how entropy helps exploration, a simple example of **entropy** is:

Suppose you have three actions, A1, A2, and A3. The probabilities to each action as follows:

- Probability of selecting A1: $P(A1) = 0.9$
- Probability of selecting A2: $P(A2) = 0.05$
- Probability of selecting A3: $P(A3) = 0.05$

The entropy of this policy can be calculated as:

$$
\begin{align*}
S &= -\sum_{i=1}^{3} P(A_i) \ln(P(A_i)) \\
S &= -(0.9 \ln(0.9) + 0.05 \ln(0.05) + 0.05 \ln(0.05)) \\
S &\approx 0.394
\end{align*}
$$

Let's change the probabilities to 

- Probability of selecting A1: $P(A1) = 0.4$
- Probability of selecting A2: $P(A2) = 0.3$
- Probability of selecting A3: $P(A3) = 0.3$

The entropy $S$ will be 1.08.

Thus higher entropy encourages the model to explore more actions, instead of just making one action very prominent.  However, it is important to note that we should balance exploration and exploitation by putting a coefficient in front of this entropy to control how much we want to explore.

### 3.2 Critic loss function

The critic loss function is simply

$$J(\theta) = (G_t - V(s))^2$$

That's it!  Now let's look at the algorithm.

## Algorithms and code

1. Initialize actor and critic network
2. Collect data
   1. Initialize environment and its states
   2. Let the agent interact with the environment
      1. Store the state, rewards, actions, next states, and log probability of the action into a list
   3. Once you finish collecting these states, rewards, etc.
      1. Reverse the time and compute $G_t$
3. Calculate probability
   1. $\displaystyle\frac{\pi}{\pi_\text{old}} \hat{A}_t$
   2. $\text{clip}(\displaystyle\frac{\pi}{\pi_\text{old}}, 1 - \epsilon, 1 + \epsilon) \hat{A}_t$
4. Calculate loss
   1. Actor loss = - minimum of the 3.1 and 3.2 - entropy bonus
   2. Critic loss = $(G_t - V(s))^2$
5. Backpropagate

### Backbone neural network for actor and critic networks

Note the actor and critic networks can be feedforward, RNN or CNN depending on the states.  Here I think using a feedforward is good for a beginner lecture.

In [6]:
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np

class FeedForwardNN(nn.Module):
    def __init__(self, in_dim, out_dim):
        super(FeedForwardNN, self).__init__()

        self.layer1 = nn.Linear(in_dim, 64)
        self.layer2 = nn.Linear(64, 64)
        self.layer3 = nn.Linear(64, out_dim)
        
    def forward(self, states):
        if isinstance(states, np.ndarray):
            states = torch.tensor(states, dtype = torch.float)
                    
        activation1 = F.relu(self.layer1(states))
        activation2 = F.relu(self.layer2(activation1))
        out         = self.layer3(activation2)
        
        return out

### PPO class

Here, we gonna define several functions:
- `collect_data` - collect data and put into a list
- `fit` - called `collect_data`, and then compute `G_t` and then learn using PPO loss function
- `get_action` - asked the `actor network` to give you the action and its prob given the states
- `compute_discounted_rewards` - compute `G_t`
- `predict` - ask the `actor network` and `critic network` to give you the action, prob and `V(s)`, and entropy

In [7]:

from torch.distributions import MultivariateNormal
from torch.optim import Adam

class PPO:
    def __init__(self, env):
        self._init_params()
        
        #extract info from environment
        self.env = env
        self.states_dim = env.observation_space.shape[0]
        self.act_dim    = env.action_space.shape[0]
        
        ## STEP 1
        #input is state for both actor and critic networks
        #output is a value for critic networks, and action distribution for actor networks 
        self.actor  = FeedForwardNN(self.states_dim, self.act_dim) 
        self.critic = FeedForwardNN(self.states_dim, 1)
        
        ##this is for sampling actions when collecting data
        self.cov_var = torch.full(size = (self.act_dim, ), fill_value=0.5)
        self.cov_mat = torch.diag(self.cov_var)  #basically every action has a variance of 0.5
        
        self.actor_optim = Adam(self.actor.parameters(), lr=self.lr)
        self.critic_optim = Adam(self.critic.parameters(), lr=self.lr)
    
    def _init_params(self):
        torch.manual_seed(999)  #just for reproducibility
        self.timesteps_per_batch = 4800
        self.max_timesteps_per_episode = 1600
        self.gamma = 0.95
        self.n_updates_per_iteration = 5
        self.clip = 0.2
        self.lr = 0.005
        self.entropy_weight = 0.05 #higher means more exploration; we can set it very low for pendulum because it's a very simple problem
    
    ## STEP 2
    def collect_data(self):
        #rollout
        batch_states    = [] #shape: (number of timesteps per batch, states_dim)
        batch_acts      = [] #shape: (number of timesteps per batch, act_dim)
        batch_log_probs = [] #(number of timesteps per batch, )
        batch_rewards   = [] #(number of episodes, number of timesteps per episode)
        batch_discounted_rewards = [] #(number of timesteps per batch, )
        batch_lens      = [] #(number of episodes, )
        
        #Number of timesteps run so far this batch
        t = 0
        ep_rewards = []
        
        #batch means one batch of data we collect, which can span multiple episodes
        #one episode means you start the env, until you reach the terminal state
        
        while t < self.timesteps_per_batch:  #30
            
            #Rewards this episode
            ep_rewards = []
            
            states = self.env.reset()[0]  ## STEP 2.1
            done   = False
            
            ## STEP 2.2
            for ep_t in range(self.max_timesteps_per_episode):
                t += 1
                
                #collect states
                batch_states.append(states)
                
                action, log_prob = self.get_action(states)    
                states, rewards, done, _, _ = self.env.step(action)
                
                #collect reward, action, and log prob
                ep_rewards.append(rewards)                
                batch_acts.append(action)
                batch_log_probs.append(log_prob)
                
                if done:
                    break
                
            batch_lens.append(ep_t + 1)           
            batch_rewards.append(ep_rewards)
        
        #convert to tensor; note that converting the list first to np array then to tensor is much faster
        batch_states    = torch.tensor(np.array(batch_states), dtype=torch.float)
        batch_acts      = torch.tensor(np.array(batch_acts), dtype=torch.float)
        batch_log_probs = torch.tensor(np.array(batch_log_probs), dtype=torch.float)

        ## STEP 2.3
        #compute G_t
        batch_discounted_rewards = self.compute_discounted_rewards(batch_rewards)
        
        return batch_states, batch_acts, batch_log_probs, batch_discounted_rewards, batch_lens
                
    def fit(self, total_timesteps):
        t = 0 # Timesteps simulated until now
        i = 0
        actor_losses  = [] #for reporting
        critic_losses = []
        discounted_rewards = []
        
        while t < total_timesteps:
                        
            batch_states, batch_acts, batch_log_probs, batch_discounted_rewards, batch_lens = self.collect_data()
                        
            t += np.sum(batch_lens)
            i += 1
                    
            # Calculate V
            V, _ , _ = self.predict(batch_states, batch_acts)

            # Calculate advantage
            A_k = batch_discounted_rewards - V.detach()
            
            # For faster convergence
            A_k = (A_k - A_k.mean()) / (A_k.std() + 1e-10)
            
            for _ in range(self.n_updates_per_iteration):
                V, curr_log_probs, entropy = self.predict(batch_states, batch_acts)
                ratios = torch.exp(curr_log_probs - batch_log_probs) #log ratio become minus
                
                # Calculate surrogate losses
                surr1 = ratios * A_k
                surr2 = torch.clamp(ratios, 1 - self.clip, 1 + self.clip) * A_k
                actor_loss = (-torch.min(surr1, surr2)).mean()
                entropy_loss = entropy.mean()
                actor_loss = actor_loss - self.entropy_weight * entropy_loss
                critic_loss = nn.MSELoss()(batch_discounted_rewards, V)
                
                actor_losses.append(actor_loss.detach())
                critic_losses.append(critic_loss.detach())
                
                # Backprop
                self.actor_optim.zero_grad()
                actor_loss.backward()
                self.actor_optim.step()
                
                self.critic_optim.zero_grad()    
                critic_loss.backward()    
                self.critic_optim.step()
                
                #just for plotting
                discounted_rewards.append(batch_discounted_rewards.mean())
                
            self.print_summary(i, t, discounted_rewards, critic_losses, actor_losses)
                
    def get_action(self, states):
        mean = self.actor(states)
        dist = MultivariateNormal(mean, self.cov_mat)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        
        #detach from computational graph
        return action.detach().numpy(), log_prob.detach()
    
    def compute_discounted_rewards(self, batch_rewards):
        # batch_rewards: shape (number of episodes, number of timesteps per episode)
        batch_discounted_rewards = []  #shape: (num of timesteps in batch)
                        
        # Iterate through each episode backwards to maintain same order in batch_discounted_rewards
        for episode_reward in reversed(batch_rewards):
        
            discounted_reward = 0
            for reward in reversed(episode_reward):
                discounted_reward = reward + discounted_reward * self.gamma
                batch_discounted_rewards.insert(0, discounted_reward)
                
        batch_discounted_rewards = torch.tensor(batch_discounted_rewards, dtype=torch.float)
        
        return batch_discounted_rewards
    
    def predict(self, batch_states, batch_acts):
        # Query critic network for a value V for each state in batch_states.
        V = self.critic(batch_states).squeeze()
        
        mean = self.actor(batch_states)
        dist = MultivariateNormal(mean, self.cov_mat)
                
        log_probs = dist.log_prob(batch_acts)
        
        return V, log_probs, dist.entropy()
    
    def print_summary(self, i, t, discounted_rewards, critic_losses, actor_losses):
        avg_discounted_rewards  = np.mean([rewards.float().mean() for rewards in discounted_rewards])
        avg_actor_loss  = np.mean([losses.float().mean() for losses in actor_losses])
        avg_critic_loss = np.mean([losses.float().mean() for losses in critic_losses])
        
        if(i+1) % 10 == 0:
            print(f"#{i+1:3.0f} | Timesteps: {t:7.0f} |  Critic Loss: {avg_critic_loss:10.3f} | Actor Loss: {avg_actor_loss:10.6f} | Dis. Rewards: {avg_discounted_rewards:5.3f}")
        

### Training

In [8]:
#pip install gymnasium
#brew install swig
#pip install box2d-py

import gymnasium as gym
import pickle

env = gym.make("Pendulum-v1")

model = PPO(env)
model.fit(500000)

filename = 'model/pendulumv1'
with open(f'{filename}.pkl', 'wb') as file:
    pickle.dump(model, file)

# 10 | Timesteps:   43200 |  Critic Loss:   6536.713 | Actor Loss:  -0.049997 | Dis. Rewards: -114.061
# 20 | Timesteps:   91200 |  Critic Loss:   3647.361 | Actor Loss:  -0.052533 | Dis. Rewards: -102.112
# 30 | Timesteps:  139200 |  Critic Loss:   2615.570 | Actor Loss:  -0.053370 | Dis. Rewards: -93.388
# 40 | Timesteps:  187200 |  Critic Loss:   2072.689 | Actor Loss:  -0.054115 | Dis. Rewards: -83.478
# 50 | Timesteps:  235200 |  Critic Loss:   1822.857 | Actor Loss:  -0.053837 | Dis. Rewards: -73.711
# 60 | Timesteps:  283200 |  Critic Loss:   1557.916 | Actor Loss:  -0.054006 | Dis. Rewards: -63.065
# 70 | Timesteps:  331200 |  Critic Loss:   1378.605 | Actor Loss:  -0.054071 | Dis. Rewards: -54.818
# 80 | Timesteps:  379200 |  Critic Loss:   1219.411 | Actor Loss:  -0.054183 | Dis. Rewards: -48.356
# 90 | Timesteps:  427200 |  Critic Loss:   1092.476 | Actor Loss:  -0.054234 | Dis. Rewards: -43.259
#100 | Timesteps:  475200 |  Critic Loss:    991.444 | Actor Loss:  -0.054301 | 

## Testing

In [9]:
import gymnasium as gym
import pickle

filename = 'model/pendulumv1'

with open(f'{filename}.pkl', 'rb') as file:
    model = pickle.load(file)

env = gym.make('Pendulum-v1', render_mode='human')
num_episodes = 1

for episode in range(num_episodes):
    state, _ = env.reset()
    done = False
    total_reward = 0
    
    i = 0
    while not done:
        env.render()

        action, log_probabilities = model.get_action(state)
        next_state, reward, done, truncated, info = env.step(action)    
        
        angle = np.arctan2(next_state[1], next_state[0])
        angle_threshold = 0.05
        if abs(angle) < angle_threshold:
            done = True
                       
        state = next_state
        
        i += 1
        #the more negative the rewards, the farther it is from the upright position.
        print(f"Iteration {i: 4.0f} | Angle: {angle:6.3f} | Reward: {reward:3.5f}")

env.close() #env.close() won't close the window; just restart the kernel and it will close the window

Iteration    1 | Angle:  0.351 | Reward: -0.14191
Iteration    2 | Angle:  0.374 | Reward: -0.15256
Iteration    3 | Angle:  0.396 | Reward: -0.16520
Iteration    4 | Angle:  0.417 | Reward: -0.17961
Iteration    5 | Angle:  0.438 | Reward: -0.19589
Iteration    6 | Angle:  0.461 | Reward: -0.21448
Iteration    7 | Angle:  0.485 | Reward: -0.23611
Iteration    8 | Angle:  0.511 | Reward: -0.26184
Iteration    9 | Angle:  0.541 | Reward: -0.29313
Iteration   10 | Angle:  0.575 | Reward: -0.33194
Iteration   11 | Angle:  0.614 | Reward: -0.38094
Iteration   12 | Angle:  0.660 | Reward: -0.44372
Iteration   13 | Angle:  0.715 | Reward: -0.52510
Iteration   14 | Angle:  0.778 | Reward: -0.63152
Iteration   15 | Angle:  0.853 | Reward: -0.77161
Iteration   16 | Angle:  0.941 | Reward: -0.95671
Iteration   17 | Angle:  1.045 | Reward: -1.20153
Iteration   18 | Angle:  1.168 | Reward: -1.52367
Iteration   19 | Angle:  1.310 | Reward: -1.97250
Iteration   20 | Angle:  1.483 | Reward: -2.52903


## Workshop

1. What is Reinforcement Learning?
2. Be familiar what is agents, environment, policy, states, action, and rewards.
3. What is $G_t$?  How do we obtain it? 
4. How is the critic network relates to $G_t$?
5. In $G_t$, there is some discount factor $\gamma$, what it it?  What is the range?  0 means what? and 1 means what?
6. What is $V(s)$?  How do we obtain it?
7. So the advantage function $\hat{A}_t = G_t - V(s)$.  But what it really means?
8. How do we obtain the log probability of actions given the current state?  How it relates to the actor network? What is the size of this?
9.  In summary, what does $$\mathbb{E} \underbrace{(\log \pi_\theta(a_t | s_t)}_{\text{log prob.}} \underbrace{\hat{A}_t}_{\text{advantage func.}})$$ mean?
10. In Trust Region Methods, we introduce (1) proability ratio, and (2) KL divergence.  What is it for?
11. But what remains the limitation of the Trust Region Methods?
12. In Proximal Policy Methods, it adds some clip function, how does that fix the Trust Region Methods limitation?
13. In PPO, there is a entropy term.  What is it?  Bigger lambda in front of the entropy means what?  How about smaller term?