# **FRA 503: Deep Reinforment Learning**
**HW 3 Cartpole Function Approximation**

Napat Aeimwiratchai 65340500020  
Phattarawat Kadrum 65340500074

**Learning Objectives:**

1. Understand how function approximation works and how to implement it.

2. Understand how policy-based RL works and how to implement it.

3. Understand how advanced RL algorithms balance exploration and exploitation.

4. Be able to differentiate RL algorithms based on stochastic or deterministic policies, as well as value-based, policy-based, or Actor-Critic approaches.

5. Gain insight into different reinforcement learning algorithms, including Linear Q-Learning, Deep Q-Network (DQN), the REINFORCE algorithm, and the Actor-Critic algorithm. Analyze their strengths and weaknesses.

# **Part 1 : Understanding the algorithm**

4 function approximation-based RL algorithms:
- Linear Q-Learning 
- Deep Q-Network (DQN)
- MC REINFORCE algorithm
- Actor-Critic method (xxxxx)

Value-based / Policy based / Actor-Critic approch  
Specify type policy it learns(stochastic or deterministic), identity the type of observation space and action space (discrete or continuous), and explain how each advanced RL method balances exploration and exploitation. 


## Linear Q-Learning 

- Approach Type: Value-based
- Policy Type: Deterministic (ε-greedy)
- Observation Space: Discrete or low dimensional continuous 
- Action Space: Discrete
- Explore vs Exploitation:
    - uses ε-greedy
    - with probability ε => exploration
    - with probability 1-ε => exploitaion


Linear Q-Learning (with TD) **Q-Update Rule:** $$ q_{t+1}(S_t, A_t) \leftarrow q_t(S_t, A_t) + \alpha_t \left[ R_{t+1} + \gamma \max_a q_t(S_{t+1}, a) - q_t(S_t, A_t) \right] $$ **Linear Function Approximation:** $$ q(s, a) \approx \mathbf{w}^\top \phi(s, a) $$ **Weight Update:** $$ \mathbf{w}_{t+1} \leftarrow \mathbf{w}_t + \alpha_t \, \delta_t \, \nabla_{\mathbf{w}} q(s, a) $$ Where: $$ \delta_t = R_{t+1} + \gamma \max_{a'} q(S_{t+1}, a') - q(S_t, A_t) $$

## Deep Q-Network (DQN)

- Approach Type: Value-based
- Policy Type: Deterministic (ε-greedy)
- Observation Space: High dimensional (continuous)
- Action Space: Discrete
- Explore vs Exploitation:
    - uses ε-greedy
    - Experience Replay: decorrelates samples
    - Target Networks: stabilize Q-value update


Deep Q-Network (DQN) **Loss Function:** $$ L(\theta) = \mathbb{E}_{(s,a,r,s') \sim D} \left[ \left( r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a) \right)^2 \right] $$ **Target Q-Value:** $$ y = r + \gamma \max_{a'} Q_{\theta^-}(s', a') $$ **Gradient Update:** $$ \theta \leftarrow \theta - \alpha \nabla_\theta L(\theta) $$ Where: 
- $\theta$: parameters of the Q-network  
- $\theta^-$: parameters of the target network (updated periodically)  
- $D$: experience replay buffer  


## Monte Carlo REINFORCE

- Approach Type: Policy-based
- Policy Type: Stochastic
- Observation Space: Discrete or Continuous 
- Action Space: Discrete or Continuous
- Explore vs Exploitation:
    - Exploration from stochastic policy (softmax or Gaussian)
    - No need ε-greedy
    - Monte Carlo (update at episode end) => high variance
    - Does not bootstrap (no value function), slow learning but avoids bias
    


REINFORCE (Monte Carlo Policy Gradient) **Policy Gradient Update:** $$ \theta_{t+1} \leftarrow \theta_t + \alpha_t \, G_t \, \nabla_\theta \log \pi_\theta(a_t | s_t) $$ **Return from timestep \( t \):** $$ G_t = \sum_{k=0}^{T - t - 1} \gamma^k R_{t + k + 1} $$

## Actor-Critic (A2C: Advantage Actor-Critic)

- Approach Type: Actor-Critic (Hybrid)
- Policy Type: Stochastic (Actor outputs probability distribution over actions)
- Observation Space: Discrete or Continuous 
- Action Space: Discrete or Continuous

- Explore vs Exploitation:
    - Stochastic policy: Actions are sample from probability distributions 

- Critic (value function) as a baseline: Reduce variance of policy gradient updates (used to compute Advantage)
- Bootstrapping & TD learning
- A2C uses TD(0) or n-step return updates
- Update directly from gradients


A2C (Advantage Actor-Critic)

$$
L^{\text{A2C}}(\theta) = -\mathbb{E}_t \left[ \log \pi_\theta(a_t | s_t) \cdot \hat{A}_t \right]
$$

Where:

$$
\hat{A}_t = r_t + \gamma V_{\phi}(s_{t+1}) - V_{\phi}(s_t)
$$

or with n-step returns:

$$
\hat{A}_t = \left( \sum_{i=0}^{n-1} \gamma^i r_{t+i} \right) + \gamma^n V_{\phi}(s_{t+n}) - V_{\phi}(s_t)
$$

Critic value loss:

$$
L^{\text{VF}}(\phi) = \mathbb{E}_t \left[ \left( V_\phi(s_t) - R_t \right)^2 \right]
$$

Entropy bonus (to encourage exploration):

$$
L^{\text{ENT}}(\theta) = \mathbb{E}_t \left[ \mathcal{H} \left[ \pi_\theta(\cdot | s_t) \right] \right]
$$

Total A2C loss:

$$
L^{\text{total}} = L^{\text{A2C}} + c_v L^{\text{VF}} - c_e L^{\text{ENT}}
$$

Where \( c_v \) and \( c_e \) are the coefficients for the value loss and entropy bonus respectively.


### **Comparison Table**
|Algorithm|Approach Type|Policy Type|Observation Space|Action Space|Exploration vs Exploitation|
|---|---|---|---|---|---|
|Linear Q|Value based|Deterministic (ε-greedy)|Discrete|Discrete|ε-greedy|
|DQN|Value based|Deterministic (ε-greedy)|Continuous|Discrete|ε-greedy + Replay Buffer + Target Network|
|MC REINFORCE|Policy based|Stochastic|Discrete/Continuous|Discrete/Continuous|Stochastic Policy|
|A2C|Actor Critic|Stochastic|Discrete/Continuous|Discrete/Continuous|Stochastic policy + Estimated Advantage + baseline

# **Part 2 : Setting up `Cart-Pole` Agent**

1. RL Base Class
2. Replay Buffer Class
3. Algorithm folder

# **Part 3 : Trainning & Playing to stabilize `Cart-Pole` Agent.**

We design experiments by changing the hyperparameters of each algorithm and training them to observe the results. Each experiment train by changing each parameter in **param** individually, while keeping all other parameters set to **defaults** values.

## Linear Q-learning

```py
defaults = {
    'num_of_action': 5,
    'action_range': [-20.0, 20.0],
    'learning_rate': 0.001,
    'n_episodes': 5000,
    'initial_epsilon': 1.0,
    'epsilon_decay': 0.998,
    'final_epsilon': 0.001,
    'discount': 0.9
}
```

```py
param = {
    "learning_rate": [0.005],
    "discount": [0.8],
    "action_range": [[-10, 10]],
}
```

<p align = "center">
    <img src="result/LQ_reward.png" alt="Alt text" width="800"/>
</p>

Increase trian to 5000 episode

<p align = "center">
    <img src="result/LQ_reward_5000ep.png" alt="Alt text" width="800"/>
</p>

From the reward graph, we select the model trained by changing *action_range*, as it give the highest overall reward across both 3000 and 5000 episodes.

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/Linear_Q_longest_episode.mp4" type="video/mp4">
    </video>
    <p>from 3000 episode</p>
  </div>

  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/Linear_Q_longest_episode_5k.mp4" type="video/mp4">
    </video>
    <p>from 5000 episode</p>
  </div>
</div>

<p align = "center">
    <img src="result/LQ.png" alt="Alt text" width="800"/>
</p>

## Deep Q-Network (DQN)

```py
defaults = {
    'num_of_action': 7,
    'action_range': [-20.0, 20.0],
    'learning_rate': 0.001,
    'hidden_dim': 128,
    'n_episodes': 3000,
    'tau': 0.005,
    'dropout': 0.1,
    'initial_epsilon': 1.00,
    'epsilon_decay': 0.998,
    'final_epsilon': 0.01,
    'discount': 0.9,
    'buffer_size': 10000,
    'batch_size': 1
}
```

```py
param = {
    "learning_rate": [0.005],
    "hidden_dim": [256],
    "action_range": [[-10, 10]],
    "tau": [0.001, 0.01],
    "dropout": [0.2],
    "buffer_size": [50000],
    "discount": [0.8]
}
```

<p align = "center">
    <img src="result/DQN_reward.png" alt="Alt text" width="800"/>
</p>

Since the reward graph makes it difficult to see the different of performance between each model, we ran all the models to determine which one give the longest episode length, as shown in the graph below.

<p align = "center">
    <img src="result/DQN.png" alt="Alt text" width="800"/>
</p>

And then we select the model that train by changing *hidden_dim* as it give longest step

<p align="center">
  <video width="800" controls>
    <source src="result/DQN_longest_episode_hi256.mp4" type="video/mp4">
  </video>
</p>

## Monte Carlo REINFORCE


```py
defaults = {
    'num_of_action': 7,
    'action_range': [-20.0, 20.0],
    'learning_rate': 0.001,
    'hidden_dim': 128,
    'n_episodes': 3000,
    'n_observations': 4,
    'dropout': 0.1,
    'discount': 0.9
}
```

```py
param = {
    "learning_rate": [0.0005],
    "hidden_dim": [256],
    "discount": [0.8],
    "dropout": [0.2],
    "action_range": [[-10, 10]],
}
```

<p align = "center">
    <img src="result/MC_reward.png" alt="Alt text" width="800"/>
</p>

From the reward graph, we select the model trained by changing *discount_factor*.

<p align="center">
  <video width="800" controls>
    <source src="result/MC_longest_episode.mp4" type="video/mp4">
  </video>
</p>

<p align = "center">
    <img src="result/MC.png" alt="Alt text" width="800"/>
</p>

## Actor-Critic (A2C: Advantage Actor-Critic)


```py
defaults = {
    'num_of_action': 7,
    'action_range': [-20.0, 20.0],
    'learning_rate': 0.001,
    'hidden_dim': 128,
    'n_episodes': 3000,
    'n_observations': 4,
    # 'dropout': 0.1,
    'discount': 0.9
}
```

```py
param = {
    "learning_rate": [0.005],
    "hidden_dim": [256],
    "discount": [0.8],
    "action_range": [[-10, 10]],
}
```

<p align = "center">
    <img src="result/AC_reward.png" alt="Alt text" width="800"/>
</p>

<p align = "center">
    <img src="result/blackpink.png" alt="Alt text" width="800"/>
</p>

As seen in the graph, the *default* parameters and *action_range* model have similar rewards, so we selected both models to run further evaluations.

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/AC_longest_episode_default.mp4" type="video/mp4">
    </video>
    <p>default parameter</p>
  </div>

  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/AC_longest_episode_ar_10_10.mp4" type="video/mp4">
    </video>
    <p>changing action_range</p>
  </div>
</div>

<p align = "center">
    <img src="result/AC.png" alt="Alt text" width="800"/>
</p>

# **Part 4 : Evaluate `Cart-Pole` Agent performance.**

- Learning efficiency (how well agent learns to recieve higher rewards)
- Deployment performance (how well the agent perform in stabilize problem)

Analyze and visualize the result to determine: 
1. Which algo performs best?
2. Why does it perform better than the other?

From Part 3, the *Actor-Critic (A2C: Advantage Actor-Critic)* give the best performance.

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/AC_longest_episode_default.mp4" type="video/mp4">
    </video>
    <p>default parameter</p>
  </div>

  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/AC_longest_episode_ar_10_10.mp4" type="video/mp4">
    </video>
    <p>changing action_range</p>
  </div>
</div>

Compare to `Linear Q Learning` that use linear approximation.

$$ q(s, a) \approx \mathbf{w}^\top \phi(s, a) $$

So it may not be able to model the non-linear state-action space in this task, as the cart-pole dynamics are non-linear.

<p align = "center">
    <img src="result/LQ_reward_ep.png" alt="Alt text" width="800"/>
</p>

Where A2C use nueral-network so it can handle more complex task.

In `MC_Reinforce` it update policy gradient $$ \theta_{t+1} \leftarrow \theta_t + \alpha_t \, G_t \, \nabla_\theta \log \pi_\theta(a_t | s_t) $$ where it compute return ($G_t$) from Monte-Carlo algorithm $$ G_t = \sum_{k=0}^{T - t - 1} \gamma^k R_{t + k + 1} $$ so it may lead to high-varaince during traning

<p align = "center">
    <img src="result/MC_reward_ep.png" alt="Alt text" width="800"/>
</p>

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <img src="result/MC_lately.png" alt="Alt text" width="800"/>
    <p>MC reward during traning</p>
  </div>

  <div style="width: 48%;">
    <img src="result/AC_lately.png" alt="Alt text" width="800"/>
    <p>AC reward during traning</p>
  </div>
</div>

As can see AC has less variance in the lately episode of training.

Moreover, we think that `DQN` and `MC_RL` struggle with the local optima problem, as can see in the video where the cart moves slightly to the left or right before the episode terminates.

<div style="display: flex; justify-content: space-between; text-align: center;">
  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/DQN_longest_episode_hi256.mp4" type="video/mp4">
    </video>
    <p>DQN</p>
  </div>

  <div style="width: 48%;">
    <video width="100%" controls>
      <source src="result/MC_longest_episode.mp4" type="video/mp4">
    </video>
    <p>MC_RL</p>
  </div>
</div>