# **Minekaa RL Parkour**

Member :
- Napat Aeimwiratchai 65340500020
- Phattarawat kadrum 65340500074

Goal : Train RL agent to play a Minekaa parkour map.

## What is Minekaa

Minekaa is a 3D Python game that replicates Minecraft, but simplified to include only the important character actions used for parkour gameplay. These include movement (forward, backward, left, right, jump) and camera control (look up, look down, rotate left, rotate right).

<p align = "center">
    <img src="asset/minekaa.png" alt="Alt text" width="800"/>
</p>

## Previous work

### RLMinecraftParkour

from https://github.com/LouisCaubet/RLMinecraftParkour

Use PPO to trained agent to play minecraft parkour
- Level
    - Level 1: Straight line
    - Level 2: Narrower straight line with one-block jump
    
- Rewards:
    - +100 for reaching the diamond block (Mian task)
    - +10 for each (gold) block towards the goal (Mini reward for get closer to goal)
    - -100 and end of episode when touching the bedrock (Penalty term)


### MineRl_parkour

from https://github.com/seantey/minerl_parkour

- Environment: MineRL + Microsoft Malmo (Minecraft interface) + OpenAI Gym
- Agent: Minecraft Player Bot
- States: Observable state is pixels of player first person P.O.V, full state is the entire minecraft world.
- Actions: First Person Shooter controls (up,down,left,right,strafe,camera)
- Completed Level 1: Straight Line Cliffwalking Lava Map
- Rewards:
    - +100 if player reaches diamon block goal
    - +10 for every block distance closer to goal
    - -100 for death / drowining in lava

## **Experiment**

<p align = "center">
    <img src="asset/Axis.png" alt="Alt text" width="400"/>
</p>

### Purpose

- Compare Performance of DQN and SAC (SAC-Discrete) in Parkour task.

- Adding reward & action term to improve model performance. (Add more reward & action compare to ref. paper)

### **Action & Reward**

**Action** (10 actions): 
- Forward (1)
- Backward (1)
- Slide Left/Right (2)
- Rotate(Roll) 0/45/90 degree (3)
- Rotate(Pitch) Left/Right (2)
- Jump (1)

**Reward:**
- +100: when agent reach goal
- -50: hit lava 
- Mini reward: agent closer to goal (Euclidian)
- Penalty: increase based on how long it takes step

### **Parkour map**

**Level 1:** Straight Forward map  
For test the setup of code

<p align = "center">
    <img src="asset/straightmap.png" alt="Alt text" width="200"/>
</p>

**Level 2:** Zigzag path

<p align = "center">
    <img src="asset/zigzagmap.png" alt="Alt text" width="200"/>
</p>

### **Algorithm**

#### DQN (Deep Q-Learning)

- Q Learning

$$
Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right]
$$  


- In this task where obs is pixel of image using the Q-table to store Q-values is hard
- So DQN use neural network to approximate Q-function $Q(s, a; \theta)$
- Stores the agent’s experiences $(s, a, r, s’)$ in a replay buffer and samples mini-batches randomly to break correlation between samples and stabilize training.


and want to minimize the loss function while training

$$
\text{loss} = \left( r + \gamma \max_{a'} \hat{Q}(s', a') - Q(s, a) \right)^2
$$

#### SAC (Soft Actor critic-Discrete)

We use modified SAC that work with discrete action. Refference from https://github.com/toshikwa/sac-discrete.pytorch

Soft Actor-Critic wants to find a policy that maximises the maximum entropy objective

$$\pi^* = \arg\max_{\pi} \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t) \sim \tau_\pi} \left[ \gamma^t \left( r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot | s_t)) \right) \right]$$

Where Entropy-Regularized

$$\mathcal{H}(\pi(\cdot | s)) = - \mathbb{E}_{a \sim \pi(\cdot | s)} \left[ \log \pi(a | s) \right]$$

Entropy measures how random a random variable is. A high entropy means more exploration. SAC want to maximize entropy so that the agent can explore a wide range of actions, increasing the chances of finding the best policy through better actions.

reward objective

$$J(\pi) = \mathbb{E}_{(s_t, a_t) \sim \rho_\pi} \left[ r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot | s_t)) \right]$$

SAC is state of the art sample efficiency in multiple challenging continuous control
domains. The paper purpose SAC-D that modified the SAC from continuous action into discrete action.

- first thing modified $\pi_\phi \left( a_t \mid s_t \right)$ to outputs a probability instead of a density.

$V(s_t) := \mathbb{E}_{a_t \sim \pi} \left[ Q(s_t, a_t) - \alpha \log \left( \pi(a_t \mid s_t) \right) \right]$ `to` $V(s_t) := \pi(s_t)^T \left[ Q(s_t) - \alpha \log\left( \pi(s_t) \right) \right]$

- Change calculation of the temperature loss to reduce the variance of that estimate.

$J(\alpha) = \mathbb{E}_{a_t \sim \pi_t} \left[ -\alpha \left( \log \pi_t(a_t \mid s_t) + \bar{H} \right) \right]$ `to` $J(\alpha) = \pi_t(s_t)^T \left[ -\alpha \left( \log(\pi_t(s_t)) + \bar{H} \right) \right]$

- Policy objective

$J_{\pi}(\phi) = \mathbb{E}_{s_t \sim D, \epsilon_t \sim \mathcal{N}} \left[ \alpha \log\left( \pi_{\phi}(f_{\phi}(\epsilon_t; s_t) \mid s_t) \right) - Q_{\theta}(s_t, f_{\phi}(\epsilon_t; s_t)) \right]$ `to` $J_{\pi}(\phi) = \mathbb{E}_{s_t \sim D} \left[ \pi_t(s_t)^T \left[ \alpha \log(\pi_{\phi}(s_t)) - Q_{\theta}(s_t) \right] \right]$

### **Reinforcement learning component**

#### Environment

Custom map env made from ursina + stable_baseline 3 for algorithm

#### State 

Observation is pixel-image from agent FPV (first person view) and agent position (use to compute reward).

#### Actions

```py
# 0 move forward
self.player.position += forward * speed
# 1 move backward
self.player.position -= forward * speed
# 2 left
self.player.position -= right * speed
# 3 right
self.player.position += right * speed
# 4 look down 45 degrees
self.player.rotation_x = 45
# 5 look down 90 degrees
self.player.rotation_x = 90
# 6 look forward (reset pitch)
self.player.rotation_x = 0
# 7 rotate left
self.current_angle_index = (self.current_angle_index - 1) % 4
self.player.rotation_y = self.rotation_angles[self.current_angle_index]
# 8 rotate right
self.current_angle_index = (self.current_angle_index + 1) % 4
elf.player.rotation_y = self.rotation_angles[self.current_angle_index]
# 9 jump forward
self.player.velocity.y = self.jump_force
self.player.position += forward * 0.2
self.on_ground = False
```

### **Result**

#### **First map (straight)**

**DQN**

<div style="display: flex; justify-content: center; align-items: center; gap: 20px;">
    <img src="asset/DQNstvis.png" alt="Zigzag Map" width="200" />
    <video width="500" controls>
        <source src="asset/DQN_straight.mp4" type="video/mp4">
    </video>
</div>

- The agent can reach the goal, but when it touches the goal, it chooses to move backward. This may be due to a random action that happens to yield the highest reward, which the agent think it as a optimal policy.

If you can't see the video can you can view at asset/DQN_straight.mp4

**SAC**

<div style="display: flex; justify-content: center; align-items: center; gap: 20px;">
    <img src="asset/SACstvis.png" alt="Zigzag Map" width="200" />
    <video width="500" controls>
        <source src="asset/sac_evalst.mp4" type="video/mp4">
    </video>
</div>

- The agent can receive a positive reward when taking steps closer to the goal, but there seem to be many random actions.

If you can't see the video can you can view at asset/sac_evalst.mp4

#### **Second map (zigzag)**

**DQN**

<div style="display: flex; justify-content: center; align-items: center; gap: 20px;">
    <img src="asset/DQNzigvis.png" alt="Zigzag Map" width="200" />
    <video width="500" controls>
        <source src="asset/DQN_zigzag.mp4" type="video/mp4">
    </video>
</div>

- The agent is able to move about halfway to the goal, but afterward, it repeatedly chooses the same action (looking down (action 4)) until it gets terminated due to reaching the maximum timestep. It’s possible that the agent prefers to incur a penalty from repeating actions rather than falling into the lava.

If you can't see the video can you can view at asset/DQN_zigzag.mp4

**SAC**

<div style="display: flex; justify-content: center; align-items: center; gap: 20px;">
    <img src="asset/SACzigvis.png" alt="Zigzag Map" width="200" />
    <video width="500" controls>
        <source src="asset/sac_evalzig.mp4" type="video/mp4">
    </video>
</div>

- The agent can move a little, but it still seems to be taking random actions.

If you can't see the video can you can view at asset/sac_evalzig.mp4

### Graph


Adding reward term

<p align = "center">
    <img src="asset/compare_penalty.png" alt="Alt text" width="600"/>
</p>

Before adding the reward term (penalty: increases with step duration), we see that the agent sometimes repeated actions or performed actions that did not contribute to increasing the reward (such as rotating in place). However, after we add the reward term (a penalty that increases with the number of steps taken), we can saw that the training reward improved, even when using the same hyperparameters and a fixed random seed. The agent also tended to perform more meaningful actions, with fewer unnecessary behaviors. But, in some environments, the agent still takes an unimportant actions (repeatedly adjusting the camera view) until the episode ends due to reaching the maximum timestep. This might be because the penalty for taking additional actions reduces the total reward less than the penalty incurred for falling into lava.

<div style="display: flex; justify-content: center; align-items: center; gap: 20px;">
    <img src="asset/SACDgraph.png" alt="Zigzag Map" width="500" />
    <img src="asset/SACDsmoothgraph.png" alt="Zigzag Map" width="500" />
</div>

Problem with SAC-D: Reward is converge to 0
- Entropy converge too fast
- Under exploration
- too late update target_network_frequency


In conclusion both algorithms seem to be able to increase the reward during training, but they still require tuning of hyperparameters and the number of training steps to achieve better results.

### Future Improve

- Add penalty for timeout (terminate)
- Make Continue action and train with Continuous SAC
- Hyperparameter tuning

**Reference**  
[1] https://doi.org/10.48550/arXiv.1910.07207  
[2] https://github.com/LouisCaubet/RLMinecraftParkour  
[3] https://github.com/seantey/minerl_parkour