###### Università degli Studi di Milano, Data Science and Economics Master Degree

# VFA
## Value Function Approximation

### Alfio Ferrara

## Introduction
Reinforcement Learning methods are based on the idea of updating the value $V(s)$ of the states in which the agent can find itself or, alternatively, the value $Q(s, a)$, which represents the value assigned to taking an action $a$ when in state $s$.

In **tabular solutions**, the functions $V(s)$ and $Q(s, a)$ are explicitly represented as tables that associate a value with states and state-action pairs.

However, maintaining $V(s)$ and $Q(s, a)$ explicitly is **not practical** when the number of states becomes significantly large, even if finite. Some examples:

- The possible configurations of the **Rubik’s Cube** are approximately $4.3 \times 10^{19}$.  
- The number of possible **positions in Chess** is around $10^{47}$.  
- In a **real-world map**, even if represented as a discrete grid, there can be **billions of different positions**.

Moreover, there are several real world situations where the number of states is simply **infinite**, for example because we have a continuous state space. Examples:
- In **autonomous vehicle control** the veihicle state is typically represented by continuous variables, such as **speed**, **position** as $(x, y)$ coordinates, **steering angle**, etc.
- If we need to **control the temperature** in a room, we typically describe the state as a combination of **temperature**, **humidity**, **number of people in the room**, etc.
- In a **real-world map**, instead of a large grid, we may want to represent the position of agents just through their coordinates **(longitude, latitude)**.

### A matter of generalization
However, the number of states is not the only limitation of tabular solutions. There is also another relevant issue that is associated with the information that the agent can exploit about a state. Let's see a couple of examples:

![](./imgs/obs-space.png)



- **First example**: here we have a discrete state space. However, with respect to the target of reaching the apple, the situation of the pig (the agent) is identical. Thus, we expect $V(s_A) \sim V(s_B)$, but the two states are distinct in a tabular setting. This means that when the Agent reaches $s_B$ it cannot exploit what has been learned about $V(s_A)$ in order to estimate $V(s_B)$.
- **Second example**: the position of the space probe (the agent) is almost the same in $s_A$ and $s_B$. Thus, we would like to exploit the information available about $V(s_A)$ in order to choose the action to take when in $s_B$.

To overcome this issue, we would like to have a mechanism for **generalizing the notion of state** and act similarly when we are in states that are almost identical.

## Jousting Duel: A Tactical Medieval Jousting Game
As an example of such an environment, we introduce Jousting Duel. Jousting Duel is a reinforcement learning-driven medieval jousting game where **two knights charge toward each other in a high-speed duel**. The player controls one knight, making **strategic decisions** on **steering**, **speed**, and **lance positioning** to **maximize their chances of landing a successful hit** while avoiding their opponent’s attack.

The game features a continuous state space, where the knight’s position, speed, and lance angle are dynamically updated. However, the player can only take discrete actions, such as steering left or right and adjusting their lance angle. Success depends on precise timing and positioning, requiring the agent to learn optimal strategies through reinforcement learning.

The goal is to hit the opponent's weak spot while dodging their lance. The game rewards efficient jousting techniques and penalizes missed attacks or poor positioning. 

<img style="width: 50%;" src="./imgs/jousting-duel.png" />

### Tech notes
#### Opponent (aka Environment) parameters
- `max_distance = 10.0` : starting distance between opponents
- `min_distance = 0.0` : collision point
- `speed_agent = speed_opponent = 1.0` : fixed speed for agent and opponent
- `lance_angle_change = 0.1` : Change in lance angle per action
- `opponent_lance_angle = np.random.uniform(-1, 1)` : random lance angle per episode

#### State space
- $\delta_t$: **relative distance** of opponents at time $t$ in $[\textrm{max distance}, 0]$
- $\sigma$: **relative speed** of the two opponents in $[-5, 5]$
- $\theta$: **agent lance angle** in $[-1, 1]$
- $\phi$: **opponent lance angle** in $[-1, 1]$

#### Action space
With $c$ as a constant steer rate:
- $a=0$: **steer left** $\rightarrow \theta = \theta - c$
- $a=1$: **stay centered** $\rightarrow \theta = \theta$
- $a=1$: **steer right** $\rightarrow \theta = \theta + c$
- $a=2$: **steer left** $\rightarrow \theta = \theta - c$
- $a=3$: **lance up** $\rightarrow \theta = \theta + c$
- $a=4$: **lance down** $\rightarrow \theta = \theta - c$
- $a=5$: **increase speed** $\rightarrow \sigma = \min{(\sigma + 1; 5)}$

#### Reward

In [4]:
from gymbase import environments
import gymnasium as gym

In [6]:
env = gym.make('JoustingDuel-v0', distance=10)
state = env.reset()

done = False
while not done:
    action = env.action_space.sample()  # Random action
    print(f"Action taken {action}")
    state, reward, done, truncated, info = env.step(action)
    env.render()  # Print the game state
    print()

env.close()

Action taken 3

        Distance: 8.00, 
        Agent Lance Angle: 0.10, 
        Opponent Lance Angle: 0.60, 
        Distance to target: 0.50

Action taken 3

        Distance: 6.00, 
        Agent Lance Angle: 0.20, 
        Opponent Lance Angle: 0.60, 
        Distance to target: 0.40

Action taken 4

        Distance: 4.00, 
        Agent Lance Angle: 0.10, 
        Opponent Lance Angle: 0.60, 
        Distance to target: 0.50

Action taken 1

        Distance: 2.00, 
        Agent Lance Angle: 0.10, 
        Opponent Lance Angle: 0.60, 
        Distance to target: 0.50

Action taken 0

        Distance: 0.00, 
        Agent Lance Angle: 0.00, 
        Opponent Lance Angle: 0.60, 
        Distance to target: 0.60

