###### Università degli Studi di Milano, Data Science and Economics Master Degree

# Backward propagation by Dynamic Programming

### Alfio Ferrara

Backward propagation can be used as a tool for finding the optimal policy by recirsively exploring the Agent options at different time steps.

## Example: the Taxation Game
In this game, the player takes on the role of a politician with a **5-year mandate to govern their country**. The goal is to **maximize the total revenue** collected through taxation while balancing economic stability. Each year, the politician must decide between two levels of taxation:

- **High Taxation**: Generates more revenue in the short term but risks lowering economic growth for future years.
- **Moderate Taxation**: Generates less revenue in the short term but maintains economic growth for future years.

At the end of the 5 years, the game evaluates the player's total revenue and their ability to sustain the economy. 

## Modeling the game

- action space: `(0, 1)` corresponding to _high taxes_ VS _moderate_taxes_
- observation space: `(economy, years)`, where `economy` is the level of economy represented as _high growth_ or _low growth_; `years` is the number of years missing to the end of the politician mandate.
- `reward`: 

| Economic Growth | High Taxation | Moderate Taxation |
|------------------|--------------|-------------------|
| High             | 15           | 10                |
| Low              | 8            | 5                 |

- `transition` (deterministic)

| Economic Growth | High Taxation Impact | Moderate Taxation Impact |
|------------------|----------------------|--------------------------|
| High             | Decreases to Low    | Remains High             |
| Low              | Remains Low         | Remains Low              |



See the game implementation using gymnasium at [Taxation Game](./gymbase/environments.py).

In [2]:
import gymnasium as gym 
import gymbase.environments

**Note** that, in terms of the Environment
- 0: means `High`
- 1: means `Low`

In [3]:
years = 3
env = gym.make("TaxationGame-v0", years=3, transitions=None, rewards=None)
_, _ = env.reset()
print("Transitions")
print(env.unwrapped.transitions)
print("Rewards")
print(env.unwrapped.rewards)

Transitions
[[1 0]
 [1 1]]
Rewards
[[15 10]
 [ 8  5]]


In [3]:
(economy, years), _ = env.reset()

done, total_reward = False, 0

env.render()

i, actions = 0, [1, 1, 0]

while not done:
    #action = env.action_space.sample()
    action = actions[i]
    s_prime, reward, done, _, _ = env.step(action=action)
    total_reward += reward
    s_prime_label = 'High Economy' if s_prime == 0 else 'Low Economy'
    print(f"\tAction Taken: {'High Taxation' if action == 0 else 'Moderate Taxation'}")
    print(f"\tReward: {reward}, Next State: {s_prime_label}")
    print(f"\tTotal Reward: {total_reward}")
    env.render()
    i += 1
print(f"Total Reward: {total_reward}")

📈 Economic Growth: High, 📅 Years Remaining: 3
	Action Taken: Moderate Taxation
	Reward: 10, Next State: Low Economy
	Total Reward: 10
📈 Economic Growth: High, 📅 Years Remaining: 2
	Action Taken: Moderate Taxation
	Reward: 10, Next State: Low Economy
	Total Reward: 20
📈 Economic Growth: High, 📅 Years Remaining: 1
	Action Taken: High Taxation
	Reward: 15, Next State: Low Economy
	Total Reward: 35
📈 Economic Growth: Low, 📅 Years Remaining: 0
Total Reward: 35


## Backward induction

As an example of recursive solution, we investigate **backward induction**. This means that we aim at finding the **optimal policy** by starting from the best solution at time $t = 0$ (the last year) and computing the alternatives for the previous years.

In order to find the optimal policy $\pi^{*}(s, t)$, where $s$ represents the current **economy** and $t$ the time when the decision has to be taken, we define the value of a state as $V(s, t)$.

First, we take into account the **base**, that is the decision at time $t = 0$ (last year).
At $t=0$ the game ends, which means that the optimal policy at that time will always be to enforce _high taxation_ (is the last year of the mandate, so who takes care of the future!).

Moreover, the value of the state does not depend on the state itself, because at time $t$ you do not do any other action and do not get any reward, which means that:

$$
V(s, 0) = 0\ \forall s
$$

Then, for each state (_high economy_, _low economy_) and action (_high taxation_, _low_taxation_) we compute, recursively:

$$
Q(s, a, t) = r(s, a) + V(s', t - 1)
$$

Thus, as a first step, we need to update also $Q(s, a, 0) = r(s, a)$

**Note** that, in general, $V(s, t) = \max\limits_{a} Q(s, a, t)$

The final optimal policy is then

$$
\pi^{*}(s, t) = \arg\max\limits_{a} Q(s, a, t)
$$

![](./imgs/backward-induction.png)


In [1]:
import numpy as np

In [4]:
def backward_induction(environment: gym.Env):
    # Transition table
    transitions = environment.unwrapped.transitions
    rewards = environment.unwrapped.rewards
    states = environment.observation_space[0]
    actions = environment.action_space

    V = np.zeros((states.n, years + 1))
    pi = np.zeros((states.n, years + 1))
    Q = np.zeros((states.n, actions.n, years + 1))

    for t in range(1, years + 1):
        for state in range(states.n):
            for a in range(environment.action_space.n):
                s_prime = transitions[state, a]
                r = rewards[state, a]
                Q[state, a, t] = r + V[s_prime, t - 1] # Recursive step
            # Update
            V[state, t] = Q[state, :, t].max()
            pi[state, t] = np.argmax(Q[state, :, t])
    return pi, V, Q 

In [5]:
policy, V, Q = backward_induction(env)
print("Policy")
print(policy)
print("Values")
print(V)

Policy
[[0. 0. 1. 1.]
 [0. 0. 0. 0.]]
Values
[[ 0. 15. 25. 35.]
 [ 0.  8. 16. 24.]]


In [9]:
Q[:,:,2]

array([[23., 25.],
       [16., 13.]])

**Note** that we can compute the optimal policy by backward induction because of the finite horizon and the deterministic transitions, which means that we do not need to explore the environment

Let's check with different rewards, which makes High Taxation less competitive, like

| Economic Growth | High Taxation | Moderate Taxation |
|------------------|--------------|-------------------|
| High             | 15           | 10                |
| Low              | 1            | 5                 |

Moreover, now if we do `Moderate Taxation` in `Low Economy` the economy turns `High`

| Economic Growth | High Taxation Impact | Moderate Taxation Impact |
|------------------|----------------------|--------------------------|
| High             | Decreases to Low    | Remains High             |
| Low              | Remains Low         | Increases to High              |

![](./imgs/back-even.png)


In [10]:
rewards = np.array([
    [15, 10],
    [1, 5]
])
transitions = np.array([
    [1, 0],
    [1, 0]
])

years = 3

env = gym.make("TaxationGame-v0", years=years, transitions=transitions, rewards=rewards)
policy, V, Q = backward_induction(environment=env)

In [11]:
print("Policy")
print(policy)
print("Values")
print(V)

Policy
[[0. 0. 1. 0.]
 [0. 1. 1. 1.]]
Values
[[ 0. 15. 25. 35.]
 [ 0.  5. 20. 30.]]
