# Policy Gradient

## Présentation

 * **Objectif**: Trouver les meilleurs paramètres $\theta$ de la politique $\pi_\theta(s, a)$ qui maximise la mesure de qualité de la politique $J(\theta)$.
 * Exemple de mesure de qualité pour évaluer la politique (_average reward per time-step_)
 
$
J_{avR}(\theta) = \sum_s \underbrace{d^{\pi_\theta}(s)}_\textrm{Probabilité d'être dans l'état s} \sum_a \underbrace{\pi_\theta(s, a)}_\textrm{Probabilité de choisir l'action a sachant qu'on est dans l'état s} R^a_s
$

 * Amélioration de $J(\theta)$: Hill Climbing, Gradient Ascent
 * Si $\pi_\theta(s, a)$ est différentiable, calcul du gradient (log derivative trick) $\nabla_\theta \pi_\theta(s, a) = \pi_\theta(s, a) \nabla_\theta \log \pi_\theta(s, a)$

## Policy Gradient Theorem

 - Etant donnée une politique $\pi_\theta(s, a)$ dérivable
 - Etant donnée une fonction objective pour la qualité J
 - Alors, le gradient de la politique est:
 
$
\nabla_\theta J(\theta) = \mathbb{E}[Q^{\pi_\theta}(s, a) \nabla_\theta \log \pi_\theta(s, a)]
$

## Monte-Carlo Policy Gradient (REINFORCE)

 - Mettre à jour $\theta$ via **Stochastic Gradient Ascent**
 - Utilise le **Policy Gradient Theorem**
 - Utilise le retour v_t comme une mesure non biaisé de $Q^{\pi_\theta}(s, a)$
 

    def REINFORCE():
        Initialise θ
        Pour chaque épisode {(s_1, a_1, r_2), ..., (s_t-1, a_t-1, r_t)}
            Pour i = 1 à t-1
                θ <- θ + α ∇log π(s_i , a_i)v_i

## Example with Logistic Policy / Two actions space

Références: 
 - https://www.oreilly.com/library/view/reinforcement-learning/9781492072386/app01.html
  - https://www.janisklaise.com/post/rl-policy-gradients/

Soit $\phi(s,a)$, un vecteur représentant les variables issues de l'environnement.

Logistic Policy:

$
\pi_\theta(a/s) = \pi_\theta(\theta^T s) = \frac {e^{\theta^T s}} {1 + e^{\theta^T s}}
$

Derivée de la fonction logisitque

$
\nabla_\theta log(\pi_\theta(0 / s)) = s - s \pi_\theta(0 / s) \\
\nabla_\theta log(\pi_\theta(1 / s)) = - s \pi_\theta(0 / s)
$


In [None]:
import gymnasium as gym
import copy
import numpy as np
from tqdm import tqdm

In [None]:
def 𝜙(s, a):
    return np.concatenate((s, np.array([a])))

class SoftMaxReinforcement():
    def __init__(self, α, 𝛾):
        self.θ = np.random.random((env.observation_space.shape[0]))
        self.α = α
        self.𝛾 = 𝛾
    
    def update(self, episode):
        number_of_steps = len(episode)
        v_i = 0
        θ = self.θ
        G = [0] * (number_of_steps + 1)
        for i in reversed(range(number_of_steps)):
            G[i] = episode[i][2] + self.𝛾 * G[i+1]
        for i in range(number_of_steps):
            state_i, action_i, reward_i = episode[i]
            proba_i = self.𝜋(state_i)
            gradient = state_i - state_i * proba_i if action_i == 0 else - state_i * proba_i
            θ = θ + self.α * gradient * G[i]
        self.θ = θ
        return gradient
    
    def 𝜋(self, s):
        return 1/(1 + np.exp(-s @ self.θ))

    
def run(env, agent):
    episode = []
    state, info = env.reset()
    done = False
    
    while not done:
        𝜋_0 = agent.𝜋(state)
        probs = [𝜋_0, 1 - 𝜋_0]
        selection_action = np.random.choice([0, 1], p=probs)
        next_state, reward, terminated, truncated, info = env.step(selection_action)
        episode.append((state, selection_action, reward))
        state = next_state
        done = terminated
    
    return episode

env = gym.make("CartPole-v1")
actions = env.action_space

episodes_length = []
gradients = []
nb_episodes = 400
agent = SoftMaxReinforcement(0.01, 0.99)

for i in tqdm(range(nb_episodes)):
    episode = run(env, agent)
    gradient = agent.update(episode)
    
    episodes_length.append(len(episode))
    gradients.append(gradient)

In [None]:
import matplotlib.pyplot as plt

def moving_average(x, w = 1000):
    return np.convolve(x, np.ones(w), 'valid') / w

plt.plot(moving_average(episodes_length))

In [None]:
from gymnasium.utils.save_video import save_video
env = gym.make("CartPole-v1", render_mode="rgb_array_list")
episode = run(env, agent)
save_video(
         env.render(),
         "videos",
         fps=env.metadata["render_fps"],
      )
len(list(map(lambda e: e[2], episode)))