# Policy Gradients

- Para entrenar esta red se requieren definir las probabilidades objetivo `y`. 

- Si una acción es buena, deberiamos incrementar su  probabilidad, si es mala reducirla.

- Cómo saber si es buena o mala?, algunas acciones pueden tener efectos retardados, dado que cuando se ganan o pierden puntos no se sabe de forma clara que acciones contribuyeron  (the _credit assignment problem_).

- _Policy Gradients_ juega múltiples episodios y luego toma acciones de los buenos episiodios como las más probables, mientras que acciones de malos episiodios, como poco probables.

- **Jugamos primero y luego revisamos que funcionó**

- Creamos una función para jugar.

- Se asume que la acción correcta es la derecha (1).

- Se calcula el costo y sus gradientes (se guardarán y luego se modifican dependiendo de que tan buena o mala resultó la acción).

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import sys
import sklearn


!pip install gymnasium[classic_control]  #install gymnasium and virtual display
!pip install pyvirtualdisplay
import gym
import pyvirtualdisplay
import tensorflow as tf
from tensorflow import keras

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# To get smooth animations
import matplotlib.animation as animation
mpl.rc('animation', html='jshtml')

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "rl"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)
    
    
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()
display

Collecting pygame==2.1.0 (from gymnasium[classic_control])
  Downloading pygame-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pygame
Successfully installed pygame-2.1.0
Collecting pyvirtualdisplay
  Downloading PyVirtualDisplay-3.0-py3-none-any.whl (15 kB)
Installing collected packages: pyvirtualdisplay
Successfully installed pyvirtualdisplay-3.0


<pyvirtualdisplay.display.Display at 0x7b6603214c10>

In [2]:
def play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        left_proba = model(obs[np.newaxis])
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))
    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, trun, info = env.step(int(action[0, 0].numpy()))
    return obs, reward, done, grads

- Si `left_proba` es alta, `action` tenderá  a `False`. 
- Si se obtiene `False` significa 0 en el casteo a flotante, `y_target` será 1 - 0 = 1. 
- Fijamos el objetivo a 1, pretendiendo que la probabilidad de ir a izquierda debería ser 100%.

- Se crea una función que desde `play_one_step()` juega múltiples episodios, retornando los rewards y gradients.

In [3]:
def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):
    all_rewards = []
    all_grads = []
    for episode in range(n_episodes):
        current_rewards = []
        current_grads = []
        obs,_ = env.reset()
        for step in range(n_max_steps):
            obs, reward, done, grads = play_one_step(env, obs, model, loss_fn)
            current_rewards.append(reward)
            current_grads.append(grads)
            if done:
                break
        all_rewards.append(current_rewards)
        all_grads.append(current_grads)
    return all_rewards, all_grads

- El algritmo Policy Gradients utiliza el modelo para jugar varias veces un episodio (e.g., 10 tiempos), y luego revisará las rewards y normaliza. 

- Se construye una función para descontar rewards y otra para normalizar rewards a lo largo de muchos episodios.

In [4]:
def discount_rewards(rewards, discount_rate):
    discounted = np.array(rewards)
    for step in range(len(rewards) - 2, -1, -1): #acumular descuentos de fin a inicio
        discounted[step] += discounted[step + 1] * discount_rate
    return discounted

def discount_and_normalize_rewards(all_rewards, discount_rate):
    all_discounted_rewards = [discount_rewards(rewards, discount_rate)
                              for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean) / reward_std #normalización zscore
            for discounted_rewards in all_discounted_rewards]

- **Por  ejemplo**: si tenemos 3 acciones, y después de cada acción se obtiene los rewards: 10 en la primera, 0 en al  segunda, y -50 en la tercera. 

- Si se utiliza un factor de descuento de 80%, en la 3rd acción se obtendrá -50 (full credit for the last reward), pero en la segunda solo -40 (80% credit del último reward), y para la 1st action obtendrá el 80% de -40 (-32) más crédito completo del primer reward (+10), lo cual conlleva a un descuento de reward de -22:

In [5]:
print(discount_rewards([10, 0, -50], discount_rate=0.8))
print(discount_rewards([10, 20], discount_rate=0.8))

[-22 -40 -50]
[26 20]


- Para normalizar todos los descuentos de rewards sobre los episodios, se cálcula la media y desviación estandar de todos los discounted rewards y se normaliza por zscore:

In [6]:
discount_and_normalize_rewards([[10, 0, -50], [10, 20]], discount_rate=0.8)

[array([-0.28435071, -0.86597718, -1.18910299]),
 array([1.26665318, 1.0727777 ])]

Se definen los parámetros de simulación:

In [7]:
n_iterations = 170
n_episodes_per_update = 10
n_max_steps = 200
discount_rate = 0.95
optimizer = keras.optimizers.Adam(lr=0.01)
loss_fn = keras.losses.binary_crossentropy

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential([
    keras.layers.Dense(20, activation="elu", input_shape=[4]),
    keras.layers.Dense(10, activation="elu"),
    keras.layers.Dense(1, activation="sigmoid"),
])

In [8]:
env = gym.make('CartPole-v1',render_mode="rgb_array")
env.reset(seed=42)


for iteration in range(n_iterations):
    all_rewards, all_grads = play_multiple_episodes(
        env, n_episodes_per_update, n_max_steps, model, loss_fn)
    total_rewards = sum(map(sum, all_rewards))  #map aplica fun sobre items en all rewards                   
    print("\rIteration: {}, mean rewards: {:.1f}".format(          
        iteration, total_rewards / n_episodes_per_update), end="") 
    all_final_rewards = discount_and_normalize_rewards(all_rewards,
                                                       discount_rate)
    all_mean_grads = []
    for var_index in range(len(model.trainable_variables)):
        mean_grads = tf.reduce_mean(
            [final_reward * all_grads[episode_index][step][var_index]
             for episode_index, final_rewards in enumerate(all_final_rewards)
                 for step, final_reward in enumerate(final_rewards)], axis=0)
        all_mean_grads.append(mean_grads)
    optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))

env.close()

Iteration: 169, mean rewards: 141.1

In [9]:
def render_policy_net(model, n_max_steps=200, seed=42): 
    frames = []
    env = gym.make('CartPole-v1',render_mode="rgb_array") 
    np.random.seed(seed)
    obs,_ = env.reset(seed=seed)
    for step in range(n_max_steps):
        frames.append(env.render())
        left_proba = model.predict(obs.reshape(1, -1))
        action = int(np.random.rand() > left_proba)
        obs, reward, done, trun, info = env.step(action)
        if done:
            break
    env.close()
    return frames

def update_scene(num, frames, patch):
    patch.set_data(frames[num])
    return patch

def plot_animation(frames, repeat=False, interval=40):
    fig = plt.figure()
    patch = plt.imshow(frames[0])
    plt.axis('off')
    anim = animation.FuncAnimation(
        fig, update_scene, fargs=(frames, patch),
        frames=len(frames), repeat=repeat, interval=interval)
    plt.close()
    return anim

In [10]:
frames = render_policy_net(model)
plot_animation(frames)

