# REINFORCE Algorithm

### 1. Muestrar {$\tau^i$} de $\pi_{\theta}(a_t|s_t)$ - Correr M trayectorias usando la policy
### 2. Estimar el retorno: $$ R(\tau_i)  \approx \sum_{t=0}^{T}R(s_t^i, a_t^i)$$
### 3. Entrenar un modelo: $$ \nabla_{\theta} J_{\theta} \approx \frac{1}{M} \sum_{i=1}^{M}  R(\tau_i)   \sum_{t=0}^T \nabla_{\theta} log \pi_{\theta}(a_t^i|s_t^i)$$

# Implementación en Keras

### Suponiendo que solo corremos un episodio por iteración

La loss queda (Le ponemos un menos adelante para que tengamos que minimizar):

$$ \huge J_{\theta} =  - R \sum_{t=0}^T log \pi_{\theta}(a_t|s_t)$$
$$ \huge J_{\theta} =  - \sum_{t=0}^T log \pi_{\theta}(a_t|s_t) R $$
$$ \huge J_{\theta} =  \sum_{t=0}^T log \frac{1}{\pi_{\theta}(a_t|s_t)} R $$

Recordando la Entropía cruzada:

$$ \huge H(y_{true}, y_{pred}) = \sum_{i} y_{true_i} log (\frac{1}{y_{pred_i}}) $$

Ejemplo: 

- Sumpongamos que tenemos 3 acciones posibles y la red neuronal predijo $y_{pred}$ = [0.2, 0.3, 0.5]
- Se muestreó la salida y se eligión la acción 2, es decir la acción con probabilidad 0.3
- La $y_{true}$ será [0, 1, 0]

$$ \huge H = 0 log (\frac{1}{0.2}) + 1 log (\frac{1}{0.3}) + 0 log (\frac{1}{0.5}) $$

- Si redefinimos la $y_{true}$ como $y_{true}$ = $y_{true} R$
- La $y_{true}$ queda [0, R, 0]

$$ \huge H = 0 log (\frac{1}{0.2}) + R log (\frac{1}{0.3}) + 0 log (\frac{1}{0.5}) = R log (\frac{1}{\pi_{\theta}(a_t|s_t)}) $$

In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
# conda install -c conda-forge tensorboardx

In [6]:
from REINFORCE_helper import BaseAgent
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam, SGD
import keras.backend as K
import numpy as np

In [7]:
class ReinforceAgent(BaseAgent):
    # def __init__(self):
    def get_policy_model(self, lr=0.001, hidden_layer_neurons = 128, input_shape=[4], output_shape=2):
        ## Defino métrica - loss sin el retorno multiplicando
        def loss_metric(y_true, y_pred):
            y_true_norm = K.sign(y_true)
            return K.categorical_crossentropy(y_true_norm, y_pred)
        model = Sequential()
        model.add(Dense(hidden_layer_neurons, input_shape=input_shape, activation='relu'))
        model.add(Dense(output_shape, activation='softmax'))
        ## Por que la categorical_crossentropy funciona ok?
        model.compile(Adam(lr), loss=['categorical_crossentropy'], metrics=[loss_metric])
        return model
    
    def get_action(self, eval=False):
        p = self.model.predict([self.observation.reshape(1, self.nS)])
        if eval is False:
            action = np.random.choice(self.nA, p=p[0]) #np.nan_to_num(p[0])
        else:
            action = np.argmax(p[0])
        action_one_hot = np.zeros(self.nA)
        action_one_hot[action] = 1
        return action, action_one_hot, p
    
    def get_entropy(self, preds, epsilon=1e-12):
        entropy = np.mean(-np.sum(np.log(preds+epsilon)*preds, axis=1)/np.log(self.nA))
        return entropy
    
    def get_discounted_rewards(self, r):
        # Por si es una lista
        r = np.array(r, dtype=float)
        """Take 1D float array of rewards and compute discounted reward """
        discounted_r = np.zeros_like(r)
        running_add = 0
        for t in reversed(range(0, r.size)):
            running_add = running_add * self.gamma + r[t]
            discounted_r[t] = running_add
        return discounted_r 

In [8]:
reinforce_agent = ReinforceAgent('CartPole-v1', n_experience_episodes=1)
reinforce_agent.logdir

Instructions for updating:
Colocations handled automatically by placer.


'logs/CartPole-v1/REINFORCE/1_1_0.999_0.001_1574895170'

In [9]:
reinforce_agent.model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 128)               640       
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 258       
Total params: 898
Trainable params: 898
Non-trainable params: 0
_________________________________________________________________


In [10]:
reinforce_agent.reset_env()
action, action_one_hot, p = reinforce_agent.get_action()
print('Action:', action)
print('action_one_hot:', action_one_hot)
print('Policy prob dist:', p)

Action: 0
action_one_hot: [1. 0.]
Policy prob dist: [[0.50359005 0.49640995]]


In [11]:
reinforce_agent.reset_env()
obs, actions, preds, disc_sum_rews, rewards, ep_returns, ep_len, last_obs = reinforce_agent.get_experience_episodes()

In [12]:
print('Observaciones:'), print(obs), print('Acciones:'), print(actions)
print('Policy prob dist:'), print(preds)
print('Discounted Sum of Rewards:')
print(disc_sum_rews)
print('Return copiado para cada acción:')
print(ep_returns)
print('Longitud del episodia:', ep_len)
print('Ultima observación:', last_obs)

Observaciones:
[[ 1.94014565e-04  1.86248154e-02 -2.69223726e-02  1.34183187e-02]
 [ 5.66510874e-04 -1.76100896e-01 -2.66540062e-02  2.97486816e-01]
 [-2.95550705e-03  1.93906708e-02 -2.07042699e-02 -3.48172724e-03]
 [-2.56769363e-03  2.14803338e-01 -2.07739044e-02 -3.02624526e-01]
 [ 1.72837313e-03  4.10215106e-01 -2.68263949e-02 -6.01786020e-01]
 [ 9.93267525e-03  2.15478475e-01 -3.88621153e-02 -3.17672189e-01]
 [ 1.42422447e-02  2.09309653e-02 -4.52155591e-02 -3.74938466e-02]
 [ 1.46608641e-02 -1.73514406e-01 -4.59654360e-02  2.40587266e-01]
 [ 1.11905759e-02  2.22329932e-02 -4.11536907e-02 -6.62327830e-02]
 [ 1.16352358e-02 -1.72275505e-01 -4.24783464e-02  2.13187161e-01]
 [ 8.18972569e-03  2.34272088e-02 -3.82146032e-02 -9.25869824e-02]
 [ 8.65826986e-03 -1.71126763e-01 -4.00663428e-02  1.87798554e-01]
 [ 5.23573460e-03  2.45448336e-02 -3.63103717e-02 -1.17249641e-01]
 [ 5.72663127e-03 -1.70038555e-01 -3.86553645e-02  1.63760149e-01]
 [ 2.32586018e-03 -3.64586435e-01 -3.53801616e-

In [13]:
reinforce_agent.get_entropy(preds)

0.9996788

# Algoritmo REINFORCE

In [24]:
from REINFORCE_helper import RunningVariance
from time import time

In [None]:
reinforce_agent = ReinforceAgent('CartPole-v1', n_experience_episodes=50, EPISODES=2000, epochs=20, lr=0.001)
running_variance = RunningVariance()
initial_time = time()


while reinforce_agent.episode < reinforce_agent.EPISODES:
    obs, actions, preds, disc_sum_rews, rewards, ep_returns, ep_len, last_obs = reinforce_agent.get_experience_episodes()
    for dr in ep_returns:
        running_variance.add(dr)
        
    pseudolabels = actions*ep_returns.reshape(-1, 1)

    history = reinforce_agent.model.fit(obs, pseudolabels, verbose=0, epochs=reinforce_agent.epochs, batch_size=128)
    
    reinforce_agent.log_data(reinforce_agent.episode, 
                      history.history['loss'][0], 
                      np.mean(ep_len), 
                      reinforce_agent.get_entropy(preds), 
                      running_variance.get_variance(), 
                      history.history['loss_metric'][0], 
                      time() - initial_time, ep_returns[-1])

correr en linea de comando: tensorboard --logdir logs/
Episode: 951