# REINFORCE Algorithm

### 1. Muestrar {$\tau^i$} de $\pi_{\theta}(a_t|s_t)$ - Correr M trayectorias usando la policy
### 2. Estimar el retorno: $$ R(\tau_i)  \approx \sum_{t=0}^{T}R(s_t^i, a_t^i)$$
### 3. Entrenar un modelo: $$ \nabla_{\theta} J_{\theta} \approx \frac{1}{M} \sum_{i=1}^{M}  R(\tau_i)   \sum_{t=0}^T \nabla_{\theta} log \pi_{\theta}(a_t^i|s_t^i)$$
$$\large \theta = \theta + \alpha \nabla_{\theta} J_{\theta}$$

# Implementación en Keras

### Suponiendo que solo corremos un episodio por iteración

La loss queda (Le ponemos un menos adelante para que tengamos que minimizar):

$$ \huge J_{\theta} =  - R \sum_{t=0}^T log \pi_{\theta}(a_t|s_t)$$
$$ \huge J_{\theta} =  - \sum_{t=0}^T log \pi_{\theta}(a_t|s_t) R $$
$$ \huge J_{\theta} =  \sum_{t=0}^T log \frac{1}{\pi_{\theta}(a_t|s_t)} R $$

Recordando la Entropía cruzada:

$$ \huge H(y_{true}, y_{pred}) = \sum_{i} y_{true_i} log (\frac{1}{y_{pred_i}}) $$

Ejemplo: 

- Sumpongamos que tenemos 3 acciones posibles y la red neuronal predijo $y_{pred}$ = [0.2, 0.3, 0.5]
- Se muestreó la salida y se eligión la acción 2, es decir la acción con probabilidad 0.3
- La $y_{true}$ será [0, 1, 0]

$$ \huge H = 0 log (\frac{1}{0.2}) + 1 log (\frac{1}{0.3}) + 0 log (\frac{1}{0.5}) $$

- Si redefinimos la $y_{true}$ como $y_{true}$ = $y_{true} R$
- La $y_{true}$ queda [0, R, 0]

$$ \huge H = 0 log (\frac{1}{0.2}) + R log (\frac{1}{0.3}) + 0 log (\frac{1}{0.5}) = R log (\frac{1}{\pi_{\theta}(a_t|s_t)}) $$

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# conda install -c conda-forge tensorboardx

In [3]:
from REINFORCE_helper import BaseAgent
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam, SGD
import keras.backend as K
import numpy as np

Using TensorFlow backend.


In [4]:
class ReinforceAgent(BaseAgent):
    # def __init__(self):
    def get_policy_model(self, lr=0.001, hidden_layer_neurons = 128, input_shape=[4], output_shape=2):
        ## Defino métrica - loss sin el retorno multiplicando
        def loss_metric(y_true, y_pred):
            y_true_norm = K.sign(y_true)
            return K.categorical_crossentropy(y_true_norm, y_pred)
        model = Sequential()
        model.add(Dense(hidden_layer_neurons, input_shape=input_shape, activation='relu'))
        model.add(Dense(output_shape, activation='softmax'))

        model.compile(Adam(lr), loss=# Completar con loss , metrics=[loss_metric])
        return model
    
    def get_action(self, eval=False):
        # En self.observation esta guardada la ultima observación
        p = self.model. # Realizar predicción, no olvidar el reshape(1, self.nS)
        if eval is False:
            action = # Samplear p np.random.choice
        else:
            action = # Elegir acción con np.argmax(p[0])
        action_one_hot = np.zeros(self.nA)
        action_one_hot[action] = 1
        return action, action_one_hot, p
    
    def get_entropy(self, preds, epsilon=1e-12):
        # Entropía normalizada
        entropy = np.mean(-np.sum(np.log(preds+epsilon)*preds, axis=1)/np.log(self.nA))
        return entropy
    
    def get_discounted_rewards(self, r):
        # Por si es una lista
        r = np.array(r, dtype=float)
        """Take 1D float array of rewards and compute discounted reward """
        discounted_r = np.zeros_like(r)
        running_add = 0
        for t in reversed(range(0, r.size)):
            running_add = running_add * self.gamma + r[t]
            discounted_r[t] = running_add
        return discounted_r 

In [5]:
reinforce_agent = ReinforceAgent('CartPole-v1', n_experience_episodes=1)
reinforce_agent.logdir

Instructions for updating:
Colocations handled automatically by placer.


'logs/CartPole-v1/REINFORCE/1_1_0.999_0.001_1574901399'

In [6]:
reinforce_agent.model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 128)               640       
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 258       
Total params: 898
Trainable params: 898
Non-trainable params: 0
_________________________________________________________________


In [7]:
reinforce_agent.reset_env()
action, action_one_hot, p = reinforce_agent.get_action()
print('Action:', action)
print('action_one_hot:', action_one_hot)
print('Policy prob dist:', p)

Action: 1
action_one_hot: [0. 1.]
Policy prob dist: [[0.50171095 0.49828902]]


In [8]:
reinforce_agent.reset_env()
obs, actions, preds, disc_sum_rews, rewards, ep_returns, ep_len, last_obs = reinforce_agent.get_experience_episodes()

In [9]:
print('Observaciones:'), print(obs), print('Acciones:'), print(actions)
print('Policy prob dist:'), print(preds)
print('Discounted Sum of Rewards:')
print(disc_sum_rews)
print('Return copiado para cada acción:')
print(ep_returns)
print('Longitud del episodia:', ep_len)
print('Ultima observación:', last_obs)

Observaciones:
[[-4.57454587e-02 -3.80521711e-02 -1.47490311e-02  3.41119399e-02]
 [-4.65065021e-02  1.57278147e-01 -1.40667923e-02 -2.63187727e-01]
 [-4.33609392e-02  3.52598031e-01 -1.93305469e-02 -5.60274068e-01]
 [-3.63089786e-02  5.47985880e-01 -3.05360282e-02 -8.58983913e-01]
 [-2.53492610e-02  3.53292890e-01 -4.77157065e-02 -5.76056769e-01]
 [-1.82834032e-02  5.49050086e-01 -5.92368418e-02 -8.83381449e-01]
 [-7.30240146e-03  3.54780441e-01 -7.69044708e-02 -6.09893550e-01]
 [-2.06792632e-04  5.50888392e-01 -8.91023418e-02 -9.25773657e-01]
 [ 1.08109752e-02  3.57075433e-01 -1.07617815e-01 -6.62368939e-01]
 [ 1.79524839e-02  1.63602338e-01 -1.20865194e-01 -4.05416821e-01]
 [ 2.12245306e-02 -2.96168657e-02 -1.28973530e-01 -1.53150320e-01]
 [ 2.06321933e-02  1.67093270e-01 -1.32036537e-01 -4.83578023e-01]
 [ 2.39740587e-02 -2.59422370e-02 -1.41708097e-01 -2.35251130e-01]
 [ 2.34552140e-02  1.70890006e-01 -1.46413120e-01 -5.69062867e-01]
 [ 2.68730141e-02  3.67729212e-01 -1.57794377e-

In [10]:
reinforce_agent.get_entropy(preds)

0.99761283

# Algoritmo REINFORCE

In [11]:
from REINFORCE_helper import RunningVariance
from time import time

# Pruebas interesantes:
- n_experience_episodes=1, epochs=1, lr=0.001
- n_experience_episodes=5, epochs=1, lr=0.001
- n_experience_episodes=50, epochs=1, lr=0.001
- n_experience_episodes=50, epochs=20, lr=0.001
- n_experience_episodes=50, epochs=50, lr=0.001
- n_experience_episodes=50, epochs=50, lr=0.01

In [20]:
reinforce_agent = ReinforceAgent('CartPole-v1', n_experience_episodes=50, EPISODES=2000, epochs=50, lr=0.01)
running_variance = RunningVariance()
initial_time = time()


while reinforce_agent.episode < reinforce_agent.EPISODES:
    obs, actions, preds, disc_sum_rews, rewards, ep_returns, ep_len, last_obs = # Simular los episodios
    
    for dr in ep_returns:
        running_variance.add(dr)
        
    pseudolabels = # En este caso debe multiplicar las acciones por los returns

    history = reinforce_agent.model.fit(#completar, #completar, verbose=0, epochs=reinforce_agent.epochs, batch_size=128)
    
    reinforce_agent.log_data(reinforce_agent.episode, 
                      history.history['loss'][0], 
                      np.mean(ep_len), 
                      reinforce_agent.get_entropy(preds), 
                      running_variance.get_variance(), 
                      history.history['loss_metric'][0], 
                      time() - initial_time, np.mean(ep_returns[-1]))

correr en linea de comando: tensorboard --logdir logs/
Episode: 51
Model on episode 52 improved from -inf to 194.35177089045808. Saved!
Episode: 259
Model on episode 260 improved from 194.35177089045808 to 360.5991091588095. Saved!
Episode: 519
Model on episode 520 improved from 360.5991091588095 to 393.62105513881454. Saved!
Episode: 935
Model on episode 936 improved from 393.62105513881454 to 393.62105513881454. Saved!
Episode: 987
Model on episode 988 improved from 393.62105513881454 to 393.62105513881454. Saved!
Episode: 1039
Model on episode 1040 improved from 393.62105513881454 to 393.62105513881454. Saved!
Episode: 1091
Model on episode 1092 improved from 393.62105513881454 to 393.62105513881454. Saved!
Episode: 1143
Model on episode 1144 improved from 393.62105513881454 to 393.62105513881454. Saved!
Episode: 1195
Model on episode 1196 improved from 393.62105513881454 to 393.62105513881454. Saved!
Episode: 1247
Model on episode 1248 improved from 393.62105513881454 to 393.621055