# REINFORCE Algorithm

### 1. Muestrar {$\tau^i$} de $\pi_{\theta}(a_t|s_t)$ - Correr M trayectorias usando la policy
### 2. Estimar el retorno: $$ R(\tau_i)  \approx \sum_{t=0}^{T}R(s_t^i, a_t^i)$$
### 3. Entrenar un modelo: $$ \nabla_{\theta} J_{\theta} \approx \frac{1}{M} \sum_{i=1}^{M}  R(\tau_i)   \sum_{t=0}^T \nabla_{\theta} log \pi_{\theta}(a_t^i|s_t^i)$$

# Implementación en Keras

### Suponiendo que solo corremos un episodio por iteración

La loss queda (Le ponemos un menos adelante para que tengamos que minimizar):

$$ \huge J_{\theta} =  - R \sum_{t=0}^T log \pi_{\theta}(a_t|s_t)$$
$$ \huge J_{\theta} =  - \sum_{t=0}^T log \pi_{\theta}(a_t|s_t) R $$
$$ \huge J_{\theta} =  \sum_{t=0}^T log \frac{1}{\pi_{\theta}(a_t|s_t)} R $$

Recordando la Entropía cruzada:

$$ \huge H(y_{true}, y_{pred}) = \sum_{i} y_{true_i} log (\frac{1}{y_{pred_i}}) $$

Ejemplo: 

- Sumpongamos que tenemos 3 acciones posibles y la red neuronal predijo $y_{pred}$ = [0.2, 0.3, 0.5]
- Se muestreó la salida y se eligión la acción 2, es decir la acción con probabilidad 0.3
- La $y_{true}$ será [0, 1, 0]

$$ \huge H = 0 log (\frac{1}{0.2}) + 1 log (\frac{1}{0.3}) + 0 log (\frac{1}{0.5}) $$

- Si redefinimos la $y_{true}$ como $y_{true}$ = $y_{true} R$
- La $y_{true}$ queda [0, R, 0]

$$ \huge H = 0 log (\frac{1}{0.2}) + R log (\frac{1}{0.3}) + 0 log (\frac{1}{0.5}) = R log (\frac{1}{\pi_{\theta}(a_t|s_t)}) $$

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# conda install -c conda-forge tensorboardx

In [3]:
from REINFORCE_helper import BaseAgent, format_as_pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam, SGD
import keras.backend as K
import numpy as np

Using TensorFlow backend.


In [4]:
class ReinforceAgent(BaseAgent):
    # def __init__(self):
    def get_policy_model(self, lr=0.001, hidden_layer_neurons = 128, input_shape=[4], output_shape=2):
        ## Defino métrica - loss sin el retorno multiplicando
        def loss_metric(y_true, y_pred):
            y_true_norm = K.sign(y_true)
            return K.categorical_crossentropy(y_true_norm, y_pred)
        model = Sequential()
        model.add(Dense(hidden_layer_neurons, input_shape=input_shape, activation='relu'))
        model.add(Dense(output_shape, activation='softmax'))
        ## Por que la categorical_crossentropy funciona ok?
        model.compile(Adam(lr), loss=['categorical_crossentropy'], metrics=[loss_metric])
        return model
    
    def get_action(self, eval=False):
        p = self.model.predict([self.observation.reshape(1, self.nS)])
        if eval is False:
            action = np.random.choice(self.nA, p=p[0]) #np.nan_to_num(p[0])
        else:
            action = np.argmax(p[0])
        action_one_hot = np.zeros(self.nA)
        action_one_hot[action] = 1
        return action, action_one_hot, p
    
    def get_entropy(self, preds, epsilon=1e-12):
        entropy = np.mean(-np.sum(np.log(preds+epsilon)*preds, axis=1)/np.log(self.nA))
        return entropy
    
    def get_discounted_rewards(self, r):
        # Por si es una lista
        r = np.array(r, dtype=float)
        """Take 1D float array of rewards and compute discounted reward """
        discounted_r = np.zeros_like(r)
        running_add = 0
        for t in reversed(range(0, r.size)):
            running_add = running_add * self.gamma + r[t]
            discounted_r[t] = running_add
        return discounted_r 

In [5]:
# reinforce_agent = ReinforceAgent('CartPole-v1', n_experience_episodes=3)
reinforce_agent = ReinforceAgent('MountainCar-v0', n_experience_episodes=3)

reinforce_agent.logdir

Instructions for updating:
Colocations handled automatically by placer.


'logs/MountainCar-v0/REINFORCE/3_1_0.999_0.001_1575298839'

In [6]:
reinforce_agent.model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 128)               384       
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 387       
Total params: 771
Trainable params: 771
Non-trainable params: 0
_________________________________________________________________


In [7]:
reinforce_agent.reset_env()
action, action_one_hot, p = reinforce_agent.get_action()
print('Action:', action)
print('action_one_hot:', action_one_hot)
print('Policy prob dist:', p)

Action: 0
action_one_hot: [1. 0. 0.]
Policy prob dist: [[0.31920475 0.34644496 0.3343503 ]]


In [8]:
reinforce_agent.reset_env()
obs, actions, preds, disc_sum_rews, rewards, ep_returns, ep_len, last_obs, time_step = reinforce_agent.get_experience_episodes(return_ts=True)

In [9]:
format_as_pandas(time_step, obs, preds, actions, rewards, disc_sum_rews, ep_returns, decimals = 3)

Unnamed: 0,step,observation,policy_distribution,sampled_action,rewards,discounted_sum_rewards,episode_return
0,1,"[-0.55, 0.0]","[0.319, 0.346, 0.334]","[0, 0, 1]",-1.0,-181.351,-181.351
1,2,"[-0.548, 0.001]","[0.319, 0.346, 0.334]","[0, 1, 0]",-1.0,-180.531,-181.351
2,3,"[-0.547, 0.001]","[0.319, 0.346, 0.334]","[0, 1, 0]",-1.0,-179.711,-181.351
3,4,"[-0.545, 0.001]","[0.319, 0.346, 0.334]","[0, 1, 0]",-1.0,-178.890,-181.351
4,5,"[-0.544, 0.001]","[0.319, 0.346, 0.334]","[0, 1, 0]",-1.0,-178.068,-181.351
5,6,"[-0.542, 0.001]","[0.319, 0.346, 0.334]","[0, 1, 0]",-1.0,-177.245,-181.351
6,7,"[-0.54, 0.002]","[0.319, 0.346, 0.334]","[0, 0, 1]",-1.0,-176.422,-181.351
7,8,"[-0.537, 0.003]","[0.319, 0.346, 0.334]","[0, 1, 0]",-1.0,-175.597,-181.351
8,9,"[-0.533, 0.003]","[0.319, 0.346, 0.334]","[0, 1, 0]",-1.0,-174.772,-181.351
9,10,"[-0.53, 0.003]","[0.319, 0.346, 0.334]","[1, 0, 0]",-1.0,-173.946,-181.351


In [10]:
reinforce_agent.get_entropy(preds)

0.99952227

# Algoritmo REINFORCE

In [11]:
from REINFORCE_helper import RunningVariance
from time import time

# Pruebas interesantes:
- n_experience_episodes=1, epochs=1, lr=0.001
- n_experience_episodes=5, epochs=1, lr=0.001
- n_experience_episodes=50, epochs=1, lr=0.001
- n_experience_episodes=50, epochs=20, lr=0.001
- n_experience_episodes=50, epochs=50, lr=0.001
- n_experience_episodes=50, epochs=50, lr=0.01

In [13]:
# reinforce_agent = ReinforceAgent('Acrobot-v1', n_experience_episodes=1, EPISODES=2000, epochs=1, lr=0.001)
# reinforce_agent = ReinforceAgent('LunarLander-v2', n_experience_episodes=1, EPISODES=2000, epochs=1, lr=0.001)
reinforce_agent = ReinforceAgent('MountainCar-v0', n_experience_episodes=100, EPISODES=2000, epochs=1, lr=0.0001)

running_variance = RunningVariance()
initial_time = time()


while reinforce_agent.episode < reinforce_agent.EPISODES:
    obs, actions, preds, disc_sum_rews, rewards, ep_returns, ep_len, last_obs = reinforce_agent.get_experience_episodes()
    for dr in ep_returns:
        running_variance.add(dr)
        
    pseudolabels = actions*ep_returns.reshape(-1, 1)

    history = reinforce_agent.model.fit(obs, pseudolabels, verbose=0, epochs=reinforce_agent.epochs, batch_size=128)
    
    reinforce_agent.log_data(reinforce_agent.episode, 
                      history.history['loss'][0], 
                      np.mean(ep_len), 
                      reinforce_agent.get_entropy(preds), 
                      running_variance.get_variance(), 
                      history.history['loss_metric'][0], 
                      time() - initial_time, np.mean(ep_returns[-1]))

correr en linea de comando: tensorboard --logdir logs/
Episode: 101
Model on episode 102 improved from -inf to -181.3511705213644. Saved!
Episode: 203
Model on episode 204 improved from -181.3511705213644 to -181.3511705213644. Saved!
Episode: 305
Model on episode 306 improved from -181.3511705213644 to -181.3511705213644. Saved!
Episode: 407
Model on episode 408 improved from -181.3511705213644 to -181.3511705213644. Saved!
Episode: 509
Model on episode 510 improved from -181.3511705213644 to -181.3511705213644. Saved!
Episode: 611
Model on episode 612 improved from -181.3511705213644 to -181.3511705213644. Saved!
Episode: 713
Model on episode 714 improved from -181.3511705213644 to -181.3511705213644. Saved!


KeyboardInterrupt: 