# Cart Pole

#### Descripcion 


+ **Espacio de observacion (Box(4,))**

+ Posicion del carro: [-2.4, 2.4]
+ Velocidad del carro: [$-\infty$, $\infty$]
+ Angulo del palo: [-41.8, 41.8]
+ Velocidad del palo en la punta: [$-\infty$, $\infty$]


+ **Espacio de accion (Discrete(2))**

+ Izquierda: 0
+ Derecha: 1


El objetivo es mantener el palo vertical moviendo a izquierda y derecha el carro.


La recompensa es +1 para cada paso temporal. El episodio se termina si el angulo es mayor a $\pm 12$ grados o si el carro sobrepasa la posicion $\pm 2.4$

**Ahora usaremos Keras para crear el agente**

In [1]:
# pip3 install keras-rl

In [3]:
# librerias

import gym
import numpy as np
from keras.layers import Dense, Flatten
from keras.models import Sequential
from keras.optimizers import Adam

from rl.agents import SARSAAgent
from rl.policy import EpsGreedyQPolicy

import warnings
warnings.simplefilter('ignore')

### Agente

En este vamos caso a crear un agente basado en el algoritmo SARSA (state-action-reward-state-action). Dicho algoritmo tiene la ecuacion:

$$Q_{nueva}(e_{t}, a_{t})=(1-\alpha)·Q(e_{t}, a_{t}) + \alpha·[r_{t} + \gamma·Q(e_{t+1}, a_{t+1})]$$


donde:
+ $e_{t}$ es el estado en el tiempo t
+ $a_{t}$ es la accion en el tiempo t
+ $\alpha$ es la tasa de aprendizaje $(0<\alpha \leq{1})$
+ $Q(e_{t}, a_{t})$ es el viejo valor de calidad
+ $[r_{t} + \gamma·Q(e_{t+1}, a_{t+1})]$ es el valor aprendido
+ $r_{t}$ es la recompensa recibida al pasar del estado $e_{t}$ al estado $e_{t+1}$
+ $\gamma$ es el factor de descuento $(0\leq \gamma \leq 1)$. Evalua las recompensas recibidas anteriormente con un valor mayor que las recibidas posteriormente, se puede interpretar como la probabilidad de tener exito (o sobrevivir) en cada paso temporal+ $\max_{a}Q(e_{t+1}, a_{t+1})$ es la estimacion del valor optimo futuro

In [5]:
class Agente(object):
    
    def __init__(self, entorno):
        self.entorno=entorno
        self.observaciones=self.entorno.observation_space.shape[0]
        self.dim_accion=self.entorno.action_space.n
        self.pasos=NUM_MAX_PASOS
        self.politica=EpsGreedyQPolicy()
        self.modelo=''
        
        
    def sarsa(self, entrenar=True, guardar=False, cargar=False):  # modelo sarsa con keras
        modelo=Sequential()
        modelo.add(Flatten(input_shape = (1, self.observaciones)))
        modelo.add(Dense(24, activation='relu'))
        modelo.add(Dense(24, activation='relu'))
        modelo.add(Dense(24, activation='relu'))
        modelo.add(Dense(self.dim_accion, activation='linear'))
        
        modelo=SARSAAgent(model=modelo, policy=self.politica, nb_actions=self.dim_accion)
        
        modelo.compile('adam', metrics=['mse'])
        
        
        if entrenar:
            modelo.fit(self.entorno, nb_steps=self.pasos, visualize=False, verbose=1)
        
        if guardar:
            modelo.save_weights('sarsa_weights.h5f', overwrite=True)
            
        if cargar:
            modelo.load_weights('sarsa_weights.h5f')
        
        self.modelo=modelo
        return self
    
    
    def juega(self, ver, epis):
        return self.modelo.test(self.entorno, nb_episodes=epis, visualize=ver)

In [6]:
# constantes

NUM_MAX_PASOS=50000

In [7]:
entorno=gym.make('CartPole-v1')

In [8]:
agente=Agente(entorno).sarsa()





Training for 50000 steps ...
Interval 1 (0 steps performed)


447 episodes - episode_reward: 22.103 [8.000, 225.000] - loss: 6.053 - mean_squared_error: 230.923 - mean_q: 19.352

Interval 2 (10000 steps performed)
94 episodes - episode_reward: 105.702 [8.000, 404.000] - loss: 8.464 - mean_squared_error: 1306.504 - mean_q: 48.442

Interval 3 (20000 steps performed)
82 episodes - episode_reward: 123.780 [9.000, 354.000] - loss: 5.335 - mean_squared_error: 1305.952 - mean_q: 48.908

Interval 4 (30000 steps performed)
50 episodes - episode_reward: 193.820 [10.000, 500.000] - loss: 5.211 - mean_squared_error: 2079.458 - mean_q: 61.689

Interval 5 (40000 steps performed)
done, took 64.859 seconds


In [9]:
stats=agente.juega(False, 100)
print('Recompensa media en 100 episodios:{}'.format(np.mean(stats.history['episode_reward'])))

Testing for 100 episodes ...
Episode 1: reward: 111.000, steps: 111
Episode 2: reward: 19.000, steps: 19
Episode 3: reward: 20.000, steps: 20
Episode 4: reward: 20.000, steps: 20
Episode 5: reward: 15.000, steps: 15
Episode 6: reward: 19.000, steps: 19
Episode 7: reward: 112.000, steps: 112
Episode 8: reward: 18.000, steps: 18
Episode 9: reward: 17.000, steps: 17
Episode 10: reward: 116.000, steps: 116
Episode 11: reward: 18.000, steps: 18
Episode 12: reward: 21.000, steps: 21
Episode 13: reward: 17.000, steps: 17
Episode 14: reward: 20.000, steps: 20
Episode 15: reward: 111.000, steps: 111
Episode 16: reward: 109.000, steps: 109
Episode 17: reward: 115.000, steps: 115
Episode 18: reward: 18.000, steps: 18
Episode 19: reward: 16.000, steps: 16
Episode 20: reward: 16.000, steps: 16
Episode 21: reward: 116.000, steps: 116
Episode 22: reward: 115.000, steps: 115
Episode 23: reward: 20.000, steps: 20
Episode 24: reward: 20.000, steps: 20
Episode 25: reward: 18.000, steps: 18
Episode 26: re

In [10]:
agente.juega(True, 5)

Testing for 5 episodes ...
Episode 1: reward: 23.000, steps: 23
Episode 2: reward: 117.000, steps: 117
Episode 3: reward: 20.000, steps: 20
Episode 4: reward: 112.000, steps: 112
Episode 5: reward: 16.000, steps: 16


<keras.callbacks.History at 0x13bcfc6d0>