# Cart Pole

#### Descripcion 


+ **Espacio de observacion (Box(4,))**

+ Posicion del carro: [-2.4, 2.4]
+ Velocidad del carro: [$-\infty$, $\infty$]
+ Angulo del palo: [-41.8, 41.8]
+ Velocidad del palo en la punta: [$-\infty$, $\infty$]


+ **Espacio de accion (Discrete(2))**

+ Izquierda: 0
+ Derecha: 1


El objetivo es mantener el palo vertical moviendo a izquierda y derecha el carro.


La recompensa es +1 para cada paso temporal. El episodio se termina si el angulo es mayor a $\pm 12$ grados o si el carro sobrepasa la posicion $\pm 2.4$

**Ahora usaremos Keras para crear el agente**

In [1]:
# pip3 install keras-rl

In [3]:
# librerias

import gym
import numpy as np
from keras.layers import Dense, Flatten
from keras.models import Sequential
from keras.optimizers import Adam

from rl.agents import SARSAAgent
from rl.policy import EpsGreedyQPolicy

import warnings
warnings.simplefilter('ignore')

### Agente

En este vamos caso a crear un agente basado en el algoritmo SARSA (state-action-reward-state-action). Dicho algoritmo tiene la ecuacion:

$$Q_{nueva}(e_{t}, a_{t})=(1-\alpha)·Q(e_{t}, a_{t}) + \alpha·[r_{t} + \gamma·Q(e_{t+1}, a_{t+1})]$$


donde:
+ $e_{t}$ es el estado en el tiempo t
+ $a_{t}$ es la accion en el tiempo t
+ $\alpha$ es la tasa de aprendizaje $(0<\alpha \leq{1})$
+ $Q(e_{t}, a_{t})$ es el viejo valor de calidad
+ $[r_{t} + \gamma·Q(e_{t+1}, a_{t+1})]$ es el valor aprendido
+ $r_{t}$ es la recompensa recibida al pasar del estado $e_{t}$ al estado $e_{t+1}$
+ $\gamma$ es el factor de descuento $(0\leq \gamma \leq 1)$. Evalua las recompensas recibidas anteriormente con un valor mayor que las recibidas posteriormente, se puede interpretar como la probabilidad de tener exito (o sobrevivir) en cada paso temporal+ $\max_{a}Q(e_{t+1}, a_{t+1})$ es la estimacion del valor optimo futuro

In [None]:
class Agente(object):
    
    def _init__(self, entorno):

In [4]:
entorno=gym.make('CartPole-v1')

In [8]:
def agent(states, actions):
    model = Sequential()
    model.add(Flatten(input_shape = (1, states)))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model
  
model = agent(env.observation_space.shape[0], env.action_space.n)






In [9]:
policy = EpsGreedyQPolicy()

In [10]:
sarsa = SARSAAgent(model = model, policy = policy, nb_actions = env.action_space.n)

In [11]:
sarsa.compile('adam', metrics = ['mse'])




In [12]:
sarsa.fit(env, nb_steps = 50000, visualize = False, verbose = 1)

Training for 50000 steps ...
Interval 1 (0 steps performed)


334 episodes - episode_reward: 29.847 [8.000, 326.000] - loss: 8.625 - mean_squared_error: 553.507 - mean_q: 29.571

Interval 2 (10000 steps performed)
58 episodes - episode_reward: 166.897 [12.000, 500.000] - loss: 9.356 - mean_squared_error: 1701.301 - mean_q: 58.859

Interval 3 (20000 steps performed)
70 episodes - episode_reward: 147.157 [8.000, 500.000] - loss: 5.849 - mean_squared_error: 1961.640 - mean_q: 60.397

Interval 4 (30000 steps performed)
48 episodes - episode_reward: 208.750 [11.000, 500.000] - loss: 5.152 - mean_squared_error: 1976.102 - mean_q: 60.508

Interval 5 (40000 steps performed)
done, took 63.674 seconds


<keras.callbacks.History at 0x14e75c490>

In [13]:
scores = sarsa.test(env, nb_episodes = 100, visualize= False)
print('Average score over 100 test games:{}'.format(np.mean(scores.history['episode_reward'])))

Testing for 100 episodes ...
Episode 1: reward: 265.000, steps: 265
Episode 2: reward: 292.000, steps: 292
Episode 3: reward: 298.000, steps: 298
Episode 4: reward: 260.000, steps: 260
Episode 5: reward: 291.000, steps: 291
Episode 6: reward: 283.000, steps: 283
Episode 7: reward: 272.000, steps: 272
Episode 8: reward: 262.000, steps: 262
Episode 9: reward: 295.000, steps: 295
Episode 10: reward: 269.000, steps: 269
Episode 11: reward: 273.000, steps: 273
Episode 12: reward: 290.000, steps: 290
Episode 13: reward: 301.000, steps: 301
Episode 14: reward: 261.000, steps: 261
Episode 15: reward: 285.000, steps: 285
Episode 16: reward: 289.000, steps: 289
Episode 17: reward: 276.000, steps: 276
Episode 18: reward: 278.000, steps: 278
Episode 19: reward: 295.000, steps: 295
Episode 20: reward: 286.000, steps: 286
Episode 21: reward: 280.000, steps: 280
Episode 22: reward: 287.000, steps: 287
Episode 23: reward: 295.000, steps: 295
Episode 24: reward: 290.000, steps: 290
Episode 25: reward: 

In [14]:
# sarsa.save_weights('sarsa_weights.h5f', overwrite=True)
# sarsa.load_weights('sarsa_weights.h5f')

In [16]:
_ = sarsa.test(env, nb_episodes = 2, visualize= True)

Testing for 2 episodes ...
Episode 1: reward: 279.000, steps: 279
Episode 2: reward: 289.000, steps: 289
