### RL sequence in gym/openai
---

Each environment has a `step` function.

`step` returns four values:

+ An _object_ `observation`: an environment-specific object representing your observation of the environment.
+ A _float_ `reward`: amount of reward achieved by the previous action.
+ A _boolean_ `done`: has the environment reached the end of a well-defined episode? If so, reset.
+ A _dict_ `info`: diagnostic data for debugging.

![](https://storage.googleapis.com/code.blinkanalytics.com/action_reward.png)

This is very close to the [SARSA](https://en.wikipedia.org/wiki/State-Action-Reward-State-Action) algorithm. Link goes to the Wiki page.

### This may be worth a brief detour to take a look at how SARSA works...

_Pseudo-implementation of SARSA-Max, for reference_

In [None]:
# Say we have broken the environment into an episodic list of some kind
episodes = []

# We first initialize Q(s, a) to an arbitrary starting condition.
init_Q(s, a)

for e in episodes:
    init_s()
    for step in e:
        choose_action(a, s, some_policy) # derived from Q. Could be epsilon-greedy, softmax etc.
        execute_a(), observe(r, s_prime)
        update_Q() # <- this is the interesting bit. See below.
        update_s()
        if s == terminal():
            break

#### How do Q-values get updated?

$$ Q(s, \ a), \leftarrow Q(s, \ a) + \alpha [r + \gamma max_a,Q(s \prime, \ a \prime) - Q(s, \ a)]$$

Where: 

$\alpha$ is a learning rate between 0 and 1.

$\gamma$ is a discount factor between 0 and 1. Future rewards are worth less than immediate rewards.

$max_a$ is the maximum reward that is attainable in the next state.

### So that is SARSA, and some basic RL theory. Back to gym and universe...

### gym demo
---

In [None]:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break

### universe demo
---

In [None]:
import gym
import universe

env = gym.make('flashgames.DuskDrive-v0')
env.configure(remotes=1)
observation_n = env.reset()

while True:
  action_n = [[('KeyEvent', 'ArrowUp', True)] for ob in observation_n]  # your agent here
  observation_n, reward_n, done_n, info = env.step(action_n)
  env.render()