In [1]:
import gym

## Naive implementation

In [4]:
env = gym.make('CartPole-v0')
env.reset()

for _ in range(1000):
    env.render()
    env.step(env.action_space.sample())

[2018-01-25 23:44:39,711] Making new env: CartPole-v0
[2018-01-25 23:44:39,991] You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.


## Understanding `env.step`

As mentioned in the documentation page, each environment is separated into different episodes, with `done=True` indicating that the specific episode has ended. Thus, we need to call reset there. For this, we need to understand what `env.step(action)` does and returns. `env.step(action)` takes the next step in the environment by performing the action specified by `action` and returns a tuple:
- observation: This is environment specific and represents our observation of the environment after taking the action specified in `env.step(action)`.
- reward: The reward we received upon performing the action.
- done: This is the parameter we discussed about. We need to monitor this and call `env.reset()` when `done=True`.
- info: Additional information for debugging

In [6]:
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(1000):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        
        if done:
            print('Episode #%d finished after %d timesteps' % (i_episode, t))
            break

[2018-01-25 23:45:47,428] Making new env: CartPole-v0


[-0.02610098  0.02090901  0.01410614  0.03953635]
[-0.0256828   0.21582587  0.01489687 -0.24866279]
[-0.02136628  0.02049437  0.00992361  0.04868146]
[-0.0209564  -0.17476846  0.01089724  0.34447878]
[-0.02445176 -0.37004372  0.01778682  0.64057801]
[-0.03185264 -0.56540907  0.03059838  0.93880872]
[-0.04316082 -0.37071269  0.04937455  0.65589538]
[-0.05057507 -0.17631163  0.06249246  0.37915938]
[-0.05410131  0.01786976  0.07007565  0.10681669]
[-0.05374391  0.21192118  0.07221198 -0.16296067]
[-0.04950549  0.40593907  0.06895277 -0.43201717]
[-0.04138671  0.209912    0.06031243 -0.11841924]
[-0.03718847  0.40412028  0.05794404 -0.39148087]
[-0.02910606  0.20822591  0.05011442 -0.08110646]
[-0.02494154  0.40259499  0.04849229 -0.35756656]
[-0.01688964  0.5969952   0.04134096 -0.63457295]
[-0.00494974  0.79151686  0.0286495  -0.91395536]
[ 0.0108806   0.98623982  0.0103704  -1.19749813]
[ 0.0306054   1.18122603 -0.01357957 -1.48691287]
[ 0.05422992  0.98627214 -0.04331782 -1.19850128]


## Understanding agent actions
The environments in gym have `Space` objects which describe the valid actions and observations


In [7]:
env.action_space

Discrete(2)

In [8]:
env.observation_space

Box(4,)

Basically, `Discrete` specifies the range of non-negative values. <br>
This means `Discrete(3)` means that the action can take values `{0, 1, 2}`. <br>
`Box` represents an n-dimensional (here, n=4) value.