In [1]:
import gym

In [2]:
# gym.envs.registry.all()  # Shows all the environments

#### To setup a reinforcement learning problem in gym, call `gym.make()` with the name of the problem

In [3]:
env = gym.make("CartPole-v0")

#### To initialise the prob, call `env.reset()`
-  Returns initial observation of the agent as a NumPy array, once the environemtn is initialised.
- - Cart Position = (-2.4:2.4)
- - Cart Velocity = (-inf:inf)
- - Pole Angle = (-41.8:41.8)
- - Pole Velocity At Tip = (-inf:inf)

In [4]:
observation = env.reset()  # Numpy Array
print(observation)

[ 0.03799467  0.00348804  0.03744493 -0.01553074]


#### To query the nature of the observation, call `env.observation_space`
- describes the specific types within the environment array

In [17]:
env.observation_space 

Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)

#### To query the nature of the Agents actions, call `env.action_space`
- 'Discrete' means that the Agents actions can be represented by a variable that takes discrete value, as opposed to a continuous one.
- 'Discrete(2)' means that the agent can take two discrete actions: '0' and '1'.
- 'Discrete(n)' is the total number of discrete actions an agent can take.

In [6]:
env.action_space

Discrete(2)

### To take an action, call `env.step()` with the action as an argument.
- The **first** element returned by the `env.step()` function is the new environment state.
- The **second** element returned by the `env.step()` function is called the reward. The reward is a judgement of the environment state and the action at any time step.
- In other reinforcement learning problems, we may also judge the agent's actions in a given environment state. So the reward function is usually like this: `rewards(observation, action)`. But in `CartPole-V0`, we oly judge the environment states, and don't judge the actions.

In [7]:
observation, reward, _, _ =  env.step(0)  # _ is an unused var.
print(f'Observation: {observation}\nReward: {reward}')

Observation: [ 0.03806443 -0.19215035  0.03713431  0.28872738]
Reward: 1.0


In [8]:
env.step(0)

(array([ 0.03422142, -0.38778162,  0.04290886,  0.59288697]), 1.0, False, {})

In [9]:
env.step(1)

(array([ 0.02646579, -0.19328577,  0.0547666 ,  0.31402305]), 1.0, False, {})

In [10]:
# env.step(2)  # Invalid action. Beyond the limits of action space

### The rewards is a judgement of the environment state and the action at any time step.
- the environment state may be judged as good/bad.
- the action in that environemtn state may also be judged as good/bad.
- The judgement depends on the goal.

### The learning goal (balancing the pole without crashing) is equivalent to maximizing the reward function over time.
- Max reward is +1
- The more +1's you get, the more often the pole stays upright.
- In the extreme case (maximization), you get +1 every time step and therefore the pole always stays upright.

### Other environments with different goals
- Walking
- Driving
- Playing a video game
- Making money in stock market

### In any RL problem, the first step is to express the goal as the maximisation of some reward function.
- gym already provides such a reward function for all the environments.

### To get a random valid action, sampled with equal probability, call `env.action_space.sample()`.

In [11]:
env.action_space.sample()

0

### Taking multiple actions in a loop. Call `env_render()` after each action to update the enviroment state and visualise the dynamics in real time.
- Use `time.sleep()` to see the animation in slow-motion.

In [14]:
import time
from math import pi

observation = env.reset()

for _ in range(30):  # 30 timesteps.
    pole_angle_in_radians = observation[2]
    # 360º == 2*π radians.
    pole_angle_in_degrees = (pole_angle_in_radians * 360) / (2 * pi)
    observation, reward, _, _ = env.step(0)
    print(f'Pole angle: {pole_angle_in_degrees:.2f}, Reward: {reward}')
    env.render()
    time.sleep(0.1);

Pole angle: 0.60, Reward: 1.0
Pole angle: 0.61, Reward: 1.0
Pole angle: 0.96, Reward: 1.0
Pole angle: 1.64, Reward: 1.0
Pole angle: 2.67, Reward: 1.0
Pole angle: 4.04, Reward: 1.0
Pole angle: 5.77, Reward: 1.0
Pole angle: 7.85, Reward: 1.0
Pole angle: 10.30, Reward: 1.0
Pole angle: 13.13, Reward: 0.0
Pole angle: 16.36, Reward: 0.0
Pole angle: 19.99, Reward: 0.0
Pole angle: 24.03, Reward: 0.0
Pole angle: 28.51, Reward: 0.0
Pole angle: 33.42, Reward: 0.0
Pole angle: 38.78, Reward: 0.0
Pole angle: 44.59, Reward: 0.0
Pole angle: 50.86, Reward: 0.0
Pole angle: 57.58, Reward: 0.0
Pole angle: 64.74, Reward: 0.0
Pole angle: 72.33, Reward: 0.0
Pole angle: 80.34, Reward: 0.0
Pole angle: 88.75, Reward: 0.0
Pole angle: 97.53, Reward: 0.0
Pole angle: 106.65, Reward: 0.0
Pole angle: 116.08, Reward: 0.0
Pole angle: 125.77, Reward: 0.0
Pole angle: 135.67, Reward: 0.0
Pole angle: 145.72, Reward: 0.0
Pole angle: 155.85, Reward: 0.0


### To get visual representation of the problem after setup call `env.render()`

In [15]:
env.render()

True

In [16]:
env.close()