In [20]:
import gym

In [21]:
# gym.envs.registry.all()  # Shows all the environments

#### To setup a reinforcement learning problem in gym, call `gym.make()` with the name of the problem

In [22]:
env = gym.make("CartPole-v0")

#### To initialise the prob, call `env.reset()`
-  Returns initial observation of the agent as a NumPy array, once the environemtn is initialised.
- - Cart Position = (-2.4:2.4)
- - Cart Velocity = (-inf:inf)
- - Pole Angle = (-41.8:41.8)
- - Pole Velocity At Tip = (-inf:inf)

In [27]:
observation = env.reset()  # Numpy Array
print(observation)

[-0.0462211  -0.00171908  0.03678178 -0.04742402]


#### To query the nature of the observation, call `env.observation_space`
- describes the specific types within the environment array

In [24]:
env.observation_space 

Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)

#### To query the nature of the Agents actions, call `env.action_space`
- 'Discrete' means that the Agents actions can be represented by a variable that takes discrete value, as opposed to a continuous one.
- 'Discrete(2)' means that the agent can take two discrete actions: '0' and '1'.
- 'Discrete(n)' is the total number of discrete actions an agent can take.

In [25]:
env.action_space

Discrete(2)

### To take an action, call `env.step()` with the action as an argument.
- The **first** element returned by the `env.step()` function is the new environment state.
- The **second** element returned by the `env.step()` function is called the reward. The reward is a judgement of the environment state and the action at any time step.
- In other reinforcement learning problems, we may also judge the agent's actions in a given environment state. So the reward function is usually like this: `rewards(observation, action)`. But in `CartPole-V0`, we oly judge the environment states, and don't judge the actions.

In [31]:
observation, reward, _, _ = env.step(0)
print(f'Observation: {observation}\nReward: {reward}')

Observation: [-1.96700711 -5.65253414  3.42787539  7.8635857 ]
Reward: 0.0


In [10]:
env.step(0)

(array([ 0.01567138, -0.19499978, -0.0426572 ,  0.28009863]), 1.0, False, {})

In [11]:
env.step(1)

(array([ 0.01177138,  0.00070389, -0.03705523, -0.02572725]), 1.0, False, {})

In [13]:
# env.step(2)  # Invalid action. Beyond the limits of action space

### The rewards is a judgement of the environment state and the action at any time step.
- the environment state may be judged as good/bad.
- the action in that environemtn state may also be judged as good/bad.
- The judgement depends on the goal.

### The learning goal (balancing the pole without crashing) is equivalent to maximizing the reward function over time.
- Max reward is +1
- The more +1's you get, the more often the pole stays upright.
- In the extreme case (maximization), you get +1 every time step and therefore the pole always stays upright.

### Other environments with different goals
- Walking
- Driving
- Playing a video game
- Making money in stock market

### In any RL problem, the first step is to express the goal as the maximisation of some reward function.
- gym already provides such a reward function for all the environments.

### To get a random valid action, sampled with equal probability, call `env.action_space.sample()`.

In [15]:
env.action_space.sample()

0

### Taking multiple actions in a loop. Call `env_render()` after each action to update the enviroment state and visualise the dynamics in real time.
- Use `time.sleep()` to see the animation in slow-motion.

In [46]:
import time
from math import pi

observation = env.reset()

for _ in range(30):  # 30 timesteps.
    pole_angle_in_radians = observation[2]
    # 360º == 2*π radians.
    pole_angle_in_degrees = (pole_angle_in_radians * 360) / (2 * pi)
    observation, reward, _, _ = env.step(0)
    print(f'Pole angle: {pole_angle_in_degrees:.2f}, Reward: {reward}')
    env.render()
    time.sleep(0.1);

Pole angle: -1.36, Reward: 1.0
Pole angle: -1.32, Reward: 1.0
Pole angle: -0.97, Reward: 1.0
Pole angle: -0.28, Reward: 1.0
Pole angle: 0.73, Reward: 1.0
Pole angle: 2.08, Reward: 1.0
Pole angle: 3.77, Reward: 1.0
Pole angle: 5.81, Reward: 1.0
Pole angle: 8.20, Reward: 1.0
Pole angle: 10.97, Reward: 1.0
Pole angle: 14.11, Reward: 0.0
Pole angle: 17.65, Reward: 0.0
Pole angle: 21.60, Reward: 0.0
Pole angle: 25.97, Reward: 0.0
Pole angle: 30.77, Reward: 0.0
Pole angle: 36.01, Reward: 0.0
Pole angle: 41.70, Reward: 0.0
Pole angle: 47.85, Reward: 0.0
Pole angle: 54.44, Reward: 0.0
Pole angle: 61.49, Reward: 0.0
Pole angle: 68.97, Reward: 0.0
Pole angle: 76.88, Reward: 0.0
Pole angle: 85.19, Reward: 0.0
Pole angle: 93.88, Reward: 0.0
Pole angle: 102.93, Reward: 0.0
Pole angle: 112.30, Reward: 0.0
Pole angle: 121.95, Reward: 0.0
Pole angle: 131.84, Reward: 0.0
Pole angle: 141.90, Reward: 0.0
Pole angle: 152.07, Reward: 0.0


### To get visual representation of the problem after setup call `env.render()`

In [5]:
env.render()

True

In [7]:
env.close()