# OpenAI Gym CartPole Environment

![](CartPole.png)

The OpenAI Gym website contains information about widely used environments for testing reinforcement learning algorithms. It is at [https://gym.openai.com/](https://gym.openai.com/). This notebook demonstrates the steps to go through the Getting Started information at [Getting Started with Gym](https://gym.openai.com/docs/).

OpenAI Gym also has a GitHub site at [https://github.com/openai/gym/](https://github.com/openai/gym/). More information about the CartPole problem can be found at [CartPole Wiki](https://github.com/openai/gym/wiki/CartPole-v0).

## OpenAI Gym Installation ##
OpenAI gym is not installed as part of the Anaconda Python or standard Python distributions.  You need to install it from your command prompt or terminal. 

For Anaconda Python remember, to activate your Python (base) environment.
```text
conda activate base
```
Then use *conda* to install gym.
```text
conda install -c conda-forge gym
```

Use *pip* for the standard Python install.

```text
pip install gym
```


### Getting Started with Gym ###
Next, we will go through the [Getting Started with Gym](https://gym.openai.com/docs/) examples.

#### Getting Started Example 1 ###

You may be excited to try the first example and see the cart and pole move around on your screen. I know that I was! However, there are problems rendering graphics from within the Jupyter environment. To circumvent these problems you can place the code in a .py file and then run Python on it from the terminal or command prompt window.


```bash
python cartpole1.py
```

Below is the contents pf the cartpol1.py file
```python
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(100):
    env.render()
    env.step(env.action_space.sample()) # take a random action
env.close()
```

#### Getting Started Example 2 ####

Below is the second example. I used the same process to create a .py ilfe  
```python
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()
```

Upon execution, this renders the cartpole problem until the 'done' flag is True and then starts over again. This is done 100 times. In addition to rendering a movie of the cart pole pole, it prints out the state information at each step and then the total number of steps executed until the done flag is set to true as shown below.

![](Terminal2.png)

#### Example to create and intergate the environment ####
1. Import the gym environment
2. Instantiate the Cart Pole environment
3. Print information on the observation space
4. Print information about the action space

In [1]:
import gym
env = gym.make('CartPole-v0')
print('Observation Space:', env.observation_space)
print('Action Space:', env.action_space)

Observation Space: Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Action Space: Discrete(2)


#### Example to get the range of potential values in the observation space ####
1. Print the largest values in the observation space
2. Print the smallest values in the observation space

In [2]:
print('high:',env.observation_space.high)
print('low: ', env.observation_space.low)

high: [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
low:  [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


#### Example to see the initial values in the space ####
Reset the observation space to place the cart and pole into an initial state. The four observation values are set to random small numbers to start the problem. These values will be different each time you start the problem. The four values are

1. Position of the cart
2. Velocity of the cart
3. Angular position of the end of the pole
4. Angular velocity of the end of the pole

In [3]:
print("1.", env.reset())
print("2.", env.reset())
print("3.", env.reset())
print("4.", env.reset())
print("5.", env.reset())

1. [-0.03325264  0.02704961 -0.01587968 -0.03898127]
2. [-0.02355128 -0.00777282  0.01648262 -0.03413799]
3. [-0.04344961 -0.02658321  0.03927422 -0.04977842]
4. [0.0310916  0.00615533 0.04445901 0.03010518]
5. [-0.00555808 -0.02661021 -0.0419495   0.00753793]


In [5]:
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

[0.03154479 0.02829044 0.02610899 0.04597598]
[ 0.0321106   0.22302847  0.02702851 -0.23835632]
[ 0.03657117  0.41775405  0.02226138 -0.52239263]
[ 0.04492625  0.61255574  0.01181353 -0.8079784 ]
[ 0.05717736  0.41727388 -0.00434604 -0.511603  ]
[ 0.06552284  0.6124568  -0.0145781  -0.8056523 ]
[ 0.07777198  0.8077755  -0.03069114 -1.1028851 ]
[ 0.09392749  1.0032874  -0.05274884 -1.4050367 ]
[ 0.11399323  0.8088586  -0.08084958 -1.1293    ]
[ 0.1301704   1.004941   -0.10343558 -1.4462066 ]
[ 0.15026923  1.201172   -0.13235971 -1.7693359 ]
[ 0.17429267  1.0077696  -0.16774642 -1.5205699 ]
[ 0.19444805  1.2044746  -0.19815783 -1.8605723 ]
Episode finished after 13 timesteps
[ 0.03946384  0.04349051 -0.02035143 -0.04043033]
[ 0.04033365  0.23889829 -0.02116003 -0.33946422]
[ 0.04511162  0.04408373 -0.02794932 -0.05352836]
[ 0.04599329 -0.15062656 -0.02901989  0.23020695]
[ 0.04298076  0.04489781 -0.02441575 -0.07148675]
[ 0.04387872 -0.14986575 -0.02584548  0.213394  ]
[ 0.0408814  -0.34

In [7]:
__credits__ = ["Carlos Luis"]

from os import path
from typing import Optional

import numpy as np

import gym
from gym import spaces
from gym.utils import seeding


class PendulumEnv(gym.Env):
    """
       ### Description

    The inverted pendulum swingup problem is based on the classic problem in control theory. The system consists of a pendulum attached at one end to a fixed point, and the other end being free. The pendulum starts in a random position and the goal is to apply torque on the free end to swing it into an upright position, with its center of gravity right above the fixed point.

    The diagram below specifies the coordinate system used for the implementation of the pendulum's
    dynamic equations.

    ![Pendulum Coordinate System](./diagrams/pendulum.png)

    -  `x-y`: cartesian coordinates of the pendulum's end in meters.
    - `theta` : angle in radians.
    - `tau`: torque in `N m`. Defined as positive _counter-clockwise_.

    ### Action Space

    The action is a `ndarray` with shape `(1,)` representing the torque applied to free end of the pendulum.

    | Num | Action | Min  | Max |
    |-----|--------|------|-----|
    | 0   | Torque | -2.0 | 2.0 |


    ### Observation Space

    The observation is a `ndarray` with shape `(3,)` representing the x-y coordinates of the pendulum's free end and its angular velocity.

    | Num | Observation      | Min  | Max |
    |-----|------------------|------|-----|
    | 0   | x = cos(theta)   | -1.0 | 1.0 |
    | 1   | y = sin(angle)   | -1.0 | 1.0 |
    | 2   | Angular Velocity | -8.0 | 8.0 |

    ### Rewards

    The reward function is defined as:

    *r = -(theta<sup>2</sup> + 0.1 * theta_dt<sup>2</sup> + 0.001 * torque<sup>2</sup>)*

    where `$\theta$` is the pendulum's angle normalized between *[-pi, pi]* (with 0 being in the upright position).
    Based on the above equation, the minimum reward that can be obtained is *-(pi<sup>2</sup> + 0.1 * 8<sup>2</sup> + 0.001 * 2<sup>2</sup>) = -16.2736044*, while the maximum reward is zero (pendulum is
    upright with zero velocity and no torque applied).

    ### Starting State

    The starting state is a random angle in *[-pi, pi]* and a random angular velocity in *[-1,1]*.

    ### Episode Termination

    The episode terminates at 200 time steps.

    ### Arguments

    - `g`: acceleration of gravity measured in *(m s<sup>-2</sup>)* used to calculate the pendulum dynamics. The default value is g = 10.0 .

    ```
    gym.make('Pendulum-v1', g=9.81)
    ```

    ### Version History

    * v1: Simplify the math equations, no difference in behavior.
    * v0: Initial versions release (1.0.0)

    """

    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 30}

    def __init__(self, g=10.0):
        self.max_speed = 8
        self.max_torque = 2.0
        self.dt = 0.05
        self.g = g
        self.m = 1.0
        self.l = 1.0
        self.screen = None
        self.clock = None
        self.isopen = True

        self.screen_dim = 500

        high = np.array([1.0, 1.0, self.max_speed], dtype=np.float32)
        # This will throw a warning in tests/envs/test_envs in utils/env_checker.py as the space is not symmetric
        #   or normalised as max_torque == 2 by default. Ignoring the issue here as the default settings are too old
        #   to update to follow the openai gym api
        self.action_space = spaces.Box(
            low=-self.max_torque, high=self.max_torque, shape=(1,), dtype=np.float32
        )
        self.observation_space = spaces.Box(low=-high, high=high, dtype=np.float32)

    def step(self, u):
        th, thdot = self.state  # th := theta

        g = self.g
        m = self.m
        l = self.l
        dt = self.dt

        u = np.clip(u, -self.max_torque, self.max_torque)[0]
        self.last_u = u  # for rendering
        costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)

        newthdot = thdot + (3 * g / (2 * l) * np.sin(th) + 3.0 / (m * l**2) * u) * dt
        newthdot = np.clip(newthdot, -self.max_speed, self.max_speed)
        newth = th + newthdot * dt

        self.state = np.array([newth, newthdot])
        return self._get_obs(), -costs, False, {}

    def reset(
        self,
        *,
        seed: Optional[int] = None,
        return_info: bool = False,
        options: Optional[dict] = None
    ):
        super().reset(seed=seed)
        high = np.array([np.pi, 1])
        self.state = self.np_random.uniform(low=-high, high=high)
        self.last_u = None
        if not return_info:
            return self._get_obs()
        else:
            return self._get_obs(), {}

    def _get_obs(self):
        theta, thetadot = self.state
        return np.array([np.cos(theta), np.sin(theta), thetadot], dtype=np.float32)

    def render(self, mode="human"):
        import pygame
        from pygame import gfxdraw

        if self.screen is None:
            pygame.init()
            pygame.display.init()
            self.screen = pygame.display.set_mode((self.screen_dim, self.screen_dim))
        if self.clock is None:
            self.clock = pygame.time.Clock()

        self.surf = pygame.Surface((self.screen_dim, self.screen_dim))
        self.surf.fill((255, 255, 255))

        bound = 2.2
        scale = self.screen_dim / (bound * 2)
        offset = self.screen_dim // 2

        rod_length = 1 * scale
        rod_width = 0.2 * scale
        l, r, t, b = 0, rod_length, rod_width / 2, -rod_width / 2
        coords = [(l, b), (l, t), (r, t), (r, b)]
        transformed_coords = []
        for c in coords:
            c = pygame.math.Vector2(c).rotate_rad(self.state[0] + np.pi / 2)
            c = (c[0] + offset, c[1] + offset)
            transformed_coords.append(c)
        gfxdraw.aapolygon(self.surf, transformed_coords, (204, 77, 77))
        gfxdraw.filled_polygon(self.surf, transformed_coords, (204, 77, 77))

        gfxdraw.aacircle(self.surf, offset, offset, int(rod_width / 2), (204, 77, 77))
        gfxdraw.filled_circle(
            self.surf, offset, offset, int(rod_width / 2), (204, 77, 77)
        )

        rod_end = (rod_length, 0)
        rod_end = pygame.math.Vector2(rod_end).rotate_rad(self.state[0] + np.pi / 2)
        rod_end = (int(rod_end[0] + offset), int(rod_end[1] + offset))
        gfxdraw.aacircle(
            self.surf, rod_end[0], rod_end[1], int(rod_width / 2), (204, 77, 77)
        )
        gfxdraw.filled_circle(
            self.surf, rod_end[0], rod_end[1], int(rod_width / 2), (204, 77, 77)
        )

        fname = path.join(path.dirname(__file__), "assets/clockwise.png")
        img = pygame.image.load(fname)
        if self.last_u is not None:
            scale_img = pygame.transform.smoothscale(
                img, (scale * np.abs(self.last_u) / 2, scale * np.abs(self.last_u) / 2)
            )
            is_flip = bool(self.last_u > 0)
            scale_img = pygame.transform.flip(scale_img, is_flip, True)
            self.surf.blit(
                scale_img,
                (
                    offset - scale_img.get_rect().centerx,
                    offset - scale_img.get_rect().centery,
                ),
            )

        # drawing axle
        gfxdraw.aacircle(self.surf, offset, offset, int(0.05 * scale), (0, 0, 0))
        gfxdraw.filled_circle(self.surf, offset, offset, int(0.05 * scale), (0, 0, 0))

        self.surf = pygame.transform.flip(self.surf, False, True)
        self.screen.blit(self.surf, (0, 0))
        if mode == "human":
            pygame.event.pump()
            self.clock.tick(self.metadata["render_fps"])
            pygame.display.flip()

        if mode == "rgb_array":
            return np.transpose(
                np.array(pygame.surfarray.pixels3d(self.screen)), axes=(1, 0, 2)
            )
        else:
            return self.isopen

    def close(self):
        if self.screen is not None:
            import pygame

            pygame.display.quit()
            pygame.quit()
            self.isopen = False


def angle_normalize(x):
    return ((x + np.pi) % (2 * np.pi)) - np.pi