# Cart Pole
This environment is part of the Classic Control environments.
The unique dependencies for this set of environments can be installed via:
```bash
pip install gymnasium[classic-control]
```
The environment is stochastic in terms of its initial state, within a given range. Moreover, this is considered a easier ones to solve by a policy.

In [1]:
import os
import gymnasium as gym
from gymnasium.wrappers.monitoring.video_recorder import VideoRecorder

This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson in “[*Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem*](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6313077&isnumber=6313056)”. 

```
A. G. Barto, R. S. Sutton and C. W. Anderson, "Neuronlike adaptive elements that can solve difficult learning control problems," in IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-13, no. 5, pp. 834-846, Sept.-Oct. 1983, doi: 10.1109/TSMC.1983.6313077.
Abstract: It is shown how a system consisting of two neuronlike adaptive elements can solve a difficult learning control problem. The task is to balance a pole that is hinged to a movable cart by applying forces to the cart's base. It is argued that the learning problems faced by adaptive elements that are components of adaptive networks are at least as difficult as this version of the pole-balancing problem. The learning system consists of a single associative search element (ASE) and a single adaptive critic element (ACE). In the course of learning to balance the pole, the ASE constructs associations between input and output by searching under the influence of reinforcement feedback, and the ACE constructs a more informative evaluation function than reinforcement feedback alone can provide. The differences between this approach and other attempts to solve problems using neurolike elements are discussed, as is the relation of this work to classical and instrumental conditioning in animal learning studies and its possible implications for research in the neurosciences.
```

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The pendulum is placed upright on the cart and the goal is to balance the pole by applying forces in the left and right direction on the cart.
<p align="center">
    <img 
         src="https://images.ctfassets.net/xjan103pcp94/4hLHnMXJN2EwwAXq2yYx9v/41b16121290d6c46b6b85492a572a4cf/cartPoleRemade.png"
         width="60%" 
         height="60%" 
    />
</p>

The code below loads the cartpole environment.

In [2]:
env = gym.make("CartPole-v0", render_mode='rgb_array')

  logger.warn(


In [3]:
result_path = '.gym_results'
if not os.path.exists(result_path):
    os.makedirs(result_path)

In [4]:
before_training = "%s/before_training.mp4" % (result_path)
video = VideoRecorder(env, before_training)

Let's now focus on undertanding the environment by looking at the action space.

In [5]:
env.action_space

Discrete(2)

The output Discrete(2) means that there are two actions.

Indeed, the action is a ndarray with shape (1,) which can take values {0, 1} indicating the direction of the fixed force the cart is pushed with.

| Value | Action                 |
|-------|------------------------|
| 0     | Push cart to the left  |
| 1     | Push cart to the right |

Note: The velocity that is reduced or increased by the applied force is not fixed and it depends on the angle the pole is pointing. The center of gravity of the pole varies the amount of energy needed to move the cart underneath it

In reinforcement learning, the agent produces an action output and this action is sent to an environment which then reacts.

The environment produces an observation which we can see below:

In [6]:
env.reset()

(array([-0.0317756 ,  0.02235167,  0.0329018 , -0.0068141 ], dtype=float32),
 {})

The observation is a vector of shape (4,), containing the cart's x position, cart x velocity, the pole angle in radians (1 radian = 57.295 degrees), and the angular velocity of the pole.

| Num |      Observation      |         Min         |        Max        |
|:---:|:---------------------:|:-------------------:|:-----------------:|
| 0   | Cart Position         | -4.8                | 4.8               |
| 1   | Cart Velocity         | -Inf                | Inf               |
| 2   | Pole Angle            | ~ -0.418 rad (-24°) | ~ 0.418 rad (24°) |
| 3   | Pole Angular Velocity | -Inf                | Inf               |

The numbers shown above are the initial observation after starting a new episode (`env.reset()`). With each timestep (and action), the observation values will change, depending on the state of the cart and pole.

While the ranges above denote the possible values for observation space of each element, it is not reflective of the allowed values of the state space in an unterminated episode. Particularly:
- The cart x-position (index 0) can be take values between (-4.8, 4.8), but the episode terminates if the cart leaves the (-2.4, 2.4) range.
- The pole angle can be observed between (-.418, .418) radians (or ±24°), but the episode terminates if the pole angle is not in the range (-.2095, .2095) (or ±12°)

With the following code the environment takes 20 steps (20 cycles), always taking a random action and printing the results.

In [7]:
steps = 100
for step in range(steps):
    env.render()
    video.capture_frame()
    observation, reward, terminated, _, info = env.step( env.action_space.sample() )
    print("step", step, observation, reward, terminated, info)
video.close()
env.close()

step 0 [-0.03132856  0.21698669  0.03276552 -0.28893724] 1.0 False {}
step 1 [-0.02698883  0.0214132   0.02698677  0.01389689] 1.0 False {}
step 2 [-0.02656057  0.21613795  0.02726471 -0.27015072] 1.0 False {}
step 3 [-0.02223781  0.02063774  0.02186169  0.03100542] 1.0 False {}
step 4 [-0.02182505  0.21543947  0.0224818  -0.2547005 ] 1.0 False {}
step 5 [-0.01751626  0.02000385  0.01738779  0.04498791] 1.0 False {}
step 6 [-0.01711619  0.21487221  0.01828755 -0.24215868] 1.0 False {}
step 7 [-0.01281874  0.40972823  0.01344438 -0.5290176 ] 1.0 False {}
step 8 [-0.00462418  0.6046585   0.00286403 -0.817434  ] 1.0 False {}
step 9 [ 0.00746899  0.7997411  -0.01348465 -1.1092148 ] 1.0 False {}
step 10 [ 0.02346381  0.9950377  -0.03566895 -1.4060973 ] 1.0 False {}
step 11 [ 0.04336457  1.1905837  -0.0637909  -1.7097143 ] 1.0 False {}
step 12 [ 0.06717625  0.99625015 -0.09798518 -1.4375486 ] 1.0 False {}
step 13 [ 0.08710124  0.80246294 -0.12673615 -1.1770236 ] 1.0 False {}
step 14 [ 0.1031

  logger.warn(


step 37 [ 0.25165862 -0.26057294 -0.8958929  -2.7687273 ] 0.0 True {}
step 38 [ 0.24644716 -0.07267504 -0.9512674  -3.174377  ] 0.0 True {}
step 39 [ 0.24499366 -0.25993866 -1.014755   -3.2506351 ] 0.0 True {}
step 40 [ 0.23979488 -0.07680964 -1.0797677  -3.6453354 ] 0.0 True {}
step 41 [ 0.23825869 -0.26660123 -1.1526744  -3.7703593 ] 0.0 True {}
step 42 [ 0.23292667 -0.45741564 -1.2280816  -3.9228132 ] 0.0 True {}
step 43 [ 0.22377835 -0.2832024  -1.3065379  -4.2875314 ] 0.0 True {}
step 44 [ 0.2181143  -0.47869223 -1.3922884  -4.4947343 ] 0.0 True {}
step 45 [ 0.20854045 -0.676675   -1.4821831  -4.7313313 ] 0.0 True {}
step 46 [ 0.19500697 -0.5138625  -1.5768098  -5.0457907 ] 0.0 True {}
step 47 [ 0.18472971 -0.7189066  -1.6777256  -5.3416348 ] 0.0 True {}
step 48 [ 0.17035158 -0.92809635 -1.7845583  -5.6674447 ] 0.0 True {}
step 49 [ 0.15178965 -0.77712065 -1.8979071  -5.9067116 ] 0.0 True {}
step 50 [ 0.13624723 -0.9945713  -2.0160415  -6.289925  ] 0.0 True {}
step 51 [ 0.11635581

                                                              

Moviepy - Done !
Moviepy - video ready .gym_results/before_training.mp4


Where `env.action_space.sample()` produces either 0 (left) or 1 (right).
The printed output above shows the following things:
- **step** (how many times it has cycled through the environment). In each timestep, an agent chooses an action, and the environment returns an observation and a reward
- **observation** of the environment [x cart position, x cart velocity, pole angle (rad), pole angular velocity]
- **reward** achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward. The reward is 1 for every step taken for cartpole, including the termination step. After it is 0.from base64 import b64encode
- **terminated** is a boolean. It indicates whether it's time to reset the environment again. Most tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. In cart pole, it could be that the pole tipped too far (more than 12 degrees/0.20944 radians), position is more than 2.4 meaning the center of the cart reaches the edge of the display, episode length is greater than 200, or the solved requirement which is when the average return is greater than or equal to 195.0 over 100 consecutive trials.
- **info** which is diagnostic information useful for debugging. It is empty for this cartpole environment.

In [8]:
from base64 import b64encode
def render_mp4(videopath: str) -> str:
  """
  Gets a string containing a b4-encoded version of the MP4 video
  at the specified path.
  """
  mp4 = open(videopath, 'rb').read()
  base64_encoded_mp4 = b64encode(mp4).decode()
  return f'<video width=400 controls><source src="data:video/mp4;' \
         f'base64,{base64_encoded_mp4}" type="video/mp4"></video>'

In [9]:
from IPython.display import HTML
html = render_mp4(before_training)
HTML(html)

Let's try now to apply a simple police.

The cart will move to left when the pole carrello muoverà a sinistra quando il palo si inclinerà a sinistra e a destra quando il palo cadrà verso destra.

In [10]:
def simple_policy(observation):
    angle = observation[2]
    return 0 if angle < 0 else 1

In [11]:
simple_policy_training = "%s/simple_policy_training.mp4" % (result_path)
video = VideoRecorder(env, simple_policy_training)
totals = []
episodes = 100
for episode in range(episodes):
    episode_rewards = 0
    observation, _ = env.reset()
    steps = 1000 # set 1000 max steps since we don't want to run forever
    for step in range(steps):
        env.render()
        video.capture_frame()
        action = simple_policy(observation)
        observation, reward, terminated, _, info = env.step( action )
        episode_rewards += reward
        print("episodes", episode, "step", step, observation, reward, terminated, info, episode_rewards)
        if terminated:
            break
    totals.append(episode_rewards)
video.close()
env.close()

episodes 0 step 0 [ 0.01814866  0.16416955  0.01225041 -0.27931136] 1.0 False {} 1.0
episodes 0 step 1 [ 0.02143205  0.35911462  0.00666418 -0.5681055 ] 1.0 False {} 2.0
episodes 0 step 2 [ 0.02861434  0.5541425  -0.00469793 -0.8586815 ] 1.0 False {} 3.0
episodes 0 step 3 [ 0.03969719  0.3590848  -0.02187156 -0.56747943] 1.0 False {} 4.0
episodes 0 step 4 [ 0.04687889  0.16427639 -0.03322115 -0.2817664 ] 1.0 False {} 5.0
episodes 0 step 5 [ 0.05016442 -0.03035633 -0.03885648  0.00025635] 1.0 False {} 6.0
episodes 0 step 6 [ 0.04955729 -0.2249001  -0.03885135  0.28043082] 1.0 False {} 7.0
episodes 0 step 7 [ 0.04505929 -0.41944695 -0.03324273  0.5606114 ] 1.0 False {} 8.0
episodes 0 step 8 [ 0.03667035 -0.6140869  -0.0220305   0.84263855] 1.0 False {} 9.0
episodes 0 step 9 [ 0.02438861 -0.80890137 -0.00517773  1.1283128 ] 1.0 False {} 10.0
episodes 0 step 10 [ 0.00821058 -1.0039551   0.01738852  1.4193673 ] 1.0 False {} 11.0
episodes 0 step 11 [-0.01186852 -0.80905265  0.04577587  1.132

                                                                 

Moviepy - Done !
Moviepy - video ready .gym_results/simple_policy_training.mp4




In [12]:
html = render_mp4(simple_policy_training)
HTML(html)

In [13]:
import numpy as np
def handle_policy_metrics(policy_result):
    print("""
    Policy bulletin: \n
            mean: {} \n
            std: {} \n
            min: {} \n
            max: {} \n
            """.format(np.mean(policy_result),np.std(policy_result), np.min(policy_result), np.max(policy_result)))
handle_policy_metrics(totals)


    Policy bulletin: 

            mean: 42.57 

            std: 8.859181677784917 

            min: 25.0 

            max: 68.0 

            


## References
- [An Introduction to Reinforcement Learning with OpenAI Gym, RLlib, and Google Colab](https://www.anyscale.com/blog/an-introduction-to-reinforcement-learning-with-openai-gym-rllib-and-google) By Michael Galarnyk and Sven Mika 
- [OpenAI Gym CartPole: how it works, Tutorial](https://andreaprovino.it/openai-gym-cartpole-tutorial-how-it-works/) by Andrea Provino