# Introduction to Reinforcement Learning with OpenAI Gym
## About Gym
`Gym` is an open source Python library for developing and comparing `reinforcement learning` algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. 
* `Gym` documentation website is located [here](https://www.gymlibrary.dev/). 
* `Gym` also has a discord server for development purposes that you can join [here](https://discord.gg/nHg2JRN489).
* `Gym`'s official developer site is [here](https://github.com/openai/gym).


## The Imports

In [13]:
#!setopt no_nomatch
#!pip install gym[pong]

zsh:1: no matches found: gym[pong]


In [1]:
import gym
import math
import imageio.v2 as imageio
import os
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
from joblib import load, dump

# Working with `CartPole-v0` environment
* This environment is from the [classic control group](https://www.gymlibrary.dev/environments/classic_control/)
* Please note the following is a reference of `CartPole-v1` instead of `CartPole-v0`. They both share a lot of similarity, and few subtle differences. For example, in `CartPole-v1` reward threshold is set to 475 whereas in `Cartpole-v0` it was set to 195.

![Cartpole-v0](figs/Cartpole-v1.png)

# Working with `CartPole-v0` environment
* This environment is from the [classic control group](https://www.gymlibrary.dev/environments/classic_control/)
* **Goal** is to control the cart (i.e., platform) with a pole attached by its bottom prt.
* **Trick**: The pole tends to fall right or left and you would need to balance it by moving the cart to the right or left on every step.

In [16]:
env = gym.make("CartPole-v0")

  logger.warn(


## State space (observable)
* The observation of the environment is 4 floating point numbers: [position of cart, velocity of cart, angle of pole, rotation rate of pole]
    1. x-coordinate of the pole's center of mass
    2. the pole's speed
    3.  the pole's angle to the cart/platform. the pole angle in radians (1 radian = 57.295 degrees)
    4. the pole's rotation rate

In [18]:
obs,info = env.reset()
print('obs = {}'.format(obs))
#Example printout:
# obs = [-0.02007766 -0.00363281 -0.0034504  -0.02222458]

obs = [ 0.04054569 -0.03567298  0.0173007   0.02051942]


## The problem is to find the `best action` per step
* We need to convert these 4 observations to into actions. 
* But, how do we learn to balance this system without knowing the exact meaning of the observed 4 numbers by getting the reward? 
* Here, the reward is 1; and it is given on every time step.
* The episode continues until the pole falls.
* To get a more accumulated reward, we need to balance the platform, as long as possible, in a way to avoid the pole falling.

In [19]:
print('env.action_space = {}'.format(env.action_space))
#Example printout:
# env.action_space = Discrete(2)
#only 2 actions: 0 or 1, where 0 means pushing the platform to the left, 1 means to the right.

env.action_space = Discrete(2)


In [20]:
print('env.observation_space = {}'.format(env.observation_space))
#Example printout:
# env.observation_space = Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
#The observation space is a 4-D space, and each dimension is as follows:
#Num Observation             Min         Max
#0   Cart Position           -2.4        2.4
#1   Cart Velocity           -Inf        Inf
#2   Pole Angle              ~ -41.8°    ~ 41.8°
#3   Pole Velocity At Tip    -Inf        Inf
#env.observation_space.low and env.observation_space.high which will print the minimum and maximum values for each observation variable.
print('env.observation_space.high = {}'.format(env.observation_space.high))
print('env.observation_space.low = {}'.format(env.observation_space.low))
#Example printout:
#env.observation_space.high = [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
#env.observation_space.low = [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]

env.observation_space = Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
env.observation_space.high = [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
env.observation_space.low = [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


## Apply a specific action at a step
* How about going left, i.e., action=0 from the action_space?
    - result is a `new state`

In [21]:
observation, reward, terminated, truncated, info = env.step(0)
print('observation = {}'.format(observation))
print('reward = {}'.format(reward))
print('terminated = {}'.format(terminated))
print('truncated = {}'.format(truncated))
print('info = {}'.format(info))
#Example printout:
#observation = [-0.02728556 -0.22667485 -0.01062018  0.3176722 ]
#reward = 1.0
#terminated = False
#truncated = False
#info = {}

observation = [ 0.03983223 -0.23103872  0.01771109  0.3186103 ]
reward = 1.0
terminated = False
truncated = False
info = {}


## Apply a random action at a step
* The `sample()` returns a random sample from the given/supplied space.
* Here below, you can see that we sample from the `action_space`.
* The `sample()` can also be used to sample from the `observation_space` as well -- although why would we want to use that here?

In [24]:
action = env.action_space.sample()
print('action = {}'.format(action))

action = 1


In [25]:
#Let's apply another random action with sampling
action = env.action_space.sample()
print('action = {}'.format(action))

action = 0


In [26]:
#Let's apply another random action with sampling
action = env.action_space.sample()
print('action = {}'.format(action))

action = 1


## Here is a loop with A Random CartPole-v0 agent

In [None]:
#!/Users/ashis/venv-directory/venv-ml-p3.10/bin/python3.10
#Please make this python file executable and then run it without passing it to python interpreter
#as the the interpreter listed on the first line will be invoked. Good luck!
#$ chmod +x CartPole-v0-code3.py
#$ ./CartPole-v0-code3.py
import gym
from tqdm import tqdm


#The CartPole-v0 environment with a random agent
# Goal is to control the cart (i.e., platform) with a pole attached by its bottom prt.
# Trick: The pole tends to fall right or left and you would need to balance it by moving the cart to the right or left on every step.

env = gym.make("CartPole-v0",render_mode='human')

#Here below, we created the environment and initialized few variables.
total_reward = 0.0
total_steps = 0
observation, info = env.reset(seed=42)

while True:
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    total_steps += 1

    if terminated:
        break

print('Episode terminated in {} steps\nTotal rewards accumulated = {}'.format(total_steps,total_reward))

#On average, this random agent takes 12 to 15 steps before the pole falls and the episode ends
#Most of the environments in Gym have a `reward boundary`, which is the average reward that the agent should gain during 100 consecutive eposides to solve the environment.
#For cartpole, the boundary is 195. That means, on average, the agent must hold the stick for 195 time steps or longer.
#So, our random agent's performance is extremely poor.

# Thanks for your attention