# Intro to GYM and Cartpole

This question serves as an introduction to the OpenAI Gym API.

To begin with, please install gym with the command: pip install gym

Gym is a set of classical RL environments and problems that were once used for RL reserach.

Today, these environemnts are mostly unused in research. However, the general API established in the library continues to live on.

In the next several problems, we will use Gym on Cartpole, a classical RL environment.


In [1]:
import numpy as np
import gym

# To make output consistent
np.random.seed(42)

# This makes the cartpole env
env = gym.make('CartPole-v1')

env.action_space.seed(42)
env.observation_space.seed(42)

obs, info = env.reset()
print(obs)

[-0.03250501  0.0084485  -0.02947107 -0.01222387]


In [2]:
# This shows the action space is 2 dimensional and discrete.
print(env.action_space)

# For discrete action spaces, the action is given as an integer in the range [0, action-range - 1]
# For the case of cartpole, this means the action is an integer in the range [0, 1]
# 0 pushes the cart to the left and 1 pushes the cart to the right.


Discrete(2)


In [3]:
# To actually move the env forward, we must tell it what action should be taken.
action = 0
next_obs, reward, terminated, truncated, info = env.step(action)

# We see that stepping forward with env.step() returns the next observation, i.e. the result
# of taking the given action in the environment. It also returns the reward in the environment,
# and the terminated flag which tells us if the episode has ended.
# Common reasons for the episode to end are succeeding or failing a

In [4]:
def sin_policy(obs):
    # Generally a policy will take some function of the observation and return an action.
    # For example:
    sin_obs = np.sin(obs[1])
    policy_action = None
    if sin_obs > 0:
        policy_action = 0
    else:
        policy_action = 1
    return policy_action

In [5]:
# Usually, we use the policy to write a for loop over the environment, moving it forward according to the policy.
obs, info = env.reset()
print(obs)
horizon = 500
terminated = False
for i in range(horizon):
    if terminated is False:
        action = sin_policy(obs)
        # action = random_policy(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)
        print(next_obs)
        print(F"Made it to step: {i}")
        obs = next_obs

[-0.03653543 -0.00116871  0.02932162 -0.00582582]
[-0.03655881  0.19352072  0.0292051  -0.28911513]
Made it to step: 0
[-0.03268839 -0.00200525  0.0234228   0.01263385]
Made it to step: 1
[-0.0327285   0.19277309  0.02367548 -0.27256784]
Made it to step: 2
[-0.02887304 -0.00267855  0.01822412  0.02748738]
Made it to step: 3
[-0.02892661  0.19217739  0.01877387 -0.25939038]
Made it to step: 4
[-0.02508306 -0.00320748  0.01358606  0.03915446]
Made it to step: 5
[-0.02514721  0.19171704  0.01436915 -0.24921116]
Made it to step: 6
[-0.02131287 -0.00360714  0.00938492  0.04796924]
Made it to step: 7
[-0.02138501  0.191379    0.01034431 -0.24173795]
Made it to step: 8
[-0.01755743 -0.00388918  0.00550955  0.05418982]
Made it to step: 9
[-0.01763522  0.19115333  0.00659335 -0.23674972]
Made it to step: 10
[-0.01381215 -0.00406219  0.00185835  0.05800563]
Made it to step: 11
[-0.01389339  0.19103307  0.00301847 -0.23409039]
Made it to step: 12
[-0.01007273 -0.00413188 -0.00166334  0.05954313]


In [6]:
# The full code for cartpole is here:
# https://github.com/openai/gym/blob/master/gym/envs/classic_contro

# Please answer the following questions.


# 1. Implement the function random_policy

This policy takes the observation, ignores it, and returns a random action. It should sample the actions uniformly, with equal probability of any action.



In [7]:
import random
from random import randint

In [8]:
def random_policy(obs):
    # Takes an observation and returns a random action.
    policy_action = randint(0,1)
    return policy_action

# 2. Compare the performance of sin_policy and random policy.

Do this by running both policies 100 times until termination. Then report the mean and variance of the number of steps achieved by each policy.

In [9]:
import statistics as stat

In [10]:
# Sin_policy:

num_steps = []
for j in range(100):
    obs, info = env.reset()
    horizon = 500
    terminated = False
    for i in range(horizon):
        if terminated is False:
            action = sin_policy(obs)
            next_obs, reward, terminated, truncated, info = env.step(action)
            obs = next_obs
        else:
            num_steps.append(i)
            break
            
[stat.mean(num_steps), stat.variance(num_steps)]

[38.84, 232.13575757575757]

In [11]:
# Random policy:

num_steps = []
for j in range(100):
    obs, info = env.reset()
    horizon = 500
    terminated = False
    for i in range(horizon):
        if terminated is False:
            action = random_policy(obs)
            next_obs, reward, terminated, truncated, info = env.step(action)
            obs = next_obs
        else:
            num_steps.append(i+1)
            break
            
            
[stat.mean(num_steps), stat.variance(num_steps)]

[22.76, 119.1741414141414]

# 3. Try to devise a hand-coded policy that beats sin_policy.

This policy should NOT use any machine learning. It should be a determinstic function of the input. Report the mean and variance of your policy across 100 samples.

In [12]:
def my_policy(obs):
    x, v, theta, v_theta = obs
    sin_obs = np.sin(theta + .1*v_theta)
    policy_action = None
    if sin_obs > 0:
        policy_action = 1
    else:
        policy_action = 0
    return policy_action

In [13]:
num_steps = []
horizon = 500
for j in range(100):
    obs, info = env.reset()
    terminated = False
    for i in range(horizon):
        if terminated is False:
            action = my_policy(obs)
            next_obs, reward, terminated, truncated, info = env.step(action)
            obs = next_obs
        else:
            num_steps.append(i+1)
            break
        if i+1 == horizon:
            num_steps.append(i+1)
            
            
[stat.mean(num_steps), stat.variance(num_steps)]

[500, 0]