# Introduction

The goal for this notebook is to tackle the Classic Control problems found on OpenAI's Gym. These problems include **Acrobot**, **CartPole**, **MountainCar**, and **Pendulum**. 

## CartPole-v0

The CartPole is a very simple and classic example of a reinforcement learning problem first described by Sutton, Barto, and Anderson [Barto83]. From the OpenAI Gym website, "A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center."

### Exploring the Problem

I want to get a better understanding of the CartPole problem. What are the possible actions, what are the states, etc. I'll start off with some exploration

In [1]:
import gym
env = gym.make('CartPole-v1')
print("Action space:", env.action_space)
print("Observation space:", env.observation_space)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Action space: Discrete(2)
Observation space: Box(4,)


Okay so this gives some insight, it looks like there are two possible actions to take, 0 or 1, and the state consists of 4 values, let's look into the bounds of those values to get a feel for our state space.

In [2]:
print("Observation space upper bound:", env.observation_space.high)
print("Observation space lower bound:", env.observation_space.low)

Observation space upper bound: [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
Observation space lower bound: [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]


Okay so I immediately see some issues here, two of the observation values can take on a very large range of values 6.8e+38. This can definitely cause some issues when creating our value functions, there are a large number of possible state configurations. For now we are going to ignore it and see if it really does cause issues in practice. We're going to start off with using Sarsa for an on-policy TD approach.

In [32]:
from collections import defaultdict
import numpy as np

def cartpole_sarsa(env, epsilon = 0.1, num_of_iter=10000, discount = 0.9):
    Q = defaultdict(lambda: np.array([0,0])) # These action values are arbitrarily 0 and represent the actions 0 and 1.
    
    for iter_i in range(1, num_of_iter):
        current_state = tuple(env.reset()[1::2])
        done = False
        if((iter_i-1) % 1000 == 0):
            print("Finished iteration {}".format(iter_i+1))
        while not done:
            policy = e_soft_policy_from_Q_for_state(Q, current_state, epsilon)
            action = np.random.choice(Q[current_state], p=policy)
            
            next_state, reward, done, info = env.step(action)
            next_state = tuple(next_state[1::2])
            
            Q[current_state] = (Q[current_state] + (1 / iter_i) * 
                                (reward + discount * max(Q[next_state]) - Q[current_state]))
            current_state = next_state
    return Q

In [16]:
def e_soft_policy_from_Q_for_state(Q, state, epsilon):
    state_actions = Q[state]
    
    max_action = np.argmax(state_actions)
    probabilities = np.zeros(len(state_actions)) # np.zeros_like did not work for some reason
    
    for i in range(len(state_actions)):
        if i == max_action:
            probabilities[i] = 1 - epsilon + epsilon / len(state_actions)
        else:
            probabilities[i] = epsilon / len(state_actions)
    return probabilities

In [17]:
def test_policy(env, policy, iterations=5):
    for _ in range(iterations):
        state = tuple(env.reset())
        done = False
        while not done:
            env.render()
            print(policy[state])
            state, reward, done, info = env.step(np.random.choice([0, 1], p=policy[state]))
            state = tuple(state)
                                                 
    env.close()

In [33]:
import gym
import numpy as np
env = gym.make('CartPole-v1')

epsilon = 0.1
number_of_iterations = 10000
discount = 0.9

Q = cartpole_sarsa(env, epsilon, number_of_iterations, discount)
policy = defaultdict(lambda: np.array([0.5, 0.5]))


[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Finished iteration 2
Finished iteration 1002
Finished iteration 2002
Finished iteration 3002
Finished iteration 4002
Finished iteration 5002
Finished iteration 6002
Finished iteration 7002
Finished iteration 8002
Finished iteration 9002
Finished iteration 10002
Finished iteration 11002
Finished iteration 12002
Finished iteration 13002
Finished iteration 14002
Finished iteration 15002
Finished iteration 16002
Finished iteration 17002
Finished iteration 18002
Finished iteration 19002
Finished iteration 20002
Finished iteration 21002
Finished iteration 22002
Finished iteration 23002
Finished iteration 24002
Finished iteration 25002
Finished iteration 26002
Finished iteration 27002
Finished iteration 28002
Finished iteration 29002
Finished iteration 30002
Finished iteration 31002
Finished iteration 32002
Finished iteration 33002
Finished iteration 34002
Finished iteration 35002
Finis

In [31]:
for state, av in Q.items():
    policy[state] = e_soft_policy_from_Q_for_state(Q, state, epsilon)

for k,v in policy.items():
    if v[1] != 0.05:
        print(v)

In [13]:
test_policy(env, policy)

[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
[0.5 0.5]
