Our first step is to import the libraries we need to run a Cart Pole game in an OpenAI Gym environment (software library developed to simulate and test RL algorithms). We will also import numpy, a helpful mathematical computing library.

In [1]:
import gym
import numpy as np

Next, we’ll create the environment.

In [2]:
env = gym.make('CartPole-v1')

In order to run through episodes, let’s build a function that accepts the environment and a policy array as inputs. The function will play the game and return the score from an episode as output. We’ll also receive an observation of the game state after every action.

In [3]:
def play(env, policy):
    
    observation = env.reset()
    
    # create variables to track game status, score, and hold observations at each time step
    
    score = 0
    observations = [ ]
    completed = False
    
    # play the game until it is done
    
    for i in range(3000):
        
        # record observations
        
        observations += [observation.tolist()]
        
        if completed:
            break
        
        # use the policy to decide on an action
        
        result = np.dot(policy, observation)
        
        if result > 0:
            action = 1
        else:
            action = 0
        
        # take a step using the action (the env.step method returns a snapshot of the environment after the action is taken, the reward from that action, whether the episode is completed, and diagnostic data for debugging)
        
        observation, reward, completed, data = env.step(action)
        
        # record cumulative score
        
        score += reward
        
    # end the function by returning the cumulative score and full list of observations at each time step
    
    return score, observations

Awesome! Now that our brave AI is able to play the game, let’s give it a policy to do so. In the absence of a clever strategy for devising a policy, we’ll start with random values centred around zero.

In [4]:
policy = np.random.rand(1,4) - 0.5
score, observations = play(env, policy)
print('Score:', score)

Score: 9.0


After running the script, how did our agent perform? Cart Pole has a maximum score of 500. In all likelihood, our agent yielded a very low score. A better strategy might be to generate lots of random policies and keep the one with the highest score. The approach is to use a variable that progressively retains the policy, observations, and score of the best-performing game so far.

In [5]:
# create a tuple to hold the best score, observations, and policy

best = (0, [], [])

# generate 1000 random policies centred around 0 and keep the best performing one

for _ in range(1000):
    
    policy = np.random.rand(1,4) - 0.5
    
    score, observations = play(env, policy)
    
    if score > best[0]:
        best = (score, observations, policy)

print('Best score:', best[0])

Best score: 500.0


What is our best score now? Chances are, we have come up with a policy that is able to achieve the high score of 500. Our agent has beat the game!

Where do we go from here? Well that’s it for this post but if we wanted to build a more robust system we might consider some of the following approaches:

- Using an optimization algorithm to find the best policy instead of randomly picking (e.g. Deep Q Learning, Proximal Policy Optimization, Monte Carlo Tree Search, etc.)
- Testing the best policy that we obtained over many episodes to ensure that we didn’t just get lucky in the one episode
- Testing our policy on a version of cart pole with a higher top score than 500 to see how sustainable the policy is