# Brute Forcing the Cartpole problem 2

This is a continuation from the pervious cartpole burteforce. I made the claim that extending the episode requirement to pass cartpole to 500 would cause this brute force method to become redundent. Upon testing this i found that we could infact find solutions. 

Read more here: https://medium.com/@twocolossi/brute-forcing-the-cartpole-problem-4d04c9c34b12

## Imports


In [47]:
import gym
from gym import envs
import numpy as np
import time

## Policy Generator

To brute force the problem, we discretise the state space. The 'Pole Angle' and 'Pole Velocity At Tip' are split into 3 and 4 buckets creating 12 possible states of the environment (we ignore the cart position and velocity observations). Cartpole only has 2 actions, so with 12 states, we have 4096 (2^12) deterministic greedy policies (A Policy that will always pick the same one action given the same state). 

To create the policies, we convert the numbers 0 to 4095 to binary and reshape them to a 3 by 4 matrix. 

In [48]:
def createPolicy(id):
    binary = unpackbits(np.array([id]), 12)
    return np.reshape(binary, (3,4))

#Credit for this function https://stackoverflow.com/a/51509307
def unpackbits(x, num_bits):
          xshape = list(x.shape)
          x = x.reshape([-1,1])
          to_and = 2**np.arange(num_bits).reshape([1,num_bits])
          return (x & to_and).astype(bool).astype(int).reshape(xshape + [num_bits])

## Create a class to discretise the observation space

In [49]:
class DiscreteBox(object):
    def __init__(self, low, high, shape):
        self.low, self.high, self.shape = low, high, shape

    def Discretise(self, state):   
        discreteState = [int(np.floor((state[i] - self.low[i])/(self.high[i]-self.low[i])*(self.shape[i]-1))) for i in range(len(state))]
        return tuple([np.min([self.shape[i]-1, np.max([discreteState[i], 0])]) for i in range(len(state))])

## Create environment and discretiser

We have set the bounds for discrete space tighter than that seen in the observation space. We have a very limited amount of buckets so we need one of the buckets to be in a stable area of the state space. 


In [50]:
env = gym.make('CartPole-v1')

thetaHigh = 10 * 2 * np.pi / 360
high = np.array([thetaHigh, np.radians(15)])
observationSpace = DiscreteBox(-high, high, (3,4))

## Identify possible solutions

It would be computationally expensive to fully evaluate every policy, so, to start, we filter out policies by running them on cartpole once.

In [51]:
startTime = time.time()
resample = []
for i in range(4096):
    state = env.reset()
    policy = createPolicy(i)
    step = 0
    while True:
        step += 1
        state = observationSpace.Discretise(state[2:])
        action = policy[state]
        state, r, terminal, info = env.step(action)
        if terminal or step >= 600:
            if step > 465:
                resample.append(i)
            break

print(str(len(resample)) + ' potential solutions found in ' + "{:.1f}".format(time.time()-startTime) + ' seconds.')

60 potential solutions found in 13.7 seconds.


## Find solutions

We can now run each possilbe solution for 100 episodes. If they average over 195 reward they are solutions to the cartpole problem.

In [52]:
startTime, solutionCount = time.time(), 0
for i in resample:
    avg = 0
    for k in range(100):
        state = env.reset()
        policy = createPolicy(i)
        step = 0
        while True:
            step += 1
            state = observationSpace.Discretise(state[2:])
            action = policy[state]
            state, r, terminal, info = env.step(action)
            if terminal or step >= 500:
                avg += step
                break
    if avg/100 >= 495:
        print("Solution at Index: " + str(i) + " , score: " + str(avg/100))
        solutionCount += 1

print(str(solutionCount) + ' solutions found in ' + "{:.1f}".format(time.time()-startTime) + ' seconds.')

Solution at Index: 236 , score: 499.64
Solution at Index: 488 , score: 496.66
Solution at Index: 492 , score: 499.71
Solution at Index: 744 , score: 496.18
Solution at Index: 748 , score: 498.33
Solution at Index: 1000 , score: 498.65
Solution at Index: 1004 , score: 498.04
Solution at Index: 1260 , score: 499.48
Solution at Index: 1512 , score: 497.99
Solution at Index: 1516 , score: 499.19
Solution at Index: 1768 , score: 497.04
Solution at Index: 1772 , score: 500.0
Solution at Index: 2024 , score: 497.81
Solution at Index: 2028 , score: 499.91
Solution at Index: 2280 , score: 496.19
Solution at Index: 2284 , score: 499.17
Solution at Index: 2540 , score: 498.83
Solution at Index: 2796 , score: 498.52
Solution at Index: 3048 , score: 497.52
Solution at Index: 3052 , score: 499.31
Solution at Index: 3304 , score: 497.5
Solution at Index: 3308 , score: 499.3
Solution at Index: 3560 , score: 497.52
Solution at Index: 3564 , score: 499.19
Solution at Index: 3820 , score: 500.0
Solution 

28 solutions found at about 4.9 seconds per solution. That is pretty competitive for solve times. 

So, I was wrong in my claim. It did take a bit longer, but we found a lot of solutions. This doesn't change anything though. This is still a poor method, increasing the search space much further than this would be unrealistic from a time perspective. 