### OpenAI Gym
1. Gym is a toolkit for developing and comparing reinforcement learning algorithms. 
2. It makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as TensorFlow or Theano.

In [1]:
import gym
import numpy as np

#### CartPole Env
1. CartPole is the most basic environment of all types.
2. The problem consists of balancing a pole connected with one joint on top of a moving cart. The only actions are to add a force of -1 or +1 to the cart, pushing it left or right.
3. In CartPole's environment, there are four observations at any given state, representing information such as the angle of the pole and the position of the cart.
4. Using these observations, the agent needs to decide on one of two possible actions: move the cart left or right.

In [21]:
env = gym.make('CartPole-v0')
def run_episode(env, parameters, play=False): 
    observation = env.reset()
    totalreward = 0
    for _ in range(200):
        action = 0 if np.matmul(parameters,observation) < 0 else 1
        observation, reward, done, info = env.step(action)
        totalreward += reward
        if done:
            if play: env.close()
            break
        else:
            if play: env.render()
    return totalreward

#### Strategy 1 : Random Search
1. One fairly straightforward strategy is to keep trying random weights, and pick the one that performs the best.
2. Since the CartPole environment is relatively simple, with only 4 observations, this basic method works surprisingly well.

In [34]:
#look what taking an action returns
print(env.step(0))
env.reset()

(array([-0.19724541, -2.29519346,  0.3464877 ,  3.70601063]), 0.0, True, {})


array([ 0.02799342, -0.00475672,  0.03632073, -0.01977023])

In [56]:
best_parameters = np.random.rand(4) * 2 - 1
best_reward = run_episode(env,best_parameters,False)
iters = 0
for _ in range(1000):
    parameters =  np.random.rand(4) * 2 - 1
    reward = run_episode(env,parameters,False)
    if reward > best_reward:
        best_parameters = parameters
        best_reward = reward
    iters+=1
    if best_reward==200:
        break
print("Highest Reward : {} in {} iterations".format(best_reward,iters))

Highest Reward : 200.0 in 17 iterations


#### Strategy 2 : Hill Climbing
1. We start with some randomly chosen initial weights. Every episode, add some noise to the weights, and keep the new weights if the agent improves.
2. Idea here is to gradually improve the weights, rather than keep jumping around and hopefully finding some combination that works. If noise_scaling is high enough in comparison to the current weights, this algorithm is essentially the same as random search.
3. If the weights are initialized badly, adding noise may have no effect on how well the agent performs, causing it to get stuck.

In [60]:
noise_scaling = 0.1
best_parameters = np.random.rand(4)*2 - 1
best_reward = 0
last_update = 0
iters = 0
while last_update < 100:
    parameters = best_parameters + (np.random.rand(4)*2 - 1)*noise_scaling
    reward = run_episode(env,parameters,False)
    if reward > best_reward:
        best_reward = reward
        best_parameters = parameters
        last_update = 0
    else:
        last_update+=1
        
    #incrementing iterations
    iters += 1
    
    if best_reward==200:
        break
print("Highest Reward : {} in {} iterations".format(best_reward,iters))

Highest Reward : 200.0 in 38 iterations
