### Testing Q_Learner class on CartPole
We test if the Q_Learner class performs significantly better than an agent which plays without caring of the environment state.

Spoiler: it does. By setting by hand two features, the agent learn how to exploit the information they carry.

CartPole info:


https://gym.openai.com/envs/CartPole-v0/


https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

In [1]:
import gym
import numpy as np
from Modules import Q_Learner

In [2]:
# Overload the abstract method features()
class My_Q_Learner(Q_Learner):
    def features(self, state, action):
        feat1 = state[2:3]
        action_sign = 2*action - 1
        feat2 = np.abs([state[2] + action_sign])
        bias = np.ones(1)
        return np.concatenate((feat1, feat2, bias))

In [5]:
# Generate the environment
env = gym.make('CartPole-v0')

# Parameters
actions_arr = np.arange(env.action_space.n)
d = 3 # <--- set the right number of features
learning_rate = 0.01
epsilon = 0.9
discount_factor = 0.95

# Initialize the agent
agent = My_Q_Learner(actions_arr, d, learning_rate, 
                     epsilon, discount_factor)

# Train the agent on a number of matches (num_episodes)
# For each episode count the number of rounds the agent survived
num_episodes = 500
for i in range(num_episodes):
    state = env.reset()
    done = False
    rounds = 0
    while not done:
        action = agent.best_action(state, training=True)
        old_state = state
        state, reward, done, info = env.step(action)
        agent.update_parameters(old_state, state, action, reward)
        rounds += 1
    if i % 100 == 0:
        print("\n--> Game Over. Rounds: {}".format(rounds))
        print("Parameter vector:\n{}".format(agent.theta))
print("\n--> Game Over. Rounds: {}".format(rounds))
print("Parameter vector:\n{}".format(agent.theta))


--> Game Over. Rounds: 9
Parameter vector:
[-0.02782184  0.15324365 -0.9877967 ]

--> Game Over. Rounds: 52
Parameter vector:
[0.31937865 0.69114957 0.64831284]

--> Game Over. Rounds: 44
Parameter vector:
[0.30020762 0.68659513 0.66216502]

--> Game Over. Rounds: 46
Parameter vector:
[0.30535284 0.6859542  0.66047443]

--> Game Over. Rounds: 38
Parameter vector:
[0.275821   0.6961439  0.66280196]

--> Game Over. Rounds: 51
Parameter vector:
[0.29468444 0.6894124  0.66171868]


In [6]:
# Watch the trained agent playing
import time
state = env.reset()
done = False
rounds = 0
while not done:
    env.render()
    time.sleep(0.1)
    action = agent.best_action(state, training=False)
    state, reward, done, info = env.step(action)
    rounds += 1
    #print(state, action)
print("\n--> Game Over. Rounds: {}".format(rounds))
env.close()


--> Game Over. Rounds: 58
