### Testing Q_Learner class on CartPole
We test if the Q_Learner class performs significantly better than an agent which plays without caring of the environment state.

Spoiler: it does. By setting by hand two features, the agent learn how to exploit the information they carry.

CartPole info:


https://gym.openai.com/envs/CartPole-v0/


https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

In [1]:
import gym
import numpy as np
from LearningAlgorithms.q_learner import Q_Learner

In [13]:
# Overload the abstract method features()
class My_Q_Learner(Q_Learner):
    def features(self, state, action):
        feat1 = state[2:3]
        action_sign = 2*action - 1
        feat2 = np.abs([state[2] + action_sign])
        bias = np.ones(1)
        return np.concatenate((feat1, feat2, bias))

In [14]:
# Generate the environment
env = gym.make('CartPole-v0')

# Parameters
actions_arr = np.arange(env.action_space.n)
d = 3 # <--- set the right number of features
learning_rate = 0.01
epsilon = 0.9
discount_factor = 0.95

# Initialize the agent
agent = My_Q_Learner(actions_arr, d, learning_rate, 
                     epsilon, discount_factor)

# Train the agent on a number of matches (num_episodes)
# For each episode count the number of rounds the agent survived
num_episodes = 500
for i in range(num_episodes):
    state = env.reset()
    done = False
    rounds = 0
    while not done:
        action = agent.best_action(state, training=True)
        old_state = state
        state, reward, done, info = env.step(action)
        agent.update_parameters(old_state, state, action, reward)
        rounds += 1
    if i % 100 == 0:
        print("\n--> Game Over. Rounds: {}".format(rounds))
        print("Parameter vector:\n{}".format(agent.theta))
print("\n--> Game Over. Rounds: {}".format(rounds))
print("Parameter vector:\n{}".format(agent.theta))


--> Game Over. Rounds: 55
Parameter vector:
[0.47859214 0.63075309 0.61081921]

--> Game Over. Rounds: 68
Parameter vector:
[0.32977605 0.6754387  0.65956828]

--> Game Over. Rounds: 75
Parameter vector:
[0.29507237 0.69325869 0.65751402]

--> Game Over. Rounds: 27
Parameter vector:
[0.27911566 0.69605562 0.66151419]

--> Game Over. Rounds: 49
Parameter vector:
[0.3075713  0.68384956 0.66162654]

--> Game Over. Rounds: 33
Parameter vector:
[0.29510691 0.6906043  0.66028601]


In [15]:
# Watch the trained agent playing
import time
state = env.reset()
done = False
rounds = 0
while not done:
    env.render()
    time.sleep(0.1)
    action = agent.best_action(state, training=False)
    state, reward, done, info = env.step(action)
    rounds += 1
    #print(state, action)
print("\n--> Game Over. Rounds: {}".format(rounds))
env.close()


--> Game Over. Rounds: 45
