# Objective 

Build DQN that learns to play cartpole.

* I'll need to build a NN that can backpropagate to update weights.we wouldn't have a target for the other action. I think I will have to go for the more inefficient architecture. 
    * Architecture:
        * Inputs: 4 features from CartPole observation + which action is taken
        * Outputs: the value for that state-action pair
        * Will start with one hidden layer with ReLU and like the DQN paper the output will strictly be a linear layer
        * Loss function: $\frac{1}{2}(y - Q(s, a; w))^2$
        * Target: $y = r + \gamma \max_{a'}Q(s', a'; w)$
        * Will need to store training examples of $(s, a, r, s')$ in a database that holds the $N$ most recent examples which are chosen at random each iteration to be applied to gradient descent -- should try and apply mini-batch gradient descent
        * Agent will move through environment collecting generating experience data using an $\varepsilon$-greedy policy
        * Will have to freeze the network to be used in updating weights for some set iteration amount (the frozen network is for the target, or oracle substitution, only)

In [1]:
import numpy as np
import gym

In [2]:
env = gym.make('CartPole-v1')

In [103]:
D = [] # Datatset for Experience Replay
N = 200 # Number of most recent experiences stored in D
model = {} # Contains weights in the NN
h_n = 20 # Number of units in hidden layer
input_size = env.action_space.n + env.observation_space.shape[0]
alpha = 0.1 # Learning rate
gamma = 0.9 # discount rate
epsilon = 0.1 # exploration rate

In [95]:
# Randomly initialising weights of NN with Xavier initialisation 
model['W1'] = np.random.randn(h_n, input_size) / np.sqrt(input_size)
model['b1'] = np.zeros((h_n, 1))
model['W2'] = np.random.randn(1, h_n) / np.sqrt(h_n)
model['b2'] = np.zeros((1, 1))

In [96]:
def q_forward_pass(x):
    h = np.dot(model['W1'], x) + model['b1']
    h[h<0] = 0 # ReLU activation
    q = np.dot(model['W2'], h) + model['b2']
    return q, h

In [93]:
# creates input array ready for NN
def format_input(S, A):
    format_A = [1, 0] if A == 0 else [0, 1]
    return np.append(S, format_A).reshape(input_size, 1)

In [136]:
# backpropagation using a single example first
def q_backward_pass(example):
    S, A, R, SS = example
    _, max_q = get_max_q(SS)
    y = R + gamma * max_q
    x = format_input(S, A)
    q, h = q_forward_pass(x)
    
    # L = 1/2(y - q)^2
    dq = -(y-q)
    dW2 = np.dot(dq, h.T)
    db2 = dq
    dh = np.dot(model['W2'].T, dq)
    dh[h<=0] = 0
    dW1 = np.dot(dh, x.T)
    db1 = dh
    
    # either update here or return derivatives so I can then average them over a minibatch?
    # actually can't do that because then i'm not making use of vectorisation
    # the question is how to store all examples in one big array. For now I will do stochastic
    # gradient descent and use only one example to update weights and call it many times.
    model['W1'] = model['W1'] - alpha * dW1
    model['b1'] = model['b1'] - alpha * db1
    model['W2'] = model['W2'] - alpha * dW2
    model['b2'] = model['b2'] - alpha * db2

In [135]:
def get_max_q(state):
    a1 = q_forward_pass(format_input(state, 0))
    a2 = q_forward_pass(format_input(state, 1))
    return np.argmax([a1, a2]), np.max([a1, a2])

In [None]:
for episode in range(1):
    S = env.reset()
    while True:
        choice = np.random.rand()
        if choice < epsilon:
            A = env.action_space.sample()
        else: 
            A, _ = get_max_q(S)
        SS, R, done, _ = env.step(A)
        example = (S, A, R, SS)
        D = np.append(example, D[0:N-1]) # Adding new example to database and removing last if full
        
        # TODO perform steps of gradient descent
        
        if done:
            break
env.close()