# Learning Cartpole with Policy Gradient Method (REINFORCE)

⊙Q-learning, SARSA, DQN 등은 Parameter 𝜃를 최적화하여 State-Action Value의 값을 학습하고 이 value를 최대화하는 action을 선택하는 방법으로 action 결정
<br> ⊙Policy Gradient는 Value Function을 거치지 않고 𝜃가 action 선택 확률을 직접 학습

<br>idea(and code) from Karpathy's PG Pong 
<br>Karpathy's PG Pong code : https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5
<br>Karpathy's PG blog post : http://karpathy.github.io/2016/05/31/rl/

<img src = "policy_gradient.png" width=600 > 

액션 파라미터를 러닝??? <br>
몬테 칼로는 에피소드 끝까지 봐야된다. 끝까지 보고 러닝을 하겠다

## Imports

참 쉽죠?

In [1]:
import numpy as np
import cPickle as pickle # just for data saving
import gym

## Parameters

In [2]:
H = 10 #hidden node 
learning_rate = 2e-3
gamma = 0.99
decay_rate = 0.99  
score_queue_size = 100
resume = False
D = 4
# action space가 0아니면 1이다.

## Agent definition

In [3]:
if resume:  model = pickle.load(open('save.p', 'rb')) #if saved load
else:
    model = {}
    model['W1'] = np.random.randn(H,D) / np.sqrt(D) # random initialize weight
    model['W2'] = np.random.randn(H) / np.sqrt(H)   # random initialize weight

grad_buffer = { k : np.zeros_like(v) for k,v in model.iteritems() } # buffer for gradient
rmsprop_cache = { k : np.zeros_like(v) for k,v in model.iteritems() } # buffer for RMSPropagation


def sigmoid(x): 
    return 1.0 / (1.0 + np.exp(-x))


def discount_rewards(r): #calculating the return   스텝별로 리워드가 어느정도인지...???  .. . 이걸로 나중에 폴리시가 잘됬느지 안됬는지 확인함.
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(xrange(0, r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add        
    return discounted_r

            
def policy_forward(x): #calculate the action probability proceed same as Neural network
    h = np.dot(model['W1'], x)
    h = sigmoid(h)
    logp = np.dot(model['W2'], h)
    p = sigmoid(logp)
    return p, h  # 폴리시의 확률을 뽑는다.....
    
    
def policy_backward(eph, epdlogp, epx): # learning same as neural network's back propagation
    global grad_buffer
    dW2 = np.dot(eph.T, epdlogp).ravel()
    dh = np.outer(epdlogp, model['W2'])
    eph_dot = eph*(1-eph)
    dW1 = dh * eph_dot
    dW1 = np.dot(dW1.T, epx)        
    for k in model: grad_buffer[k] += {'W1':dW1, 'W2':dW2}[k]
  

* dqn은 값이 튄다. state가 막 튄다 -> target network를 고정 <br>
* policy는 일련의 에피소드를 보고 하나씩 튜닝하기 때문에 gradually 성능이 좋아진다. (policy gradient의 장점) 하지만, local optimum에 빠질 우려가 있다. 좀 왔따갔따 거린다.

## Process

In [4]:
env = gym.make('CartPole-v0') # you know what this means 
observation = env.reset() # this too
# initialize
reward_sum, episode_num = 0,0 
xs,hs,dlogps,drs = [],[],[],[] # state, hidden 1, policy gradient, reward 
score_queue = [] # to calculate score mean

while True:
    
    ##### 여기서부터
    
    x = observation # environment observation
    
    act_prob, h = policy_forward(x) # acquire action probability
    
    if np.mean(score_queue) > 180: # if sufficient amount of exploration is done select the optimal action
        action = 1 if 0.5 < act_prob else 0
    else:
        action = 1 if np.random.uniform() < act_prob else 0 # with random probability give exploration 

    xs.append(x) # observation state
    hs.append(h) # hidden state
    y = action # selected action by probability, target value
    # policy's gradient
    dlogps.append(y - act_prob) 
    # error, the value out from the neural network is the learnt value and y-act_prob is the error
    
    observation, reward, done, info = env.step(action) # next step
    reward_sum += reward # cumulative reward   
    drs.append(reward) # add the dreward
    
    ##### 여기가지 ... 에피소드 끝날떄까지 돈다. 에러레이트 계속 저장 -> 몬테칼로 방식.. 그래야 잘했나못했나 확인할 수 있따.
    
    
    
    
    if done: # if true
        episode_num += 1 
        
        # save the cumulated reward
        if episode_num > score_queue_size: 
            score_queue.append(reward_sum)
            score_queue.pop(0)
        else:
            score_queue.append(reward_sum)
        
        print "episode : " + str(episode_num) + ", reward : " + str(reward_sum) + ", reward_mean : " + str(np.mean(score_queue))
        
        if np.mean(score_queue) >= 200: # if mean score is over 200 it is solved!
            print "CartPole solved!!!!!"
            break
        
        epx = np.vstack(xs) #observation stack
        eph = np.vstack(hs) # hidden value stack
        epdlogp = np.vstack(dlogps) # policy gradient stack
        epr = np.vstack(drs)        # dreward stack
        xs,hs,dlogps,drs = [],[],[],[] # clear off
        
        discounted_epr = discount_rewards(epr) # the return from episode
        
        
        # 데이빗 실버 7장.  variance 줄이기 방법. 
        # 베이스 라인 어드벤티지 fucntion
        # 모든 step마다 공평하게 reward가 분산된다 그래야 학습이 발산하지 않고 가능하다.
        #( (reward - mean) / std ) advantage function  # 베이스라인 어드벤티지 펑선 -> 발산 방지.. 노멀라이제이션 시켜줘서 발산 방지.   1ste에서 얼마나 잘했는지 50step에서 얼마나 잘했는지 차이를 줌.
        discounted_epr -= np.mean(discounted_epr) # standardization process - mean
        discounted_epr /= np.std(discounted_epr)  # standardization process / std
        
        epdlogp *= discounted_epr # encourage or discourage
        
        policy_backward(eph,epdlogp,epx) # calculate the gradient for learning model
        
        
        
        
        # learning progress
        for k,v in model.iteritems():
            g = grad_buffer[k] 
            
            # 그냥 rmsprop 계싼하는 코드 (복붙)
            rmsprop_cache[k] = decay_rate * rmsprop_cache[k] + (1 - decay_rate)*g**2 # RMS update
            model[k] += learning_rate * g / (np.sqrt(rmsprop_cache[k]) + 1e-5) # model weight update
            
            grad_buffer[k] = np.zeros_like(v) # clearing off
        
        if episode_num % 1000 == 0: pickle.dump(model, open('Cart.p', 'wb')) # save model
        
        reward_sum = 0 #clear off
        observation = env.reset() # restart environment
        
env.close()

[2017-05-15 08:53:12,510] Making new env: CartPole-v0
  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


episode : 1, reward : 27.0, reward_mean : 27.0
episode : 2, reward : 11.0, reward_mean : 19.0
episode : 3, reward : 11.0, reward_mean : 16.3333333333
episode : 4, reward : 27.0, reward_mean : 19.0
episode : 5, reward : 13.0, reward_mean : 17.8
episode : 6, reward : 9.0, reward_mean : 16.3333333333
episode : 7, reward : 15.0, reward_mean : 16.1428571429
episode : 8, reward : 20.0, reward_mean : 16.625
episode : 9, reward : 18.0, reward_mean : 16.7777777778
episode : 10, reward : 14.0, reward_mean : 16.5
episode : 11, reward : 12.0, reward_mean : 16.0909090909
episode : 12, reward : 24.0, reward_mean : 16.75
episode : 13, reward : 16.0, reward_mean : 16.6923076923
episode : 14, reward : 16.0, reward_mean : 16.6428571429
episode : 15, reward : 21.0, reward_mean : 16.9333333333
episode : 16, reward : 20.0, reward_mean : 17.125
episode : 17, reward : 16.0, reward_mean : 17.0588235294
episode : 18, reward : 15.0, reward_mean : 16.9444444444
episode : 19, reward : 18.0, reward_mean : 17.0
epi

더욱 자세한 설명은 http://karpathy.github.io/2016/05/31/rl/ 에서 확인 하실 수 있습니다.
<br>더욱 다양한 자료를 확인 하실 수 있습니다. https://github.com/yukezhu/tensorflow-reinforce</br>