# Policy Gradient Reinforcement Learning Algorithms

Methods that learn a parameterised policy instead of action-value functions. Learning is done through gradient ascent by maximising some performance measure J(theta).

Use the policy gradient theorem to separate the effect of the policy on the actions and rewards from the unknown effect of policy changes on the state distribution (induced by the environment dynamics, which is unknown). This theorem gives a formula for grad J(theta) which is proportional to a term that does not involve the derivative of the state distribution:

grad J(theta) i.p.t. sigma mew(s) sigma q(s,a) grad pi(a|s,theta)


### Discrete action space policies:
<br>
Use parameterised numerical preferences h(s,a,theta) for each state-action pair and softmax to select actions:
<br>
pi(a|s,theta) = exp(h(s,a,theta)) / sigma_b exp(e(s,b,theta))
<br>
(Note that b is used to sum over all actions.)
<br><br>
The simplest case is when h is a linear function:
<br>
h(s,a,theta) = theta^T x(s,a)
<br>
where x(s,a) is a feature vector.

### REINFORCE: Frozen Lake with softmax linear action preferences
Gradient Ascent update: <br>
Include 1 / pi(a|St, theta) to sample under the expectaion of all possible action values

grad J(theta) = Expectaion_pi [ Gt grad pi(At, St, theta) / pi(At, St, theta) ]
<br><br>
gives the gradient ascent update:
<br>
theta_t+1 = theta_t + alpha Gt grad pi(At, St, theta) / pi(At, St, theta)
<br><br>
... but use the "eligibilty vector" (from the identity grad ln x = grad x / x) and include gamma for discounting:
<br>
theta_t+1 = theta_t + alpha gamma^t Gt grad ln pi(At|St, theta)

In [0]:
import gym
import time, random, math
import numpy as np
import pandas as pd
from IPython import display
import matplotlib.pyplot as plt
from scipy.special import softmax
%matplotlib inline
env = gym.make('FrozenLake8x8-v0', is_slippery=False)

theta = np.zeros(30)
q_values = np.zeros((env.observation_space.n, env.action_space.n))
alpha_max, alpha_min, gamma, epsilon_max, epsilon_min = 1.0, 0.1, 0.99, 0.9, 0.1
n_episodes = 2000
avg_wins = []

def features(state, action):
    pass

for n in range(n_episodes):
    env.reset()
    state = 0
  
    while True:
        # generate an episode:        
        action_preferences = theta * np.array([features(state, a) for a in range(env.action_space.n)]) # h(s,a,theta) = theta^T x(s,a)
        action = np.argmax(softmax(action_preferences))
        observation, reward, done, info = env.step(action)
        
        # update operation       
        # learn q (can be done during the episode? depends on the method)
        # learn pi

   
        q_sa = q_values[state, action]
        next_action = np.argmax(q_values[observation])
        q_values[state, action] = q_sa + alpha * (reward + gamma * q_values[observation, next_action] - q_sa)
        state = observation        
        if done: 
            #if reward==1:
            #    print('win')
            break
    #if n%1000==0: # diagnostic
    #    avg_wins.append(rate_policy(env, q_values))        

pd.Series(avg_wins).plot()