## __3. Actor-Critic for Continuing Tasks__ <br><br>


  - Estimating the Policy Gradient <br><br>

  - Actor-Critic Algorithm


<br><br><br><br><br>

## $\cdot$ Estimating the Policy Gradient <br><br>


  -  Derive a sample-based estimate for the gradient of the average reward objective

<br><br>


We have an objective for policy optimization. <br>
We also have the policy gradient theorem, <br>
which gives us a simple expression for the gradient of that objective. <br><br>

In this video, <br>
we'll complete the puzzle by showing " how to estimate this gradient " <br>
using the experience of an agent interacting with the environment.

<br><br><br><br><br>










### Getting Stochastic Samples of the Gradient <br><br>


We want to derive a gradient descent algorithm for our policy. <br>
We have our objective and it's gradient due to the Policy Gradient Theorem. <br><br>

Now, we need to figure out " how to approximate this gradient ". <br><br>

Infact, what we will do is get a stochastic sample of the gradient.

<br><br>


<img src="https://drive.google.com/uc?id=1TZyvNfcE9zA7wDDIXMAA7fzajc7UWJTD" alt="3-01" width="500">

<br><br>


Recall <br>
this expression for the gradient of the average reward. <br><br>

Computing the sum over states is really impractical. <br>
But we can do the same thing we did <br>
when deriving our stochastic gradient descent rule for policy evaluation. <br><br>

We simply make updates from states we observe while following policy $\pi$. <br>
$S_0, A_0, R_1, S_1, A_1, \;\;\; \cdots \;\;\; , S_t, A_t, R_t+1, \;\;\; \cdots$ <br><br>

This gradient from state $S_t$ provides an approximation <br>
to the gradient of the average reward $\nabla r(\pi)$. <br>
$\nabla r(\pi) \quad = \quad \displaystyle \sum_a \nabla\pi(a|\color{brown}{S_t},\theta_t) q_{\pi}(\color{brown}{S_t},a)$ <br><br>

As we discussed before for stochastic gradient descent, <br>
we can adjust the weights with this approximation , <br>
and still guarantee you will reach a stationary point.

<br><br>


<img src="https://drive.google.com/uc?id=1WM-nqwiR8DSoL9QryddzE8zsxB9dmJTk" alt="3-02" width="500">

<br><br>


This is what the stochastic gradient descent update looks like <br>
for the policy parameters. <br>
$\theta_{t+1} \quad \doteq \quad \theta_t + \alpha \displaystyle \sum_a \nabla \pi(a|S_t,\theta_t) q_{\pi}(S_t,a)$ <br><br>

We could stop here, <br>
but let's simplify this further.

<br><br><br>




### Unbiasedness of the Stochastic Samples <br><br>


Let's re-examine this from a perspective based on expectations. <br>
This will help us simplify the update and give you more insight into why the update makes sense. 

<br><br>


<img src="https://drive.google.com/uc?id=1ES4Dy0LOpXr2VYfhXyPoriTqMLgSmIR4" alt="3-03" width="500">

<br><br>


Notice that <br>
the sum over states weighted by $\mu$ <br>
can be re-written as an expectation under $\mu$. <br>
( using sampled state $S$ instead of all every state $s$ ) <br>
$= \displaystyle \sum_s \mu(s) \sum_a \nabla \pi(a|s,\theta) q_{\pi}(s,a)$ <br>
$\Rightarrow \mathbb{E}_{\mu} \big[ \displaystyle \sum_a \nabla \pi(a|S,\theta) q_{\pi}(S,a) \big]$\ <br><br>

Recall that <br>
$\mu$ is the stationary distribution for $\pi$ which reflects state visitation under $\pi$. <br>
>Q. What is stationary distribution ? <br>
>A. https://en.wikipedia.org/wiki/Stationary_distribution

<br><br>


<img src="https://drive.google.com/uc?id=1G4kQrqnabhD7l4YW7G-C_74E05nSoMok" alt="3-04" width="500">

<br><br>


In fact, <br>
the state's we oberve while following $\pi$ are distributed according to $\mu$ <br>
$S_t \sim \mu$ <br><br>

By computing the gradient from a state $S_t$, <br>
we get an unbiased estimate of this expectation. <br><br>

Thinking about our stachastic gradient as an unbiased estimate <br>
suggests one other simplification ! 

<br><br>


<img src="https://drive.google.com/uc?id=1AlV0xxm3TKSZPl-odc4Hc71sla_DrwfT" alt="3-05" width="500">

<br><br>

Notice that <br>
inside the expectation we have a sum over all actions $\displaystyle \sum_a$. <br>
We want to make this term even simpler, and get rid of ther sum over all actions ! <br><br>

If this was an expectation over actions, <br>
we could get a stochastic sample of this two(?), and avoid summing over all actions !


<br><br><br>




### Getting Stochastic Samples with One Action <br><br>

 
<img src="https://drive.google.com/uc?id=1IOtTvakDZYtS_4fpQM2YG1Qjt0X-4bAX" alt="3-06" width="500">

<br>

Here we're going to see <br>
how we can get an unbiased gradient estimate using only one action 80, <br>
which is the action taken by the agent. <br><br>

It would be nice if the sum of our actions was weighted by $\pi$ and so was an expectation under $\pi$. <br><br>

That way we could sample it <br>
using the agent's action selection, which is distributed according to $\pi$. 

<br><br>


<img src="https://drive.google.com/uc?id=1HVHdnFDrYZhyQOJAJfkOizfu2Ok3xyG-" alt="3-07" width="500">

<br>

It turns out this is an easy problem to solve ! <br><br>

To get a weighted sum correponding to an expectation <br>
we can just multiply and divided by $\pi(a|s,\theta)$. <br><br>

Now we have an expectation over actions drawn from $\pi$ for this term (in the red-box)! <br>
$= \mathbb{E}_{\pi} \big[ \frac{\nabla \pi(A|S,\theta)}{\pi(A|S,\theta)} q_{\pi}(S,A) \big]$

<br><br><br>




### Stochastic Gradient Ascent for Policy Parameters <br><br>


<img src="https://drive.google.com/uc?id=1A9QR69iUPTSi7wWabg_yxFlpCqRRA7eT" alt="3-08" width="500">

<br>

The new stochastic gradient ascent update now looks like this. 

<br><br>


<img src="https://drive.google.com/uc?id=1lMY0iDhqhzkQ2vIhq5fe-Zsg0hoxnUep" alt="3-09" width="500">

<br><br>


As an aside, <br>
it is common to rewrite this gradient as the gradient of the natural logarithm of $\pi$ ! <br><br>

This is based on a formula from calculus for the gradient of a logarithm. <br>
$\nabla \big(f(x)\big) = \displaystyle \frac{\nabla f(x)}{f(x)}$ <br><br>

Using this rule, <br>
we get that the gradient of log $\pi$ equals the gradient of $\pi$ over $\pi$ <br>
$\ln \pi(a|s,\theta) = \displaystyle \frac{\nabla \pi(a|s,\theta)}{\pi(a|s,\theta)}$

<br><br>


So this update is equivalent to what we started with.

<br><br><br>




### Why do we do this ? <br><br>


One reason is that <br>
it is actually simpler to compute the gradient of the logarithm of certain distribution ! <br><br>

The other less important reason is that <br>
it let's us write this gradient more compactly.

<br><br>

>In the end, <br>
>it is just a mathematical thrick. <br>
>so don't let it distract you from the underlying algorithm.

<br><br>


We now have something that looks like many of the learning rules used in this course. (???) <br><br>

We adjust the parameter $\theta$ proportionally to a stochastic gradient of the objective. <br><br>

We use a step size parameter $\alpha$ to control the magnitude of the step in that direction. <br>
So $\alpha$ has the same role it always has.

<br><br>


We now have a nice clean update rule to learn the policy parameters !

<br><br><br>




### Computing the Update <br><br>


<img src="https://drive.google.com/uc?id=1FdJctjxM5xsEgxTk0tAIknrUSzfvbmJl" alt="3-10" width="500">

<br>

The last thing to talk about is <br>
how to actually compute the stochastic gradient for a given state and action. <br><br>

We just need two components, <br>
  - the gradient of the policy <br>
  $\nabla \ln \pi(A_t|S_t,\theta_t)$ <br>
  - an estimate of the differential values <br>
  $q_{\pi}(S_t,A_t)$ <br><br>

The first is easy, <br>
we know the policy and this parameterization, <br>
and so can compute it's gradient. <br><br>

The second, <br>
The action value can be approximated in a variety of ways. <br>
For example, <br>
we could use a TD algorithm that learns differential action-values. <br><br>

In an upcoming video, <br>
we will go through one particular choice in detail, <br>
as well as how to compute the gradient for specific policy parameterization.


<br><br><br><br><br>






### Summary <br><br>


  - we derive a policy gradient learning rule for the average reward setting

<br><br>

In the next video, <br>
we will see how to use this rule.


<br><br><br><br><br>






## $\cdot$ Actor-Critic Algorithm <br><br>


  - Describe the actor-critic algorithm <br>
  for control with function approximation <br>
  for continuing tasks

<br><br>

Do we have to choose between directly learning the policy parameters and learning a value function ? <br><br>

No ! <br>
Even within policy gradient methods, valaue-learning methods like TD still have an important role to play. <br><br>

In this setup, <br>
the parameterized policy plays the role of an actor, <br>
while the value function plays the role of a critic, <br>
evaluating the actions selected by the actor. <br><br>

These so called actor-critic methods <br>
were some of the earliest TD-based methods introduced in Reinforcement Learning. 


<br><br><br><br><br>










### Approximating the Action-Value in the Policy Update <br><br>


<img src="https://drive.google.com/uc?id=1wf-2zrdn3R-McXIIOAr5OIUG1IOPgI2J" alt="3-11" width="500">

<br>

We finished off the last video with this expression for the policy gradient learning rule. <br>
$\theta_{t+1} \quad = \quad \theta_t + \alpha \nabla \ln \pi (A_t | S_t,\theta_t) q_{\pi}(S_t,A_t)$ <br><br>

But, we don't have access to $q_{\pi}$, <br>
so we'll have to approximate it ! 

<br><br>


<img src="https://drive.google.com/uc?id=1vdOloIMG8rTyXu74Bzq3jxtCTVfhOlS8" alt="3-12" width="500">

<br>

We can do the usual TD thing, <br>
the one-step bootstrap return. <br><br>

That is the differential reward plus the value of the next state. <br>
$R_{t+1} - \bar{R} + \hat{v}(S_{t+1},W)$ 

<br><br>


#### Critic part and Actor part of the actor-critic algorithm <br><br>


<img src="https://drive.google.com/uc?id=181G46cwlgCgimFvm8eUu7gLujWZgQB1E" alt="3-13" width="500">

<br><br>

  - Critic part of the actor-critic algorithm <br><br>

  As usual, <br>
  the parameterized function $\hat{v}(s,W)$ is learned estimate of the value function. <br>
  In this case, <br>
  $\hat{v}(s,W)$ is the differential value function. <br><br>

  This is the critic part of the actor-critic algorithm. <br><br>

  The critic provides immediate feedback. <br><br>

  To train the critic, <br>
  we can use any state-value learning algorithm. <br>
  We will use the average reward version of semi-gradient TD(0).

<br><br>


  - Actor part of the actor-critic algorithm <br><br>

  The parameterized policy is the actor ! <br><br>

  It uses the policy gradient updates shown here.

<br><br><br>




### Subtracting the Current State's Value Estimate <br><br>


policy gradient update without baseline | policy gradient update with baseline
--- | ---
<img src="https://drive.google.com/uc?id=135Q9FWuGH7A9J9z_T_be5bsvF89CEYHv" alt="3-14" width="500"> | <img src="https://drive.google.com/uc?id=1yAj8Q8KsTe71dg1B5HBaIQ6yXprRpafy" alt="3-15" width="500">

<br>

We could use this form of the update, <br>
but there is one last thing we can do to improve the algorithm. <br><br>

We can subtract off what is called a baseline ! <br>
$\hat{v}(S_t,W)$ is the baseline in this case. <br><br>

Instead of using the one-step value estimate alone, <br>
we can subtract the value estimate for the state $S_t$ to get the update that looks like this. 

<br><br>


<img src="https://drive.google.com/uc?id=1uZhuIRG5k8PV6Y2MCuVxTXmWj9NHiXNv" alt="3-16" width="500">

<br>

Notice that <br>
this expression is equal to the TD error $\delta$ ! <br><br>

The expected value of this update is the same as the previous one. <br>
Why is this ?

<br><br><br>




### Adding a Baseline <br><br>


<img src="https://drive.google.com/uc?id=1Vw5wAmegidHFSQZCDov1P_caYDgJNH_R" alt="3-17" width="500">

<br>

Let's take the expectation of the update conditioned on a particular state $S_t$ at time $t$. 

<br><br>


<img src="https://drive.google.com/uc?id=1IW5M5RT8b9AFEFqAR1htWgpcyPu3rLQy" alt="3-18" width="500">

<br>

Taking the expectation of the sum is the same as the sum of the expectations. <br><br>

We can use this <br>
to seperate out the expectation of our original term <br>
from the expectation which involves the subtracted value function. <br><br>

It turns out the expectation of the second term is $0$. <br>
So we can add this baseline to the update without changing the expectation of the update. <br><br>

>You can varify this for yourself. <br>
>To start, write the expectation as a sum of our(?) actions, <br>
>and pull the $\hat{v}()$ term out of the sum. <br>
>( we leave this as an exercise )

<br><br><br>


#### So why do we add this baseline if the update is the same in expectation ? <br><br>


<img src="https://drive.google.com/uc?id=188vXMKmkRjNuGz5mjz6hXbTXxGtrz0Z8" alt="3-19" width="500">

<br>

Subtracting this baseline tends to reduce the variance of the update <br>
which results in faster learning !

<br><br><br>




### How the Actor and the Critic Interact <br><br>


<img src="https://drive.google.com/uc?id=1bd89OMN1nnXU42b4xKCWvLfa2VbyV-dI" alt="3-20" width="500">

<br>

This update makes sense intuitively. 

<br><br>


<img src="https://drive.google.com/uc?id=1uclsq8RaLJwMDFFwTiYMWRee65KMufTK" alt="3-21" width="500">

<br>

After we execute an action, <br>
we use the TD error to decide how good the action was compared to the average for that state. <br><br>

If the TD error is positive, <br>
then it means the selected action resulted in a higher value than expected. <br>
Taking that action more often should improve our policy.

<br><br>


<img src="https://drive.google.com/uc?id=1-36_IdQ98E7iKjq4baJ-mvUyhOo_FKV9" alt="3-22" width="500">

<br>

That is exactly what this update does. <br><br>

It changes the policy parameters($\theta$?) <br>
to increase the probability of actions that were better than expected according to the critic. <br><br>

Correspondingly, <br>
if the critic is disappointed and the TD error is negative, <br>
then the probability of the action is decreased.

<br><br>


<img src="https://drive.google.com/uc?id=1Z8RsrAhmzEEGsLxTbRz77d52za2rDp4G" alt="3-23" width="500">

<br>

The actor and the critic learn at the same time constantly interacting. <br><br>

The actor is continually changing the policy to exceed the critic's expectation, and <br>
the critic is constantly updating it's value function to evaluate the actor's changing policy.

<br><br><br>




### Actor-Critic algorithm <br><br>


With the policy update in place, <br>
we're ready to go through the full algorithm for average reward actor-critic.

<br><br>


<img src="https://drive.google.com/uc?id=1ueU9Wa1l8XOl1cl26w-HjXL9YAhszFxZ" alt="3-24" width="500">

<br><br>


To start, <br>
we specify the policy parameterization and the value function parameterization. <br>
>Input : a differentiable policy parameterization $\pi(a|s,\theta)$ <br>
>Input : a differentialble state-value function parameterization $\hat{v}(s,W)$

Fot example, <br>
we might use Tile Coding to construct the approximate value function and a Soft-max policy parameterization. 

<br><br>


We will need to maintain an estimate of the average reward $R$ just like we did in the differential SARSA algorithm. <br>
We initialize this to $0$ <br>
>Initialize $\bar{R} \in \mathbb{R}$ to $0$

<br><br>


We can initialize the weights and the policy parameters however we like. <br>
>Initialize state-value weights $W \in \mathbb{R}^d$ and policy parameter $\theta \in \mathbb{R}^{d'} \quad$ (e.g. to $0$)

We initialize the step size parameters for the value estimate, the policy, and the average reward <br>
and they could all be different. <br>
>Algorithm parameters : $\alpha^W > 0, \;\; \alpha^\theta > 0, \;\; \alpha^{\bar{R}} > 0$

<br><br>


We get thie initial state from the environment, and then begin acting and learning. <br>
>Initialize $S \in \mathbb{E}$ <br>
>Loop forever (for each time step)

<br><br>


On each time step, <br><br>

we choose the action according to our policy <br>
and recieve the next state and reward from the environment. <br>
>Loop forever (for each time step) <br>
>$\quad A \sim \pi(\; \cdot \; | S,\theta)$ <br>
>$\quad$ Take the action $A$, $\quad$ observe $S', \; R$

<br>

Using this information, <br>
we compute the differential TD error <br>
and update our running estimate of the average reward. <br>
>$\delta \quad \leftarrow \quad R - \bar{R} + \hat{v}(S',W) - \hat{v}(S,W)$ <br>
>$\bar{R} \quad \leftarrow \quad \bar{R} + \alpha^{\bar{R}} \delta$

<br>

We update the value function weights using the TD update. <br>
>$W \quad \leftarrow \quad W + \alpha^W \delta \nabla \hat{v}(S,W)$

<br>

Finally, <br>
we update the policy parameters using our policy gradient update !
>$\theta \quad \leftarrow \quad \theta + \alpha^\theta \delta \nabla \ln \pi(A \; |S,\theta)$

<br>

>$S \quad \leftarrow \quad S;$

<br><br>

That's it ! <br><br>

This algorithm is designed for continuing tasks. <br>
So we can run it indefinitely and continue to improve the policy forever.

<br><br><br><br><br>






### Summary <br><br>


  - It is useful to learn a value function to estimate the gradient for the policy parameters <br><br>

  - The actor-critic algorithm implements this idea, <br>
  with a critic that learns a value function for the actor


<br><br>




<br><br><br><br><br>



