Implementation of actor-critic algorithm.
- Python 2.7 or 3.5
- TensorFlow 1.10
- gym
- numpy
- tqdm progress-bar
- Using a neural network based policy as the actor
- Using a Q-network as the critic
- Using Policy Gradient Theorem to update critic
- Using a variation of a Q-learning updates to update Q-network
Note that the above equation is similar as in the Q-learning update except that instead of using the max action-values, we are using the averaged action-values. The rationale for using the above update is that this update converges to the action-values of the present policy while the max update (Q-learning update) converges to the action-values of the optimal policy. We need the action-values of the present policies for policy gradient updates that is why we used the above updates.
To train a model for Cartpole-v0:
$ bash run.sh
To view the tensorboard
$tensorboard --logdir .