Temporal Difference Based Actor Critic

Implementation of actor-critic algorithm.

Dependencies

Python 2.7 or 3.5
TensorFlow 1.10
gym
numpy
tqdm progress-bar

Features

Using a neural network based policy as the actor
Using a Q-network as the critic
Using Policy Gradient Theorem to update critic
Using a variation of a Q-learning updates to update Q-network

Note that the above equation is similar as in the Q-learning update except that instead of using the max action-values, we are using the averaged action-values. The rationale for using the above update is that this update converges to the action-values of the present policy while the max update (Q-learning update) converges to the action-values of the optimal policy. We need the action-values of the present policies for policy gradient updates that is why we used the above updates.

Usage

To train a model for Cartpole-v0:

$ bash run.sh

To view the tensorboard

$tensorboard --logdir .

Results

Tensorboard Progress Bar

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md
configuration.json		configuration.json
dqn_agent.py		dqn_agent.py
pg_reinforce.py		pg_reinforce.py
replay_buffer.py		replay_buffer.py
run.sh		run.sh
run_dqn_critic_cartpole.py		run_dqn_critic_cartpole.py
sampler.py		sampler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Temporal Difference Based Actor Critic

Dependencies

Features

Usage

Results

About

Releases

Packages

Languages

abhishm/actor_critic

Folders and files

Latest commit

History

Repository files navigation

Temporal Difference Based Actor Critic

Dependencies

Features

Usage

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages