Reinforcement Learning

Deep Deterministic Policy Gradients

We present our implementation of an Reinforcement Learning Algorithm called DeepDGP, presented by Lillicrap et al..

For more details, please refer to my blog post.

Trained MuJoCo environments

Usage Instructions

We recommend to use python 3. Install necessary dependencies like openAI-gym by

pip3 install gym
pip3 install tensorflow
pip3 install tqdm
pip3 install matplotlib

To train and run MuJoCo environments, get a one month-trial license from MuJoCo. If you are student with .edu address, you can get 1-year MuJoCo license for free.

cd src
python3 train.py --env_id='<any openAI-gym continuous environments>' --model_dir=../trained_models/

To run our pretrained models

cd src
python3 run.py --env_id=Pendulum-v0 --model_dir=../trained_models/Pendulum-v0/

Introduction

RL allows machines and software agents to automatically determine the ideal behavior within a specific context, in order to maximize its performance / rewards.
Simple reward feedback is required for the agent to learn its behavior; this is known as the reinforcement signal.

Why DDPG?

Deep Q Network showed a great success in playing Atari games, however it works only for discrete action spaces.
Discretizing continuous action space with many dimensions suffers from curse of dimensionality.
DDPG solves this problem by introducing a variant of actor-critic method.

Algorithm

Initialize NN parameters randomly
Make a copy of actor and critics and store them as target_actor, target_critic.
Actor predicts action based on current state state plus added noise.
Receive reward r_t for the action at as well as observations, which results into next state s_{t+1}.
Store this tuple into experience buffer as, e = (s_t, a_t, r_t, s_{t+1})
Repeat this for N times to generate some experience.
Pick a batch of experiences randomly and compute R.H.S. of bellman equation using target networks.
Calculate the gradients for critic and actor using the experience batch, Lc and policy gradient, respectively.
Update target_actor and target_critic parameters by following soft update rule,
Repeat.

Hyper-Parameter Tuning

The classical trade-off faced by Reinforcement learning is Exploration vs Exploitation. This trade-off is adjusted by tuning the hyper-parameters of algorithm.

The hyper-parameters for DDPG are,

Gamma : Discount factor which determines the amount of importance given to future rewards compared to present.
Number of Roll-out steps : The number of steps after which networks are trained for K number of train steps. These are the steps allowed for random exploration experience build up.
Std Deviation of Action Noise : We use Ornstein-Uhlenbeck process for adding action noise, which suits to systems with inertia. By changing the deviation, we change the degree of exploration.
Number of Train steps : We train the parameters of Actor and Critic for K number of steps, after exploration steps.

Tuning the hyper-parameters for desired behaviors

As an example, we see in Half-Cheetah environment, cheetah gets reward to run in forward direction. In the training, we find that it first flips itself as a result of initial random exploration and then behaves in a way to keep moving forward but in flipper state.

Also, as it flips initially or in between, its running speed gets affected causing reduction in overall reward.

We tackle this situation, by reducing number of roll-out steps, which are exploration steps causing it to flip at start. We see the changes in the algorithm reflected in the behavior of cheetah. Instead of rushing in forward direction, which would cause the loss of balance, cheetah learns a stable gait with increased speed and recieves overall maximum reward.

As it can be seen from images, the setting with less noise and less number of roll-out states with regularization receives the maximum reward and a plausible gait.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
src		src
trained_models		trained_models
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

src

src

trained_models

trained_models

LICENSE

LICENSE

README.md

README.md

_config.yml

_config.yml

Repository files navigation

Reinforcement Learning

Deep Deterministic Policy Gradients

Trained MuJoCo environments

Usage Instructions

Introduction

Why DDPG?

Algorithm

Hyper-Parameter Tuning

Tuning the hyper-parameters for desired behaviors

This video shows the effect of parameter tuning on the gait of Half Cheetah

About

Releases

Packages

Languages

License

deshpandeshrinath/deepDGP

Folders and files

Latest commit

History

Repository files navigation

Reinforcement Learning

Deep Deterministic Policy Gradients

Trained MuJoCo environments

Usage Instructions

Introduction

Why DDPG?

Algorithm

Hyper-Parameter Tuning

Tuning the hyper-parameters for desired behaviors

This video shows the effect of parameter tuning on the gait of Half Cheetah

About

Resources

License

Stars

Watchers

Forks

Languages