Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actor Critic for Mountain Car #3

Closed
zencoding opened this issue Jun 1, 2017 · 2 comments
Closed

Actor Critic for Mountain Car #3

zencoding opened this issue Jun 1, 2017 · 2 comments

Comments

@zencoding
Copy link

Hi,
Thanks a lot for the wonderful work, the code is well written and modular, it helped me a lot to play with RL in general. I want to know if you have an opinion on using Actor Critic methods on control problem such as Mountain Car. I added Mountain Car to the environment and ran A2C and A3C, both of them never converged to a solution. I modified the hyperparameter discount factor to 0.9 and got both Actor loss and Critic Loss to be zero but the reward never reduced below -200. I then tried adding Entropy to improve exploaration but that leads to exploding gradients. Looking deeper, it looks like Actor Critic methods are not good at exploring the space if the return is constant.

I looked at the other implementations of solving Mountain Car and it is solved either using Function Approximation (Tile coding) or DPG, there was no one who had used Actor Critic.

@arnomoonens
Copy link
Owner

arnomoonens commented Jun 2, 2017

Hello,

Thank you for the feedback about my code! I'm glad that I can help other people using my repository.

I have experienced the same issue with the MountainCar-v0 environment. The problem is that we have an on-policy method (A2C and A3C) applied to an environment that rarely gives useful rewards (i.e. only at the end).

I have only used Sarsa with function approximation (not DPG), and I believe this algorithm works quite good on the MountainCar-v0 environment because in this case it favors actions that haven't been tried yet in the current state. This is the case because the thetas are initialized randomly uniform. Whenever a reward (for this environment -1) is received, it only changes thetas for the previous state and action.
I haven't studied and implemented DPG yet. I am interested in how that algorithm is able to "solve" this environment.

In contrary to Sarsa+F.A., updates using A3C can influence all the parameters (in this case the neural network weights) and thus the result for every state (the input to the neural net of the actor) can be influenced. I ran an experiment, and the network always seems to output the same probabilities, as the feedback to the network is also always the same.
Thus, you can only get at the finish by luck. Once the agent "discovered" the finish, the performance should improve. In fact, some people report to have successfully learned using A3C.

I hope my explanation is clear. Feel free to ask more questions otherwise.
I also don't fully understand it yet. Unfortunately, I don't have enough time right now to investigate the problem more thoroughly.


By the way, the weights of the networks in my A2C and A3C algorithms weren't initialized properly. The standard deviation was 1, which is too big and can lead to a big difference in probabilities for the action to be selected. Sometimes an action had only a probability of 0.5% for example. As I explained, the probabilities never change much and thus sometimes an action is rarely selected. I changed it now (commit 6a0d879) by using as weight initializer tf.truncated_normal_initializer(mean=0.0, stddev=0.02).

@zencoding
Copy link
Author

zencoding commented Jun 2, 2017

Thanks for your explanation, that helps. It seems that on-policy have bad exploration compared to off-policy so, in situations where the rewards are not changing with state changes, it is better to use off-policy methods.

BTW, I tried various things on A2C to make it work such as added reward for movement

for _ in range(self.config["repeat_n_actions"]): state, rew, done, _ = self.step_env(action) stateDelta = np.mean(np.square(state-old_state)) # Good rewards if agent moved the car if stateDelta > 0.0001: rew = 0 if done: # Don't continue if episode has already ended break
and Experience Replay and epilson greedy if np.random.rand() <= self.config["epsilon"]: action = np.random.randint(0,3,size=1)[0] else: action = self.choose_action(state) but the network still won't converge to less than 200 steps. I don't know why but I will investigate.

Thanks again for your help in understanding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants