-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Actor Critic for Mountain Car #3
Comments
Hello, Thank you for the feedback about my code! I'm glad that I can help other people using my repository. I have experienced the same issue with the MountainCar-v0 environment. The problem is that we have an on-policy method (A2C and A3C) applied to an environment that rarely gives useful rewards (i.e. only at the end). I have only used Sarsa with function approximation (not DPG), and I believe this algorithm works quite good on the MountainCar-v0 environment because in this case it favors actions that haven't been tried yet in the current state. This is the case because the In contrary to Sarsa+F.A., updates using A3C can influence all the parameters (in this case the neural network weights) and thus the result for every state (the input to the neural net of the actor) can be influenced. I ran an experiment, and the network always seems to output the same probabilities, as the feedback to the network is also always the same. I hope my explanation is clear. Feel free to ask more questions otherwise. By the way, the weights of the networks in my A2C and A3C algorithms weren't initialized properly. The standard deviation was 1, which is too big and can lead to a big difference in probabilities for the action to be selected. Sometimes an action had only a probability of 0.5% for example. As I explained, the probabilities never change much and thus sometimes an action is rarely selected. I changed it now (commit 6a0d879) by using as weight initializer |
Thanks for your explanation, that helps. It seems that on-policy have bad exploration compared to off-policy so, in situations where the rewards are not changing with state changes, it is better to use off-policy methods. BTW, I tried various things on A2C to make it work such as added reward for movement
Thanks again for your help in understanding |
Hi,
Thanks a lot for the wonderful work, the code is well written and modular, it helped me a lot to play with RL in general. I want to know if you have an opinion on using Actor Critic methods on control problem such as Mountain Car. I added Mountain Car to the environment and ran A2C and A3C, both of them never converged to a solution. I modified the hyperparameter discount factor to 0.9 and got both Actor loss and Critic Loss to be zero but the reward never reduced below -200. I then tried adding Entropy to improve exploaration but that leads to exploding gradients. Looking deeper, it looks like Actor Critic methods are not good at exploring the space if the return is constant.
I looked at the other implementations of solving Mountain Car and it is solved either using Function Approximation (Tile coding) or DPG, there was no one who had used Actor Critic.
The text was updated successfully, but these errors were encountered: