Skip to content

solution to cartpole problem of openAI gym with different approaches

License

Notifications You must be signed in to change notification settings

adibyte95/CartPole-OpenAI-GYM

Repository files navigation

TOPIC HitCount

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

Approaches

1. random movements

in this approach we choose an random action (left or right) given a paticular state of the enviroment. needles to say this approach performs very poorly because it does not take into consideration the present state.
this approach because of its random nature is quite un predictable. On 10 trails runs the max time of survival is 118 timesteps and acg survival time of about 21 time steps which is pretty bad.

ramdom rendering gif

2. using weight vector

in this approach we take a random weight vecotor of size 4 which is equal to the dimension of the state of the enviroment. A dot product is taken between the weight vector and state and depending upon the value of the output we take an action i.e either left or right. we see that this method outperforms the previous method but this method does not uses any machine learning algorithm. Resutls of this approach is very impressive. with proper number of games played this approach can last for more than 1000 time steps.
on 10 trail run of this algorithm max score achieved was 762 and avg score of about315.
Note that these can change with trail run and we can get even better results than this with appropriate parameter tuining
bruteforce rendering gif

3. using deep neural networks

in this approach we take generate training data by randomly taking actions on the enviromnent . if the run is succesful that is the pole is balanced on the cart from more than 100 time steps we add this example to out training set. this approach aims that we can learn how to balance the pole by learning from good training examples. we then fit the model to this training data and try to predict the outcome that is action for any new observation.

neural network rendering gif

4. using deep Q networks

this uses a technique in which the model is rewarded is if makes correct action given the observations of a state and penalty otherwise. initially the model will not be very good at guessing the output but slowly it will become good at predicting the output. exploration and exploitation is carried simaltaneouly to find new improved solutions and to find the good solution in explored search space

comparison how model performs in the begining and after a few epochs
 
we can see that initialy the model was not able to perform very good, but eventually it learns from its mistakes and performs very good( 1199 is the upper time limit ...after this game is forcefully closed).even higher avg score can be achieved by training longer and increasing the time limit

plot of score during various episodes

the pole was balanced on the cart for more than 2000 timeframes and outperforms all the approaches used above reinforcement rendering gif

references

Sentdex
Machine Learning with Phil
Medium blog

Link to other OpenAI-GYM Enviroments

mountain car

About

solution to cartpole problem of openAI gym with different approaches

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published