## Project Description

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

In [1]:
# install modules
# !pip install gym stable_baselines3

## Import Modules

In [2]:
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy

## Environment Testing with Random Actions

In [3]:
env_name = 'CartPole-v0'
env = gym.make(env_name)

In [12]:
for episode in range(1, 11):
    score = 0
    state = env.reset()
    done = False
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score += reward
        
    print('Episode:', episode, 'Score:', score)
env.close()

Episode: 1 Score: 19.0
Episode: 2 Score: 17.0
Episode: 3 Score: 77.0
Episode: 4 Score: 36.0
Episode: 5 Score: 17.0
Episode: 6 Score: 22.0
Episode: 7 Score: 18.0
Episode: 8 Score: 25.0
Episode: 9 Score: 11.0
Episode: 10 Score: 15.0


## Model Training

In [13]:
env = gym.make(env_name)
env = DummyVecEnv([lambda: env])
model = PPO('MlpPolicy', env, verbose=1)

Using cuda device


In [14]:
model.learn(total_timesteps=20000)

-----------------------------
| time/              |      |
|    fps             | 352  |
|    iterations      | 1    |
|    time_elapsed    | 5    |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 360         |
|    iterations           | 2           |
|    time_elapsed         | 11          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.009253612 |
|    clip_fraction        | 0.112       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.686      |
|    explained_variance   | -0.00415    |
|    learning_rate        | 0.0003      |
|    loss                 | 5.15        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0178     |
|    value_loss           | 50.7        |
-----------------------------------------
----------------------------------

<stable_baselines3.ppo.ppo.PPO at 0x27c497e9f70>

In [None]:
# save the model
model.save('ppo model')

## Model Testing

In [15]:
evaluate_policy(model, env, n_eval_episodes=10, render=True)



(200.0, 0.0)

In [16]:
env.close()

In [17]:
for episode in range(1, 11):
    score = 0
    obs = env.reset()
    done = False
    
    while not done:
        env.render()
        action, _ = model.predict(obs)
        obs, reward, done, info = env.step(action)
        score += reward
        
    print('Episode:', episode, 'Score:', score)
env.close()

Episode: 1 Score: [200.]
Episode: 2 Score: [200.]
Episode: 3 Score: [200.]
Episode: 4 Score: [200.]
Episode: 5 Score: [200.]
Episode: 6 Score: [200.]
Episode: 7 Score: [200.]
Episode: 8 Score: [200.]
Episode: 9 Score: [200.]
Episode: 10 Score: [200.]
