## Solution: Ray RLlib and PPO

PPO of Proximal Policy Optimization is a more powerful (and more complicated) algorithm than the DQN we've looked at.

But thanks to Ray's implementations, you can swap it in easily.

__We'll redo the earlier OpenAI Gym Cart-Pole problem, but swap in PPO for the algorithm__

Note that we import `PPOConfig` from `ray.rllib.algorithms.ppo`

By replacing "DQN" with "PPO" you can quickly get better results.

>
> Interested in PPO details? Check out this writeup: https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12
> 

In [None]:
import ray

ray.init(num_cpus=8)

In [None]:
import ray.rllib.agents.ppo as ppo
from ray.rllib.algorithms.ppo import PPOConfig

config = (
    PPOConfig()
    .environment("CartPole-v1")
    .rollouts(num_rollout_workers=2)
    .framework("torch")
    .training(model={"fcnet_hiddens": [64, 64]})
    .evaluation(evaluation_num_workers=1)
)

algo = config.build()

In [None]:
for n in range(3):
    output = algo.train()
    rmin  = output['episode_reward_min']
    rmean = output['episode_reward_mean']
    rmax  = output['episode_reward_max']
    print(f'Iteration {n} : {rmin} - {rmean} - {rmax}')

In [None]:
algo.evaluate()

In [None]:
ray.shutdown()