# Cartpole PPO Project

## TODOs
- [ ] **Dynamic Multi-Processing**: Implement dynamic multi-processing based on execution time.
- [ ] **Atomic Functions**: Ensure atomic functions are efficiently used in the code.
- [ ] **Checkpoints with Branching**: 
    - Implement checkpoints with branching logic.
    - Select the best model if it outperforms the current one.
- [ ] **Save Progression**: 
    - Visualize progression in the `README.md`.
    - Incorporate meaningful progress data and visual indicators.
- [ ] **Tensorboard Hyperparameter Tuning**: 
    - Set up Tensorboard for hyperparameter tuning and tracking performance.

## Notations

- **Steps > Episode > Epoch > Generation**
    - The hierarchy of execution:
        - Steps occur within an episode.
        - Multiple episodes form an epoch.
        - Generations are groups of epochs or models trained over time.

## Roadmap
- [ ] Fine-tune PPO hyperparameters for Cartpole.
- [ ] Optimize environment reset and step procedures for speed.
- [ ] Visualize performance and learning curve progression.

## Project Goals

- Build an efficient **PPO** agent for **Cartpole-v1** using vectorized environments.
- Achieve high performance using checkpoints and branching techniques.
- Track progress and optimize hyperparameters with **Tensorboard**.
- Ensure best practices for multi-processing and atomic function handling.

## Future Improvements

- [ ] Experiment with various architectures for the PPO network.
- [ ] Try more challenging environments after mastering Cartpole.
- [ ] Expand branching checkpoints for longer-running tasks.

## Notations

- steps < episodes < iterations < epoch < generations

In [425]:
# Auxiliar imports
import sys, os, time
import matplotlib.pyplot as plt
import numpy as np

# Gym imports
import gym
from gym.vector import SyncVectorEnv

# PyTorch imports
import torch
from torch import nn, optim

# Custom imports
sys.path.append(os.path.abspath('..')) # Add parent directory to path
import ppo_network, ppo_wrapper, importlib
importlib.reload(ppo_network) # Prevents caching issues with notebooks
from ppo_network import PPONetwork
importlib.reload(ppo_wrapper) # Prevents caching issues with notebooks
from ppo_wrapper import PPOWrapper


In [426]:
# CartPole environment
env_id = 'CartPole-v1'
max_episode_steps = 1024
num_envs = 32

env_kwargs = {
    'id': env_id,
    'max_episode_steps': max_episode_steps,
}

# Create vectorized environment
envs_vector = SyncVectorEnv([lambda: gym.make(**env_kwargs)] * num_envs)
states, infos = envs_vector.reset()

In [427]:
# Network
input_dims = 4
output_dims = 2
shared_hidden_dims = [512, 256, 128]
policy_hidden_dims = [64, 64]
value_hidden_dims = [64, 64]
activation = nn.ReLU

network_kwargs = {
    'input_dims': input_dims,
    'output_dims': output_dims,
    'shared_hidden_dims': shared_hidden_dims,
    'policy_hidden_dims': policy_hidden_dims,
    'value_hidden_dims': value_hidden_dims,
    'activation': activation,
}

network = PPONetwork(**network_kwargs)

# Test forward pass
now = time.time()
for _ in range(100):
    states_tensor = torch.tensor(states, dtype=torch.float32)
    policy, value = network(states_tensor)
    
    actions_dist = torch.distributions.Categorical(logits=policy)
    actions = actions_dist.sample().numpy()
    
    states, rewards, dones, truncateds, infos = envs_vector.step(actions)
    #print(dones)

print(f'Elapsed time: per vectorized env: {(time.time() - now)/num_envs:.2f} s')

Elapsed time: per vectorized env: 0.00 s


In [None]:
lr = 3e-4
final_lr = 1e-6
gamma = 0.99
lam = 0.90
clip_eps = 0.2
value_coef = 0.1
entropy_coef = 0.05

batch_size = 128
batch_epochs = 5
batch_shuffle = True

iterations = 1024

truncated_reward = 5

debug_prints = False

ppo_kwargs = {
    'num_envs': num_envs,
    'lr': lr,
    'final_lr': final_lr,
    'gamma': gamma,
    'lam': lam,
    'clip_eps': clip_eps,
    'value_coef': value_coef,
    'entropy_coef': entropy_coef,
    'batch_size': batch_size,
    'batch_epochs': batch_epochs,
    'batch_shuffle': batch_shuffle,
    'iterations': iterations,
    'truncated_reward': truncated_reward,
    'debug_prints': debug_prints,   
}

ppo_wrapper = PPOWrapper(envs_vector, network, **ppo_kwargs)

ppo_wrapper.train(generations=50)


Generation 0 - Reward: 48.9375
Generation 1 - Reward: 157.71875
Generation 2 - Reward: 162.0625
Generation 3 - Reward: 160.28125
Generation 4 - Reward: 167.53125
Generation 5 - Reward: 169.875
Generation 6 - Reward: 155.71875
Generation 7 - Reward: 162.84375
Generation 8 - Reward: 185.125
Generation 9 - Reward: 191.03125
Generation 10 - Reward: 176.15625
Generation 11 - Reward: 205.25
Generation 12 - Reward: 444.3125
Generation 13 - Reward: 767.9375
Generation 14 - Reward: 908.5625
Generation 15 - Reward: 1029.0
Generation 16 - Reward: 1029.0
Generation 17 - Reward: 1029.0
Generation 18 - Reward: 1029.0
Generation 19 - Reward: 1029.0
Generation 20 - Reward: 1026.21875
Generation 21 - Reward: 539.84375
Generation 22 - Reward: 1029.0
Generation 23 - Reward: 1029.0
Generation 24 - Reward: 1029.0
Generation 25 - Reward: 1029.0
Generation 26 - Reward: 1029.0
Generation 27 - Reward: 1029.0
Generation 28 - Reward: 1029.0
Generation 29 - Reward: 1029.0
Generation 30 - Reward: 1029.0
Generation