In [None]:
import gym
import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import utils.rl_utils as rl_utils

from alg.ppo import PPO

In [None]:
env_name = "CartPole-v0"
env = gym.make(env_name)

**Observation space**: The observation is a 4-dimensional vector $\mathcal{S} \in \mathbb{R}^4$, where each element represents:
- Cart position: $x \in [-4.8, 4.8]$
- Cart velocity: $\dot{x} \in (-\infty, \infty)$
- Pole angle: $\theta \in [-24^\circ, 24^\circ]$
- Pole angular velocity: $\dot{\theta} \in (-\infty, \infty)$

**Action space**: $a \in \{0, 1\}$, indicating the direction to apply force on the cart. 
- $0$: Apply force to the left.
- $1$: Apply force to the right.

**Initial state**: The episode starts with the cart at the center of the track ($x = 0$), and the pole almost upright with small random values for the pole angle and velocity.

**Termination of an episode**:
- Case 1: The pole's angle $\theta$ exceeds $24^\circ$ from vertical.
- Case 2: The cart's position $x$ exceeds $4.8$ units from the center.
- Case 3: (Truncation) The episode reaches its maximum length of 200 timesteps.

**Reward**: A reward of $+1$ is provided for every timestep the pole remains balanced, until one of the termination conditions is met.

<img src="./images/cart_pole.gif" alt="cart_pole_env" width="300"/>

In [None]:
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

actor_lr = 1e-3
critic_lr = 1e-2
num_episodes = 500
hidden_dim = 128
gamma = 0.98
lmbda = 0.95
epochs = 10
eps = 0.2

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

agent = PPO(state_dim, hidden_dim, action_dim, actor_lr, critic_lr, lmbda, epochs, eps, gamma, device)

return_list = rl_utils.train_on_policy_agent(env, agent, num_episodes)

# Iteration 0: 100%|██████████| 50/50 [00:10<00:00,  4.81it/s, episode=50,
# return=183.200]
# Iteration 1: 100%|██████████| 50/50 [00:22<00:00,  2.24it/s, episode=100,
# return=191.400]
# Iteration 2: 100%|██████████| 50/50 [00:22<00:00,  2.24it/s, episode=150,
# return=199.900]
# Iteration 3: 100%|██████████| 50/50 [00:21<00:00,  2.33it/s, episode=200,
# return=200.000]
# Iteration 4: 100%|██████████| 50/50 [00:21<00:00,  2.29it/s, episode=250,
# return=200.000]
# Iteration 5: 100%|██████████| 50/50 [00:22<00:00,  2.22it/s, episode=300,
# return=200.000]
# Iteration 6: 100%|██████████| 50/50 [00:23<00:00,  2.14it/s, episode=350,
# return=200.000]
# Iteration 7: 100%|██████████| 50/50 [00:23<00:00,  2.16it/s, episode=400,
# return=200.000]
# Iteration 8: 100%|██████████| 50/50 [00:22<00:00,  2.23it/s, episode=450,
# return=200.000]
# Iteration 9: 100%|██████████| 50/50 [00:22<00:00,  2.25it/s, episode=500,
# return=200.000]

In [None]:
mv_return = rl_utils.moving_average(return_list, 9)
plt.plot(episodes_list, mv_return)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('PPO on {}'.format(env_name))
plt.show()

![ppo_cartpole](./images/ppo_cartpole.png)
![ac_cartpole](./images/ac_cartpole.png)

In [None]:
def test_trained_policy(agent, env, num_episodes=5):
    for i in range(num_episodes):
        state = env.reset()
        done = False
        episode_return = 0
        while not done:
            env.render()  
            action = agent.take_action(state)  
            next_state, reward, done, _ = env.step(action)
            state = next_state
            episode_return += reward
        print(f"Episode {i + 1}, Return: {episode_return}")
    
    env.close()  # 渲染完成后关闭环境

test_trained_policy(agent, env)

# Assignment: PPO Training on the Pendulum-v1 Environment

## Objective:
Your task is to implement and train a Proximal Policy Optimization (PPO) agent on the `Pendulum-v1` environment using PyTorch. The goal is to learn a policy that balances the pendulum upright in continuous action space.

## Requirements:
1. **Environment**: Use the `Pendulum-v1` environment from `gym`, which has a **continuous action space**. 
   - Observation space: A 3-dimensional vector representing the cosine and sine of the pendulum angle and the angular velocity.
   - Action space: A single continuous value in the range $[-2, 2]$ representing the torque applied to the pendulum.

2. **Model Architecture**:
   - Implement a **policy network** (actor) that outputs the mean of a Gaussian distribution for the continuous action space.
   - Implement a **value network** (critic) to estimate the state value.

3. **Training**:
   - Use the **PPO algorithm** to train your agent.
   - Use the clipped objective function for PPO.
   - Train for at least **500 episodes**.

4. **Testing**:
   - After training, **test the agent** on the environment for 5 episodes and render the environment during testing to visualize the results.

5. **Deliverables**:
   - Submit your Jupyter Notebook containing:
     - The PPO agent implementation.
     - Training loop.
     - Plots of the **total reward per episode** during training.
     - Testing the trained agent with rendered results.

## Extra Credit:
- Implement **entropy regularization** to encourage exploration during training.
- Compare the performance of the PPO agent with and without entropy regularization.

## Submission Guidelines:
- Submit your completed Jupyter Notebook via the provided platform by the deadline.
- Ensure your code is well-documented, with comments explaining key steps in your implementation.
