# Actor-Critic evaluation

Plan:
- ~~Spike Actor-Critic~~
- ~~Quick test on a Gym env~~
- ~~Update requirements.txt~~
- ~~Move Actor-Critic out~~
- ~~Make training records for analysis~~
- Structure next code

## Imports & setup

### Essential tools

In [58]:
# Generic setup
from typing import Tuple, List, Callable
from collections import namedtuple

In [43]:
# Analysis
from IPython.display import clear_output

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from tqdm import tqdm

### Examine Gym environments

In [44]:
# %%capture
# from gym import envs
# print(envs.registry.all())

In [45]:
import gym

In [46]:
# Smoke test
# env = gym.make("CartPole-v1")
# # Check environment details
# CartPole-v0 is 200, 195.0
# CartPole-v1 is 500, 475.0
# env.spec.max_episode_steps, env.spec.reward_threshold, env.action_space, env.observation_space

# Rememeber to make reproducible gym environments
# env.seed(0)

### Import PyTorch

In [47]:
import torch
import torch.nn as nn
import torch.optim as optim

# Check for CUDA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Reproducible results
torch.manual_seed(0)

<torch._C.Generator at 0x127f8f330>

In [61]:
import stable-baselines3
from stable_baselines3.common.env_util import make_vec_env

SyntaxError: invalid syntax (<ipython-input-61-db22bd9e97b1>, line 1)

### Import local modules

In [48]:
from actor_critic import ActorCritic

## Set up evaluation

https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

Doing A2C, a single worker variant of A3C.
- backprop
- keep the hidden layer simple
- RMSprop as loss function
- reward step size/evaluation step size 5

Hyperparameters:
(values from Deep Reinforcement Learning Hands-On, Maxim Lapan)
- num of envs used [20, 40, 60, 80, 100] <-- do 5 separate big experiments
- batch size [16, 32, 64]
- learning rate [0.001, 0.002, 0.003]
- entropy beta? [0.02, 0.03]
- hidden layers? (nah)

Analysis:
- Average reward
- No. of timesteps per episode (before terminating state)
- entropy loss?
- policy loss
- value loss
- (implied) Collect reward, timesteps elapsed

Gym CartPole V0 & V1 description:


>A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.




In [49]:
Results = namedtuple("Results", "episode_index, timesteps_elapsed, total_reward_per_episode")

def plot_rewards_episodes(experiment_rewards: list):
    clear_output(True)
    plt.figure(figsize=(20,5))
    plt.subplot(131)
    plt.title("episodes vs reward")
    plt.plot(reward)
    plt.show()
    
def plot_timesteps_episodes(timesteps_elapsed: list):
    clear_output(True)
    plt.figure(figsize=(20,5))
    plt.subplot(141)
    plt.title("episodes vs timesteps")
    plt.plot(timesteps_elapsed)
    plt.show()

In [53]:
# One episode/test run/logging rewards
def sample_one_episode(env: gym.Env, model: ActorCritic):
    state = env.reset()
    done = False

    total_reward = 0
    timestep_counter = 0

    while not done:
        state = torch.unsqueeze(torch.FloatTensor(state), 0).to(device)
        probability_dist, values = model(state)
        action_to_take = probability_dist.sample()
        next_state, reward, done, _ = env.step(action_to_take.cpu().detach().numpy()[0])
        state = next_state

        total_reward += reward
        timestep_counter += 1

    return total_reward, timestep_counter

In [60]:
def update_returns(next_value: torch.Tensor, rewards: List[torch.FloatTensor], masks: List[torch.FloatTensor], gamma: float) -> List[torch.Tensor]:
    calculated_returns = []
    # Calculate the accumulated returns 
    # from the number of "reward steps to update".
    # Reset R to the next_value first.
    R = next_value

    # Calculate discounted return & go backwards
    for _ in range(len(rewards))[::-1]:
        # TODO: masks?
        R = rewards[_] + gamma * R * masks[step]
        # Push return value R
        calculated_returns.insert(0, R)
    return calculated_returns


In [None]:
# From: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/multiprocessing_rl.ipynb

def make_env(env_id: str, rank: int, seed: int = 0) -> Callable:
    """
    Utility function for multiprocessed env.
    
    :param env_id: (str) the environment ID
    :param num_env: (int) the number of environment you wish to have in subprocesses
    :param seed: (int) the inital seed for RNG
    :param rank: (int) index of the subprocess
    :return: (Callable)
    """
    def _init() -> gym.Env:
        env = gym.make(env_id)
        env.seed(seed + rank)
        return env
    set_random_seed(seed)
    return _init


# Eduard = MSI RX 2080, 6 cores Intel Core i5 9600KF
# Fei = 6 cores AMD Ryzen 5 3600, I'm not going to install ROCm
NUM_CPU = 6  # Number of processes to use

# Parameters based on environment:
NUM_OBSERVATIONS: int = env.observation_space.shape[0] # input
NUM_ACTIONS: int = env.action_space.n # output

# Create the vectorized environment
envs = SubprocVecEnv([make_env("CartPole-v1", i) for i in range(NUM_CPU)])

model = ActorCritic(num_inputs=NUM_INPUTS, num_outputs=NUM_OUTPUTS, hidden_layer_config=(10, 10)).to(device)


In [None]:
# Three terminating states for CartPole-v1:
# env.spec.max_episode_steps 500
# env.spec.reward_threshold 475.0
# falls over

NUM_EPISODES = 10000 # Recommended in the Pytorch example for A2C on CartPole-v0
episode_idx = 0
experiment_rewards = []
experiment_timesteps = []

while episode_idx < tqdm(NUM_EPISODES):

    log_probs = []
    values = []
    rewards = []
    masks = [] # Our thresholding tensors
    entropy = 0 # reset Entropy

    for _ in range()


    reward, timesteps_elapsed = sample_one_episode(cartpole)
    plot_rewards_episodes(results_list)
    # plot_timesteps_episodes(timesteps_elapsed)
