1. Rainbow and Ape-X Expiriments 
    1. We release a set of hyper parameters for CartPole-v1 and Classic Control and Atari
    2. We release code for Rainbow that can train X steps in Y minutes on a Mac M2 Chip
    3. We also release a version of Ape-X as described in the original paper, and an Ape-X with rainbow
        1. Compare results of each 
        2. Compare Ape-X with different Rainbow components added or removed
    4. We compare the different models of DQN as seen in their papers to rainbow, the different individual components to rainbow, and rainbow with individual components removed
    5. We graph the convergence of Hyperopt for both tensorflow and torch, we do a score/trials graph and we compare to random hyper parameters 
    6. We graph the exploration of the Hyperopt algorithm showing the difference between consecutive trials to measure when the algorithm is “confident” in its parameters
    7. Compare search space sizes
        1. Large all hp.choice
        2. small/medium hp.choice
        3. a set using hp.uniform etc 
    8. Different methods
        1. tuning only 1 part of the system at a time and then tuning the next part, from a base set of params 
            1. DQN, then PER, then Double… 
    9. Different testing methods, like for example rolling average instead of latest test score
    10. Compare rainbow training speeds with different levels of numerical precision and datatypes
        1. Mixed precision using torch.amp 
        2. Lower matmul precision
            1. comparing medium, high, and highest 
            2. https://pytorch.org/docs/master/generated/torch.set_float32_matmul_precision.html?highlight=precision#torch.set_float32_matmul_precision
    11. Ape-X Hyper parameter sweep and sensitivities
    12. Exploration methods for Rainbow Ape-X
        1. Just noisy nets (same for all actors)
        2. Noisy nets and varying epsilon 
        3. Adding a constant that changes variance of noisy nets for action selection
        4. AlphaStar Agents

## Baseline Parameter Training Results

In [None]:
import gymnasium as gym
from gymnasium.wrappers import AtariPreprocessing, FrameStack
import numpy as np
import sys
sys.path.append('../..')
from dqn.rainbow.rainbow_agent import RainbowAgent
from game_configs import AtariConfig, CartPoleConfig
from agent_configs import RainbowConfig
from utils import KLDivergence
import random
import torch

# env = ClipReward(AtariPreprocessing(gym.make("MsPacmanNoFrameskip-v4", render_mode="rgb_array"), terminal_on_life_loss=True), -1, 1) # as recommended by the original paper, should already include max pooling
# env = FrameStack(env, 4)
env = gym.make("CartPole-v1")


config_dict = {
  "dense_layers_widths": [128],
  "value_hidden_layers_widths": [128],
  "advatage_hidden_layers_widths": [128],
  "adam_epsilon": 1e-8,
  "learning_rate": 0.001,
  "training_steps": 10000,
  "per_epsilon": 1e-6,
  "per_alpha": 0.2,
  "per_beta": 0.6,
  "minibatch_size": 128,
  "transfer_interval": 100,
  "n_step": 3,
  "noisy_sigma": 0.5,
  "replay_interval": 1,
  "kernel_initializer": "orthogonal",
  "noisy_sigma": 0.5,
  "loss_function": KLDivergence(), # could do categorical cross entropy 
  "clipnorm": 10.0,
}
game_config = CartPoleConfig()
config = RainbowConfig(config_dict, game_config)
agent = RainbowAgent(env, config, name="baseline")

for param in agent.model.parameters():
  print(param)
print("start")
agent.train()

# Hyperparameter optimization of Rainbow on Cartpole-v1
Training steps: 10,000
Evaluation: 10 episodes from random starts.

## Hyperopt with PyTorch and TensorFlow only hp.choice()

In [None]:
import pickle
from hyperopt import hp
from utils import CategoricalCrossentropy, KLDivergence, generate_layer_widths
import gymnasium as gym 

width_combinations = generate_layer_widths([32, 64, 128, 256, 512, 1024], 5)

search_space = {
        "kernel_initializer": hp.choice(
            "kernel_initializer",
            [
                "he_uniform",
                "he_normal",
                "glorot_uniform",
                "glorot_normal",
                "orthogonal",
            ],
        ),
        "learning_rate": hp.choice(
            "learning_rate", [10, 5, 2, 1, 0.1, 0.01, 0.001, 0.0001, 0.00001]
        ),
        "adam_epsilon": hp.choice(
            "adam_epsilon", [0.3125, 0.03125, 0.003125, 0.0003125]
        ),
        "loss_function": hp.choice(
            "loss_function", [CategoricalCrossentropy(), KLDivergence()]
        ),
        # NORMALIZATION?
        "transfer_interval": hp.choice(
            "transfer_interval", [10, 25, 50, 100, 200, 400, 800, 1600, 2000]
        ),
        "replay_interval": hp.choice("replay_interval", [1, 2, 3, 4, 5, 8, 10, 12]),
        "minibatch_size": hp.choice(
            "minibatch_size", [2**i for i in range(4, 8)]
        ),  ###########
        "replay_buffer_size": hp.choice(
            "replay_buffer_size",
            [2000, 3000, 5000, 7500, 10000],
        ),  #############
        "min_replay_buffer_size": hp.choice(
            "min_replay_buffer_size",
            [125, 250, 375, 500, 625, 750, 875, 1000, 1500, 2000],
        ),  # 125, 250, 375, 500, 625, 750, 875, 1000, 1500, 2000
        "n_step": hp.choice("n_step", [3, 4, 5, 8, 10]),
        "discount_factor": hp.choice("discount_factor", [0.9, 0.99, 0.995, 0.999]),
        "atom_size": hp.choice("atom_size", [51, 61, 71, 81]),  #
        "conv_layers": hp.choice("conv_layers", [[]]),
        "dense_layers_widths": hp.choice("dense_layers_widths", width_combinations),
        "advantage_hidden_layers_widths": hp.choice(
            "advantage_hidden_layers_widths", width_combinations
        ),  #
        "value_hidden_layers_widths": hp.choice(
            "value_hidden_layers_widths", width_combinations
        ),  #
        "training_steps": hp.choice("training_steps", [10000]),
        "per_epsilon": hp.choice("per_epsilon", [0.00001, 0.0001, 0.001, 0.01, 0.1]),
        "per_alpha": hp.choice("per_alpha", [0.05 * i for i in range(1, 21)]),
        "per_beta": hp.choice("per_beta", [0.05 * i for i in range(1, 21)]),
        "clipnorm": hp.choice("clipnorm", [None, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]),
    }

initial_best_config = [{}]


pickle.dump(search_space, open("./search_spaces/torch_choice_search_search_space.pkl", "wb"))
pickle.dump(initial_best_config, open("./search_spaces/torch_choice_search_initial_best_config.pkl", "wb"))

%run ../../dqn/rainbow/hyperparameter_optimization.py ./search_spaces/torch_choice_search_search_space.pkl ./search_spaces/torch_choice_search_initial_best_config.pkl ClassicControl-v1_search_choice .

In [None]:
# Graphing the trials over time
from utils import plot_trials
import pickle
trials = pickle.load(open('./torch_choice_trials/CartPole-v1_trials.p', 'rb'))
tesnorflow_trials = pickle.load(open('./tensorflow_choice_trials/CartPole-v1_trials.p', 'rb'))
# print(trials.trials)
%matplotlib inline
plot_trials(trials, "Carpole-v1_torch_trials_over_time")
plot_trials(tesnorflow_trials, "Carpole-v1_tensorflow_trials_over_time")


## Hyperopt with Torch and hp.uniform, hp.loguniform, etc...

In [None]:
import pickle
from hyperopt import hp
from hyperopt.pyll import scope
from utils import CategoricalCrossentropy, KLDivergence, generate_layer_widths
import gymnasium as gym 
import numpy as np

width_combinations = generate_layer_widths([32, 64, 128, 256, 512, 1024], 5)
# print(width_combinations)

search_space = {
        "kernel_initializer": hp.choice(
            "kernel_initializer",
            [
                "he_uniform",
                "he_normal",
                "glorot_uniform",
                "glorot_normal",
                "orthogonal",
            ],
        ),
        "learning_rate": hp.qloguniform('learning_rate', np.log(0.00001), np.log(1.0), 0.00001),
        "adam_epsilon": hp.qloguniform('adam_epsilon', np.log(0.0003125), np.log(0.3125), 0.0003125),
        "loss_function": hp.choice(
            "loss_function", [CategoricalCrossentropy(), KLDivergence()]
        ),
        # NORMALIZATION?
        "transfer_interval": hp.qloguniform('transfer_interval', np.log(10), np.log(2000), 10),
        "replay_interval": hp.uniformint('replay_interval', 1, 12),
        "minibatch_size": scope.int(hp.quniform('minibatch_size', 16, 128, 16)),  ###########
        "replay_buffer_size": scope.int(hp.quniform('replay_buffer_size', 2000, 10000, 1000)),  #############
        "min_replay_buffer_size": scope.int(hp.quniform('min_replay_buffer_size', 125, 2000, 125)), 
        "n_step": hp.uniformint('n_step', 2, 10),
        "discount_factor": hp.qloguniform('discount_factor', np.log(0.9), np.log(0.999), 0.001),
        "atom_size": scope.int(hp.quniform('atom_size', 51, 81, 10) + 1),  #
        "conv_layers": hp.choice("conv_layers", [[]]),
        "dense_layers_widths": hp.choice("dense_layers_widths", width_combinations),
        "advantage_hidden_layers_widths": hp.choice(
            "advantage_hidden_layers_widths", width_combinations
        ),  #
        "value_hidden_layers_widths": hp.choice(
            "value_hidden_layers_widths", width_combinations
        ),  #
        "training_steps": hp.choice("training_steps", [10000]),
        "per_epsilon": hp.qloguniform('per_epsilon', np.log(0.00001), np.log(0.1), 0.00001),
        "per_alpha": hp.quniform('per_alpha', 0.05, 1.0, 0.05),
        "per_beta": hp.quniform('per_beta', 0.05, 1.0, 0.05),
        "clipnorm": hp.qloguniform('clipnorm', np.log(0.1), np.log(100.0), 0.1)
    }

initial_best_config = [{}]


pickle.dump(search_space, open("./search_spaces/torch_quantized_search_search_space.pkl", "wb"))
pickle.dump(initial_best_config, open("./search_spaces/torch_quantized_search_initial_best_config.pkl", "wb"))

%run ../../dqn/rainbow/hyperparameter_optimization.py ./search_spaces/torch_quantized_search_search_space.pkl ./search_spaces/torch_quantized_search_initial_best_config.pkl ClassicControl-v1_quantized_search .

In [None]:
# Graphing the trials over time
from utils import plot_trials
import pickle
trials = pickle.load(open('../../dqn/rainbow/CartPole-v1_trials.p', 'rb'))
tesnorflow_trials = pickle.load(open('./tensorflow_trials/CartPole-v1_trials.p', 'rb'))
# print(trials.trials)
%matplotlib inline
plot_trials(trials, "Carpole-v1_torch_search_trials_over_time")
plot_trials(tesnorflow_trials, "Carpole-v1_tensorflow_trials_over_time")

In [None]:
# Speed of results
# Graph of hyperparameter convergence over time for torch and tensorflow compared with random search
# Hyperparameters importance (ape-x analysis file)
# Hyperparameter similarity of consecutive trials as a measure of confidence
# Search space size vs convergence speed
# Compare results to tuning only one system at a time
# Using rolling average as evaluation method


Rainbow results on Classic Control environments
Training steps: 
Evaluation: 
Hyperparamaters: 

Rainbow results on (subset of) Atari games
Training steps: 
Evaluation: 
Hyperparamaters: 