# Further Usage

The previous page demonstrated how to directly run an algorithm by calling the runner.
In order to help users better understand the internal implementation process of "XuanCe",
and facilitate further algorithm development and implementation of their own reinforcement learning tasks,
this section will take the PPO algorithm training on the MuJoCo environment task as an example,
and provide a detailed introduction on how to call the API from the bottom level to implement reinforcement learning model training.

To get started, install XuanCe first.

(Note: --quiet is optional and only suppresses output in Google Colab; it's not required for installing XuanCe)

In [ ]:
!pip install xuance --quiet

## Create the config file

A config file should contain the necessary arguments of a PPO agent, and should be a YAML file.
Here we show a config file named "ppo_mujoco_config.yaml" for MuJoCo environment in gym.
You can also create this file by running the following code directly.

In [ ]:
import textwrap

yaml_content = textwrap.dedent("""
    dl_toolbox: "torch"  # The deep learning toolbox. Choices: "torch", "mindspore", "tensorlayer"
    project_name: "XuanCe_Benchmark"
    logger: "tensorboard"  # Choices: tensorboard, wandb.
    wandb_user_name: "your_user_name"  # The username of wandb when the logger is wandb.
    render: False # Whether to render the environment when testing.
    render_mode: 'rgb_array' # Choices: 'human', 'rgb_array'.
    fps: 50  # The frames per second for the rendering videos in log file.
    test_mode: False  # Whether to run in test mode.
    device: "cpu"  # Choose an calculating device.
    distributed_training: False  # Whether to use multi-GPU for distributed training.
    master_port: '12355'  # The master port for current experiment when use distributed training.
    
    agent: "PPO_Clip"  # The agent name.
    env_name: "Classic Control"  # The environment device.
    env_id: "Pendulum-v1"  # The environment id.
    env_seed: 1  # Random seed for environment.
    vectorize: "DummyVecEnv"  # The vecrized method to create n parallel environments. Choices: DummyVecEnv, or SubprocVecEnv.
    learner: "PPOCLIP_Learner"  # The learner.
    policy: "Gaussian_AC"  # choice: Gaussian_AC for continuous actions, Categorical_AC for discrete actions.
    representation: "Basic_MLP"  # The representation name.
    
    representation_hidden_size: [128,]  # The size of hidden layers for representation network.
    actor_hidden_size: [128,]  # The size of hidden layers for actor network.
    critic_hidden_size: [128,]  # The size of hidden layers for critic network.
    activation: "leaky_relu"  # The activation function for each hidden layer.
    activation_action: 'tanh'  # The activation function for the last layer of actor network.
    
    seed: 1  # The random seed.
    parallels: 10  # The number of environments to run in parallel.
    running_steps: 300000  # The total running steps for all environments.
    horizon_size: 256  # the horizon size for an environment, buffer_size = horizon_size * parallels.
    n_epochs: 8  # The number of training epochs.
    n_minibatch: 8  # The number of minibatch for each training epoch. batch_size = buffer_size // n_minibatch.
    learning_rate: 0.0004  # The learning rate.
    
    vf_coef: 0.25  # Coefficient factor for critic loss.
    ent_coef: 0.01  # Coefficient factor for entropy loss.
    target_kl: 0.25  # For PPO_KL learner.
    kl_coef: 1.0  # For PPO_KL learner.
    clip_range: 0.2  # The clip range for ratio in PPO_Clip learner.
    gamma: 0.98  # Discount factor.
    use_gae: True  # Use GAE trick.
    gae_lambda: 0.95  # The GAE lambda.
    use_advnorm: True  # Whether to use advantage normalization.
    
    use_grad_clip: True  # Whether to clip the gradient during training.
    clip_type: 1  # Gradient clip for Mindspore: 0: ms.ops.clip_by_value; 1: ms.nn.ClipByNorm()
    grad_clip_norm: 0.5  # The max norm of the gradient.
    use_actions_mask: False  # Whether to use action mask values.
    use_obsnorm: True  # Whether to use observation normalization.
    use_rewnorm: True  # Whether to use reward normalization.
    obsnorm_range: 5  # The range of observation if use observation normalization.
    rewnorm_range: 5  # The range of reward if use reward normalization.
    
    test_steps: 10000  # The total steps for testing.
    eval_interval: 50000  # The evaluate interval when use benchmark method.
    test_episode: 5  # The test episodes.
    log_dir: "./logs/ppo/"  # The main directory of log files.
    model_dir: "./models/ppo/"  # The main directory of model files.
""")

with open("ppo_pendulum_config.yaml", "w") as f:
    f.write(yaml_content)

## Run an example

In [ ]:
import argparse
import numpy as np
from copy import deepcopy
from xuance.common import get_configs
from xuance.environment import make_envs
from xuance.torch.utils.operations import set_seed
from xuance.torch.agents import PPOCLIP_Agent

configs_dict = get_configs(file_dir="ppo_pendulum_config.yaml")
configs = argparse.Namespace(**configs_dict)

set_seed(configs.seed)
envs = make_envs(configs)
Agent = PPOCLIP_Agent(config=configs, envs=envs)

train_information = {"Deep learning toolbox": configs.dl_toolbox,
                     "Calculating device": configs.device,
                     "Algorithm": configs.agent,
                     "Environment": configs.env_name,
                     "Scenario": configs.env_id}
for k, v in train_information.items():
    print(f"{k}: {v}")

def env_fn():
    configs_test = deepcopy(configs)
    configs_test.parallels = configs_test.test_episode
    return make_envs(configs_test)

train_steps = configs.running_steps // configs.parallels
eval_interval = configs.eval_interval // configs.parallels
test_episode = configs.test_episode
num_epoch = int(train_steps / eval_interval)

test_scores = Agent.test(env_fn, test_episode)
Agent.save_model(model_name="best_model.pth")
best_scores_info = {"mean": np.mean(test_scores),
                    "std": np.std(test_scores),
                    "step": Agent.current_step}
for i_epoch in range(num_epoch):
    print("Epoch: %d/%d:" % (i_epoch, num_epoch))
    Agent.train(eval_interval)
    test_scores = Agent.test(env_fn, test_episode)

    if np.mean(test_scores) > best_scores_info["mean"]:
        best_scores_info = {"mean": np.mean(test_scores),
                            "std": np.std(test_scores),
                            "step": Agent.current_step}
        # save best model
        Agent.save_model(model_name="best_model.pth")
# end benchmarking
print("Best Model Score: %.2f, std=%.2f" % (best_scores_info["mean"], best_scores_info["std"]))
Agent.finish()