## Gymnasium - An API standard for reinforcement learning with a diverse collection of reference environments

Documents: https://gymnasium.farama.org/introduction/train_agent/

Gymnasium is a project that provides an API (application programming interface) for all single agent reinforcement learning environments, with implementations of common environments: cartpole, pendulum, mountain-car, mujoco, atari, and more.



# Stable Baselines3 - Training, Saving and Loading

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)


[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

Examples with Collab Code: [https://stable-baselines3.readthedocs.io/en/master/guide/examples.html](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html)


# 1. Open a Terminal and install the following packages:

sudo apt-get install build-essential python-dev-is-python3 swig python3-pygame git

# 2. autoformatting and install box2d-py and stable-baselines3

In [1]:
# for autoformatting
# %load_ext jupyter_black

In [None]:
# Use pip to install box2d-py and stable-baselines3[extra] which required >= 2.0.0a4
# and gymnasium[other] which includes pymovie


## Import policy, RL agent, and create directories

In [None]:
# import libaries: gymnasium, numpy, stable_baselines3 and stable_baselines3.common.callbacks


In [None]:
# Create directories for models, videos and tb_logs


## Create the Gym env and instantiate the agent

For this example, we will use Lunar Lander environment.

"Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine. "

Lunar Lander environment: [https://gymnasium.farama.org/environments/box2d/lunar_lander/](https://gymnasium.farama.org/environments/box2d/lunar_lander/)

![Lunar Lander](https://cdn-images-1.medium.com/max/960/1*f4VZPKOI0PYNWiwt0la0Rg.gif)


We chose the MlpPolicy because input of Lunar Lander is a feature vector, not images.

The type of action to use (discrete/continuous) will be automatically deduced from the environment action space



In [None]:
# Create env for evaluation and record a video
#   Save video progress in directory: video_proress_dir
#   Save the first video for the first episode

# Create a DQN model for evaluation using the following parameters:
# - policy: "MlpPolicy"
# - environment: env
# - verbose: 1
# - exploration_final_eps: 0.1
# - target_update_interval: 250
# - tensorboard_log: log_dir 


We load a helper function to evaluate the agent:

In [None]:
# import a helper function: evaluate_policy to evaluate the policy


Let's evaluate the un-trained agent, this should be a random agent.

In [None]:
# Before training, how agent is performed and its mean of rewards and std of rewards

# print out its mean of rewards and std of rewards and video saved in video_dir


## Train the agent and save it

1. Create an env 
2. Record video 
3. Create DQN model
4. Record the result very 10000 steps
5. RL 100,000 (1e5)
6. Save the final result

Warning: this may take a while

In [None]:
# Trigger video creation every 50 episodes

# Create an env and Record video

# Create a DQN model

# Save every 10000 steps, save the result under "models/dqn_lunar" with prefix "dqn_lunar" using CheckpointCallback

# Start training the agent with timesteps of 100000, using checkpoint callback and setting tensorboard log name to "dqn_lunar"

# Save the final agent result optionally

# delete trained model to demonstrate loading

## Load the trained agent

In [None]:
# load the final checkpoint file dqn_lunar_final.zip


In [None]:
# Create evaluation env and create a video using RecordVideo wrapper


In [None]:
# Evaluate the trained agent

# print out the mean of rewards and std of rewards after training


In [None]:
# close the environment


# ==============================================

# Final Task: If time is allowed, increase the training frequency such that the mean_reward score can go up to 100+ mark. Can you tell how many times you need to train the model ?