Hi everyone,

In this tutorial series, I want to show you how to use RLlib to train your agents ranging from OpenAI gym environments to external robotics simulators, parallelize your training setup efficiently, and customize the behavior of your agents so that you can extend the algorithms however you want.

But in this first tutorial, letâ€™s start from scratch and try to train a vanilla DQN agent for the LunarLander-v2 environment from the OpenAI gym library.

First of all, let's start with installing the ray library with RLlib components:

In [1]:
pip install -U "ray[rllib]"

Requirement already up-to-date: ray[rllib] in /home/anil/.local/lib/python3.8/site-packages (1.11.0)
Note: you may need to restart the kernel to use updated packages.


Now we can import our DQN agent and gym library as:

In [2]:
from ray.rllib.agents.dqn import DQNTrainer
import gym

In RLlib, each algorithm we will be using will be a subclass of the Trainer class, as in this DQN example. Now we are ready select the environment and instantiate our agent. For instantiating the `Trainer` subclasses in RLlib, we give config dictionaries. With this config dictionaries, we define the environment we want to use and specify hyperparameters like learning rate and number of layers in the model neural network.

For our first tutorial, let's select the "CartPole-v1" environment since it is easy to train on. Note that we didn't specify the environment in config by the class name, instead we have just gived the name of the environment. The reason is that RLlib can understand the default gym environments like that. But in future tutorials where we use our custom environments, we will use give the class name.

By default, DQNTrainer uses learning rate of 5e-4 and 1 hidden layer with 256 neurons. Let's change learning rate to 1e-3 and have 2 hidden layer with 128 neurons each. As I've already said, we will use `torch` for deep learning framework so we specify it, otherwise RLlib will try to use `tensorflow`. And since we said vanilla DQN algorithm, let's turn off double_q update, dueling networks and prioritized replay. In a later tutorial, we will explicitly design a dueling neural network architecture for DQN, but now, let's continue like that:

In [3]:
trainer = DQNTrainer(
    config={
        "env": "CartPole-v1",
        "lr": 1e-3,
        "hiddens": [64],
        "double_q": False,
        "dueling": False,
        "prioritized_replay": False,
        "framework": "torch"
    }
)

Now we are ready to train! Let's train the agent for 100 iterations, with each iteration is 1000 steps. This 1000 comes from the `timesteps_per_iteration` config, which is 1000 by default. If you dig into the default configs, you will see that epsilon-greedy action selection is used for exploration and target Q network updated every 500 steps.

In [4]:
for _ in range(100):
    results = trainer.train()
    print(f"iter: {trainer.iteration}, mean score: {results['episode_reward_mean']}")

iter: 1, mean score: 18.884615384615383
iter: 2, mean score: 18.87
iter: 3, mean score: 20.59
iter: 4, mean score: 22.52
iter: 5, mean score: 28.27
iter: 6, mean score: 35.47
iter: 7, mean score: 43.25
iter: 8, mean score: 52.44
iter: 9, mean score: 59.91
iter: 10, mean score: 66.98
iter: 11, mean score: 76.04
iter: 12, mean score: 84.11
iter: 13, mean score: 92.71
iter: 14, mean score: 101.56
iter: 15, mean score: 108.44
iter: 16, mean score: 116.5
iter: 17, mean score: 125.86
iter: 18, mean score: 133.22
iter: 19, mean score: 137.84
iter: 20, mean score: 143.79
iter: 21, mean score: 147.88
iter: 22, mean score: 155.06
iter: 23, mean score: 163.21
iter: 24, mean score: 166.44
iter: 25, mean score: 167.37
iter: 26, mean score: 172.01
iter: 27, mean score: 179.31
iter: 28, mean score: 183.73
iter: 29, mean score: 189.13
iter: 30, mean score: 194.92
iter: 31, mean score: 201.68
iter: 32, mean score: 203.98
iter: 33, mean score: 201.82
iter: 34, mean score: 201.92
iter: 35, mean score: 20

Train completed! In the CartPole-v1 environment, the maximum score agent can get is 500, which we managed to get 353 mean score. But we were taking explorationary actions, now let's see what happens if we act fully greedy.

For that we create the usual reinforcement learning loop and we can calculate actions for observations by `compute_single_action` method of trainer. Don't forget to specify `explore=False` so we act fully greedy:

In [6]:
env = gym.make("CartPole-v1")

for _ in range(10):
    obs = env.reset()
    done = False
    score = 0

    while not done:
        action = trainer.compute_single_action(obs, explore=False)

        obs, reward, done, _ = env.step(action)
        env.render()

        score += reward

    print(score)

500.0
500.0
500.0
500.0
500.0
500.0
500.0
500.0
500.0
500.0


As you can see, we have played for 10 episodes and we got 500 reward at each episode! And this brings us to the end of first episode. In the next tutorial, we will create our custom environment with continuous actions and try to solve it with DDPG algorithm.