# ChainerRL Quickstart Guide

This is a quickstart guide for users who just want to try ChainerRL for the first time.

If you have not yet installed `chainerrl`, run the command below to install it:
```
pip install chainerrl
```

If you have already installed `chainerrl`, let's begin!

First, you need to import necessary modules. The module name of ChainerRL is `chainerrl`. Let's import `gym` and `numpy` as well since they are used later.

In [None]:
import chainer
import chainer.functions as F
import chainer.links as L
import chainerrl
import gym
import numpy as np

ChainerRL can be used for any problems if they are modeled as "environments". An environment must define its observation space and action space and have two methods: `reset` and `step`. OpenAI Gym provides various kinds of environments.

Let's try 'CartPole-v0', which is a classic control problem. You can see below that its observation space consists of four real numbers while its action space consists of two discrete actions.

Below are the important methods of environments.
- `env.reset` will reset the environment to the initial state and return the initial observation.
- `env.step` will execute a given action, move to the next state and return four values:
  - a next observation
  - a scalar reward
  - a boolean value that indictes the current state is terminal or not
  - additional information
- `env.render` will render the current state.

In [None]:
env = gym.make('CartPole-v0')
print('observation space:', env.observation_space)
print('action space', env.action_space)
obs = env.reset()
env.render()
print('initial observation:', obs)
action = env.action_space.sample()
obs, r, done, info = env.step(action)
print('next observation:', obs)
print('reward:', r)
print('done:', done)
print('info:', info)

Now you have defined our environment. Next, you need to define an agent, which will learn through interactions with the environment.

ChainerRL provides various agents, each of which implements a deep reinforcement learning algorithm.

To use DQN, you need to define a so-called Q-function that receives an observation and returns an value for each action the agent can take. In ChainerRL, you can define your Q-function as `chainer.Link` as below. Note that you need to wrap outputs by `chainerrl.action_value.DiscreteActionValue`.

In [None]:
class QFunction(chainer.Chain):

    def __init__(self, ndim_obs, n_actions, n_hidden_channels=50):
        super().__init__(
            l0=L.Linear(ndim_obs, n_hidden_channels),
            l1=L.Linear(n_hidden_channels, n_hidden_channels),
            l2=L.Linear(n_hidden_channels, n_actions))

    def __call__(self, x, test=False):
        h = F.tanh(self.l0(x))
        h = F.tanh(self.l1(h))
        return rl.action_value.DiscreteActionValue(self.l2(h))
    
ndim_obs = env.observation_space.shape[0]
n_actions = env.action_space.n
q_func = QFunction(ndim_obs, n_actions)

# Uncomment if CUDA is available
# q_func.to_gpu(0)

You can also use ChainerRL's predefined Q-functions.

In [None]:
_q_func = rl.q_functions.FCStateQFunctionWithDiscreteAction(
    ndim_obs, n_actions,
    n_hidden_layers=2, n_hidden_channels=100)

As in Chainer, `chainer.Optimizer` is used to update models.

In [None]:
# Use Adam to optimize q_func. eps=1e-2 is for stability.
optimizer = chainer.optimizers.Adam(eps=1e-2)
optimizer.setup(q_func)

A Q-function model and its optimizer are used by a DQN agent. To create a DQN agent, you need to specify a bit more parameters and configurations.

In [None]:
# Set the discount factor that discounts future rewards
gamma = 0.95

# For exploration, epsilon-greedy with epsilon=0.2 is used
explorer = rl.explorers.ConstantEpsilonGreedy(
    epsilon=0.2, random_action_func=env.action_space.sample)

# DQN uses Experience Replay.
# Specify a replay buffer and its capacity.
replay_buffer = rl.replay_buffer.ReplayBuffer(capacity=10 ** 6)

# Since observations from CartPole-v0 is numpy.float64 while
# Chainer only accepts numpy.float32 by default, specify
# a converter as a feature extractor function phi.
phi = lambda x: x.astype(np.float32, copy=False)

# Now create an agent that will interact with the environment.
agent = rl.agents.DQN(
    q_func, optimizer, replay_buffer, gamma, explorer,
    replay_start_size=1000, update_frequency=1,
    target_update_frequency=100, phi=phi)

Now you have an agent and an environment. It's time to start reinforcement learning!

In training, use `agent.act_and_train` to select exploratory actions. `agent.stop_episode_and_train` must be called after finishing an episode. You can get training statistics of the agent from via `agent.get_statistics`.

In [None]:
n_episodes = 200
for i in range(1, n_episodes + 1):
    obs = env.reset()
    reward = 0
    done = False
    R = 0  # return (sum of rewards)
    t = 0  # time step
    while not done and t < 200:
        action = agent.act_and_train(obs, reward)
        obs, reward, done, _ = env.step(action)
        R += reward
        t += 1
    if i % 10 == 0:
        print('episode:', i,
              'R:', R,
              'statistics:', agent.get_statistics())
    agent.stop_episode_and_train(obs, reward, done)

Now you finished training the agent. How good is the agent now? You can test it by using `agent.act` and `agent.stop_episode` instead. Exploration is not used anymore.

In [None]:
env.render(close=True)
for i in range(10):
    obs = env.reset()
    done = False
    R = 0
    t = 0
    while not done and t < 200:
        env.render()
        action = agent.act(obs)
        obs, r, done, _ = env.step(action)
        R += r
        t += 1
    print('test episode:', i, 'R:', R)
    agent.stop_episode()

The only remaining task is to save the agent so that you can reuse it. Simply call `agent.save` to save, `agent.load` to load.

In [None]:
# Save an agent to the 'agent' directory
agent.save('agent')
# Load an agent from the 'agent' directory
# agent.load('agent')

Almost everytime you apply RL to something, it entails training, testing and saving agents. So ChainerRL has utility functions that do these things.

In [None]:
# Train agent at env for 10000 steps.
# Evaluate it after every 1000 steps.
# For each evaluation 10 episodes are sampled.
# Save everything to the result dir.
rl.experiments.train_agent_with_evaluation(
    agent, env, steps=10000, eval_n_runs=10,
    eval_frequency=1000, outdir='result')

That's all of the ChainerRL quickstart guide. Please look into the `examples` directories to learn about ChainerRL. Thank you.