## Train an RL agent
This notebook will focus the following topics:

 - define reward function,
 - define featurize function,
 - training a single RL agent.

In this notebook a reinforcement learning agent is trained to control the current flowing through an inductor.
It will be shown for an easy case how the agent can learn and be applied to an electrical power grid simulated with de JEG package.

The use case is shown in the figure below.
This environment consists of a single phase electrical power grid with 1 source and 1 load connected via a cable.

![](figures/RL_single_agent.png "")

First we define the environment with the configuration shown in the figure. 
For more information on how to setup an environment see `Env_Create_DEMO.ipynb`.

`RL` is selected as `control_type` for the source (`parameters["source"][1]["control_type"]`).
Initially, any key can be used as the `mode`. Here, we choose the name `my_ddpg`. 
This key is then used to link an agent to the source and its corresponding state and action ids.
Based on these indices, the state that will be provided to the agent as well as the actions the agent outputs are passed to the appropriate places with the help of a `MultiAgentGridController`.
For further details please refer to Userguide -> `MultiAgentGridController`.



In [None]:
using Dare
using ReinforcementLearning

: 

In [None]:
# calculate passive load for wanted setting / power rating
R_load, L_load, X, Z = Parallel_Load_Impedance(100e3, 1, 230)

# define grid using CM
CM = [0. 1.
    -1. 0.]

# Set parameters accoring graphic above
parameters = Dict{Any, Any}(
    "source" => Any[
                    Dict{Any, Any}("pwr" => 200e3, "control_type" => "RL", "mode" => "my_ddpg", "fltr" => "L"),
                    ],
    "load"   => Any[
                    Dict{Any, Any}("impedance" => "R", "R" => R_load, "v_limit"=>1e4, "i_limit"=>1e4),
                    ],
    "grid" => Dict{Any, Any}("phase" => 1)
)

To teach the agent that it should control the current in a certain way it needs information about which value the current shoud be (reference value) (->`featurize`) and how good the state is which was reached using the chosen action (-> `reward`).

Therefore, the reference value has to be defined. 
Here we will use a constant value to keep the example simple.
But since the the `reference(t)` function take the simulation time as argument, more complex, time dependent signals could be defined.

In [None]:
function reference(t)
    return 1
end

Afterwards the `featurize()` function, which gives the user the opportunity to modify a state before it gets passed to the agent, is defined.

It takes three arguments:
- `state` contains all the state values that correspond to the source controlled by agent with key `name`
- `env` references the environment
- `name` contains the key of the agent

The signal generated by the `reference` function is then added to the state for the agent `my_ddpg`. This will help the agent to learn because later we will define a reward that has maximum value if the measured current fits the reference value.
The reference value has to be normalized in an appropirate way that it fits to the range of the normalized states.

Additionally more signals could be added here to enhance the learning process.

As stated before, `state` already contains all state values of the source the agent with key `name` should control.
However, the environment maintains a lot more states than that. Through `featurize` we could expose them to the agent but we refrain from that here since we want to simulate a scenario where the the source the agent controls is far away (e.g. 1km) from the load its supplying. 
In cases like this it's common that the agent has no knowlegde about states of the load since no communication and measurements exchange between source and load is assumed.

In onther examples the electrical power grid consits of multiple sources and loads. The other sources are controlled by other agents or classic controllers. In that case, typically every controller / agent has knowlegde of the states of the source it controls but not about the states another agent/controller controls.
(For more information see `MultiAgentGridController` and `inner_featurize` of the `env`.)

In [None]:
featurize_ddpg = function(state, env, name)
    if name == "my_ddpg"
        norm_ref = env.nc.parameters["source"][1]["i_limit"]
        state = vcat(state, reference(env.t)/norm_ref)
    end
end

Before defining the environment, the `reward()` function has to be defined. It provides a feedback to the agent on how good the chosen action was.
First, the state to be controlled is taken from the current environment state values.
Since the states are normalized by the limits the electrical components can handle, a value greater than `1` means that the state limit is exceeded typically leading to a system crash.
Therefore, first it is checked if the measured state is greater than `1`. In that case a punishment is returned which, here, is chosen to be `r = -1`.

In case the controlled state is within the valid state space, the reward is caculated based on the error between the wanted reference value and the measured state value. 
If these values are the same, meaning the agent perfectly fullfills the control task, a reward of `r = 1` is returned to the agent. ( -> r $\in$ [-1, 1]).
If the measured value differs from the reference, the error - based on the root-mean square error (RMSE) in this example - is substracted from the maximal reward: `r = 1 - RMSE`:

$r = 1 - \sqrt{\frac{|i_\mathrm{L,ref} - i_\mathrm{L1}|}{2}}$

To keep the reward in the wanted range, the current difference is devided by 2. (E.g., in worst case, if a reference value equal to the corresponding current limit is chosen $i_\mathrm{L,ref} = i_\mathrm{lim}$ and the measured current is the negative current limit $i_\mathrm{L1} = -i_\mathrm{lim}$ more the 1 would be substracted without this normaization).

In [None]:
function reward_function(env, name = nothing)
    if name == "my_ddpg"
        index_1 = findfirst(x -> x == "source1_i_L1", env.state_ids)
        state_to_control = env.state[index_1]

        if any(abs.(state_to_control).>1)
            return -1
        else

            refs = reference(env.t)
            norm_ref = env.nc.parameters["source"][1]["i_limit"]          
            r = 1-((abs.(refs/norm_ref - state_to_control)/2).^0.5)
            return r 
        end
    end
end

Then, the defined parameters, featurize and reward functions are used to create an environment consisting of the electircal power grid. To keep the first learning example simple the action given to the env is internally not delayed. 

In [None]:
env = SimEnv(
    CM = CM, 
    parameters = parameters, 
    t_end = 0.1, 
    featurize = featurize_ddpg, 
    reward_function = reward_function, 
    action_delay = 0)

In this example a `Deep Deterministic Policy Gradient` agent (https://arxiv.org/abs/1509.02971, https://spinningup.openai.com/en/latest/algorithms/ddpg.html) is chosen which can learn a control task on continous state and action spaces.
It is configured using the `create_agent_ddpg()` function which uses the information about the state and action ids, based on the parameter dict, stored in the `agent_dict` in the env:

`env.agent_dict[chosen_key]` (chosen key, here, `my_ddpg`):
- `"source_number"`: ID/number of the source the agent with this key controls
- `"mode"`: Name of the agent
- `"action_ids"`: List of strings with the action ids the agent controls/belong to the "source_number"`
- `"state_ids"`: List of strings with the state ids the agent controls/belong to the "source_number"`

This information is used in the `setup_agents()` method to configure the control-side of the experiment.

The agent is configured to receive as many inputs as environment returns for it's state (after `featurize`) and return as many outputs as actions requested from the env corresponding to the ids.

In [None]:
agent = create_agent_ddpg(na = length(env.agent_dict["my_ddpg"]["action_ids"]),
                          ns = length(state(env, "my_ddpg")),
                          use_gpu = false)

The `setup_agents()` function takes the control types defined in the parameter dict and hands the correct indices to the corrensponding controllers / agents.
The function returns `controllers` which is an instance of the `MultiAgentGridController` which contains the different agents and classic controllers and maps their actions to the corresponding sources. 

Since in this example only one RL agent will be used it only contains the defined `my_ddpg` agent. 
Therefore, the agent handed over to the `setup_agents()` function is internally extended by a name to a `named policy` (https://juliareinforcementlearning.org/docs/rlcore/#ReinforcementLearningCore.NamedPolicy ).
Using this name the `MultiAgentGridController` (compare, https://juliareinforcementlearning.org/docs/rlzoo/#ReinforcementLearningZoo.MADDPGManager) enables to call the different agents/controllers via name during training and application.

To use the previously defined agent, a dict linking tha `chosen_key`: `my_ddpg` to the defined RL agent is handed over to the `setup_agents` method: 

In [None]:
my_custom_agents = Dict("my_ddpg" => agent)

controllers = setup_agents(env, my_custom_agents)

The `controllers` in this examples consits only of the one RL agent (`my_ddpg`) and can be trained usin the `learn()` function to train 20 episodes:

In [None]:
learn(controllers, env, num_episodes = 20)

After the training, the `simulate()` function is used to run a test epiode without action noise and the state to be controlled ($i_\mathrm{L1}$) is plotted:

In [None]:

states_to_plot = ["source1_i_L1"]
hook = data_hook(collect_state_ids = states_to_plot)

simulate(controllers, env, hook=hook)

plot_hook_results(hook = hook,
                  states_to_plot  = states_to_plot)