# Deep Reinforcement Learning for Firm Power Provision by Wind Farms

## The Problem

Wind farms can provide abundent clean, "free" energy.

Traditional wind farm control aims to maximize the amount of available power.

It does this by adjusting the *yaw angle* and the *axial induction factor* of each turbine.

But the electricity grid doesn't need *lots* of power, it needs just the right *amount* of power at each point in time.

So traditional wind farm control is problematic because:

1. power from wind farms is often *curtailed* and let go to waste and 
2. maximizing the power that the turbines extract from the wind above what is required 
can lead to unnecessary fatigue loading over time.

![Hornsrev Wind Farm](figs/hornsrev.jpeg)

## Reinforcement Learning is all the rage... but what is it?

An autonomous system learning what to do by **trial and error**:
> ... learning what to do — how to map situations to actions — so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. 

![Reinforcement Learning Feedback Loop](figs/rl.png)

where there is a **delayed reward** for actions taken in the present:
> ... actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards.

![Grid World Example](figs/basicgridworld.png)

So the agent fumbles through a **trajectory** of states, actions and rewards and learns as it goes:

$S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3,\ldots$

### Key Ingredients
- Environment, $E$
- Agent(s), $A_1, \ldots, A_l$
- States, $s_t \in \mathcal{S}$
- Observations, $o_t \in \mathcal{O}$
- Actions, $a_t \in \mathcal{A}$
- Policy, $\pi(a_t | s_t)$, probability of selecting action $a_t$ from the current state $s_t$
- Reward, $r(s_t, a_t) \in \mathbb{R}$

At each time-step $t$, the agent tries to select an action $a_t \sim \pi(a_t | s_t)$ that maximizes the **discounted return** with a **discount rate** $\gamma \in [0, 1]$:

$$
\begin{aligned}
G_t &= \sum_{i=t}^T \gamma^{(i - t)} r(s_i, a_i)\\
    &= r(s_t, a_t) + \sum_{i=t+1}^T \gamma^{(i - t)} r(s_i, a_i)\\
    &= r(s_t, a_t) + \gamma G_{t+1}
\end{aligned}
$$

The **action-value function for policy $\pi$** describes the expected return after taking an action $a_t$ from state $s_t$ and thereafter following a policy $\pi$, and is thus is a measure for how good it is to perform a given action in a given state:

$$
Q^\pi(s, a) = \mathbb{E}_\pi\left[G_t | s_t=s, a_t=a\right]
$$

The action-value function satisfies the recursive relationship known as the **Bellman equation**:

$$
\begin{aligned}
Q^\pi(s, a) &= \mathbb{E}_{r_t, s_{t+1} \sim E, a_{t+1} \sim \pi(s, a)} \left[G_t | s_t=s, a_t=a\right]\\
&= \mathbb{E}_{r_t, s_{t+1} \sim E, a_{t+1} \sim \pi(s, a)} \left[r_t + \gamma G_{t+1} | s_t=s, a_t=a\right]\\
&= \sum_{s',r} \text{Pr}(s', r | s, a) \sum_{a'} \pi(a'|s') \left[r + \gamma \mathbb{E}_{r_{t+1}, s_{t+2} \sim E, a_{t+2} \sim \pi(s, a)} \left[ G_{t+1} | s_{t+1}=s', a_{t+1}=a'\right]\right] \\
&= \mathbb{E}_{r_t, s_{t+1} \sim E} \left[r_{t} + \gamma Q^\pi(s_{t+1}, a_{t+1})\right]
\end{aligned}
$$

If the policy can be described as a deterministic function of the state, $\mu(s)$, then:

$$
Q^\mu(s, a) = \mathbb{E}_{r_t, s_{t+1} \sim E} \left[r_{t} + \gamma Q^\mu(s_{t+1}, \mu(s_{t+1}))\right]
$$
and the expectation depends only on the environment's influence on $r_t$ and $s_{t+1}$ (and not on the influence of a stochastic policy $\pi$ on $a_{t+1}$).

### Policy-Gradient Methods
- Learning a parameterized policy, $\pi(a | s, \theta)$,
- ... by updating the parameters based on the gradient of a performance measure, $J(\theta)$:

$$
\theta_{t+1} = \theta_t + \alpha \nabla_\theta \hat{J}(\theta_t)
$$

### Actor-Critic Methods 
Learning a policy (the **actor**) *and* a value function (the **critic**) to assess the action with.
- The **policy**, $\mu(s)$, is the actor.
- The **action-value function**, $Q^\pi(s, a)$, is the critic.

### An Algorithm for Deterministic Policies - Deterministic Policy Gradient (DPG)
- learns a deterministic parameterized actor function, $\mu(s | \theta^\mu)$ with a policy-gradient update:
$$
\theta^\mu_{t+1} = \theta^\mu_t + \alpha^\mu \nabla_{\theta^\mu} \hat{J}(\theta^\mu_t)
$$
where the **performance measure** is defined as the expected discounted return from the first-time step:
$$
J = \mathbb{E}_{r_i, s_i \sim E, a_i \sim \mu} \left[G_1\right]
$$
and the **gradient** with respect to the actor function's parameters is found by appling the chain rule of differentiation:
$$
\begin{aligned}
\nabla_{\theta_\mu} J &\approx \mathbb{E}_{s_t \sim E} \left [ \nabla_{\theta_\mu} Q^\mu(s, a | \theta^Q) |_{s=s_t,a=\mu(s_t|\theta^\mu)} \right]\\
&= \mathbb{E}_{s_t \sim E} \left [ \nabla_{a} Q^\mu(s, a | \theta^Q) |_{s=s_t,a=\mu(s_t|\theta^\mu)} \nabla_{\theta^\mu} \mu(s|\theta^\mu)|_{s=s_t} \right]
\end{aligned}
$$

- learns a parameterized critic function, $Q^\mu(s, a | \theta^Q)$ with an estimation error gradient update:

$$
\theta^Q_{t+1} = \theta^Q_{t} + \alpha^Q \nabla_{\theta^Q} Q^\mu(s, a | \theta^Q_t) \left( \underbrace{r_t + \gamma Q^\mu(s_{t+1}, a_t | \theta^Q)}_\text{target value} - \underbrace{Q^\mu(s_t, a_t | \theta^Q)}_\text{current estimate} \right)
$$

### Adding Deep Learning into the Mix - Deep Deterministic Policy Gradient (DDPG)
Artifical neural networks (ANNs) can be used for nonlinear function approximation.

We learn the parameterized critic by minimizing the loss:

$$
L(\theta^Q) = \mathbb{E}\left[(\underbrace{y_t}_\text{target value} - \underbrace{Q^\mu(s_t, a_t | \theta^Q)}_\text{current estimate})^2\right]
$$

where $y_t = r(s_t, a_t) + \gamma Q^\mu(s_{t+1}, \mu(s_{t+1} | \theta^\mu))$.

This update algorithm is known as **Q-learning**.

We learn the parameterized policy by minimizing the loss:

$$
L(\theta^\mu) = \mathbb{E}\left[-Q^\mu(s_t, \mu(s_t | \theta^\mu) | \theta^Q)\right]
$$.



Directly implementing Q-learning with ANNs has proven to be unstable in many environments since the learned critic network $Q^\mu(s, a | \theta^Q)$ being updated is also used to calculate the target value $y_t$, so the update is prone to divergence.

Enter the **DDPG** solution!
1. Create a copy of the actor and critic networks, $\mu'(s | \theta^{\mu'})$ and $Q'(s, a | \theta^{Q'})$, respectively to calculate the target values.

2. Apply "soft" updates to the weights of these **target networks**
$$
\begin{aligned}
\theta^{\mu'}_{t+1} &= \tau \theta^{\mu}_{t} + (1-\tau)\theta^{\mu'}_{t}\\
\theta^{Q'}_{t+1} &= \tau \theta^{Q}_{t} + (1-\tau)\theta^{Q'}_{t}
\end{aligned}
$$
where a small value of $\tau << 1$ means that the target weights are constrained to be updated very slowly, greatly improving the stability of learning at the cost of slower convergence.

### To be greedy or learn more - Exploitation vs. Exploration
There are **greedy actions**...
> If you maintain estimates of the action values, $Q(s, a)$, then at any time step there is at least one action whose estimated value is greatest ... you are *exploiting* your current knowledge of the values of the actions.

$$
a_t = \arg\,\max\limits_a Q^\mu_t(s, a | \theta^\mu)
$$

there are **nongreedy actions**...
> If instead you select one of the nongreedy actions ... you are *exploring*, because this enables you to improve your estimate of the nongreedy action’s value.

and then there are **$\epsilon$-greedy** actions...
> to behave greedily most of the time, but every once in a while, say with small probability $\epsilon$, instead select randomly from among all the actions with equal probability

$$
a_t = \left\{ \begin{matrix}
\text{uniform sample from } \mathcal{A} \text{ with probability } \epsilon \\
\arg\,\max\limits_a Q^\mu_t(s, a | \theta^\mu) \text{ with probability } 1-\epsilon
\end{matrix} \right.
$$

### Too many cooks ... - Multiagent Reinforcement Learning

Multiple agents $\Rightarrow$ multiple state-spaces, observation-spaces, action-spaces, action-value functions, policies $\Rightarrow$ a *lot* of NNs to keep track of.

Enter **Parameter-Sharing**!
- Rather than maintaining an actor, critic, target actor and target critic network for every agent, we use one of each for all agents.
- Add an **agent indicator** variable to the observation space to distinguish different agents.
- Pad shorter observations with zeros to maintain a constant-sized observation space.
- Pad shorter actions with zeros to maintain a constant-sized action-space, and truncate when executed for each agent.

## Letting the Wind Farm find its own way in this world ... A Reinforcement-Learning Solution

### Objective
- Train one agent to adjust the yaw angle of each turbine to coarsely track the power reference 
without too much yaw travel of turbine loading.
- Train another agent to adjust the axial induction factor of each turbine to finely track the power reference.

### Constraints
- Yaw actuation is *much* slower than axial induction factor actuation
- Excessive yaw travel damages the bearings
- Excessive thrust on the turbines results in fatigue loading on the components

### Autoregressive Observation Space
When the yaw angle of a turbine changes, it takes a while for that effect to propagate downstream...

So the current reward depends on previous axial induction factors, yaw angles, wind field measurements and online status of each turbine $\Rightarrow$ **autoregressive observation space**.

We also included a preview of the time-varying power reference, the yaw travel for each wind turbine and the thrust force experienced by each turbine.

### Action-Spaces
Given the set of current and previous observations, $O_t$...

For each turbine $i\in\mathcal{T}$...

every $\Delta t_\gamma$ time-steps, choose discrete yaw angle changes $\gamma^i_t \in [-1, 0, 1]$...

and every $\Delta t_a$ time-steps, choose continuous axial induction factors $a^i_t \in [0, \frac{1}{3}]$

### Rewards
$$
\begin{aligned}
r_\gamma &= W^P_\gamma \exp(-\beta^P_\gamma (\Delta P)^2)
		         - W^T_\gamma * \exp(-\beta^T_\gamma (\max_i T_i)^2)
		         - W^{\text{Tr}}_\gamma * \exp(-\beta^{\text{Tr}}_\gamma (\max_i \text{Tr}_i)^2))\\
r_a &= W^P_a \exp(-\beta^P_a (\Delta P)^2)
\end{aligned}    
$$

## Challenges & Open Questions
- The control actions of the yaw angle and axial induction factor agents are **coupled** $\Rightarrow$ how can we tell which action led to the present reward?
- When a turbine goes offline, the dynamics of the wind field in the wind farm can change dramatically $\Rightarrow$ can the RL algorithm learn from these occurences and adapt?
- The choice of weights of the reward functions will greatly influence the results $\Rightarrow$ how to tune these efficiently for a given wind farm?
- We know the power reference we will need to track for the next $24$ hours! $\Rightarrow$ can we use this to our advantage by learning the best yaw angles for some future time horizon?
- Parameter-sharing halves the learning flexibility for the policies and the action-value functions $\Rightarrow$ will it suffice for our purposes?

## Here goes nothing!



In [2]:
# install packages
# from google.colab import drive
# drive.mount('/content/gdrive')

# %cd /content/gdrive/My Drive/rl_wf

# ! git clone https://github.com/achenry/rl-fowf-control.git
# ! ls
# ! cd rl-fowf-control
# ! git checkout aoife

# ! pip install floris==2.4
# ! python rl-fowf-control/floridyn/setup.py develop
# ! pip install torch
# ! pip install grpcio=1.43.0
# ! pip install gymnasium
# ! pip install scikit-learn
# ! pip install dm_tree
# ! pip install gputil
# ! pip install pettingzoo
print(1)

1
