# RL formulation

## Definitions, basic terms

### RL framework

Objective is to gather as much reward as possible during operation.

<img src="http://drive.google.com/uc?export=view&id=1VqFRFYo8Tv2YDG__6ntVliUNMb-RlEmU" width=75%>

### RL framework

Reinforcement learning uses interactions for learning. As opposed to supervised learning, the dataset to learn from (whether labeled or not) is unavailable. The agent generates the dataset (experiences) during the learning process by interactions. 
The agent can influence its environment by actions. The effect of the action on the environment is described by the states. 
The agent receives feedback signals from the environment, which indicates the efficiency of its actions.
The goal of the agent is to gather the most rewards possible.

### Environment and agent

In RL we have two players, interacting with each other: the environment and the agent. 

Examples:
* Drone: agent - flying drone, environment - its surrounding
* Trading bots: agent - the trading algorithm, environment - stock market (the broker interface)
* AlphGo: agent - the algorithm playing the game, environment - board and the opponent (!)

### Definition of state

The mathematical framework of reinforcement learning is based on the so called Markov Decision Process (MDP). More details later on. One of the key part of MDP, is the state. The state should describe the environment fully, meaning there are no further information required to determine the future of the environment.

Example 1: Consider the following physical system. We have an inclined slope with slope angle 45 degrees. There is a ball on the slope and its heading toward the bottom of the slope. In order to describe the dynamical behavior of the system we need the **position** and the **velocity** of the ball (and the parameters of the ball but it is assumed to be fix over time).

### Definition of state

Example 1 (continue): Therefore the state of this environment (the pyhysical system) is the position and the velocity of the ball. This is ensured by the second law of Newton because the equation does not contain higher derivatives of two (acceleration):

$$F = m \cdot \frac{d^2 s}{d t^2}$$

Therefore by knowing the position and the  velocity at a moment, the future positions and velocities (future states) can be determined by the laws of physics. 

### Definition of state

Example 2: In case of the drone. The state is represented by the image (current frame) created by the camera. 

*Generally: a state is well chosen if it is able to describe the environment by its own, there is no need to know further hystorical data. The future can be calculated by knowing the state and the dynamics of the environment.*

What can we tell about the state in the second example. Is it enough to fully describe the environment? Can you show examples when it is not enough?

### Definition of action

The agent can change the state of the environment by the actions. The actions can be represented by vectors. 

Examples:

* Drone: changing the trust power of the rotors
* Trading bots: sell, buy, wait, choose the stock
* AlphaGo: putting a stone to the board

### Definition of rewards

The environment signals rewards to the agent indicating its results. To distinguish the gathered rewards so far from the rewards received in a moment, the literature uses the return and the immediate reward terms.

Because the reward is signaled by the environment, the generation of reward is also part of the model of the environment. Therefore we can use a probability distribution to describe the rewards: $P(r|s, a)$

### Definition of rewards

Summed return:

$$G = \sum_i{r_i}$$

Discounted return:

$$G = \sum_i{\gamma^i r_i}$$
where $\gamma < 1$ and it is assumed that $r_i < R_{max}$, $\forall i$.

### Definition of rewards

Because of this last one, if $i\rightarrow \infty$:

$$G = \sum_i{\gamma^i r_i} < R_{max} \sum_i{\gamma^i} = \frac{R_{max}}{1-\gamma}$$
therefore $G$ remains finite.

### Definition of policy

The agent's behavior is represented by the policy. The policy is a function that maps the states to actions (deterministic policy) or to action probabilities (stochastic). $\pi: S \rightarrow A$ or $\pi: S \rightarrow P(A)$.

### Definition of policy

Deterministic policy:

$$a = \pi(s)$$

Stochastic policy:

$$\pi(s, a) = P(a|s)$$

### Definition of policy

Stochastic policies can have advantages over deterministic ones.

<img src="http://drive.google.com/uc?export=view&id=1b-EDUk5cFVpqtOvZ0o4begzKCYwO0dMS" width=75%>

### Definition of policy

If the environment is partially observable then only some parts of the state can be seen. If the state is $s = [f_1, f_2 ..., f_n]$, where $f_i$ is a parameter of the state, the observation can be: $o = [f_1, f_2, ..., f_m]$, $m < n$. 

The gridworld example: each grid is a state but the agent sees only the walls, exactly at the borders. So on the grids with quotion marks, the agent will see a wall at the bottom and top. As a consequence the two states have the same observations. They can not be distinguished. However, the first case requires to go to the right, but the second case requires to go to the left. 

A deterministic policy is not enough because it works when we are lucky regarding the initial state. A stochastic policy picks left and right with equal probability.

### Definition of transition matrix

Idea: The transition matrix describes the probability of the state transitions of the environment. This can describe the dynamics of the environment. Formally:

$$T(s, s') = P(s'|s)$$

Because in each state it is sure that something will happen:

$$\sum_i{T(s, s_i)}=1.0, \forall s$$

### Definition of transition matrix

<img src="http://drive.google.com/uc?export=view&id=1Bjtd-5U4ixFJVh9yLDBlmXw8UYn3BV9i" width=75%>

### Definition of transition matrix

<img src="http://drive.google.com/uc?export=view&id=1xYKAW6u_phry9BJEPifw1_mGYdWgvDGS" width=75%>

### Definition of transition matrix

Examples:

* transition from $s_1$ -> $s_2$: $T(s_1, s_2) = 1.0$
* transition from $s_4$ -> $s_3$: $T(s_4, s_3) = 0.4$

### Definition of transition matrix

The transition matrix is a bit more complicated for reinforcement learning because the transition also depends on the actions:

$$T(s, a, s') = P(s'|s, a)$$

Because in each state it is sure that something will happen after an action was applied:

$$\sum_i{T(s, a, s_i)}=1.0, \forall s$$

### Trajectory, generated by policy $\pi$
<img src="http://drive.google.com/uc?export=view&id=1cjBsgkWXO-jlJRMjV7QDuqODs4-IRcCU" width=75%>

### Markov Decision Process (MDP)

The Markov Decision Process is defined by the following:

* S - a finite set of states
* A - a finite set of actions
* T - transition probability
* R - reward function, given as $R(s, a) = E\left[ r | s, a \right]$
* $\gamma$ - discounting factor (if the return is discounted)

### Markov Decision Process (MDP)

Markov property:

$$P\left( X_t | X_{t-1} \right) = P\left( X_t| X_{t-1}, X_{t-2}, ..., X_0 \right)$$

In words: The future is independent of the past given the present.

Markov chain:

If we have a $X_0, X_1, ..., X_t$ random variable series and the Markov property holds for each time, this is a discrete-time Markov chain. Think about examples!

### Markov Decision Process (MDP)

The trajectory shown earlier is a Markov chain. The Markov property:

$$P\left( s_t|s_{t-1}, a_{t-1} \right) = P\left( s_t | s_{t-1}, a_{t-1}, s_{t-2}, a_{t-2}, ... s_0, a_0 \right)$$

The first term is the transition probability, introduced earlier.

### Environment types
* fully observable, partially observable
* episodic, continuing
* stationary, non-stationary
* continuous action space, discrete action space
* stochastic environment, deterministic environment
* sparse reward, dense reward

**Fully-observable, partially-observable**

The environment is fully observable if the state is able to describe the environment (the Markov property holds) fully, otherwise it is partially observable. Therefore the main difference between the state and the observation is that for the states the Markov property holds but for the observations do not.

Example:

* fully - in case of a video game if I know the inner state of the game engine (all of the positions, status etc.)
* partially - in case of a video game (like Dota2) I see only the screen but I do not know anything about the other parts of the space

**Episodic, continuing (tasks)**

An environment is episodic if there are terminating states (where the Markov chain ends and can not go ahead anymore) and they are achievable (ergodicity). A continuing environment means if we start a process, it lasts forever.

Example:
* episodic - board games, parking a car, manouvering a helicopter from A to B
* continuing - trading bots, keeping the accumulator charged of a robot in a warehouse, recommendation engines

**Stationary, non-stationary**

An environment is stationary if its dynamics (behavior) do not change with time, the transition matrix and reward function are constant in time. Otherwise it is non-stationary.

**Continuous, discrete action space**

Example:

* continuous - training a robot to walk, the action is the set of forces which should be applied to move legs and other parts, the forces can be any real number
* discrete - game of Pacman, the possible actions are discrete (left, right, up and down)

**Stochastic, deterministic**

If the dynamics of the environment shows stochasticity, the transition matrix contains elements other than 1.0, the environment is stochastic. Otherwise it is deterministic. In other words, the environment is deterministic if starting from the same state and doing the same thing, the result will be the same.

Examples:

* Deterministic - controlling a robot (assuming proper physical conditions)
* Stochastic - trading, drone flight when wind blows

**Sparse, dense reward**

Most of the time rewards are recieved rarely (e.g. at the end of the episode). This makes the learning process more difficult. In an ideal case, the reward is dense, meaning it is received frequently.

# MCTS

In the previous part we discussed N-armed bandits which are basically, stateless decision problems. Now we are discussing an important family of algorithms where the model of the environment is perfectly known, we have several states but a simulator is available. The simulator makes it possible to go ahead and simulate the decisions. Based on the "imagined" experiences the agent can make a real decision. 

This is similar when someone plays a board game and tries to think ahead in time, plays the game in head then chooses the next step.

MCTS stands for: Monte Carlo Tree Search

MCTS is proved to be successful in games like Go, Chess, Shoggi. Furthermore it was used for Atari-games (see later in the course) when the state of the game (so called ROM) is known.

MCTS is an umbrella name for algorithms that has the following structure (Sutton book):

<img src="http://drive.google.com/uc?export=view&id=1gOEkDBksko9kvp-JmlDfvUyp_iPK7pcn" width=75%>

**Selection:** Starting at the root node, a tree policy based on the action values attached to the
edges of the tree traverses the tree to select a leaf node.

**Expansion:** On some iterations (depending on details of the application), the tree is expanded
from the selected leaf node by adding one or more child nodes reached from the selected node via
unexplored actions.

**Simulation:** From the selected node, or from one of its newly-added child nodes (if any), simulation
of a complete episode is run with actions selected by the rollout policy. The result is a
Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout
policy.

**Backup:** The return generated by the simulated episode is backed up to update, or to initialize,
the action values attached to the edges of the tree traversed by the tree policy in this iteration
of MCTS. No values are saved for the states and actions visited by the rollout policy beyond the
tree.


<img src="http://drive.google.com/uc?export=view&id=13xVD0_lTw77CXwriSq2iLmF_uOjFhNjC" width=75%>


[source](https://stanford.edu/~rezab/classes/cme323/S15/projects/montecarlo_search_tree_report.pdf)


### UCT

We put special attention of a concrete MCTS algorithm, the so-called UCT. UCT stands for upper-confidence bounds for trees.
First proposed in [Levente Kocsis and Csaba Szepesvari, Bandit based Monte-Carlo Planning](http://ggp.stanford.edu/readings/uct.pdf).

The main result of this paper is that they showed, the UCT algorithm chooses the optimal action with probability 1, if the number of samples goes to infinity (consistence). 

The algorithm maintains a score in each node:

$$Score(s') = U(s') + \beta \sqrt{\frac{\log n(s)}{n(s')}}$$

First term: exploitation term (what is the utility of the node, expected rewards after moving down the tree started from node $s$)

Second term: exploration term (it ensures that nodes visited rarely, should be visited more in the future, not to miss good opportunities), $n(s)$ is number of times, the node was visited

Selection phase: the algorithm starts at the root node and moves down the tree by selecting the node with the best (highest) score.

Expansion phase: if a node with an unexplored child is encountered, the algorithm creates a new child.

Simulation: a random playout is performed to estimate the utility of the current node (from where the rollout starts).

Backup: an averaging backup is used to update the value estimates of all nodes on the path from the root to the new node (from where the rollout started). The number of visits updated: $n(s) \leftarrow n(s) + 1$. The utility is updated: $U(s) \leftarrow U(s) + \frac{R - (U(s))}{n(s)}$

Notes on the last formula: 

$$\overline{X} = \frac{1}{n}\sum_i{X_i}$$

We can calculate the new average on-the-fly, when a new sample (R) comes:

$$\overline{X} \leftarrow \frac{n\overline{X} + R}{n+1} = \frac{(n+1)\overline{X} + R - \overline{X}}{n+1} = \overline{X} + \frac{R-\overline{X}}{n+1}$$

# Bellman-equation

## Motivation

we have learned how reinforcement learning is about leaning a good (or optimal) path of actions in situations of sequential decision making

The "simplest" way of solving this problem is sampling all actions 

As there are more actions complexity explodes 

Remainder of the class is about how to deal with such complexity
- Efficiently sampling (learning from previous data) (e.g. off-policy learning/ causality)
- Efficiently finding "optimal" policies (e.g. Bellman Equations) 
- Efficiently summarizing what we have "learned" (e.g. deep reinforcement learning) and generalizing

### Efficiency of the Bellman Equation - dynamic programming 

Lets denote the number of states with $n_s$ and the number of actions with $n_a$.
The dynamic programming method is guaranteed to find the optimal policy in **polynomial time** (of $n_s$ and $n_a$).
The total number of possible policies are $n_a^{n_s}$. Therefore the dynamic programming approach is **exponentially faster than any direct search e.g. Monte Carlo Tree search** in policy space.


In practice, the dynamic programming approach can solve MDP problems (to be defined in what follows) with **millions of states**. This is important because if you manage to formalize  a problem in a tabular setting than it can be solved by dynamic programming and you can take advantage of its good convergence properties.

## State-value, action-value

**V - state-value function**

**Q - action-value function**


### State-value function (shortly: value-function)

The $\pi$ policy generates trajectories ($\tau$) until the end of the episode, starting from $s_0$.
$\tau = [s_0, a_0, r_0, s_1, a_1, r_1, ..., s_i, a_i, r_i, ..., s_T, a_T, r_T]$ 

$$V^\pi(s) = E_\tau \left[ G(\tau) | s_0 = s, \pi \right]$$

Where $G$ is the return. If the discounted return is used:

$$V^\pi(s) = E_\tau \left[ \sum_i{\gamma^i r_i} | s_0 = s, \pi, r_i \in \tau \right].$$

### Action-value function

The $\pi$ policy generates trajectories ($\tau$) until the end of the episode, starting from $s_0$.
$\tau = [s_0, a_0, r_0, s_1, a_1, r_1, ..., s_i, a_i, r_i, ..., s_T, a_T, r_T]$ 

$$Q^\pi(s, a) = E_\tau \left[ G(\tau) | s_0 = s, a_0 = a, \pi \right]$$

Where $G$ is the return. If the discounted return is used ($\gamma < 1$):

$$Q^\pi(s, a) = E_\tau \left[ \sum_i{\gamma^i r_i} | s_0 = s, a_0 = a, \pi, r_i \in \tau \right].$$

## Monte Carlo solution

The most naiv approach would be sampling trajectories and calculate the returns for each of them. Then it is possible to average them out and find the value function for each state:

$$V(s) = \frac{\sum_i^n{G(\tau_i)}}{n}$$

However, we can use a trajectory for sampling return for all the states encountered during the trajectory.
There are two strategies how to do that: first visit and every visit.

First visit means, if a state encountered several times during a trajectory, the return is calculated for only the first visit.

Every visit means, if a state encountered several times, the returns are calculated for each of them.


<img src="http://drive.google.com/uc?export=view&id=1uboWLi-NoQ1GMUZtrF1DBsc4Rxqx1BTU" width=65%>

The $s_i$ and $s_j$ is the same state but encountered after $i$ and $j$ steps.
The return gathered after state $s_i$:

$$G_i = \sum_{k=i}^T{ r(s_k) }$$

$$G_j = \sum_{k=j}^T{ r(s_k) }$$

For the first-visit MC, only $G_i$ is considered while for every-visit MC both $G_i$ and $G_j$ is considered.

### Can dynamic programming really help?

By "memorizing states" one can try to solve RL problems by dynamic programming.

But is this really feasible?

**Time Complexity**
In Dynamic programming problems, Time Complexity is the number of unique states/subproblems * time taken per state.

In this problem, for a given n, there are n unique states/subproblems. For convenience, each state is said to be solved in a constant time. Hence the time complexity is O(n * 1).

**Space Complexity**
We use one array called cache to store the results of n states. Hence the size of the array is n. Therefore the space complexity is O(n).


Note that there are trade-offs between complexity in space and time


For large problems (many states and many actions) the Bellman formulation is still not efficient.

[source](https://www.freecodecamp.org/news/demystifying-dynamic-programming-24fbdb831d3a/#:~:text=In%20Dynamic%20programming%20problems%2C%20Time,O(n%20*%201))