# Reinforcement Learning Classes

## Sequential Decision Making

We define the value of selecting an action as the expected reward we receive when taking bad action.

$$
q^*(a) \doteq \mathbb{E}[R_t | A_t = a] \quad \forall a \in \{1, \dots, k\}
$$

The equation above explains that the conditional expectation is defined as the expectation of reward ($R_t$) given we selectd the action $a$. The goal is to **maximize** the **expected reward**. It is important to highight that the $q^*(a)$ value is not known, so we estimate it.

$$
\argmax_a{q^*(a)}
$$

On way to estimated this value is through the *Sample-Average Method*

$$
Q_t(a) = \frac{\text{sum of rewards when } a \text{ taken prior to } t }{\text{number of times } a \text{ taken prior to } t }
$$

or

$$
Q_t(a) = \frac{\sum^{t-1}_{i=1}{R_i}}{t-1} \implies Q_{t+1}(a) = \frac{\sum^{t}_{i=1}{R_i}}{t}
$$

The result will determine the **greedy action** to the problem. The greedy action is the action that currently has the largest estimated value. Selecting the greedy action means the agent is exploiting its current knowledge. It means that the agent is trying to get the most reward it can. It is possible to write the estimated value by the incremental update rule.

$$
Q_{n+1} = Q_n + a_n (R_n - Q_n)
$$

The error in the estimate is the difference between the old estimate and the new target. Taking a step towards that new target will create a new estimate that reduces our error. The step size can be a function of $n$ that produces a number from zero to one ($a_n \to [0,1]$ or $a_n = \frac{1}{n}$).

### Exploration and Exploation

Here the difference between explorarion and exploitation will be discussed.

- **Explorarion:** The exploration improve knowledge for *long-term* benefit. Here you will explore all the oportunities, aiming to find the best solution to your problem.
- **Exploitation:** The exploitation exploit knowledge for *shot-term* benefit. Here the other opportunities will not be explored and the best solution won't be discovered.
- **The Dilemma:** How do we choose when to explore and when to exploit?
    - Use the *epsilon-greedy* method to balance the explorarion and exploitation.
 

## The Markov Decision Process

Given a state $S$ and action $a$, $p$ tells us the joint probability of next state $s'$ and reward $r$ are. Since $p$ is a probability distribution, it must be non-negative and it's sum over all possible next states and rewards must equal one. Note that future state and reward only depends on the current state and action. This is called the Markov property. It means that the present state is sufficient and remembering earlier states would not improve predictions about the future

$$
p(s', r | s,a)
$$

### Value Functions

A state value function is the future award an agent can expect to receive starting from a particular state. More precisely, the state value function is the expected return from a given state.

$$
v_\pi(s) \doteq \mathbb{E_\pi}[G_t | S_t = s]
$$

An action value describes what happens when the agent first selects a particular action. More formally, the action value of a state is the expected return if the agent selects action $a$ and then follows the policy $\pi$.

$$
q_\pi(s,a) \doteq \mathbb{E_\pi}[G_t | S_t = s, A_t = a]
$$