# The Reinforcement Learning framework

In a Reinforcement Learning framework we have:
 * an agent that learns or learned to interact with the environment
 * the assumption that time evolves in descrete time steps
 * At the initial time step the agent observes the environment's state
 * After this obervation, the agent selects an apropriate action in response.
 * Next, the environment presents a new situation (state) to the agent and at the same time it gives the agent a reward. The reward provides an indication to the agent so it knows how appropriate its action was to the previous state.
 * The process continues, repeating again these steps
 
 Let's not assume the environment shows the agent everything he needs to know to make well infromed decisions, but it simplifies the underlying mathematics if we do assume so.

So, reviewing the steps again:

* Agent receives the state $S_{0}$ at the initial time step.
* Agent chooses an action (and takes such action) $A_{0}$
* Environment gives to the agent a reward $R_{1}$ along with the next state $S_{1}$
* The agent then selects (and take) another action $A_{1}$ in response to the new state
* The environment gives again to the agent a reward - this time $R_{2}$ along with the next state $S_{2}$
* The agent then selects (and take) the next action - this time $A_{2}$

![Reinforcement Learning Framework](images/learningFramework.png)

So, as the agent interacts with the environment there will be a sequence of states actions and rewards that will be transmitted from the environment to the agent (rewards and next state) as the latter modifies such environment by means of the actions selected to be taken.

**Reward will always be the most relevant quantity to the agent.**
The goal of any RL agent is to maximize the expected cumulative reward or the sum of the rewards that it obtains as it interacts with the invironment (over all time steps). So it must find the best *strategy* for choosing actions with which the cumulative reward is likely to be high.

There are two kind of reinforcement learning tasks.

**Episodic Taks** are those tasks that have a well defined ending point. In this case, a complete sequence of steps (interactions) from start to finish is known as an *episode*. It can be said from episodic tasks:

    * are tasks with well defined end points
    * bring "experience" from one episode to the next one
    
**Continuous Tasks** are tasks that go on forever without end. Like an RL algorithm that buys and sells stocks in response to the financial market; this kind of agent would be best modeled as an agent in the continuing tasks. In this cases the agent "lives forever."

### To remember:

* A **task** is an instance of the reinforcement learning (RL) problem.
* **Continuing tasks** are tasks that continue forever, without end.
* **Episodic tasks** are tasks with a well-defined starting and ending point.
    * In this case, we refer to a complete sequence of interaction, from start to finish, as an **episode**
    * Episodic tasks come to an end whenever the agent reaches a **terminal state**

## The Reward Hypothesis

The Reward Hypothesis sais:

   "All goals can be framed as the maximization of *expected* cumulative reward".
    
This allows us formulate an agent's goal along the lines of maximizing the expected cumulative reward.

For example, frame the idea of a humanoid learning to walk in the context of reinforcement learning. In this context we would have that:

    * the states could be:
        + Position and velocities of the joints
        + Statistics about the ground
        + Foot sensor data
    * the actions could be:
        + Forces applied to the joints
        
![Reward Hoypothesis](./images/RewardHypotesisIllust.png)


The reward structure for this problem is surprisingly intuitive (from the DeepMind paper):

$$ r = min(v_{x}, v_{max}) - 0.005(v_{y}^2 + v_{z}^2) - 0.05y^2 - 0.02||u||^2 + 0.02$$

In this formula:
    
    * The first term will influence the behaviour of the humanoid to walk fast
    * The first, second and third terms will influence the humanoid to walk forward
    * The fourth term, will influence the humanoid to walk smoothly
    * The fifth term will add to the cumulated reward for each time step the humanoid does not fall (in the moment the humanoid falls, the episode is over, so with term it makes sense to walk as much as possible without falling).
    
The specific function of each term of this equation:

    * The first term is the proportional to the robots forward velocity
    * The second term penalizes deviation from forward direction
    * The third term penalizes deviation from center of track
    * The fourth term penalizes the torques
    * The fifth term is the constant reward for not falling.
    
![Humanoid Reward](images/HumanoidReward.png)

## Cumulative Reward

An RL agent should not focus only on individual time steps and instead, it should to keep all the time steps in mind.
Actions have short and long term consequences and the agent needs to gain some understanding of the complex effects its actions have on the environment.
Again: the goal of the agent is to maximize the expected *cumulative* reward.

**Definition:** The return at time step $t$ is:

$$ G_{t} = R_{t+1} + R_{t+2} + R_{t+3} + R_{t+4} + ...$$

This returned is denoted as G.

The agent seeks to maximize the expected return. It is *expected* because the agent cannot predict with complete certainty what the future reward is likely to be, so it has to rely on a prediction - an estimate.

## Discounted Return

Here the idea is that we'll maximize a different sum with rewards that are farther along in time and that are multiplied by smaller values.


$$G_{t} = R_{t+1} + (0.9)R_{t+2} + (0.81)R_{t+3} + (0.73)R_{t+4} + ...$$

The coefficients (i.e. 0.9, 0.81, 0.73...) are the discount rate. $G_{t}$ is the *Discounted Return*.

By *discounted* it is meant, that the goal will change in a way that the agent values more immediate rewards rather then rewards that are received further in the future.

*How* to choose the coefficients? The discounted rate is normally a real number between 0 and 1 (including 0 and 1).

$$ Discounted\_rate: \gamma \in [0, 1]$$

Normally the first term is multiplied by $\gamma$, then the seconds is multiplied by $\gamma^2$, the third is multiplied by $\gamma^3$ and so and so on.

$\gamma$ is set by us to refine the goal that we have for the agent.