## Reinforcement Learning
* The Reinforcement learning is characterized by an **agent** learning to interact with its **environment**.
* At each time step, The environment presents a situation to the agent called a **state**. The agent then has to choose an appropriate **action** in response.  
* One time step later, the agent receives a **reward**. A reward can be negative or positive. This is the environment's way of telling the agent whether it has responded with an appropriate action or not. The agent also receives a new state.
* Now, the main goal of the agent is to maximize expected **cumulative reward**. This simply means the expected sum of rewards attained over all the time steps.
* The reward hypothesis is that all goals for any agent are framed as the **maximization of the expected cumulative reward**.

#### So what mathematical framework does the agents use to learn?
Enter Markov Decision Processes (MDP).

What are they, you ask? Simple.
An MDP can be defined as:
* A (finite) set of states S
* A (finite) set of actions A
* A (finite) set of rewards R
* The one-step dynamics of the environment and
* A discount rate γ(gamma) where 0≤γ≤1. The gamma has to be close to 1.
If γ = 0, the agent will only care about the immediate reward.
If γ = 1, then the return is not discounted. 
But why are we doing a discounted return? The main aim of this is to refine the goal you have for the agent. 

Now, you might wonder what the hell a one-step dynamic of the environment is for. It's purpose is to help the environment decide the state and rewards the agent gets at every time step.
Formally, it is a conditional probability (P(A|B) == The probability that event A will happen given event B has already occured). It's then defined as follows:

$$
p(s 
′
 ,r∣s,a)≐P(S 
t+1
​	 =s 
′
 ,R 
t+1
​	 =r∣S 
t
​	 =s,A 
t
​	 =a) 
​
​   for each possible 
​ s', r, s, \text{and } a
. $$


#### Solving MDPs
Now that we've learnt to define a problem into an MDP, how does the Agent decide which actions to take given its states?
This is where the **policy** comes in.  
The core problem is to come up with a policy that'll help the agent map the set of states to the set of actions it will take. 
A policy is simply a function $π$ that specifies the action **$π(s)$** the agent will choose when state **$s$**.

Therefore, in order to solve an MDP, the agent must determine the best policy. The best policy will be an optimal policy that tells the agent to select actions so that it always gets the highest possible cumulative reward.

There are two ways of defining a policy:
* Deterministic policy –– this is a mapping $ π:S \xrightarrow\ A $
* Stochastic policy –– this uses probability. For each state $s$ and action $a$, it creates a probability $π(a \mid s)$ that the agent chooses action $a$ while in state $s$
    $$ π(a \mid s) = \Pr\bigl(A_t = a \mid S_t = s) $$

