## Reinforcement Learning
* The Reinforcement learning is characterized by an **agent** learning to interact with its **environment**.
* At each time step, The environment presents a situation to the agent called a **state**. The agent then has to choose an appropriate **action** in response.  
* One time step later, the agent receives a **reward**. A reward can be negative or positive. This is the environment's way of telling the agent whether it has responded with an appropriate action or not. The agent also receives a new state.
* Now, the main goal of the agent is to maximize expected **cumulative reward**. This simply means the expected sum of rewards attained over all the time steps.
* The reward hypothesis is that all goals for any agent are framed as the **maximization of the expected cumulative reward**.

At an arbitrary time step $t$, the agent-environment interaction is simply a bunch of states, actions and rewards

$(S_0,A_0,R_1,S_1,A_1,,R_{t−1},S_{t−1},A_{t−1},R_t,S_t,A_t)$

![](images/reinforcement-agent.png "The Agent-Environment Interaction")




#### How do we implement Reinforcement Learning?
Enter Markov Decision Processes (MDP). We define our RL problem in the form of an MDP. MDPs provide a framework to model a decision making process for the agent.

But what are they, you ask? Simple: An MDP can be defined follows:
* A (finite) set of states S
* A (finite) set of actions A (The actions available to the agent)
* A (finite) set of rewards R (The reward the agent will get after transitioning from one state to another)
* The one-step dynamics of the environment and,
* A discount rate $\gamma$ (gamma) where $0≤\gamma≤1$.

Points to note:

The discount rate $\gamma$ represents the degree of importance between present and future rewards the agent gets. 

But why are we doing a discounted return in the first place? 

The main aim of doing this is to **refine the goal** you have for the agent. 

If $\gamma = 0$, the agent will only care about the immediate reward.

If $\gamma = 1$, then the return is not discounted. 

This means that the  $\gamma$ has to be close to 1. 

Should the present reward carry the same weight as future rewards? No. It actually makes more sense to value rewards that come sooner more highly since they are more predictable. The closer in time the reward is to the agent the more juicier it is to it!

Therefore, the larger the discount rate is, the larger the immediate reward.

$ G_t = \mathopen{} R_{t + 1} + \gamma R_{t + 2} + \gamma^2 R_{t + 3} + \text{... } $

If we replace the values for example, we can have it as:

$ G_t = \mathopen{} R_{t + 1} + (0.9) R_{t + 2} + (0.82) R_{t + 3} + \text{... } $

Moving on swiftly.

Now, you must be wondering what _the one-step dynamics of the environment_ is all about. 

Well, it's purpose is to help the environment decide the state and rewards the agent gets at every time step.

When the environment responds to the agent at time step $t+1$, **it considers only the state and action at the previous time step
$(S_t, A_t)$**. 

This means that it doesn't care or look at the actions, rewards or states **that came prior to the last time step**. This dictates the one-step dynamics of the environment. 

It is therefore a conditional probability ($P(A \mid B)$ –– meaning we find the probability that event A will happen given event B has already occured). The one-step dynamic of the environment is defined as follows:

$$ P(s', r \mid s, a) = P\bigl(S_{t+1} = s', R_{t + 1} = r \mid S_t = s, A_t = a) \text{ for each possible } s, r, s' \text { and } a $$


#### Solving MDPs
Now that we've learnt to define a problem into an MDP, how does the Agent decide which actions to take given its states?
This is where the **policy** comes in.  
The core problem is to come up with a policy that'll help the agent map the set of states to the set of actions it will take. 
A policy is simply a function $\pi$ that specifies the action **$\pi(s)$** the agent will choose when state **$s$**.

Therefore, in order to solve an MDP, the agent must determine the best policy. The best policy will be an optimal policy that tells the agent to select actions so that it always gets the highest possible cumulative reward.

There are two ways of defining a policy:
* Deterministic policy –– this is a mapping $ \pi:S \xrightarrow\ A $
* Stochastic policy –– this uses probability. For each state $s$ and action $a$, it creates a probability $\pi(a \mid s)$ that the agent chooses action $a$ while in state $s$
    $$ \pi(a \mid s) = \Pr\bigl(A_t = a \mid S_t = s) $$
   

