# Reinforcement Learning: A "mathless" introduction to underlying concepts
This first blog post is intended to give you a rough outline of key concepts of Reinforcment Learning. To achieve that we will try to introduce them naturally by addressing your intuition. While this post will attempt to be perfectly understandable without any special mathematical knowledge, math, definitions and equations are necessary and unremovable from Reinforcement Learning and will be introduced in a second post.



## A motivation: Why bother with Reinforcement Learning?
Looking back at recent historic successes of reinforcement learning algorithms there are various use cases for such techniques that become quite obvious. The most well known is probably AlphaGo, a computer program developed by researchers at DeepMind, which has been shown to surpass the abilities of th best professional human Go player in 2017. Additional success stories are to be found in the realm of esports. OpenAI developed bots able to beat professional teams at the highly competetive video game Dota 2 in a 5v5 game in 2019.

## So what is Reinforcement Learning?
To understand the fundamentals of Reinforcement Learning it's important to understand one key concept: The relationship between an agent and its environment. To grasp the idea of an agent, it seems pretty natural to replace it with oneself - a human being interacting with its surroundings, retrieving information about itself and those surroundings, and changing this state through actions.

Now we - as the agent - usually have quite a lot of different actions we could take. Imagine having to walk with the dog: There are two options regarding clothing. We could either take a jacket or pass on doing so. To decide which action to pick, it's pretty instinctively to observe the temperature outside - the environment so to speak. But how do we know if our decision was good in that scenario? We'll receive a feedback - a reward - from the environment. In our example we might be freezing if we decided not to take the jacket.

And that example - at its core - is exactly what the relationship between agent and environment is about. More formally an agent takes an action ***a*** in state ***s*** and collects a reward ***r*** in return to evaluate how good of an action a was in that specific state ***s*** and additionally transitioning into the next state **s'** which is determined by a transition function ***P***.


![AgentEnvironment.PNG](attachment:AgentEnvironment.PNG)

## Model-based vs model-free
Now that we introduced the key concept of an agent - environment relationship it's necessary to differentiate between to general cases: Knowing the model and not knowing the model.

Imagine a world, in which the weather is sunny exactly every other day and rainy on every remaining day. If our agent has to do a task outside but can freely decide on which day to do it, knowing the model of its world it will probably decide to go outside on a sunny day for a better outcome.
If it - on the other hand - does not know the model it will probably go out on the first day to collect the reward for completing its task as soon as possible and will only over the course of weeks learn the model of its world as part of the learning experience itself.


Again more formally there is a distinction between model-free and model-based Reinforcement Learning. In model-based Reinforcement Learning the model is known, the agent has all the information and can plan its actions perfectly. That's why model-based problems can be solved with "simpler" algorithms using Dynamic Programming.
Model-free Reinforcement Learning on the other hand needs the agent to learn the model itself as part of solving the problem.

// Maybe Insert another illustration, expanded with the model but without policies yet

![modelbased-shortened.png](attachment:modelbased-shortened.png)

## Policies

A policy $\pi$ describes the strategy by which the agent chooses which actions to take in which state. The optimal policy returns the best possible action for each state to receive the biggest possible reward. Furthermore we differentiate a deterministic policy $\pi$(***s***) and a stochastic policy $\pi$(***a***|***s***) which returns the action based on a probability given that the agent is in state ***s***.


## Value functions



Now to evaluate an action ***a*** in a specific state ***s*** it doesn't seem too practicable to try it out and see what the reward will be. Instead it seems to be a good idea to try and predict what the future rewards will be if action ***a*** is taken in state ***s***.

To get a better understanding behind the reasons for this approach let's take another look at an example in the real world. 

**State**: Having set amout of money in your account.

**Action 1**: Withdraw that money now to buy yourself something.

**Action 2**: Leave the money in your account to earn more money through interest and buy yourself something more expensive in the future.

Both of these actions will give you a certain reward. The key difference is that if you would only look at the reward both these actions give you right now, action 1 would give you a much higher reward because you would buy yourself something right away while action 2 would reward you with nothing. But action 2 will grant you a much higher reward in the future and this reward may outweigh the reward of action 1. The main point to look at when deciding between action 1 and 2 is, wether waiting the additional time is worth the higher reward or not. And this is the the major idea behind value functions.

Value functions are mostly referred to as $G_t$ with ***t*** being the current state. The Value function will look at future states ***t+1,t+2...*** and evaluate the reward these states will grant you. But you can´t simply look at all future rewards and add them up. We need to discount future rewards to not wait indefinitely and prefer rewards we receive in the near future.
We do that by introducing a discount factor $\gamma$.

Back to our example, in case we wouldn´t have such a discount factor, we might wait indefintley to buy a house because its reward would outweigh everything else instead of buying food for example.

Now we can use these value functions to update our policies converging to the optimal policy. We will encounter this principle as value iteration in a later post. 

![modelcombined.png](attachment:modelcombined.png)