# 1. Q-Learning

![q-learning algo](https://cdn-media-1.freecodecamp.org/images/k0IARc6DzE3NBl2ugpWkzwLkR9N4HRkpSpjw)

Last week, we learned the basics of policy gradient algorithms through REINFORCE and A2C implementation. The working principle of such algorithms mostly aimed at optimizing the policy directly, so the action probability distribution ensures the maximum expected reward. This week, we are going to analyze ***value-based*** approach starting with Q-Learning algorithm

In short, Q-Learning has the following characteristics:
- **Model-free**. It estimates optimal policy without the need for any reward functions from the environment.
- **Off-policy**. The function learns froms its actions and does not depend on the current policy.

In constrast to the previous weeks, we are first going to analyze the mathematical derivation of Q-Learning before the explanation of the algorithm itself.

## Mathematics

To derive Q-Learning algorithm, let's take a step back and start from the basis of reinforcement learning as a whole.

### Rewards

As we have learned during the previous week, the reward is a feedback from the environment to the agent that helps to measure how *good* was the action. The total reward, therefore can be written as:
$$
R_t = r_{t+1}+r_{t+2}+...
$$

Even though such formula allows us to calculate the sum of the rewards, it would be quite unreasonable to use this as building function to our model, as it would potentially increase to infinity. One way of avoiding such problem is through the concept of **future reward** that, as we learned last tutorial, takes into account the decreasing importance of future rewards.
$$
R_t = r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+2}+...
$$

Such formula can be simplified in the following way:
$$
R_t = r_{t+1}+\gamma R_{t+1}
$$


### Policy

Policy is a function that shows the probability of taking an action **a** at state **s**. In the policy gradient algorithms, we wanted to optimize the policy in a way that maximized the reward function.

Important thing to note is that, since the policy is a probability distribution, the sum of all possible actions at a state must add-up to 1.

$$
\sum_a \pi (s, a) = 1
$$

### New notations

We have already covered these concepts in the previous tutorial. On the other hand, the further formula derivation will require us to introduce some new notations for **immediate reward** and **transition probability**.

**Imediate reward** can be imagined as an expected reward for going from state s to s' via action a.

$$
R_{ss'}^a = E[r_{t+1} | s_t = s, a_t = a, s_{t+1} = s']
$$

**Transition probability** can be defined as a probability of going from state s to s' via action a.

$$
P_{ss'}^a = E[s_{t+1} = s' | s_t = s, a_t = a]
$$

### Value functions

In the advantage actor-critic tutorial, we already ran into **state value function** and **Q-value function**, that will play a more important role in Q-Learning models.

**State value function** shows the expected total reward that can be received from the current state:

$$
V^{\pi}(s) = E[R_t | s_t = s]
$$

**The Q-value function** shows the expected total reward that can be received after taking a specific action at current state:

$$
Q^{\pi}(s, a) = E[R_t | s_t = s, a_t = a]
$$

### Bellman equation

Now, let's expand the Q-value function.
$$
Q^{\pi}(s, a) = E[R_t | s_t = s, a_t = a]
$$

$$
Q^{\pi}(s, a) = E[r_{t+1} + \gamma R_{t+1} | s_t = s, a_t = a]
$$

$$
Q^{\pi}(s, a) = E[r_{t+1} | s_t = s, a_t = a] + \gamma E[R_{t+1} | s_t = s, a_t = a]
$$

$$
Q^{\pi}(s, a) = \sum_{s'} P_{ss'}^aR_{ss'}^a + \gamma \sum_{s'} P_{ss'}^aV^{\pi}(s')
$$

If we assume that whenever the agent taks an action, it always ends up at the same next state, we can simplify the previous equation into the following one:

$$
Q^{\pi}(s, a) = R_{ss'}^a + \gamma V^{\pi}(s')
$$

Most commonly, this is written in the following manner.

$$
Q^{\pi}(s_t, a_t) = r(s_t, a_t) + \gamma V^{\pi}(s_{t+1})
$$

## Q-Learning

![greedy_algo](https://miro.medium.com/max/375/0*rQ7hXKOPSxcR271w.gif)

The Q-Learning is based on the **greedy policy** principle: agent always chooses the optimal next step (with the highest reward). In the mathematical context, we can write a relation between the state value and Q-value functions:

$$
V(s_t) = max_a Q(s_t, a)
$$

Plugging this into the previously derived equation, we get the Bellman equation for the Q-Learning.

$$
Q^{\pi}(s_t, a_t) = r(s_t, a_t) + \gamma max_{a_{t+1}} Q(s_{t+1}, a_{t+1})
$$

In the descriptive terms, this equation tells that the of an action in a certain state is the **immediate reward** we get from taking the action and the **maximum expected reward** in the next state.


## Implementation

So far, we learned the main function that guides the process of Q-Learning; however, how does this translate to code?

1. **Initiatizing Q-value space.** In order to record the data, model will have an associated Q-table containing all Q-values of the environment.
2. **Extracting action probabilities**. At the start of each episode, we need to retrieve the probabilities of all actions in that state.
3. **Taking action**. The action is taken according to the probability distribution.
4. **Recording data**. We need to extract reward, next state after taking a sample action.
5. **Taking next action**. The next action is taken according to the **greedy algorithm**.
6. **Calculating loss**. In our case, the "loss" is the difference between the theoretical Q-value and the value we got after taking an action.
7. **Updating Q-table**. After getting the loss, we can update our table by adding difference multiplied by the learning rate.