## Introduction
Last week, we have learned the basics or the fundamental blocks involved in Reinforcement Learning (RL). In this notebook, we will be doing some recap on those basics (the concept or reward and return, state-value function, ...) as well as introduce you to another RL algorithm, called Q-learning. Then, we will extend the implementation of Q-learning by using neural network, and this is where Deep Q Network (DQN) comes into play 🙌


##  1. Basics of Reinforcement Learning

### Reinforcement Learning Cycle
<img src=https://miro.medium.com/max/1400/0*DznLQWZFohKXv5iZ.png width="500">

RL loop (also known as RL cycle) is the iterative process that an RL agent follows to learn and improve its decision-making skills through interaction with its environment. The RL loop (which is shown in the figure above) consists of the following steps:
1. Observation: The agent observes the current state of the environment.
2. Action: Based on its current state, the agent selects an action to take.
3. Reward: The environment provides the agent with a reward based on the action it took.
4. Next State: The environment transitions to a new state based on the agent's action.

The RL loop is a powerful framework for building intelligent agents that can learn from their experiences and improve their decision-making skills over time. This framework provides a mean (or rather a formal definition) for the agent to learn a policy, or decision-making strategy, that maximizes the expected total reward over the long-term.


### A Simple Example: Path Planning Problem
Let's have a look at a simple path planning problem that can be used by using Reinforcement Learning.

<img src="img/robot_03.png" width="400">

The goal of this problem is to find the shortest path for the robot to move from its starting cell to the end cell. Let's start by formalizing this problem into a RL framework first. 
- A state, $S_t$ can be defined as the cell the robot is currently in and can move to. Because there are 15 cells, there will be 15 states in this RL framework.
- An action, $A_t$ can be defined as the possible move the robot can take. For this, let's define it as either UP, DOWN, RIGHT, and LEFT.
- A reward, $R_t$ will be defined later.
- A next state for this problem can be described as deterministic. This means if the robot is in the cell (0,0), and taking an action, $A_t$ RIGHT, the robot will 100% end up in cell (1,0) 

**What is a non-deterministic model then?**\
In the case of non-deterministic problem, if the robot takes a certain action, there will be some probability that the robot will end up at at least two different states. As an example, if the robot is in the cell (0,0) and taking an action, $A_t$ = RIGHT, the robot will have 50% probability of being in cell (1,0) and 50% probability of being in cell (0,0)

### Policy

A **policy** (also sometimes referred to as 'model') is a function that maps the current state of the environment to an action to be taken by the agent. The policy determines the behavior of the agent and specifies the way in which the agent interacts with the environment.

Shown below is an example of the policy for the path planning problem. Let's call this as policy $\pi$:

<img src="img/robot_02.png" width="400">

### Reward

A **reward** is a signal that indicates how well an agent is doing with respect to a particular task or goal. It is a scalar value that is received by the agent after it performs an action in a given state of the environment. In the context of the robot path planning we introduced before, a reward is given when the robot takes a certain action (moving UP/DOWN/LEFT/RIGHT), and it indicates how favourable the move is. 

For this problem, we can introduce a simple reward function:
- $R_t$ = +100; if the robot arrives at the end cell (being in cell (3,2) and taking RIGHT or being in cell (4,1) and moving DOWN)
- $R_t$ = -1000$; if the robot exits the 'playing field' or the grid (being in cell )
- $R_t = -1$; for any other move the robot takes 



### Return

Reward on its own is not that meaningful as it only able to capture a single-step performance. Hence why the idea of 'return' is introduced. Return refers to the total amount of reward an agent receives over time when it follows a particular policy. 

$$
R_t = r_{t+1}+r_{t+2}+...
$$


### Discount Factor

Even though **return** allows us to calculate the sum of the rewards, it would be quite unreasonable to use this as a building block to our model, as it would potentially increase to infinity. One way of avoiding such problem is by introducing the concept of **discount factor**, which takes into account the decreasing importance of future rewards.
$$
R_t = r_{t+1}+\gamma r_{t+2}+\gamma^2 r_{t+2}+...
$$

Such formula can be simplified in the following way:
$$
R_t = r_{t+1}+\gamma R_{t+1}
$$

Note that the value of **discount factor, $\gamma$** ranges from 0 to 1.

TO-DO: Try compute the return the robot will receive when being in the starting state and following policy $\pi$ that we have introduced before. Do you think this is an optimal policy?

### State-Value Function

Shows the **expected return** that can be received from the current state:

$$
V^{\pi}(s) = E[R_t | s_t = s]
$$


### Action-Value Function

Shows the **expected return** that can be received after taking a specific action at current state:

$$
Q^{\pi}(s, a) = E[R_t | s_t = s, a_t = a]
$$




**What's the difference between State-value function and Action-value function?**\
The difference is a bit subtle. If you look at the formula closely, it is the dependency on action that differs between them. Mathematically, the **action-value function** is defined as the expected return starting from a particular state s, taking a particular action a, and then following the current policy $\pi$, whereas for **state-value function**, it is defined as the expected return starting from a particular state s, and then **following directly the current policy $\pi$**.


Now, let's expand the Q-value function.
$$
Q^{\pi}(s, a) = E[R_t | s_t = s, a_t = a]
$$

$$
Q^{\pi}(s, a) = E[r_{t+1} + \gamma R_{t+1} | s_t = s, a_t = a]
$$

$$
Q^{\pi}(s, a) = E[r_{t+1} | s_t = s, a_t = a] + \gamma E[R_{t+1} | s_{t+1} = s, a_{t+1} = a]
$$

Since our path planning problem is deterministic, we can simplify our Q-value function further to arrive at:
$$
Q^{\pi}(s_t, a_t) = r(s_t, a_t) + \gamma Q^{\pi}(s_{t+1})
$$


The **action-value function**, or **Q-value function** is the building block of Q-Learning that we will be looking at the next topic.

Looking back into our path planning problem, since we have 14 different states (excluding the END cell), and each state can take 4 different actions (UP, DOWN, LEFT, RIGHT), we will have 56 different Q-value function, each associated with its state-action pair.

<img src="img/q_value_01.png" width="400">