# Q Learning 介绍
In enhanced learning, there is a well-known algorithm called q-learning. Let's start with the principle and then talk about q-learning through a simple small example.

## q-learning principle
We use a simple example to import q-learning. Suppose a house has 5 rooms and some rooms are connected. We want to be able to get out of this room.

![](https://ws2.sinaimg.cn/large/006tNc79ly1fn70q0n91lj30h40a8aaf.jpg)

Then we can simplify it into the form of some nodes and graphs. Each room acts as a node. The two rooms are connected by a door. Just connect a line between the two nodes to get the picture below.

![](https://ws4.sinaimg.cn/large/006tNc79ly1fn70r6c6koj30h60b2gm0.jpg)

In order to simulate the whole process, we placed an agent in any room, hoping it can get out of the room, which means that it can go to node 5. In order to let the agent know that node 5 is the target room, we need to set some rewards. For each side, we associate a bonus value: the bonus value of the side directly connected to the target room is set to 100, and the other edges can be set. 0, note that room 5 has an arrow pointing to itself, the bonus value is also set to 100, and the other directly points to room 5 is also set to 100, so when the agent reaches room 5, he will choose one In Room 5, this is also called the absorption target, and the effect is as follows

![](https://ws4.sinaimg.cn/large/006tNc79ly1fn71gf4idrj30c207u74i.jpg)

Think about the fact that the agent can keep learning. Every time we put it in one of the rooms, then it can continue to explore and walk to Room 5 according to the reward value, that is, out of the house. For example, now that the agent is in Room 2, we hope that it can continue to explore and go to Room 5.

### Status and Action
There are two important concepts in q-learning, one is state, the other is action, we call each room a state, and the agent moves from one room to another called an action, corresponding to the above The graph is that each node is a state, and each arrow is an action. If the agent is in state 4, slave state 4 can choose to go to state 0, or state 3 or state 5. If it reaches state 3, it can also choose to go to state 2 or state 1 or state 4.

We can create a reward table based on the rewards of status and actions. Use -1 to indicate that there is no edge between the corresponding nodes, and the edge rewards that do not reach the end point are counted as 0, as follows

![](https://ws2.sinaimg.cn/large/006tNc79ly1fn71o8jlinj307t055wek.jpg)

Similarly, we can let the agent continuously learn the knowledge in the environment through interaction with the environment, let the agent estimate the possible benefits of each action according to each state. This matrix is called Q table, and each row represents state. Each column represents a different action. For a situation with unknown state, we can randomly let the agent start from any position, and then explore the new environment to get all the states as possible. At first, the agent didn't know anything about the environment, so the values were all initialized to 0, as follows

![](https://ws2.sinaimg.cn/large/006tNc79ly1fn71t3h3wnj306u053jrf.jpg)

Our agents update the results in the Q table by continually learning, and finally make decisions based on the values in the Q table.


### Q-learning algorithm
With the rewards table and the Q table, we need to know how the agent updates the Q table by learning so that we can finally make decisions based on the Q table. This time we need to talk about the Q-learning algorithm.

The algorithm of Q-learning is particularly simple, and the state transition formula is as follows

$$Q(s, a) = R(s, a) + \gamma \mathop{max}_{\tilde{a}}\{ Q(\tilde{s}, \tilde{a}) \}$$

Where s, a represents the current state and action, $\tilde{s}, \tilde{a}$ respectively represent the next state after s takes the action of a and the action corresponds to all actions, the parameter $\gamma$ is A constant, $0 \leq \gamma \le 1 $ represents a degree of attenuation of future rewards, which is a metaphor for a person's vision of the future.

解释一下就是智能体通过经验进行自主学习，不断从一个状态转移到另外一个状态进行探索，并在这个过程中不断更新 Q 表，直到到达目标位置，Q 表就像智能体的大脑，更新越多就越强。我们称智能体的每一次探索为 episode，每个 episode 都表示智能体从任意初始状态到达目标状态，当智能体到达一个目标状态，那么当前的 episode 结束，进入下一个 episode。

The entire algorithm flow of q-learning is given below.
- step1 given parameters $\gamma$ and reward matrix R
- step2 order Q:= 0
- step3 For each episode:
- 3.1 randomly select an initial state s
- 3.2 If the target status is not reached, perform the following steps
- (1) Select one of all possible actions in the current state s a
- (2) Using the selected behavior a, get the next state $\tilde{s}$
- (3) Calculate Q(s, a) according to the previous transfer formula
- (4) Let $s: = \tilde{s}$


### Single step demo
To better understand q-learning, we can exemplify one of them.

First select $\gamma = 0.8$, the initial state is 1, Q initializes to zero matrix

![](https://ws2.sinaimg.cn/large/006tNc79ly1fn71t3h3wnj306u053jrf.jpg)


![](https://ws2.sinaimg.cn/large/006tNc79ly1fn71o8jlinj307t055wek.jpg)

Because it is state 1, we observe the second row of the R matrix. Negative numbers indicate illegal behavior. There are only two possibilities for the following state. Go to state 3 or go to state 5. Randomly, we can choose to go to state 5.

What happens when we get to state 5? Looking at line 6 of the R matrix, you can see that it corresponds to three possible actions: go to state 1, 4 or 5, according to the above transfer formula, we have

$$Q(1, 5) = R(1, 5) + 0.8 * max\{Q(5, 1), Q(5, 4), Q(5, 5)\} = 100 + 0.8 * max\{0, 0, 0\} = 100$$

So now the Q matrix has been updated and changed.

![](https://ws2.sinaimg.cn/large/006tNc79ly1fn8182u6xlj306y04mmx6.jpg)

Now our state changes from 1 to 5, because 5 is the final target state, so an episode is completed and goes to the next episode.

In the next episode, we randomly select an initial state and update the Q matrix. After a lot of episodes, the matrix Q approaches convergence, then our agent learns the optimal path from any state to the target state.


From the above principle, we know the most important state transition formula of q-learning. This formula is also called Bellman Equation. Through this formula, we can continuously update the Q matrix and finally get a convergent Q matrix.

Below we use code to implement this process

We define a simple labyrinth process, that is

![](https://ws1.sinaimg.cn/large/006tNc79ly1fn82ja4dkwj308d08d3yj.jpg)

The initial position is randomly at state 0, state 1 and state 2, and then the agent is expected to go to state 3 to get the treasure. The feasible course of action above has been marked with an arrow.


In [1]:
import numpy as np
import random

The following defines the reward matrix. There are 4 rows and 5 columns in total. Each row represents the state of state 0 to state 3, and each column represents five states: up, down, left, right, and still. The 0 in the reward matrix indicates the infeasible route. For example, the first line, up and left are not feasible routes, all are represented by 0, going down will go to the trap, so use -10 for reward, right to go and still give -1 reward Because neither the trap was triggered nor the treasure was reached, but the process wasted time.


In [2]:
reward = np.array([[0, -10, 0, -1, -1],
                   [0, 10, -1, 0, -1],
                   [-1, 0, 0, 10, -10],
                   [-1, 0, -10, 0, 10]])

Next define a q matrix initialized to 0


In [3]:
q_matrix = np.zeros((4, 5))

Then define a transition matrix, that is, a state that arrives from a state after taking a feasible action, because the states and actions here are limited, so we can save them, for example, the first row represents state 0, up. And left is not a feasible route, so the value of -1 means that it goes down to state 2, so the second value is 2, and the right direction reaches state 1, so the fourth value is 1. Keep it different or at state 0, so the last one is labeled 0, and the other lines are similar.


In [7]:
transition_matrix = np.array([[-1, 2, -1, 1, 0],
                              [-1, 3, 0, -1, 1],
                              [0, -1, -1, 3, 2],
                              [1, -1, 2, -1, 3]])

Finally define the effective actions for each state, such as the effective actions of state 0 are lower, right, and stationary, corresponding to 1, 3, and 4


In [8]:
valid_actions = np.array([[1, 3, 4],
                          [1, 2, 4],
                          [0, 3, 4],
                          [0, 2, 4]])

In [9]:
# Define gamma in bellman equation
gamma = 0.8

Finally, let the agent interact with the environment, and constantly use the bellman equation to update the q matrix. We run 10 episodes.


In [10]:
for i in range(10):
Start_state = np.random.choice([0, 1, 2], size=1)[0] # Random initial starting point
    current_state = start_state
While current_state != 3: # determine whether the end point is reached
Action = random.choice(valid_actions[current_state]) # greedy randomly selects the active action in the current state
Next_state = transition_matrix[current_state][action] # Get the next state by the selected action
        future_rewards = []
        for action_nxt in valid_actions[next_state]:
Future_rewards.append(q_matrix[next_state][action_nxt]) # Get rewards for all possible actions in the next state
        q_state = reward[current_state][action] + gamma * max(future_rewards) # bellman equation
Q_matrix[current_state][action] = q_state # update q matrix
Current_state = next_state # turns the next state into the current state
        
    print('episode: {}, q matrix: \n{}'.format(i, q_matrix))
    print()

episode: 0, q matrix: 
[[  0.   0.   0.  -1.  -1.]
 [  0.  10.  -1.   0.  -1.]
 [  0.   0.   0.   0.   0.]
 [  0.   0.   0.   0.   0.]]

episode: 1, q matrix: 
[[  0.   0.   0.  -1.  -1.]
 [  0.  10.  -1.   0.  -1.]
 [  0.   0.   0.  10.   0.]
 [  0.   0.   0.   0.   0.]]

episode: 2, q matrix: 
[[  0.   -2.    0.    7.    4.6]
 [  0.   10.    4.6   0.    7. ]
 [ -1.8   0.    0.   10.   -2. ]
 [  0.    0.    0.    0.    0. ]]

episode: 3, q matrix: 
[[  0.   -2.    0.    7.    4.6]
 [  0.   10.    4.6   0.    7. ]
 [  4.6   0.    0.   10.   -2. ]
 [  0.    0.    0.    0.    0. ]]

episode: 4, q matrix: 
[[  0.   -2.    0.    7.    4.6]
 [  0.   10.    4.6   0.    7. ]
 [  4.6   0.    0.   10.   -2. ]
 [  0.    0.    0.    0.    0. ]]

episode: 5, q matrix: 
[[  0.   -2.    0.    7.    4.6]
 [  0.   10.    4.6   0.    7. ]
 [  4.6   0.    0.   10.   -2. ]
 [  0.    0.    0.    0.    0. ]]

episode: 6, q matrix: 
[[  0.   -2.    0.    7.    4.6]
 [  0.   10.    4.6   0.    7. ]
 [  4.6  

It can be seen that after the first episode, the agent learns to go down in state 2 to get rewards. After learning continuously, after 10 episodes, the agent knows that in state 0, it can go to the right. Get rewards, go down in state 1 to get rewards, go right in state 3 to get rewards, so that in this environment any state agent can know how to get to the treasure location as quickly as possible

From the above example we simply demonstrated q-learning, we can see that it is very troublesome to build the whole environment, so we can help us build a learning environment through some third-party libraries, the most famous of which is open- Ai's gym module, we will introduce gym in the next chapter.
