# Q-Learner

**Overview**  
Q-learning is a reinforcement learning technique. The goal is to learn the optimal policy. It is model free meaning that it doesn't need the entire environment to run. Q-learning involves an agent and an environment. The environment is a set of states S, and a set of actions A that are allowed in each state. The agent will reside in a state and perform an action. The environment will process that action and return back the new state (could be the same) and the reward for that action. The goal of the agent is to find the optimal policy with the largest reward. Just like with the MDPs in the previous notebook, the agent calculates the maximum future reward.  

**Learning Rate**: The learning rate, or \alpha, is a value between 0 and 1. It determines how aggressively you update the table value. With a learning rate near 0 you will not update your table value much. With the learning rate near 1 you are almost replacing the current value with the new calculated value. Basically, you ignore most of what you had learned for what you picked up this time. Typically, this is around 0.1.

**Explore/Exploit**: Imagine a baby that is trying to learn how to accomplish a task. At first they are just a tornado of arms and legs flailing about. But, given enough time they figure out what they need to accomplish. The flailing about is the exploring part and the part where they know what to do is exploiting their knowledge.  
  
A major part of the q-learning agent is whether to explore the environment or exploit the environment. Initially, everything is exploring as the agent hasn't learned anything about the environment (in most implementations this is done by setting the initial values of all the actions within the state to a random number). When the agent starts training you will need to determine how much of the time do you take random actions and how much of the time do you take the optimal action. If you explore for too long you won't learn the optimal policy because all of your actions will be random and if you exploit too much you have the possibility of never finding the optimal solution. This process is called *epsilon-greedy* where epsilon is the percent of time the agent chooses to explore. In most problems the ideal explore rate is 10%. There are some algorithms that have this value decay over time to take advantage of your training.  

**Discount Factor**: As stated in the MDP section, The discount factor is between 0 and 1. This determines how much you want to give the future path credit. You need to balance out your immediate rewards versus your future rewards. The higher the discount factor the further into the future path you want to include in this state/action pair. I will record the same math here as before to help this make sense. If your discount factor is 0.8 and after 5 steps you get a reward of 4 the present value of that reward is $0.8^4 * 5$ or ~2. If you change the discount factor or 0.9 that value becomes ~3.2. 0.1 turns into 0.0005.
  
**Q Table**: In a standard Q-learning algorithm the agent holds a q table that it uses to determine the ideal action for each state. This table is S x A in size. For each state we store the reward for each action. Typically, this is done as a 2 dimensional array but you can use other data structures. Also, in deep learning the q table is a neural network.  

**Algorithm**  
$$Q'(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha * \big( r_t + \gamma * max_a Q(s_{t+1},a) - Q(s_t,a_t) \big) $$  
  
$\alpha$: This is the learning rate. As listed above, this determines how much you change your Q value.  
$Q(s_t,a_t)$: This is the old Q value from the table.  
$r_t$: Is the reward you are receiving for taking action $a_t$ in state $s_t$  
$\gamma$: This is the discount factor similar to the MDP   
$max_aQ(s_{t+1},a)$: This is the maximum future reward from state $s_{t+1}$.  
$Q(s_t,a_t)$: This is the old Q value

---
**Question 1**  
A simple question to make sure you understand where all the pieces fit together.  
First, the hyper parameters.  
&nbsp;&nbsp;&nbsp;&nbsp;Your learning rate ($\alpha$) is 0.1  
&nbsp;&nbsp;&nbsp;&nbsp;Your discount rate ($\gamma$\) is 0.8.  
You are in state S1 and taking action A1 to state S2 and getting the reward of 5.  
What is the new Q[S1,A1] value assuming this following Q table?  

| |A1|A2|A3|  
|----------|----------|----------|---------|  
|S1|0.1|0.2|0.3|
|S2|0.4|0.5|0.6|  
|S3|0.7|0.8|0.9|  
|S4|0.11|0.12|0.13|  

In [2]:
from qlearning import QLQuestion1 #Import solution file
QLQuestion1(0.1) #Pass in an integer value

Correct
  The old value is 0.1
  The learning rate (alpha) is 0.1
  The reward is 5
  The discount factor is 0.8
  The max action from state S2 is A3 with a value of 0.6
  0.1 + 0.1 * (5 + 0.8 * 0.6 - 0.1)
  0.1 + 0.1 * (5 + 4.8 - 0.1)
  0.1 + 0.1 * (5.38)
  0.1 + 0.538
  0.638


One thing I wanted to point out was finding $max_aQ(s_{t+1},a)$ in the previous question as it was hard for me to figure out until I did a few examples. In our problem we are looking at $max_a Q(S2,a)$. When we look at state S2 we see that we have 3 values, 0.4, 0.5, and 0.6. Clearly, the largest is under action A3. So, the $max_a$ in this case is A3. Now, we don't care which column it is from or care what action is taken. We just need the max expected value.

**Discrete/Continuous Environments**

A discrete environment is when you have a certain number of states. This can be anything from 1 up to a managable amount. If you have to have a supercomputer in order to store your Q table it isn't discrete. A continuous environment is more like the real world. You wouldn't be able to list the states in a simple action like walking across the room let alone a complex simulation.  

If you want to still use a q-learner in a continuous state you have to use something called *discretization*. This is where you group the continuous states into discrete ranges. If you have seen a histogram you have seen discretization. When you break up the data into bins you are essentially making all the data fit into a discrete number of bars.  

The simplest way to discretize a continuous space is to split the states into buckets. If you environment has 1 million states you could break them up into states of 10,000.  

If you have a more complex environment you could break them into buckets based on how similar their task is to complete. For example, in soccer, if you are on the left side of the field you need to go to the right to get to the goal. If you are on the right you need to go left to get to the goal. You could split that into 2 states. Now, granted, that is terrible and would never work but you get the idea.  

To get beyond discretization in continuous spaces and foreshadow future notebooks, you can use function approximation. Function approximation is something that is actually named correctly. You are asking for a function that will approximate the target function. In this world, Neural Networks are an example. We will see them when we get into deep learning.

In [None]:
# Possible graph numbers 1 - 100 where there are 4 distinct groups to show buckets based on their similarity

**Continuous Environment**

Explain this section. Use discretization and then show its limitations

In [None]:
# Create a problem that shows discretization. Possible have it generate numbers and show the bins