# Section 1: Reinforcement Learning Basics

## Decision Making

In psychology, decision-making is regarded as the cognitive process resulting in the selection of a belief. In economics, it is the act of deciding on matters of the economy given certain indicators. In neuroscience, there are studies that show various brain regions work together during the decision-making process. These studies are expanded to try and figure out addictions and other self-control related items. In statistics, there are statistical decisions which are made on the basis of observations of a phenomenon that obeys probabilistic laws that are not completely known. 

All of this is to say that making decisions are all over every part of our lives. We take in inputs, do some processing, and output a decision. Some situations we have a full working model of our environment and other times we have to use probabilities. If we can take all of these areas of study, we can attempt to recreate how humans make decisions and transfer that to artificial intelligence. 

In this course, I will work through one type of decision-making algorithm called Q-learning. Q-learning is a type of reinforcement learning where we are given rewards for actions in an attempt to learn a policy to navigate an environment. In a Markov Decision Process (MDP), Q-learning will find an optimal policy given infinite exploration.

After Q-learning, I will go through some enhancements. The first is Double Q-learning which was created by Hado van Hasselt in 2010. Double Q-learning uses a second “off-policy” to evaluation the next action. This second policy is used because in a noisy environment the action value can be overestimated and slow learning.

After Double Q-learning, I will cover Deep Q-network (DQN) which was created by Google DeepMind 2015. DQN uses a neural network for the internal Q table. This allows a much larger (even continuous) environment.

Finally, I will cover Double DQN which was created by Google DeepMind in 2016. It uses DQN but adds a second “off-policy” network to evaluate the next action.

Welcome to the ride!

## Markov Decision Process

Markov Decision Process (MDP) provides a mathematical framework for modeling decision-making. It is a discrete time (distinct points in time) stochastic (randomly determined) process.

MDPs are made up of 4 parts:  
S: Finite set of states (Ex: s<sub>1</sub>, s<sub>2</sub> ... s<sub>N</sub>)  
A: Finite set of actions (Ex: North, South, East, West)  
P<sub>a</sub>(s,s'): Probability that action a in state s at time t will lead to state s' at time t + 1  
R<sub>a</sub>(s,s'): Immediate reward received after moving from state s to state s' by action a  

Some places include $\gamma$ but I don't like it because it is a independant of the others.  
$\gamma$: The discount factor between 0 (inclusive) and 1 (exclusive)  

An MDP is a collection of states that each have a selection of actions associated with them. With each action comes a reward (can be 0). The solution to an MDP is the policy ($\pi$). That policy will determine the optimal action to take at each state to maximize the reward.

A real world example would be an inventory control system. Your states would be the amount of items you have in stock. Your actions would be the amount to order. The discrete time would be the days of the month. The reward would be the profit.  

A major drawback of MDPs is called the "Curse of Dimensionality". This states that the more states/actions you have the more computational difficult it is to solve.  
  
---  
For the first question of the notebook, I will give a quick example of a discrete process MDP. I will ask to see if you can put the definitions above into practice.

**Question 1**: Given the following deterministic process (you select North you will move North) MDP, what is the optimal policy (path with the most points)?  
  
*Notes*:  
  * The number in the box is the reward  
  * Assume there is a negative time reward AKA you can't just sit on a single cell and collect rewards  
  * Once you hit the end you are done. (Absorbing state)  
  * S is the starting point  
  * F is the ending point  
  * Use N for North, E for East, S for South, and W for West  
  * Pass the directions as a single string. Ex: NESWN will make a cirle  
  
  

| | | |
|----------|----------|---------|
|S|1|1|
|1 |0|1|  
|-1|-1|0|  
|0 |0|F|


In [1]:
# Code up an MDP. If I get frisky maybe the inventory control system above
from basic import MDPQuestion1 #Import solution file
MDPQuestion1('EESSS') #Enter string of directions

Correct
  SENESSS took too many steps


## Policy Iteration

Talk about policy iteration and have a VERY WELL documented equation

In [None]:
# Code up a policy iteration problem

## Value Iteration

Talk about value iteration. Compare this with policy iteration

In [None]:
# Code up a value iteration problem that matches PI

## Policy Iteration verus Value Iteration

Write about the speed/iterations and the comparisons of each

In [None]:
# Code!!!

## Deterministic and Stochastic Movements

Talk about the differences and have a few practice problems. Maybe have stochastic and see if the student can guess where the action will take them

---
### Further Reading
**Double Q-Learning**  
Hasselt, H. V. (2010). Double Q-learning. Advances in Neural Information Processing Systems 23,2613-2621. Retrieved from http://papers.nips.cc/paper/3964-double-q-learning.pdf

**DQN**  
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529

**Double DQN**  
Van Hasselt, H., Guez, A., & Silver, D. (2016, February). Deep Reinforcement Learning with Double Q-Learning. In AAAI (Vol. 2, p. 5)
