# Model Free vs. Model Based Reinforcement Learning

A **Markov Decision Process** is defined by a 4-tuple (State, Actions, Reward(s, a), and Transition (s' |s, a)).
An agent does not know how the world will change in response to its actions- the transition function $sT(s'|s,a)$ nor the reward function- preventing it from planning a solution. 

In model based reinforcement learning an agent must learn a model of how the environment works, this can be performed using supervised learning approaches. Once the agent has adequately modeled the environment, it can use a planning algorithm with its learned model to find a policy. Solutions that follow this frameowrk are model-based RL algorithms.

Q-learning, actor-critic, and policy search methods do not construct a map of decisions but rather develop a policy to infer the best action. These are model-free reinforcement learning. 

## Online vs. Target

## On-policy vs. off-policy

**Off-policy**: A different policy is used to act and evaluate the quality of actions. 

With off policy the agent will find the optimal policy independent of the policy used during exploration- but in the case of Q-learning only when you visit the sites enough times. Off-policy tend to be slower than on-policy methods. 

**On-policy**: The same policy is used to act and evaluate the quality of actions. 

In on-policy learning we have a 'practice policy' that we develop and use to learn and explore the environment. This policy matures and it becomes our optimal policy. Ploicy gradients are an example of this- where we develop our policy by exploring the environment.

# Value-Based 
### Q- Learning

In Q-learning we have to important functions: 
**Action-value function:** Denoted as Q, for Q learning the action value function is used to find the value of taking an action from our given state, for all actions and states in the world. 
**State-value function:** Denoted as V, this function is the value of every state in the world. 

Initialize the Q function randomly. Start in a random state, s. 
We will follow our q function to decide how to move but we will act randomly a percentage of time, epsiolon, a technique called **epsilon-greedy** that is very popular because it balances exploration and exploitation with stochastic parameter epsilon. 

This is an off-policy algorithm.

### DQN

### Double DQN

### SARSA (state-action-reward-state-action)

### Offpol Sarsa



# Policy-Based

### REINFORCE 

We use a neural net, or another function approximator, to directly model the action probabilities. Each time the agent interacts with the environment it generates data, (State, Action, Reward, State') , we tweak the parameters, $\theta$, of the neural network so that 'good' actions will be sampled more likely in the future. We repeat this process until the policy network converge to the optimal policy, $\pi$. We are trying to maximize expected reward at time t $E[R_t]$. The gradient formula is given as $\Delta_{\theta} E[R_t] = E[\Delta_{\theta}logP(a)R_t]$

Using the above to approximate the gradient is called **REINFORCE**. This is also called the **actor-critic** framework, where the policy network acts as the actor, and the reward at time t, also called the value gradient acts as the critic. This is an on-policy 

Policy gradients are believed to be able to apply to a wider rangr of problems. Q functions can be too complex to be learned, are slower to converge than gradients- but gradients can converge to a local optima. DQN can't learn stochastic policies, gradients can. Policy gradients can be easily applied to model continuous action space. 

The biggest drawback to policy gradients are the high variance in estimating the gradient. We estimates the gradient from a series of accumulated points (sars), hence it is a Monte Carlo method and it can be very noisy. 

#### Variance reduction in policy gradients

Use a baseline function, B.

$\Delta_{\theta}E[R_t]= E[\Delta_{\theta}logP(a)(R_t-B)]$

A baseline helps reduce variance by punishing small value of R_t that indicate bad returns but that otherwise increase the value a. 

We can use the value function $V(s)$ as the baseline. The value function is the expected value of total future rewards at state s. As a result only better than average actions can get positive updates. Hence the term $(R_t-B)$ is often called the advantage function, the corresponding gradient estimation function is called the advantage actor critic.

Other methods include: **TD(lambda)** and **Generalized Advantage Estimation (GAE)**. 


# Notes that need to be added where they make sense

Polic methods are not necessarily the same as actor-critic becausae actor- critic does not necessarily have a gradient and policy gradient methods do not necessaily have to have a critic component. 
Often a sampled reward return (or a sampled reward return minus some baseline) is  used to adjust policy parameters rather than an implemented critic. 

Actor-critic methods tend to be *online*, they update their parameters after each step in the environment. However, this is not a rule and you can find actor-critic methods that are offline as well, for example Deepmind's A3C. 

https://github.com/awjuliani/Meta-RL

Actor critic are advantageous to vanilla policy gradients because without factoring in some baseline there is no punishment for alterantive, better decisions and the agent can weaken from the gradient pointing in the wrong direction.

Commom critic functions:

**Temporal difference**


# Deep Hierarchical Reinforcement Learning

# Actor-Critic

### DPG

### DDPG



# Search 

### Monte Carlo Search

### Grid Search

### Random Search

### Linear Search/Combination


## Markov Decison Process

## Partially Observable Markov Decision Process

Partially Observable Markov Decision Processes (POMDP) are Markov Decision Processes that exist in an environment where we never have full information. They are more difficult to solve but essential for modeling problems in reality. 

The key to partially observable environments is to give the agent a *capacity for temporal integration of observations*. If information at a single moment is not enough to make a good decison, then enough varying information over time. 

There are several ways of accomplishing this temporal integration:

1. Stack the frames using an external frame buffer, and feed the network the last four frames at a time, used in the original DQN paper. The downside is holding frames in memory requires a large frame buffer that can slow down training, additionally we may need more than four frames to capture the event that is most important in making decisions at this moment. 
2. Use a recurrent neural net. We can use an RNN and their ability to learn temporal dependencies. These class of agents are called Deep Recurrent Q-Networks (DRQN). If we use recurrent layers we can't use random batches of experience. Instead we use traces of a set length that are sampled from our mini-batches- this way we retain our random sampling without losing temporal trends. 
    1. A trick for DRQN from a group at Carnegie is to only send the last half of the gradients for a given trace. 


### Hill Climbing

## Long Term Credit Assignment
We deal with the credit assignment problem by using discount rates 




## Exploration vs. Exploitation






# [Diversity Is All You Need](https://arxiv.org/pdf/1802.06070.pdf)
   
   
This paper follows a similar pattern of many papers recently where the training comes in two parts, the first where there is a cheap unsupervised exploratory period an agent with a random skill z, explores and area without reward but with a discrimiantor network that attempts to max-entropy of actions. The second part is short supervised, expensive, trainging period. 

Z, the skill, is a number from 1-20.
The skill and the discrimantor are both learned. Both are neural networks. 

### Soft-Actor Critic
There is an entropy parameter that punishes the advantage function to do diverse things. 
    

## Policy Search / Policy Iteration
Policy search attempts to find an optimal policy, where pi* is a mapping from states to actions. If both states and actions are finite, there will be a finite number of possible policies that we can search. In many methods we use our old policy to explore the environment and greedily update our policy. I have seen the use of policy search and policy iteration used without explicit dismabiguation, which is why I don't have a clear difference here. 

Here's another definition from quora:

Policy iteration is a dynamic programming algorithm that uses a value function to model the expected return for each state-action pair. Many techniques in RL such as Q-learning, TD-learning, SARSA, QV-learning, etc. use value functions. 

In policy search, a parameterized policy is stored, but no value function is used or estimated. Policy search methods can rely on roll-outs of policies using trajectory-based sampling.

So the main difference between policy iteration techniques and policy search techniques is whether a value function is used

## Value Iteration
Search in the space of value functions and the optimal policy is obtained as a byproduct once we have found the optimal value function. 