---
### <font color=blue>Reinforcement learning</font>
![ML](images/ml_16.png) 
![ML](images/ml_15.png) 
#### Markov Decision Problems
* Set of states **`S`** 
* Set of actions **`A`**
* Transition function **`T[s,a,s']`** 
* Reward function **`R[s,a]`**
* Find **`policy π(s)`** that will maximize reward over time
* If we have T and R, there are algorithms we can unleash that will find this optimal policy. Two of them are policy iteration and value iteration.


#### Defined by:
* A Markov decision problem is defined by S, A, T and R. 
* S is the potential states. 
* A are the potential actions. 
* T is a transition probability, which is given the state S we're in, the action A we're taking, and ends up in state S'. 
* R is the reward function. 
* The goal for reinforcement learning algorithm is to find a policy, π, that maps a state to an action that we should take, and its goal is to find this π such that it maxismizes some future sum of the reward.

#### Two approaches to find policy π 
* Model-based  
 * Build models of T[s,a,s'] and R[s,a] to solve problems using value iteration or policy iteration. 
* Model-free  
 * Q-Learning
 
#### What to optimize ?
![ML](images/ml_17.png) 
$\lambda$ relates very strongly to interest rates.
If $\lambda$  were 0.95, it means each step in the future is worth about 5% less than the immediate reward if we got it right away.

---
### <font color=blue>Q-learning</font>
![ML](images/ml_18.png) 
We can think of it as a table which gets two-dimensions, $s$ and $a$. <br/>
$s$ is the state we're looking at. <br/>
$a$ is the action we might take. Q represents the value of taking action $a$ in state $s$.<br/>
Two components: immediate reward that we get for taking action $a$ in state $s$  plus discounted reward which is the reward we get for future actions. Q represents the rewards we get for acting now and in the future.<br/><br/>
$\pi(s)$ represents the policy which means, what is the action we take when we are in state $s$ or what is the policy for state $s$? We take advantage of our Q table to figure that out.<br/>
We're in state $s*$ and we want to find out which action is the best. All we need to do is look across all the potential actions and find out which value of $Q[s,a]$ is maximized. We don't change $s$, just step through each value of $a$ and the one that is the largest is the action we should take.<br/>
![ML](images/ml_19.png) 

#### How does it take that information to improve this Q table? 
![ML](images/ml_20.png) 
There are two main parts to the update rule. 
* A low value of alpha, for instance, means that in this update rule, the previous value for Q of s,a is more stronly preserved. 
* A low value of gamma means that we value later rewards less, which equates to essentially a high discount rate. 
* A high value of gamma means that we value later rewards very significantly.
* $a'$ is the next action we will take. 
* $argmax_{a'}(Q[s', a'])$ means that we will find the best action $a'$ that maximizes the value when we're in that state $s'$.

#### Update Rule
The formula for computing $Q$ for any state-action pair $<s, a>$, given an experience tuple $<s, a, s', r>$, is:
$Q'[s, a] = (1 - α) · Q[s, a] + α · (r + γ · Q[s', argmaxa'(Q[s', a'])])$

Here:

* $r = R[s, a]$ is the immediate reward for taking action $a$ in state $s$,
* $γ ∈ [0, 1]$ (gamma) is the discount factor used to progressively reduce the value of future rewards,
* $s'$ is the resulting next state,
* $argmaxa'(Q[s', a'])$ is the action that maximizes the Q-value among all possible actions $a'$ from $s'$, and,
* $α ∈ [0, 1]$ (alpha) is the learning rate used to vary the weight given to new experiences compared with past Q-values.

#### Two Finer Points
![ML](images/ml_21.png) 
* choose random action with probability C  
* Pick the actio with the highest Q value

#### The Trading Problem: Actions
![ML](images/ml_22.png) 
Three actions:
* Buy
* Sell
* Do nothing(Hold)

#### The Trading Problem: Rewards
Rewards should relate in some way to the returns of our strategy
* Short-term rewards in terms of daily returns  
* Long-term rewards that reflect the cumulative return of a trade cycle from a buy to a sell or for shorting from a sell to a buy
For faster convergence:
* **`r=daily return`**:  this one is called an immediate reward which is faster to converge,. 
If we reward a little bit on each day, the learner is able to learn much more quickly because it gets much more frequent rewards.
* **`r=0 until exit, then cumulative return`**: this one is called delayed reward. If we use this one, we get no rewards at all until the end of a trade cycle, from a buy to a sell. The learner has to infer from that final reward all the way back that each action in sequence there must have been accomplished in the right order to get the reward.

#### The Trading Problem: State
[ ]**`Adjusted close`**: cannot be able to generalize over different price regimes for when the stock was low to when it was high. If we're trying to learn a model for several stocks at once and they each hold very different prices, adjusted close doesn't serve well to help us generalize.<br/>
[ ]**`Simple Moving Average (SMA)`**<br/>
[v]**`Adjusted close / SMA`**: combine Adjust close and SMA together into a ratio that makes a good factor to use in state.<br/>
[v]**`Bollinger Band value`**<br/>
[v]**`P/E ratio`**<br/>
[v]**`Holding stock`**<br/>
[v]**`Return since entry`**: the return since we enter the position. This might help us set exit points,for instance, maybe we've made 10% on the stock since we bought it and we should take our winnings while we can.<br/>

#### Creating the state
![ML](images/ml_23.png) 
* State ia an integer
* Discretize each factor which essentially means to convert the real number into an integer.
* Combine: combine all of those integers together into a single number.<br/>

Steps:<br/>
 * Assuming we're using a discrete state space that means more or less that our overall state is going to be this one integer that represents at once all of our factors.
 * Consider we have 4 factors and each one is a real number.
 * Run each of these factors through their individual discretizers and we get an integer.
 * Then we've happened to select integers between 0 and 9, but we can have larger ranges, for isntance, 0 to 20 or 0 to 100 even.
 * Stack them one after the other into our overall discretized state.
 
#### Discretizing
![ML](images/ml_24.png) 
![ML](images/ml_25.png) 
Use a way to convert a real number into an integer across a limited scale. In other words, we might have hundres of individual values here between 0 and 25 of a real number. We want to convert that into an integer say between 0 and 9.
* First thing is we determine ahead of time how many steps we're going to have. In other words, how many groups do we want to be able to put the data into?
* So we divide how mnay data elements we have all together by the number of steps.
* Then we sort the data and then the threshold just end up being the locations for each one of these values. In other words, if we had, say, 100 data elements, 10 steps, then our step size is 10. So we just find the 10th data element which is our first threshold and then 20th and 30th and so on.
* The threshold might end up looking something like the figure.
When we go to query and have a new value between those two threshold, 7 and 8, we'll see the value wwould be an 8.

#### Summary
**`Advantages`**<br/>
The main advantage of a model-free approach like Q-Learning over model-based techniques is that it can easily be applied to domains where all states and/or transitions are not fully defined.
As a result, we do not need additional data structures to store transitions T(s, a, s') or rewards R(s, a).
Also, the Q-value for any state-action pair takes into account future rewards. Thus, it encodes both the best possible value of a state (maxa Q(s, a)) as well as the best policy in terms of the action that should be taken (argmaxa Q(s, a)).<br/><br/>
**`Issues`**<br/>
The biggest challenge is that the reward (e.g. for buying a stock) often comes in the future - representing that properly requires look-ahead and careful weighting.
Another problem is that taking random actions (such as trades) just to learn a good strategy is not really feasible (you'll end up losing a lot of money!).
In the next lesson, we will discuss an algorithm that tries to address this second problem by simulating the effect of actions based on historical data.

### <font color=blue>Dyna</font>
One problem with Q-learning is that it takes many experienced tuples to converge. This is expensive in terms of interacting with the world because we have to take a real step, in other words, execute a trade, in order to gather information. 
<br/><br/>
Dyna works by building models of T, the transition matrix, and R, the reward matrix. Then after each real interactin with the world, we hallucinate many additional interactions, usually a few hundred. That are used then to update the Q table.
<br/><br/>
Dyna is intended to speed up learning or model convergence for Q-learning.

Q-learning is a model-free which means that it does not rely on T (transition matrix) or R (reward function). Q-learning does not know both of them.

Dyna ends up becoming a blend of model-free and model-based methods.

![ML](images/ml_26.png) 
* Initializa the Q table, and begin iterating 
* Observe S
* Execute action A, and then observe new state, S', and reward,R
* Update Q table with this experience tuple and repeat.<br/>
<br/>
<b>When we augment Q learning with Dyna-Q, we had 3 new components:</b>
* Learn Model: add some logic that enables us to learn models of T and R
* Halucinate experience: rather than interacting with the real world like we do appear with Q learning part and this is expensive by the way.
We halucinate these experiences, update our Q table.
* Update Q: updated by 2. and repeat many times like 100s times.

So we can leverage the experience we gain in Q-learning from an interaction with the real world, but then update our model in Dyna-Q more completely before we step out and interact with the real world again.

After we've interated enough times in Dyna-Q, then we return back up to Q-learning and resume our interaction with the real world.

The key thing is that for each experience with the real world, we have maybe 100 or 200 updates of our model in Dyna-Q.
![ML](images/ml_27.png) 
Then we find new values for T and R.
the point where we update includes the following:
* We want to update T, called T' here, which represents our transition matrix and update our reward function, called R'.
T' is the probability that if we are in state s and we take aciton a, it will end up in s'.
R' is our expected reward if we are in state s and we take action a.

How to update T' and R'?
1. randomly select an s.
2. randomly select an a.
3. Infer our new state s' by looking at T.
4. Infer a reward, our immediate reward r by looking at big R or R table.

Now we've got s, a, s', r or a complete experience tuple and we can update our Q-table using that.

Q table update is our final step.

####  <font color=red>Learning T</font>
![ML](images/ml_28.png) 
T(s, a, s') represents the probability that if we are in state s, take action a, we will end up in state s'.

To learn the model of T, we just observe how these transitions occur, in other words, we'll have experience with the real world, we'll get back on s, a, s', we just count how many times it did happened.

We introduce new table called $T_{count}$ or $T_c$
1. initialize all of our T count values to be a very small number.
2. begin executing Q learning. each time we interact with real world we observe, s, a, and s'.
3. increment that location in our $T_{count}$ matrix.<br/>

####  <font color=red>Evaluating T</font>
![ML](images/ml_29.png) 
sum over i where we have i iterate over all the possible states of T[s, a, , i]. This is the number of times in total that we're in state s and executed action a.

#### <font color=red>Learning R</font>
![ML](images/ml_30.png) 
<font color=blue>$R[s,a]$</font> is a model that is expected reward if we're in state s and execute action a.
<font color=blue>$r$</font> is our immediate reward when we experience this in the real world, in other words, it's what we get in an experience tuple.

So we want to update this model every time we have a real experience.

Similar to Q-table update equation:
<font color=blue>$\alpha$</font>: learning rate
<font color=blue>$r$</font>: new best estimate or immediate reward of what value should be.

we're waiting presumably, our old value more than our new value

#### <font color=red>Dyna-Q recap</font>
![ML](images/ml_31.png) 

#### <font color=red>Summary</font>
![ML](images/ml_32.png) 
The Dyna architecture consists of a combination of:
* direct reinforcement learning from real experience tuples gathered by acting in an environment,
* updating an internal model of the environment, and,
* using the model to simulate experiences.
