#<font size = 6pt>**Introduction**</font>

##<font size = 5pt> Basic Definitions/Terminology </font>

* ***Agent:*** The player/bot/actor we are training. It's fundamental goal is to maximize it's *reward* over time, given by the *environment*, by selecting proper *actions*.
* ***Environment:*** The world that the agent is placed in.
* ***State:*** A sufficient representation of the world.
* ***Action:*** The methods/ways/options the agent has to interact with the environment with.

##<font size = 5pt> Comparison to other AI </font>

SL= Supervised Learning
UL= Unsupervised Learning
RL= Reinforcement Learning
IL= Imitation Learning

![](https://drive.google.com/uc?export=view&id=1mvhwW08_7UpB9eMzyIQcVlGNNGfl4uEd)

![](https://drive.google.com/uc?export=view&id=16AreFfWwmAiNOpAGmSv39hqoReihV_Rt)
* AI planning assumes have a model of how decisions impact
environment

* Supervised learning is provided correct labels

* Unsupervised learning is provided no labels

* Reinforcement learning is provided with censored labels
  * reward recieved = censored labels

* Imitation learning assumes input demonstrations of good policies
  * IL reduces RL to SL. IL + RL is promising area

Optimization: Goal is to find an optimal way to make decisions that yield the best outcomes.

Learns from Experience: The agent will learn from experiences (it's own or other external experience).

Generalization: Need some form of generalization because it may be impossible to explicitly define the entire state-space. If your agent is able to generalize when learning, it will be able to perform on a new configuration, not before seen.

Delayed Consequences: Decisions now can impact things much later. Complicates the temporal credit assignment problem which is how you determine the causal relationship from the decisions you made in the past and outcomes in the future. Have to deal with the immediate and long term ramifications of actions and states.

Exploration: Have to sufficiently explore the state-space in order to effectively exploit it. A RL agent only learns about things by trying them itself. It is a balancing act that needs to be perfected in order to obtain the optimal results from your agent.

#<font size = 6pt>**Returns/Rewards**</font>

The basic goal of any agent is to maximize it's *Discounted Sum of Rewards* it is how an agent 'learns'

Thus it is critical that the rewards truly indicate the goal

An agent acts within a finite horizon, or some episodic sequence. This could be one round in a match of pong, one attempt through a maze, etc;

Therefore if the agent recieves a reward at every time step $t$, it's ***discounted sum of rewards*** would be would be defined as:

---

$ G_t $  $= r_t + \gamma \  r_{t+1} + \gamma^2 \ r_{t+2} + ... + \gamma^n \ r_{T}
\\ = r_t + \gamma \ (r_{t+1} + \gamma \ r_{t+2} + ... + \gamma^{n-1} \ r_{T})
\\ = r_{t} + \gamma G_{t+1}
\\ = \sum_{k = 0}^\infty \gamma^k \ r_{t+k}$

---

* $T$ denotes the time-step the episode terminates at and $\gamma$ is a *discount factor*
  * *Continuing Tasks* or infinite episodes, $T = \infty$, still converge as long as $\gamma < 1$

* $\gamma$ is a *Discount Factor* used to simplifiy math and the express magnitude of future/immediate rewards; $\gamma \in [0, 1]$

  * $\gamma$ closer to 1 emphasises more on future rewards (farsighted)
  * $\gamma$ closer to 0 empasises more on immediate rewards (myopic)



## Simple Example

![](https://drive.google.com/uc?export=view&id=1mvhwW08_7UpB9eMzyIQcVlGNNGfl4uEd)

* Reward: +1 in $s_1$, +10 in $s_7$, and $0$ everywhere else

* Calculate the reward from this sequence: $s_4, s_5, s_6, s_7$ using $\gamma = \frac{1}{2}$


$$0 + \left(\frac{1}{2} \cdot 0\right) + \left(\frac{1}{4} \cdot 0\right) + \left(\frac{1}{8} \cdot 10\right) = 1.25$$



#<font size = 6pt>**Policies**</font>

***Policies*** determine the future of the agent, denoted $\pi$. Their essentially the 'strategy' of the agent.

They map states to actions; $\pi: S \rightarrow A$

* Deterministic Policy: $\pi(s) = a$

* Stochastic Policy: $\pi(a|s) = P(a_t=a | s_t = s)$ ; the probability of taking action $a$ from state $s$.

\

$s_0 \xrightarrow[]{\pi} a_0 \rightarrow s_1 \xrightarrow[]{\pi} a_1 \rightarrow s_2 \xrightarrow[]{\pi} a_2 $

\

\

The ***Optimal Policy***, $\pi^*$, defines the best policy, that is, the policy whose expected return is greater than or equal to all other policies, if followed. There always exist one (or more) optimal policies.

---

$$\pi^*(s) = \underset{x}{\arg\max} V^{\pi}(s)$$


Policy $\pi^{\prime}$ is the optimal policy if: $\forall s \in \mathcal{S} : V^{\pi{\prime}}(s) \geq V^{\pi}(s)$

---

#<font size = 6pt>**Value Functions**</font>

* Most important equations in RL

* Value Functions tell you how valuable or 'good' something is

##State-Value Function

The State-Value function also known as the Bellman State-Value Equation, denoted $V(s)$, tells you how good it would be for your agent to be in state $s$, Essentially how valuable or 'good' is state $s$.

The state-value function, or the expected return, when starting in state $s$ and following policy $\pi$ is defined as:

\

---
$ V_{k}^{\pi}(s) $  $= \mathbb{E}^{\pi}\left[G_t | s_t = s\right]
\\ = \mathbb{E}^{\pi}\left[r_t  + \gamma \ G_{t+1} | s_t = s\right]
\\ = \sum_a \pi(a | s) \sum_{s^{\prime}, \  r} p(s^{\prime}, \ r | s, a)\left[r + \gamma \ \mathbb{E}^{\pi}\left[G_{t+1} | s_{t+1} = s^{\prime}\right]\right]
\\ = \sum_a \pi(a | s) \sum_{s^{\prime}, \  r} p(s^{\prime}, \ r | s, a)\left[r + \gamma \ V^{\pi}_{k-1}(s^{\prime})\right]
$

---

\

SIMPLY:

---
$V(s) $ $ = \underbrace{R(s, a)}_{\text{Immediate} \\ \text{Reward}} + \underbrace{\gamma \sum_{s^{\prime}}P(s^{\prime}|s, a)V(s^{\prime})}_{\text{Discounted Sum of Future Rewards}}$

---

\

<font size = 5pt>Optimality</font>

The ***Optimal State-Value Function*** denoted, $V^*(s)$ $\forall s \in \mathcal{S}$, assigns each state to the largest expected return achievable by any policy.

\

---
$ V^{*}(s) $  $= \underset{a}{\max} \mathbb{E}\left[r_t + \gamma V^*(s_{t+1}) | s_t = s, a_t = a\right]
\\ = \underset{a}{\max} \sum_{s^{\prime}, \ a} p(s^{\prime}, \ r | s, a) \left[r + \gamma V^*(s^{\prime})\right]
$

---

#State-Action Value Function

The State-Action-Value function also known as the Bellman State-Action-Value Equation, denoted $Q(s, a)$, tells you how good it would be for your agent to choose action $a$, in state $s$, Essentially how valuable or 'good' is action $a$ from state $s$.

The state-action-value function, or the expected return, when starting in state $s$, choosing action $a$, and following policy $\pi$ thereforward is defined as:


 $Q^{\pi}(s, a)$
$= \mathbb{E}^{\pi}\left[G_t | s_t = s, a_t = a\right]
\\ = \sum_{s^{\prime}, \  r} p(s^{\prime}, \ r | s, a)\left[r + \gamma \ \sum_a \pi(a^{\prime} | s^{\prime})  Q^{\pi}(s^{\prime}, a^{\prime})\right]
\\ = \sum_{s^{\prime}, \  r} p(s^{\prime}, \ r | s, a) \left[r + \gamma \ V^{\pi} (s^{\prime})\right]$

\

---
$ Q_{k}^{\pi}(s, a) $  $= \mathbb{E}^{\pi}\left[G_t | s_t = s\right]
\\ = \mathbb{E}^{\pi}\left[r_t  + \gamma \ G_{t+1} | s_t = s\right]
\\ = \sum_a \pi(a | s) \sum_{s^{\prime}, \  r} p(s^{\prime}, \ r | s, a)\left[r + \gamma \ \mathbb{E}^{\pi}\left[G_{t+1} | s_{t+1} = s^{\prime}\right]\right]
\\ = \sum_a \pi(a | s) \sum_{s^{\prime}, \  r} p(s^{\prime}, \ r | s, a)\left[r + \gamma \ V^{\pi}_{k-1}(s^{\prime})\right]
$

---

\

SIMPLY:

---
$Q(s, a) $ $ = \underbrace{R(s, a)}_{\text{Immediate} \\ \text{Reward}} + \underbrace{\gamma \sum_{s^{\prime}}P(s^{\prime}|s, a)V(s^{\prime})}_{\text{Discounted Sum of Future Rewards}}$

---

\

<font size = 5pt>Optimality</font>

The ***Optimal State-Action-Value Function*** denoted, $Q^*(s, a)$ $\forall s \in \mathcal{S} \land \forall a \in \mathcal{A}$, assigns state–action pair to the largest expected return achievable by any policy.

\

---
$ V^{*}(s) $  $= \underset{a}{\max} \mathbb{E}\left[r_t + \gamma V^*(s_{t+1}) | s_t = s, a_t = a\right]
\\ = \underset{a}{\max} \sum_{s^{\prime}, \ a} p(s^{\prime}, \ r | s, a) \left[r + \gamma V^*(s^{\prime})\right]
$

---

#<font size = 6pt>**Markov Decision Processes (MDPs)**</font>

## Finite

An MDP is a tuple: ($\mathcal{S}$, $\mathcal{A}$, $P$, $R$, $\gamma$)

  * $\mathcal{S}$ is a (finite) set of Markove states; $s \in \mathcal{S}$

  * $\mathcal{A}$ is a (finite) set of actions; $a \in \mathcal{A}$

  * $P$ is the transition probability for each state and action;  <font size = 2pt> Defined with differing tuples depending on the system </font>
    * $P(s_{t+1} = s^{\prime} | s_t = a, a_t = a)$
    * $P(s_{t+1} = s^{\prime}, r_t = r| s_t = a, a_t = a)$

  * $R$ is a reward function; <font size = 2pt> Defined with differing tuples depending on the system </font>
    * $R(s_t = a) = \mathbb{E}\left[r_t | s_t = s\right]$
    * $R(s_t = s, a_t = a) = \mathbb{E}\left[r_t|s_t = s, a_t = a\right]$
    * $R(s_t = s, a_t = a, s_{t+1} = s^{\prime}) = \mathbb{E}\left[r_t|s_t = s, a_t = a, s_{t+1} = s^{\prime}\right]$

  * $\gamma$ is a Discount Factor used to simplifiy math and the express magnitude of future/immediate rewards; $\gamma \in [0, 1]$
    * $\gamma$ closer to 1 emphasises more on future rewards (farsighted)
    * $\gamma$ closer to 0 empasises more on immediate rewards (myopic)

##<font size = 5pt> Dynamics </font>

An MDP gives rise to a sequence or $trajectory$ like this: $s_0, a_0, r_0, s_1, a_1, r_1, s_2, a_2, r_2, ...$

In a discrete or finite MDP, the sets of states, actions, and rewards ($\mathcal{S}$, $\mathcal{A}$, $\mathcal{R}$) all have a finite number of elements. In this case, the random variables $r_t$ and $s_t$ have well defined discrete probability distributions, dependent on the preceding state and action. Which defines the dynamics of an MDP:

\
<font size = 4pt>***Transition Probabilities***: </font>

<font size = 2pt>These are isomorphic. Essentially depends on the dynamics/model of your specific system.</font>
<font size = 2pt>These are sometimes denoted with a $T$</font>

\
* ***State$-$Reward Transition:*** Predicts the next state$-$reward pair; if we start in state $s$ and take action $a$, the probability that we end up in state $s^{\prime}$ and recieve reward $r$ is:

$$P(s^{\prime}, r|s, a) = P(s_{t+1} = s^{\prime}, r_t = r| s_t = s, a_t = a)$$

\

* ***State Transition:*** Predicts next state; if we start in state $s$ and take action $a$, the probability that we end up in state $s^{\prime}$ is:

$$P(s^{\prime}| s, a) = P(s_{t+1} = s^{\prime}|s_t = s, a_t = a)$$


\
<font size = 4pt>***Expected Reward for Transitions***: </font>

<font size = 2pt>These are isomorphic. Essentially depends on the dynamics/model of your specific system.</font>


\

* ***State$-$Action$-$Next-State Triple:*** Predicts the immediate expected reward from state$-$action$-$next-state triples; if we start in state $s$, take action $a$, and transition into state $s^{\prime}$, the expected (mean) reward is:

$$R(s, a, s^{\prime}) = \mathbb{E}\left[r_t | s_t = s, a_t = a, s_{t+1} = s^{\prime}\right]$$

\
* ***State$-$Action Pair:*** Predicts immediate expected reward from state$-$action pairs; if we start in state $s$ and take action $a$, the expected (mean) reward is:

$$R(s, a) = \mathbb{E}\left[r_t | s_t = s, a_t = a\right]$$

<font size = 5pt>Types of Environments</font>

***Deterministic:*** For each state and action there is an explicit new state; $P: S \times A \rightarrow S$

***Stochastic:*** For each state and action there is a probability distribution over the next states; $P: S \times A \rightarrow P(s^{\prime} | s, a) $




##<font size = 4pt> Basic MDP Intuition</font>

The ***Markov Decision Process*** is a control process that provides a mathematical framework for modeling decision making.

![](https://drive.google.com/uc?export=view&id=1jz9fqua-nLJoXTXH5V2DOoY4PpV-FJmv)

* The agent is trying to select actions $a_t$ that maximize the total expected future reward $r_t$.

* At each time-step $t$:
  1. Agent takes an action $a_t$
  2. The Environment transitions states due to the action, and rewards the agent $r_t$ along with the an observation of the new environment, $s_t$.
  3. The agent recieves $s_t$ and $r_t$...

* ***History:*** A sequence of past Observations, Actions, and Rewards; $h_t = (a_0, o_0, r_0, ..., a_t, o_t, r_t)$

* ***Markov Property/Assumption:*** The state used by the agent is a sufficient statistic of the history. Which means the future is independent of the past given present.

    State $s_t$ is said to have the *Markov property* if and only if:

    $$p(s_{t+1}|s_t,a_t) = p(s_{t+1}|h_t,a_t)$$

* The *Markov Property/Assumption* can always be satisfied if you include all the history, $s_t = h_t$.
However the      most recent observation might be sufficient; $s_t = o_t$. It depends on the domain.

* Partial Observability: Need some finite sequence of the history for a sufficient statistic; $s_t = f(h_t)$
* Full Observability: The most recent observation is sufficient statistic of the history; $s_t = o_t$


##<font size = 3pt>Low Level Example</font>

Given: A mobile robot has the job of collecting empty soda cans in an office environment. It has sensors for detecting cans, and an arm and gripper that can pick them up and place them in an onboard bin; it runs on a rechargeable battery.
\
* Two charge levels can be distinguished for it's rechargeable battery;
$\mathcal{S} = \{high, low\}$

* Depending on the state of it's rechargeable battery it can choose from 2 different sets of actions; <font size = 2pt>(No reason to charge in $high$)</font>$\mathcal{A}(high) = \{search, wait\};\mathcal{A}(low) = \{search, wait, recharge\}$
  * $Searching:$ Actively search for a can for a certain period of time (Uses battery)
  * $Waiting:$ Remain stationary and wait for someone to bring it a can (Doesn't use battery)
  * $Recharging:$ Head back to its home base to recharge its battery.

*  A period of $searching$ that begins with a $high$ energy level leaves the energy level $high$ with probability $\alpha$ and reduces it to $low$ with probability $1 - \alpha$

* A period of $searching$ that begins with a $low$ energy level leaves the energy level $low$ with probability $\beta$ and depletes the battery completely with probability $1 - \beta$

* $r = +1$ for each can collected

* $r = -3$ if the battery is fully depleted

* Let $r_{search}$ denote the expected number of cans the robot will collect (expected reward) while $searching$

* Let $r_{wait}$ denote the expected number of cans the robot will collect (expected reward) while $waiting$

* $r_{search} > r_{wait}$

* No cans can be collected during a run home for recharging

* No cans can be collected on a step in which the battery is depleted

\

The MDP of this system

---

Contigencies/Dynamics Table and Transition Graph:
![](https://drive.google.com/uc?export=view&id=1SKzz8h3N5sZt7_oGSxPVFchBrL-HABX0)

* Two Kinds of Nodes
  * **State Nodes** for each possible state <font size = 2pt>(a large
open circle labeled by the name of the state)</font>
  * **Action Nodes** for each state-action pair <font size = 2pt>(a small solid circle labeled by the name of the action and connected by a line to the
state node) </font>

* Starting in state $s$ and taking action $a$ moves you along the line from *state node* $s$ to *action node* $(s, a)$

* Then the *environment* responds with a transition to the next *state’s node* via one of the arrows leaving *action node* $(s, a)$

* Each arrow corresponds to a *triple* $(s, s^{\prime}, a)$, and we label the arrow with the *transition probability*, $p(s^{\prime} |s, a)$, and the *expected reward* for that transition, $r(s, a, s^{\prime})$ <font size = 2pt>(the transition probabilities labeling the arrows leaving an action node always sum to 1)</font>


#<font size = 6pt>**Monte Carlo**</font> <font size = 2pt> skip for now </font>

#<font size = 6pt>**Temporal-Difference**</font>

#<font size = 6pt>**Sarsa**</font> <font size = 2pt> skip for now </font>

#<font size = 6pt>**Q-Learning**</font>