# Introduction

provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)

---

<div class="alert alert-block alert-info">
Table of Contents: <br>
    
<ul>
    <li>1. <a href="#1.-Reinforcement-Learning">Reinforcement Learning</a></li>
    <li>2. <a href="#2.-Pillars-of-RL-vs-Other-ML-Paradigms">Pillars of RL vs Other ML Paradigms</a></li>
    <li>3. <a href="#3.-Introduction-to-Sequential-Decision-Processes">Introduction to Sequential Decision Processes</a></li>
    <li>4. <a href="#4.-Types-of-Sequential-Decision-Processes">Types of Sequential Decision Processes</a>
        <ul>
            <li>4.1. <a href="#4.1.-Bandits">Bandits</a></li>
            <li>4.2. <a href="#4.2.-MDPs-and-POMDPs">MDPs and POMDPs</a></li>
        </ul>
    </li>
    <li>5. <a href="#5.-The-Environment">The Environment</a></li>
    <li>6. <a href="#6.-RL-Algorithm-Components">RL Algorithm Components</a>
        <ul>
           <li>6.1. <a href="#6.1.-Model">Model</a></li>
           <li>6.2. <a href="#6.2.-Policy">Policy</a></li>
           <li>6.3. <a href="#6.3.-Value-Function">Value Function</a></li>
        </ul>
    </li>
    <li>7. <a href="#7.-Types-of-RL-Agents">Types of RL Agents</a></li>
    <li>8. <a href="#8.-Problems-in-Sequential-Decision-Processes">Problems in Sequential Decision Processes</a></li>
    <li>9. <a href="#9.-Resource">Resource</a></li>
</ul>
</div>

# 1. Reinforcement Learning

> __Reinforcement Learning__ : _Learn_ to make good sequences of decisions (under uncertainty)
<br>

There are many applications of RL:

* play games! 
* robotics
* healthcare

# 2. Pillars of RL vs Other ML Paradigms

Reinforcement Learning involves 4 common characteristics (that define it):

1. __Optimization__
    - We want to find an optimal way to make good decisions.
2. __Delayed consequences__
    - Sometimes decisions made now may be good or bad later.
3. __Exploration__
    - Agent explores their environment. 
    - Data can be censored or hidden from the agent (agent only knows what they are doing).
4. __Generalization__
    - Why not just hard code a way to solve an RL problem? Well, it is infeasible and we need something (a method) that can be generalizeable to different problems.
    
<br>

_We know RL is a field within ML, but how does RL compare to other paradigms in ML?_

<br>

__AI Planning__ vs RL:

* Optimization
* Generalization
* Delayed consequences
* Given model of the world (rules of the game; e.g. Go or chess)

__Supervised ML__ vs RL

* Optimization
* Generalization
* Learns from experience with correct labels

__Unsupervised ML__ vs RL

* Optimization
* Generalization
* Learns from experience with no labels


__Imitation Learning__ vs RL

* Optimization
* Generalization
* Delayed consequences
* Learn by imitation of others with good input demos of good policies
* Great because it seems promising with RL and avoids the exploration problem!

# 3. Introduction to Sequential Decision Processes

![diagram.PNG](attachment:diagram.PNG) <br>
_Figure 1. The sequential decision process diagram._

We think of an __agent__ who interacts with the __world__ via __actions__ and the world outputs __observations__ and __rewards__. Our goal is to maximize total expected future reward.

Some problems:
* require balancing immediate and long term rewards
* require strategic behavior to achieve high rewards

The agent will have a history $h_{t} = (a_{1}, o_{1}, r_{1},..., a_{t}, o_{t}, r_{t})$ of actions and the __agent state__ will be a function of the history: $s_{t} = f(h_{t})$.

The world can be __fully observable__ or __partially observable__ to the agent.

 Within the field of RL, we make the common __markov assumption__.
 
 > __markov assumption__ : future is independent of past given present
 
 $$
 p(s_{t + 1}~|~s_{t}, a_{t}) =  p(s_{t + 1}~|~h_{t}, a_{t}) \hspace{1em} (Eq.~1)\\
 $$
 
 We use this assumption because it can be easily satisfied.
 In practice, we think of the most recent observation is sufficient statistics for the history $s_{t} = o_{t}$. 
 
> __Markov Decision Process (MDP)__ : the world/environment is fully observable. Then, the observation can be the agent's state. 

> __Partially Observable Markov Decision Process (POMDP)__ : the world/environment is partially observable. Then the agent's state can be a subset of the history (where the history is generated from the environment state).

# 4. Types of Sequential Decision Processes

## 4.1. Bandits

> __Bandits__ : actions have no influence on next observations. There are no delayed rewards. Think of advertisemnents to users.

## 4.2. MDPs and POMDPs

These (as defined above in [`3. Introduction to Sequential Decision Processes`](#3.-Introduction-to-Sequential-Decision-Processes)) generally are the exact opposite of the Bandits case. Actions now will affect future observations.

# 5. The Environment

> __Deterministic__ : given history & action, the same observation and reward is returned for a given action

> __Stochastic__ : given history & action, many potential observations and rewards

# 6. RL Algorithm Components

We employ an RL algorithm to solve an RL task. This algorithm usually has one or more of the following components:

* __Model__ : representation of the world (something to estimate changes in the world from an agent's action)
* __Policy__ : function that maps an agent's states to action
* __Value function__ : a function that details the future rewards for being in a state and/or action when following a certain policy

## 6.1. Model

$$
p(s_{t + 1} = s'~|~ s_{t} = s, a_{t} = a) \hspace{1em} (Eq.~2)\\
r(s_{t} = s, a_{t} = a) = \mathbb{E}[r_{t}~|~s_{t} = s, a_{t} = a] \hspace{1em} (Eq.~3)\\
$$

Above are 2 mathematical formulations for what could be a model in the general RL algorithm. One can model the probability of landing in state $s'$ given a state and action. One can model the expected immediate reward given a state and action. 

## 6.2. Policy

The policy $\pi$ determines how agent chooses actions basically. The policy maps states to actions $\pi~~:~~S \rightarrow A$ 

The policy can be deterministic $\pi(s) = a$ or stochastic $\pi(a|s) = P(a_{t} = a~|~s_{t} = s)$.

## 6.3. Value Function

> __value function $V^{\pi}$__ : expected discounted sum of future rewards under a particular policy $\pi$

$$
V^{\pi}(s_{t} = s) = \mathbb{E}[r_{t} + \gamma r_{t + 1} + \gamma^{2} r_{t + 2} + ...~|~ s_{t} = s] \hspace{1em} (Eq.~4)\\
$$

The value function says, given a policy $\pi$ and we start at state $s$, what is the expected discounted sum of future rewards if we act via that policy from here on out?

> __discount factor $\gamma$__ : $\gamma \in (0, 1)$ and it weighs immediate vs future rewards

# 7. Types of RL Agents

> __Model-based__ : an RL agent that has a model representation of the world but may or may not have policy and/or value function

> __Model-free__ : no model but has value function and/or policy function

![rl_agent_types.PNG](attachment:rl_agent_types.PNG) <br>
_Figure 2. RL Agent types._

# 8. Problems in Sequential Decision Processes

__Planning__
* how can we get our agent to get/plan what information it needs such that it could make good decisions going forward?

__Exploration and/vs Exploitation__
* how to balance exploration vs acting based on prior experience (exploitation)
* we shouldn't always follow our rule, sometimes we want to explore
* conversely, we don't always want to be trying new things 

__Evaluation and Control__
* we want to evaluate how good a policy is (by estimating/predicting expected rewards)
* we also want to optimize and find the best policy (control)
* we think of control as a subset of the evaluation RL problem

# 9. Resource

If you missed the link right below the title, I'm providing the resource here again along with the course website.

- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)
- [Course Website](http://web.stanford.edu/class/cs234/index.html)

This is a series of 15 lectures provided by Stanford.
