Watch all videos and read the review sections.
It's a lot of material, but we need to get through it before we can do some coding exercises.

# 1.0 Introduction to RL
---

[Intro](https://www.youtube.com/watch?v=6jSFl5kxIBs&t=2s)

[The Setting](https://www.youtube.com/watch?v=nh8Gwdu19nc&t=21s)

[The Setting Revisited](https://www.youtube.com/watch?v=V6Q1uF8a6kA&t=8s)

[Episodic vs. Continuing Tasks](https://www.youtube.com/watch?v=E1I-BPanSM8&t=51s)

[Reward Hypothesis](https://www.youtube.com/watch?v=uAqNwgZ49JE&t=10s)

[Goals and Rewards: 1](https://www.youtube.com/watch?v=XPnj3Ya3EuM&t=1s)

[Goals and Rewards: 2](https://www.youtube.com/watch?v=pVIFc72VYH8&t=4s)

[Cumulative Rewards](https://www.youtube.com/watch?v=ysriH65lV9o&t=62s)

[Discounted Rewards](https://www.youtube.com/watch?v=opXGNPwwn7g&t=58s)

[MDPs: 1](https://www.youtube.com/watch?v=NBWbluSbxPg&t=1s)

[MDP: 2](https://www.youtube.com/watch?v=CUTtQvxKkNw&t=3s)

[MDP: 3](https://www.youtube.com/watch?v=UlXHFbla3QI&t=17s)

## Review
---

### The Setting, Revisited
- The reinforcement learning (RL) framework is characterized by an agent learning to interact with its environment.
- At each time step, the agent receives the environment's state (the environment presents a situation to the agent), and the agent must choose an appropriate action in response. One time step later, the agent receives a reward (the environment indicates whether the agent has responded appropriately to the state) and a new state.
- All agents have the goal to maximize expected cumulative reward, or the expected sum of rewards attained over all time steps.

### Episodic vs. Continuing Tasks
- A task is an instance of the reinforcement learning (RL) problem.
- Continuing tasks are tasks that continue forever, without end.
- Episodic tasks are tasks with a well-defined starting and ending point.
- In this case, we refer to a complete sequence of interaction, from start to finish, as an episode.
- Episodic tasks come to an end whenever the agent reaches a terminal state.

### The Reward Hypothesis
- Reward Hypothesis: All goals can be framed as the maximization of (expected) cumulative reward.

### Cumulative Reward
- The return at time step $t$ is:
$$G_t := R_{t+1} + R_{t+2} + R_{t+3} + \ldots$$

- The agent selects actions with the goal of maximizing expected (discounted) return. (Note: discounting is covered in the next concept.)

### Discounted Return
- The discounted return at time step t is:
$$G_t := R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots$$

- The discount rate $\gamma$ is something that you set, to refine the goal that you have the agent.
    - It must satisfy $0 \leq \gamma \leq 1$.
    - If $\gamma=0$, the agent only cares about the most immediate reward.
    - If $\gamma=1$, the return is not discounted.
    - For larger values of $\gamma$, the agent cares more about the distant future. Smaller values of $\gamma$ result in more extreme discounting, where - in the most extreme case - agent only cares about the most immediate reward.

### MDPs and One-Step Dynamics
- The state space $\mathcal{S}$ is the set of all (nonterminal) states.
- In episodic tasks, we use $\mathcal{S}^+$ to refer to the set of all states, including terminal states.
- The action space $\mathcal{A}$ is the set of possible actions. (Alternatively, $\mathcal{A}(s)$ refers to the set of possible actions available in state $s \in \mathcal{S}$.)
- The one-step dynamics of the environment determine how the environment decides the state and reward at every time step. This is often called the transition function:
$$p(s',r|s,a) \doteq \mathbb{P}(S_{t+1}=s', R_{t+1}=r|S_{t} = s, A_{t}=a) \text{ for each possible } s', r, s, \text{and } a$$

- A (finite) Markov Decision Process (MDP) is defined by:
    - a (finite) set of states $\mathcal{S}$ (or $\mathcal{S}^+$, in the case of an episodic task)
    - a (finite) set of actions $\mathcal{A}$
    - a set of rewards $\mathcal{R}$
    - the one-step dynamics of the environment
    - the discount rate $\gamma \in [0,1]$





## 2.0 Monte Carlo Methods
---

[Review](https://www.youtube.com/watch?v=3H5x0lstvmo)

[Grid World Example](https://www.youtube.com/watch?v=Lwibg_IfmrA)

[Monte Carlo Methods](https://www.youtube.com/watch?v=titaMCRl224)

[MC Prediction Part: 1](https://www.youtube.com/watch?v=6ts9gdIS6vg)

[MC Prediction Part: 2](https://www.youtube.com/watch?v=jR49ZyKuJ98)

[MC Prediction Part: 3](https://www.youtube.com/watch?v=9LP6uXdmWxQ)

[Greedy Policy](https://www.youtube.com/watch?v=DH6c-aODMLU)

[Epsilon Greedy Policy](https://www.youtube.com/watch?v=PxJMtlR06MY)

[Incremental Mean](https://www.youtube.com/watch?v=h-8MB7V1LiE)

[Constant Alpha](https://www.youtube.com/watch?v=QFV1nI9Zpoo)

## Review
---

### Monte Carlo Methods
- Monte Carlo methods - even though the underlying problem involves a great degree of randomness, we can infer useful information that we can trust just by collecting a lot of samples.
- The **equiprobable random policy** is the stochastic policy where - from each state - the agent randomly selects from the set of available actions, and each action is selected with equal probability.

### MC Prediction
- Algorithms that solve the prediction problem determine the value function $v_{\pi}$ (or $q_{pi}$) corresponding to a policy $\pi$.
- When working with finite MDPs, we can estimate the action-value function $q_{\pi}$ corresponding to a policy $\pi$ in a table known as a Q-table. This table has one row for each state and one column for each action. The entry in the s-th row and a-th column contains the agent's estimate for expected return that is likely to follow, if the agent starts in state s, selects action a, and then henceforth follows the policy $\pi$.
- Each occurrence of the state-action pair $s,a (s\in\mathcal{S},a\in\mathcal{A})$ in an episode is called a visit to s,a.
- There are two types of MC prediction methods (for estimating $q_{\pi}$
    - First-visit MC estimates $q_{\pi}(s,a)$ as the average of the returns following only first visits to s,a (that is, it ignores returns that are associated to later visits).
    - Every-visit MC estimates $q_{\pi}(s,a)$ as the average of the returns following all visits to s,a.

### Greedy Policies
- A policy is greedy with respect to an action-value function estimate Q if for every state $s\in\mathcal{S}$, it is guaranteed to select an action $a\in\mathcal{A}(s)$ such that $a = \arg\max_{a\in\mathcal{A}(s)}Q(s,a)$. (It is common to refer to the selected action as the greedy action.)
- In the case of a finite MDP, the action-value function estimate is represented in a Q-table. Then, to get the greedy action(s), for each row in the table, we need only select the action (or actions) corresponding to the column(s) that maximize the row.

### Epsilon-Greedy Policies
- A policy is $\epsilon$ greedy with respect to an action-value function estimate Q if for every state $s\in\mathcal{S}$
    - with probability $1-\epsilon$, the agent selects the greedy action, and
    - with probability $\epsilon$, the agent selects an action uniformly at random from the set of available (non-greedy AND greedy) actions.

### MC Control
- Algorithms designed to solve the control problem determine the optimal policy $\pi$ from interaction with the environment.
- The Monte Carlo control method uses alternating rounds of policy evaluation and improvement to recover the optimal policy.

### Exploration vs. Exploitation
- All reinforcement learning agents face the Exploration-Exploitation Dilemma, where they must find a way to balance the drive to behave optimally based on their current knowledge (exploitation) and the need to acquire knowledge to attain better judgment (exploration).
- In order for MC control to converge to the optimal policy, the Greedy in the Limit with Infinite Exploration (GLIE) conditions must be met:
every state-action pair s, as,a (for all $s\in\mathcal{S}$ and $a\in\mathcal{A}(s)$) is visited infinitely many times, and
the policy converges to a policy that is greedy with respect to the action-value function estimate Q.

### Incremental Mean
- (In this concept, we amended the policy evaluation step to update the Q-table after every episode of interaction.)

### Constant-alpha
- (In this concept, we derived the algorithm for constant-$\alpha$ MC control, which uses a constant step-size parameter $\alpha$.)
- The step-size parameter $\alpha$ must satisfy $0 < \alpha \leq 10$. Higher values of $\alpha$ will result in faster learning, but values of $\alpha$ that are too high can prevent MC control from converging to $\pi_*$

