# Usage

In reinforcement learning, the agent generates its own training data by interacting with the world. The agent must learn the consequences of his own actions through trial and error, rather than being told the correct action.

# 1. K-armed bandits

Let's start with a simple case of **`decision-making under uncertainty`**.

In the **`K-armed bandit problem`**:
- we have an **`agent`**
- who chooses between **`k actions`**
- and receives a **`reward`** based on the action it chooses.

<img src="resources/k_armed_bandit.png" alt="Drawing" style="width: 400px; margin-left: 4em"/>

## 1.1 The Action Values

### 1.1.1 The real Action Values q*(a)
We define the value of selecting an action as the expected reward we receive when taking that action.

**`q*(a) = E[ R | A = a ] = Sum[ p(r|a)*r ]`**, for each possible action 1 through k.

<img src="resources/q*_and_argmax(q*).png" style="width: 400px; margin-left: 4em"/>

### 1.1.2 Estimated Action values Q(a)

However, q*(a) is unknow in most problems, we need to estimate it.

**`Q(a, t) = Sum(R) / (t - 1)`**

<img src="resources/Q(a)_sample_average.png" style="width: 400px; margin-left: 4em"/>

### 1.1.3 Estimating Action Values Incrementally

**`Q(n + 1) = Sum(R) / n = Q(n) + [R(n)- Q(n)] / n`**

<img src="resources/Q(a)_incremental_update.png" style="width: 400px; margin-left: 4em"/>

<img src="resources/Q(a)_incremental_update_rule.png" style="width: 400px; margin-left: 4em"/>

### 1.1.4 Constant stepsize could solve the non-stationary problem

the influence of initialization of Q goes to zero with more and more data. **`The most recent rewards contribute most`** to our current estimate

<img src="resources/Q(a)_decaying_past_rewards.png" style="width: 400px; margin-left: 4em"/>

## 1.2 Action Selection

### 1.2.1 The Greedy Action

The greedy action is the action that currently has **`the largest estimated value`**.

### 1.2.2 Exploitation

**`Selecting the greedy action`** means the agent is exploiting its current knowledge. It is trying to **`get the most reward`** it can **`right now`**. We can compute the greedy action **`by taking the argmax`** of our estimated values.

### 1.2.3 Exploration

Alternatively, the agent may choose to explore by **`choosing a non-greedy action`**. The agent would **`sacrifice immediate reward`**, hoping to gain more information about the other actions for **`potential long term rewards`**.

### 1.2.4 Optimistic Initial Value

- encourage early exploration
- cannot adapt to non-stationay problem. No exploration at larger steps.

### 1.2.5 Epsilon-Greedy

- works for non-stationary problem.
- however, cannot make full use of exploitation because of epsilon

<img src="resources/Action_selection_Epsilon_Greedy.png" style="width: 400px; margin-left: 4em"/>


### 1.2.6 Upper Confidence Bound

- Initially, UCB expore more to systematically reduce uncertainty 

- UCB exploration reduce over time, thus doesn't fit to non-stationary problem too

<img src="resources/Action_selection_Upper_Confidence_Bound.png" style="width: 400px; margin-left: 4em"/>


---

# 2. Markov Decision Process (MDP)

The K-armed bandit problem doesn't include many aspects of real-world problems. The agent is presented with the same situation and each time and the same action is always optimal. 

In many problems, different situations call for different responses. The actions we choose now affect the amount of reward we can get into the future.

<img src="resources/MDP.png" style="width: 600px; margin-left: 4em"/>

This diagram summarizes the agent environment interaction in the MDP framework. The agent environment interaction generates a trajectory of experience consisting of states, actions, and rewards. Actions influence immediate rewards as well as future states and through those, future reward.

## 2.1 Dynamics of MDP

Given a state S and action a, p tells us the joint probability of next state S prime and reward R.

<img src="resources/MDP_Dynamics.png" style="width: 600px; margin-left: 4em"/>

The following example shows the Dynamics of the Soda Can Robot problem.

States: Battery High, Battery Low

Actions: Wait, Search, Recharge


<img src="resources/MDP_Example.png" style="width: 600px; margin-left: 4em"/>

## 2.2 Goal: Maximize the expected return

Return at time step t, is the **`sum of rewards`** obtained **`after time step t`** 

**`G(t) = R(t+1) + R(t+2) + R(t+3) + …`**

**`E[G(t)] = E[ R(t+1) + R(t+2) + R(t+3) + … ]`**

- Episodic task: the agent-environment interaction breaks up into episodes. 
- Continuing task: the agent-environment interaction goes on indefinitely. 

**`Discounting factor`** gama, makes sure the **`sum of all future goals are finite`**.

- **`smaller gama`**, makes the agent **`short sighted`** since it concerns more on immediate rewards.
- **`greater gama`**, considers the **`long term rewards`** more.

<img src="resources/Recursive_nature_of_expected_goal.png" style="width: 400px; margin-left: 4em"/>

## 2.3 Policy: A distribution over actions for each possible state.

- Deterministic policy: maps each state to a single action 
- Stochastic policy: assigns possibility to each action at each state 
- A Valid policy depends on only the current state 

<img src="resources/Policy_notation.png" style="width: 600px; margin-left: 4em"/>

## 2.4 Value functions

Value functions are crucial in reinforce learning, they allow an agent to **`query the quality of its current situation`** instead of waiting to observe the long-term outcome. 

The benefit is twofold:

- First, the return is not immediately available
- Second, the return may be random due to stochasticity in both the policy and environment dynamics. 

The value function **`summarizes all the possible futures`** by averaging over returns. 

Ultimately, we **`care most about learning a good policy`**. Value function enable us to **`judge the quality of different policies`**.

### 2.4.1 State Value functions

is the expected return from a given state, with respect to the policy Pi.

<img src="resources/State_value_function.png" style="width: 400px; margin-left: 4em"/>

### 2.4.2 Action Value functions

the action value of a state s is the expected return if the agent selects action a and then follows policy Pi.

<img src="resources/Action_value_function.png" style="width: 400px; margin-left: 4em"/>


## 2.5 Bellman equations of Value functions

They provide **`relationships`** between the **`values of a state or state action pair`** and the **`possible next states or next state action pairs`**

### 2.5.1 State value Bellman euqations

<img src="resources/State_value_Bellman_euqation.png" style="width: 600px; margin-left: 4em"/>

### 2.5.2 Action value Bellman equations

<img src="resources/Action_value_Bellman_equation.png" style="width: 600px; margin-left: 4em"/>

### 2.5.3 Why Bellman equations?

Can use the Bellman equations to **`compute value functions`**.

- You can use the Bellman equations to **`solve a value function`** by writing **`a system of linear equations`**.
- We can **`only solve small MDPs`** directly with Bellman equations.

<img src="resources/Bellman_equation_example.png" style="width: 600px; margin-left: 4em"/>
