# Usage

In reinforcement learning, the agent generates its own training data by interacting with the world. The agent must learn the consequences of his own actions through trial and error, rather than being told the correct action.

# 0. K-armed bandits

Let's start with a simple case of **`decision-making under uncertainty`**.

In the **`K-armed bandit problem`**:
- we have an **`agent`**
- who chooses between **`k actions`**
- and receives a **`reward`** based on the action it chooses.

<img src="resources/k_armed_bandit.png" alt="Drawing" style="width: 400px; margin-left: 4em"/>

## 0.1 The Action Values

### 0.1.1 The real Action Values q*(a)
We define the value of selecting an action as the expected reward we receive when taking that action.

**`q*(a) = E[ R | A = a ] = Sum[ p(r|a)*r ]`**, for each possible action 1 through k.

<img src="resources/q*_and_argmax(q*).png" style="width: 400px; margin-left: 4em"/>

### 0.1.2 Estimated Action values Q(a)

However, q*(a) is unknow in most problems, we need to estimate it.

**`Q(a, t) = Sum(R) / (t - 1)`**

<img src="resources/Q(a)_sample_average.png" style="width: 400px; margin-left: 4em"/>

### 0.1.3 Estimating Action Values Incrementally

**`Q(n + 1) = Sum(R) / n = Q(n) + [R(n)- Q(n)] / n`**

<img src="resources/Q(a)_incremental_update.png" style="width: 400px; margin-left: 4em"/>

<img src="resources/Q(a)_incremental_update_rule.png" style="width: 400px; margin-left: 4em"/>

### 0.1.4 Constant stepsize could solve the non-stationary problem

the influence of initialization of Q goes to zero with more and more data. **The most recent rewards contribute most to our current estimate

<img src="resources/Q(a)_decaying_past_rewards.png" style="width: 400px; margin-left: 4em"/>

## 0.2 Action Selection

### 0.2.1 The Greedy Action

The greedy action is the action that currently has **`the largest estimated value`**.

### 0.2.2 Exploitation

**`Selecting the greedy action`** means the agent is exploiting its current knowledge. It is trying to **`get the most reward`** it can **`right now`**. We can compute the greedy action **`by taking the argmax`** of our estimated values.

### 0.2.3 Exploration

Alternatively, the agent may choose to explore by **`choosing a non-greedy action`**. The agent would **`sacrifice immediate reward`**, hoping to gain more information about the other actions for **`potential long term rewards`**.