# Introduction to Reinforcement Learning
Reinforcement Learning can be thought of as a third paradigm of statistical learning, separated from Supervised (e.g., Regression models) and unsupervised learning(e.g. clustering).

One of the main characteristics of Reinforcement Learning is that it consists of a goal-directed agent, which interacts with an environment. The agent can observe the state of the environment, and perform actions that change the state of the environment. The agent also receives rewards from the environment, which are used to evaluate the agent's actions. The goal of the agent is to maximize the total reward it receives from the environment. The environment is uncertain. leading the agent to try to balance exploitation (leveraging previous successful experiences) *vs* exploration(trying out new strategies), in its decisions.

Whenever the agent take an action, the environment is modified as a result. Therefore, the correct choice of action must take into account indirect delayed consequences of all previous actions. Thus the environment can be thought of as a Markov Decision Process (MDP).

## Getting Started: the K-armed Bandit Problem
The K-armed bandit problem is a simple example of a Reinforcement Learning problem. The agent is faced repeatedly with a choice. It can choose between K different actions, and receives a reward based on the action it chooses, the reward comes from a fixed probability distribution over the actions. The goal of the agent is to maximize the total reward it receives over time. The action chosen at time $t$ is denoted as $A_t$. The reward received is denoted as $R_t$. The expected value of an action $a$ is denoted as $q_*(a)$:

$$q_*(a) = \mathrm{E}[R_t \mid A_t = a].$$

The estimated value of an action $a$ at time $t$ is denoted as $Q_t(a)$. The value of an action is the expected reward given that action is selected. The value of an action is not known to the agent, and must be estimated based on the rewards received. The agent must balance exploration (trying out different actions to estimate their value) and exploitation (choosing the action with the highest estimated value). If the agent always chooses the action with the highest estimated value, we say that the agent is greedy. If the agent chooses a non-greedy action, we say that the agent is exploratory.

## Action-value Methods
Action-value methods estimate the value of each action, and select the action with the highest estimated value. The simplest action-value method is the sample-average method. The estimated value of an action $a$ at time $t$ is the average of the rewards received when $a$ was selected up to time $t$:

$$Q_t(a)=\frac{\sum_{i=1}^{t-1} R_i \cdot \mathbb{1}_{A_t=a}}{\sum_{i=1}^{t-1} \mathbb{1}_{A_t=a}}$$

The greedy action is the action with the highest estimated value:
$$A_t=argmax_a(Q_t(a))$$

A simple alternative to the greedy action is the $\epsilon$-greedy action. With probability $\epsilon$, the agent chooses a random action. With probability $1-\epsilon$, the agent chooses the greedy action. The $\epsilon$-greedy action is a simple way to balance exploration and exploitation.

In the code below we run a simulation of a 10-armed bandit that learns over 1000 decisions. This experiment is repeated 2000 times for multiple values of $\epsilon$.

In [4]:
%run ../Code/Sutton_Barto/chapter02/ten_armed_testbed.py

FileNotFoundError: [Errno 2] No such file or directory: '../images/figure_2_1.png'

To efficiently implement action-value methods, we can use the incremental update rule, to estimate the Average reward at step $n$
$$Q_{n+1}=Q_n + \frac{1}{n}\left[R_n - Q_n \right],$$

where $R_n$ is the reward at step $n$.

### Nonstationary Reward Problems
In the previous example, the reward distribution was stationary. The expected reward of each action did not change over time. In the nonstationary case, the expected reward of each action changes over time. Let $\alpha$ be a constant step size.

\begin{align}
Q_{n+1} &= Q_n + \alpha \left[R_n - Q_n \right] \\
&= \alpha R_n + (1-\alpha) Q_n \\
&= \alpha R_n + (1-\alpha) \left[ \alpha R_{n-1} + (1-\alpha) Q_{n-1} \right] \\
&= \alpha R_n + (1-\alpha) \alpha R_{n-1} + (1-\alpha)^2 Q_{n-1} \\
&= \dots\\
&= (1-\alpha)^n Q_1 + \sum_{i=1}^n \alpha (1-\alpha)^{n-i} R_i
\end{align}