# **Reinforcement Learning**

-When we think about the nature of learning, we think about interaction

-Reinforcement learning focuses on a computational approach to goal-directed learning from interaction

## **1.1 Reinforcement Learning**

-The learner is not specifically told which actions to take, but learns through trial and error in addition to delayed reward
  - these two features distinguish RL from other ML
  
-Formally, RL uses ideas from dynamical systems theory, specifically, "incompletely-known Markov decision processes"
  - learning agent must sense environment, must be able to take action and must have goals

-RL contrasts with supervised learning in that SL seeks to obtain generalizability from historical labels
  - RL meanwhile is trying to learn from *interaction* and not necessarily from charted territory

-unsupervised learning is also distinguished by the fact that its trying to find representations in the data and it has nothing to do with a reward signal

-Therefore, we consider RL to be a third paradigm of ML

-One of the challenges of RL is the trade-off between exploration and exploitation

  - to maximize reward, it can exploit past behaviors its deemed rewarding
  - but to find those behaviors in the first place, it has to explore and possibly lose reward
  - The catch is that the task cannot be accomplished by either one exclusively
  - The exploration-exploitation dilemna remains unsolved

-RL is part of a swing of the pendulum in AI research towards discovery of principles, as opposed to rules based

## **1.3 Elements of Reinforcement Learning**

-The 4 main elements of an RL system are: a *policy*, a *reward signal*, a *value function* and optionally, a *model* of the environment

  - a **policy** defines the agent's way of behaving. 
    - Its a mapping from perceived states of the environment to actions to be taken in those states (in psychology, called *associations* and from their perspective is the basis for learning)
    - may be as simple as a lookup or as complicated as a search process

  - a **reward signal** defines the goal of the RL system. 
    - The reward is that which the system is trying to maximize and is the primary basis for altering policy
    - it is more immediate term

  - a **value function** specifies what is good in the long-term
    - The specifies the total amount of reward which can be expected in the long-term

  - But rewards are primary, and values are predictions of rewards. **Without rewards there could be no values.**
    - However, when making and evaluating decision, we use values since they provide the most reward to us over time
    - Values are difficult to determine, while rewards are a given
    - Value estimation is one of the most important things to have been progressed in RL research in the last 60 years

  - a **model** of the environment allows inferences to be made about how the environment will behave
    - e.g. given a state and an action, the model might predict next state and reward
    - models are used for planning and are not in all RL systems

## **1.4 Limitations and Scope**

-RL relies on the concept of "state", as input to the policy and value function
  - essentially it conveys to the agent some sense of how the environment is
  - We will focus mainly on estimating value functions (but you don't need this to solve RL problems)

## **1.5 An Extetnded Example: Tic-Tac-Toe**

-assuming you play against an imperfect player, how can we construct a player which will learn the opponent's imperfections and maximize its chance of winning?

  - classicial optimization techniques (dynamic programming) require complete specification of opponent, including the probabilities with which the opponent makes each move

  - essentially, we can learn a model of the opponent's behavior from experience to get these probabilities and then apply dynamic programming (many RL methods do this)

  - or, using a value function. 
    - We would set of a table of numbers, each representing the probabilities of winning from each state of the tic-tac-toe board
    - this estimate is the state's value
    - set all initial values of the states to 0.5
    - then play many games against the opponent, greedily selecting the moves (moving to states with greatest value) and occasionally randomly selecting another move to explore
    - then we "back up" value of states so that earlier states are nudged closer in value to later states

$$ V(S_t) \larr V(S_t) + \alpha[V(S_{t+1}) - V(S_t)] $$

<p align="center">
  <img src="../data/tictactoe.png" />
</p>

-tic tac toe example is just one
  - with RL problems, we don't necessarily need an adversary
  - we don't necessarily need discrete time
  - we don't necessarily need a finite state space like in tic-tac-toe (in infinite state space games, we might utilize the generalizability of SL)
  - it can work even when certain states are hidden
  - with tic-tac-toe we were able to look ahead and know the states that would come from moves (this is a model of the game). Even this is not necessary for RL. 
  - but with respect to opponent, our tic-tac-toe player actually has no model

## **1.6 Summary**

-RL is a computational approach to understanding and automating goal-directed learning and decision making
  - it distinguishes itself from other computational approaches in its interaction-based learning and the lack of need of direct supervision or model of environment
  - its fundamentally based on Markov decision processes
  - the concepts of value and value function are key to most RL methods