# Reinforcement Learning (RL)

The machine learns by watching and understanding. The focus is on performance - by finding a balance between exploration of uncharted territory and exploitation of current knowledge.

![Screen%20Shot%202018-08-06%20at%2011.23.51%20PM.png](attachment:Screen%20Shot%202018-08-06%20at%2011.23.51%20PM.png)

## Applications
- Gaming: AlphaGo, TD Gammon, Dota etc
- Robotics: : learning to walk
- Self driving cars
- Trading

## Elements
- Owner or decision maker: Agent
- Environment
- Policy
- Reward Signal
- value function
- optionally, a model of the environment

![Screen%20Shot%202018-08-07%20at%207.32.43%20PM.png](attachment:Screen%20Shot%202018-08-07%20at%207.32.43%20PM.png)

#### Environment
- Functions that transform an action taken in the current state into the next state and a reward
- We can know the agent’s function, but we cannot know the function of the environment. It is a black box where we only see the inputs and outputs.

#### Agents
- Functions that transform the new state and reward into the next action

#### Policy (π)
- Similar to a factory that takes any environmental state as input and outputs the action
- Core of RL
- Defines the learning agent's way of behaving at a given time
- May be a simple function/look up table or involve extensive computation like search process
- Strategy that the agent employs to determine the next action based on the current state. 
- It maps states to actions, the actions that promise the highest reward.

#### Reward signal (R)
- Defines the goal of the RL problem
- On each time step, the environment sends a single number called reward
- Primary basis for altering policy; If low reward, the policy may need to be changed

#### Value function (V)
- Whereas rewards determine the immediate, intrinsic desirability of environmental states, values indicate the long-term desirability of states after taking into account the states that are likely to follow and the rewards available in those states.
- Rewards are basically given directly by the environment, but values must be estimated and re-estimated from the sequences of observations an agent makes over its entire lifetime.
- Vπ(s) is defined as the expected long-term return of the current state under policy π.

#### Model of the environment
- Mimics the behaviour of environment
- Used for planning

#### Discount Factor
- The discount factor is multiplied with future rewards as discovered by the agent in order to dampen their effect on the agent’s choice of action.
- makes future rewards worth less than immediate rewards
- Often expressed with the lower-case Greek letter gamma: γ. If γ is .8, and there’s a reward of 10 points after 3 time steps, the present value of that reward is 0.8³ x 10. A discount factor of 1 would make future rewards worth just as much as immediate rewards.
- Always between 0 and 1
- Set by us, not by the agent

![Screen%20Shot%202018-08-12%20at%2010.29.21%20PM.png](attachment:Screen%20Shot%202018-08-12%20at%2010.29.21%20PM.png)

#### Action (A)

#### State (S)

#### Q-Value
- Q-value is similar to Value, except that it takes an extra parameter, the current action a. 
- Qπ(s, a) refers to the long-term return of the current state s, taking action a under policy π. 
- Q maps state-action pairs to rewards. Note the difference between Q and policy.

#### Trajectory
- A sequence of states and actions that influence those states

#### One-step dynamic
- The one-step dynamics of the environment determine how the environment decides the state and reward at every time step.

### Domain Selection
- is deciding which types of input and feedback your agent should pay attention to is a hard problem to solve.

## Limitations and Scope
- Different from evolutionary methods where - space of policies are small, lot of time available for search
- RL methods can learn while interacting with the environment but evolutionary methods don't

## Tasks
### Episodic
- Interaction ends at a time step, when the outcome is a win or loss
- Eg: When the game ends, the self-driving car crashes

### Continuing
- Stock trading AI

## Reward Hypothesis
*That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward) *

# Markov Decision Process
- MDP formalizes sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, states.
- Involve delayed reward and need to tradeoff immediate and delayed reward.

![Screen%20Shot%202018-08-12%20at%2010.30.31%20PM.png](attachment:Screen%20Shot%202018-08-12%20at%2010.30.31%20PM.png)

#### Markov Property
A stochastic process has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it. A process with this property is called a Markov process.

## Policies
### Deterministic Policy
- States are mapped with corresponding actions

![Screen%20Shot%202018-08-14%20at%209.50.09%20PM.png](attachment:Screen%20Shot%202018-08-14%20at%209.50.09%20PM.png)

### Stochastic Policy
- Actions are determined randomly

![Screen%20Shot%202018-08-14%20at%209.50.50%20PM.png](attachment:Screen%20Shot%202018-08-14%20at%209.50.50%20PM.png)

![Screen%20Shot%202018-08-14%20at%209.51.21%20PM.png](attachment:Screen%20Shot%202018-08-14%20at%209.51.21%20PM.png)