# Intro to RL w Python



### Methods of Reinforcement Learning

__Dynamic Programming__ perfectly optimized choices for a perfectly known environment. 

__TD Learning__ Start with a guess of expected rewards and use trial and error to update expectations. 


### Markov Decision Processes
__Markov__ implies the nest state only depends on the current state (not the past) We can make a problem Markov by bundling up all the information of the past along with knowledge available to the agent into a single state vector. This state might encompass all information about the environment, in which case the environment is __fully observable__.

__MDPs__ Are characterized by having state, rewards, and actions.  

<img src='https://miro.medium.com/max/1362/1*7cuAqjQ97x1H_sBIeAVVZg.png' width=500>

### Types of MDPs
__Bandits__ where actions taken have no effect on the environment. These are very useful in business for many things. __No Delayed rewards__ -> Since the environment isn't affected by agent's actions, there is no "long-term planning" involved the agent simply makes the most optimal action in each state. The environment might change on its own, but these changes cannot be anticipated as a result of the agent's decisions.

__MDP/POMDP__ Agent may have full observability or partial observability of its environment.

__Deterministic/Stochastic__ State change as a result of prev. state and action may be a function (same every time) or it may be random. 

*In the case of stochasticity, can an agent's actions change the probabilities of the state function -> Maybe?  Envokes the thought that an agent's actions can definitely change the reward function as well as the state function

This is incorrect, the reward function is fixed, the state-state transition function is what changes as a result of agent action. For stochasticity this means changes in probabilities, for deterministic this means changes in the function*

### RL Algorithm Components

__Model__ does the agent model the dynamics of the environment. (optional) Predicts what the environment will do next, an internal representation of the environment.

__Policy__ What is the rule that maps agent's state to an action (oftentimes this is a greedy rule) Probabilistic distribution of actions given states (agent's behavior function). Deterministic or stochastic

__Value Function__ Given current state, what is the expected CUMULATIVe future reward for all available actions given current state.

Most State of the art DL models use neural networks to approximate the value function. *I'm not sure if these architectures are model-less approaches... I also don't know what an actor-critic algorithm is.*



### Types of RL Agents

__Value-Based__ The value function determines utility of each action and the policy picks the best one.

__Policy-Based__ explicitly represent the policy. 

__Actor-Critic__ agent is a value-based and a policy based agent.

__Model-Based__ agent builds a model of how the environment will respond to its actions

__Model-Free__ agent goes directly to policy/value function. Experience -> learn behavior


### Off-policy vs On-Policy
__off-policy__ learner learn the value of an optimal policy independent of the agent's actions. There are two policies, the policy used to generate behavior (agent's policy) and the estimation policy, which is evaluated and improved during learning.

__On-policy__ learner learns the value of the optimal policy

*From the Sutton book: "The on-policy approach in the preceding section is actually a compromise—it learns action values not for the optimal policy, but for a near-optimal policy that still explores. A more straightforward approach is to use two policies, one that is learned about and that becomes the optimal policy, and one that is more exploratory and is used to generate behavior. The policy being learned about is called the target policy, and the policy used to generate behavior is called the behavior policy. In this case we say that learning is from data “o↵” the target policy, and the overall process is termed o↵-policy learning."*

# RL Algorithms

### Imitation Learning

Don't break the problem into many sub components. Train a network to predict what the current driver will do, then learn from the difference. (In the context of self-driving cars) Use an neural network to compress the decision process of an extremely complicated (or unknown) system.

#### Issues
- Distribution Mismatch - There will be situations that the network will have never trained on. Edge cases that occur very infrequently but still require decision.
- Error prone human behavior - Imitation can only be as good as the imitated.
- Markov Assumption - Need to encode foresight into decisions, rather than acting solely on the present.


### Q- Learning
Named after the Q - function that estimates the cumulative expected reward for taking an action in a certain state.

<img src='https://www.cse.unsw.edu.au/~cs9417ml/RL1/images/qalg.gif'>

### SARSA
On-Policy RL algorithm for TD-Learning. In Q-Learning, the update rule uses the next action that maximizes the Q function, in SARSA, the next action is generated by the current policy.

<img src='https://www.cse.unsw.edu.au/~cs9417ml/RL1/images/salg.gif'>

### Deep Q Learning
Leverages a neural network to approximate the value function for each available action. The agent chooses the action with maximal output from the neural network. While training, maintain a replay memory buffer to store past experiences. Also maintain a second copy of the Q-network that will act as the "target network" We will use the outputs of the two networks to learn the ENV's action-value function.

Note that having two networks makes the training algorithm akin to TD-Learning. Also note that having an experience buffer mitigates the effects of correlated minibatches.


<img src='https://miro.medium.com/max/2261/1*nb61CxDTTAWR1EJnbCl1cA.png'>


### Deep Deterministic Policy Gradient
Relies on an actor-critic architecture. The actor tunes the parameters $\theta$ for the policy function


### Federated Learning
learn a task from daily activity and delegate it to the edge. Learned behavior is modified at the edge, and new behaviors are shared with the center and other edges.

Consider recommendations for individual users on a mobile app. The cellular device may have its own ML model learning the best recommendation strategy, and new information can be backed up to the center. This compartmentalization at the edge also allows secure access to sensitive user data.

Build a model at the center that uses enterprise data (thus enterprise data is kept close at hand) that can be applied at the edge level with transfer learning.

# RL in Retail
[link](https://towardsdatascience.com/deep-learning-vs-deep-reinforcement-learning-algorithms-in-retail-industry-ii-9c17c83ecf2f)

RL is used to optimize assortment, stock levels, and regional prices. Supply chain maintenance to maximize efficiency, Optimizing space utilization in warehouses to reduce transit times, dynamic pricing,split delivery routing systems. 

#### Dynamic Pricing with RL
Agent periodically updates prices based on env state. mode is pre-trained by historical sales data and previous human decisions. This technique is also used for training bidding agents in multiseller environments.

<img src='https://miro.medium.com/max/2046/1*sjm43hZu1chZp5LJiBZhJA.png' width=600>

The above image represents a single seller system. Seller maintains an inventory that is replenished at certain threshold (arrival time is exponential distr based on past replenishes) Impartial buyers and shoppers, imp buyers dont care about availability or vol. discounts. 

Once a buyer/shopper is offered a price they like, they move into the queue to await their item from inventory. THey can balk on an exponential distr. if they do not receive item.

For multiseller systems, each captive has an associated utility function that they will attempt to maximize.