# Week 8 - Reinforcement Learning

<img align="right" style="padding-right:10px;" src="figures_wk8/reinforcement_learning.png" width=400><br>

**FTE Overview:**
* Reinforcement Learning (RL)
   - RL Objective
   - RL Algorithms
   - RL Example
* Q-Learning
   - What's this 'Q'?
   - Q-Learning Algorithm
* Demo: Q-Learning
   - Results

## Reinforcement Learning (RL)
**Reinforcement Learning (RL)** is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

RL differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

### RL Objective
<img align="right" style="padding-right:10px;" src="figures_wk8/rl_components.png" width=600><br>

Reinforcement Learning is one of the most beautiful branches in Artificial Intelligence. The objective of RL is to maximize the reward of an agent by taking a series of actions in response to a dynamic environment.

Reinforcement Learning is the science of making optimal decisions using experiences. Breaking it down, the process of Reinforcement Learning involves these simple steps: <br>

1. Observation of the environment
2. Deciding how to act using some strategy
3. Acting accordingly
4. Receiving a reward or penalty
5. Learning from the experiences and refining our strategy
6. Iterate until an optimal strategy is found


### RL Algorithms
<img align="right" style="padding-right:10px;" src="figures_wk8/rl_learning_system.png" width=450><br>

There are 2 main types of RL algorithms. They are **model-based** and **model-free**.

A model-free algorithm is an algorithm that estimates the optimal policy without using or estimating the dynamics (transition and reward functions) of the environment. Whereas, a model-based algorithm is an algorithm that uses the transition function (and the reward function) in order to estimate the optimal policy.

### RL Example
Consider the scenario of teaching a dog new tricks. The dog doesn't understand our language, so we can't tell him what to do. Instead, we follow a different strategy. We emulate a situation (or a cue), and the dog tries to respond in many different ways. If the dog's response is the desired one, we reward them with snacks. Now guess what, the next time the dog is exposed to the same situation, the dog executes a similar action with even more enthusiasm in expectation of more food. That's like learning "what to do" from positive experiences. Similarly, dogs will tend to learn what not to do when face with negative experiences.

That's exactly how Reinforcement Learning works in a broader sense:

* Your dog is an "agent" that is exposed to the **environment**. The environment could in your house, with you.
* The situations they encounter are analogous to a **state**. An example of a state could be your dog standing and you use a specific word in a certain tone in your living room
* Our agents react by performing an **action** to transition from one "state" to another "state," your dog goes from standing to sitting, for example.
* After the transition, they may receive a **reward or penalty** in return. You give them a treat! Or a "No" as a penalty.
* The **policy** is the strategy of choosing an action given a state in expectation of better outcomes.

## Q-Learning
**Q-Learning** is a model-free reinforcement learning algorithm.

Q-Learning is a values-based learning algorithm. Value based algorithms updates the value function based on an equation(particularly Bellman equation). Whereas the other type, policy-based estimates the value function with a greedy policy obtained from the last policy improvement.

Q-learning is an off-policy learner. Means it learns the value of the optimal policy independently of the agent’s actions. On the other hand, an on-policy learner learns the value of the policy being carried out by the agent, including the exploration steps and it will find a policy that is optimal, taking into account the exploration inherent in the policy.

### What's this 'Q'?
The ‘Q’ in Q-Learning stands for **quality**. Quality here represents how useful a given action is in gaining some future reward.

**Q-learning Definition** <br>

* Q*(s,a) is the expected value (cumulative discounted reward) of doing a in state s and then following the optimal policy.
* Q-Learning uses Temporal Differences(TD) to estimate the value of Q*(s,a). Temporal difference is an agent learning from an environment through episodes with no prior knowledge of the environment.
* The agent maintains a table of Q[S, A], where S is the set of states and A is the set of actions.
* Q[s, a] represents its current estimate of Q*(s,a).

### Q-Learning Algorithm
The Q-Learning Algorithm uses the Bellman equation and takes two inputs: state (s) and action (a).

<img align="center" style="padding-right:10px;" src="figures_wk8/q_learning_formula.png" width=700><br>

A bit complex yes! However, the demo below should clear things.

## Demo: Q-Learning
For this demo we will watch the following three part video series:
* https://www.youtube.com/watch?v=yMk_XtIEzH8&list=PLQVvvaa0QuDezJFIOU5wDdfy4e9vdnx-7 <br>
* https://www.youtube.com/watch?v=Gq1Azv_B4-4 <br>
* https://www.youtube.com/watch?v=CBTbifYx6a8 <br>

These videos are accompanied by the following:.  
* [Q-Learning introduction and Q Table - Reinforcement Learning w/ Python Tutorial p.1](https://pythonprogramming.net/q-learning-reinforcement-learning-python-tutorial/) <br>
* [Q-Learning introduction and Q Table - Reinforcement Learning w/ Python Tutorial p.2](https://pythonprogramming.net/q-learning-algorithm-reinforcement-learning-python-tutorial/?completed=/q-learning-reinforcement-learning-python-tutorial/) <br>
* [Q-Learning introduction and Q Table - Reinforcement Learning w/ Python Tutorial p.3](https://pythonprogramming.net/q-learning-analysis-reinforcement-learning-python-tutorial/?completed=/q-learning-algorithm-reinforcement-learning-python-tutorial/) <br>

<div class="alert alert-block alert-danger">
<b>Important::</b> In working through the code listed on the webpages, there are a number of errors that prevent the demo from performing correctly.  The code shown int he videos is correct!
</div>

### Results
Since your assignment for the week is to reproduce the demo, I will not be publishing the code associated with this demo. However, here is the result that you are going for!

**Goal:** Use Q-Learning to move the car from the bottom of the hill to the finish flag.

<img align="center" style="padding-right:10px;" src="figures_wk8/finished_car.png" width=400><br>

References: <br>
https://www.kdnuggets.com/2019/10/mathworks-reinforcement-learning.html <br>
https://en.wikipedia.org/wiki/Reinforcement_learning <br>
https://towardsdatascience.com/a-beginners-guide-to-q-learning-c3e2a30a653c <br>
https://pythonprogramming.net/q-learning-reinforcement-learning-python-tutorial/ <br>
https://pythonprogramming.net/q-learning-algorithm-reinforcement-learning-python-tutorial/?completed=/q-learning-reinforcement-learning-python-tutorial/ <br>
https://pythonprogramming.net/q-learning-analysis-reinforcement-learning-python-tutorial/?completed=/q-learning-algorithm-reinforcement-learning-python-tutorial/ <br>
https://www.youtube.com/watch?v=yMk_XtIEzH8&list=PLQVvvaa0QuDezJFIOU5wDdfy4e9vdnx-7 <br>
https://www.youtube.com/watch?v=Gq1Azv_B4-4 <br>
https://www.youtube.com/watch?v=CBTbifYx6a8 <br>