# Capstone Project - Reinforcement Learning
### Goal
- Learn Object Oriented Programming techniques
- Build your own Machine Learning model from scratch - Reinforcement Learning
- Teach it to pickup and deliver items

### What is Reinforcement Learning?
- Reinforcement learning teaches the machine to think for itself based on past action rewards.
![Reinforcement Learning](img/ReinforcementLearning.png)
- Basically, the Reinforcement Learning algorithm tries to predict actions that gives rewards and avoids punishment.
- It is like training a dog. You and the dog do not talk the same language, but the dogs learns how to act based on rewards (and punishment, which I do not advise or advocate).
- Hence, if a dog is rewarded for a certain action in a given situation, then next time it is exposed to a similar situation it will act the same.
- Translate that to Reinforcement Learning.
    - The **agent** is the dog that is exposed to the **environment**.
    - Then the **agent** encounters a **state**.
    - The **agent** performs an **action** to transition to a **new state**.
    - Then after the transition the **agent** receives a **reward** or **penalty (punishment)**.
    - This forms a **policy** to create a strategy to choose actions in a given **state**.

### What algorithms are used for Reinforcement Learning?
- The most common algorithm for Reinforcement Learning are.
    - **Q-Learning**: is a model-free reinforcement learning algorithm to learn a policy telling an agent what action to take under what circumstances.
    - **Temporal Difference**: refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function.
    - **Deep Adversarial Network**: is a technique employed in the field of machine learning which attempts to fool models through malicious input.
- We will focus on the **Q-learning** algorithm as it is easy to understand as well as powerful.

### How does the Q-learning algorithm work?
- As already noted, I just love this algorithm. It is “easy” to understand and powerful as you will see.
![Q-Learning algorithm](img/QLearning.png)
- The **Q-Learning** algorithm has a **Q-table** (a Matrix of dimension state x actions – don’t worry if you do not understand what a Matrix is, you will not need the mathematical aspects of it – it is just an indexed “container” with numbers).
- The **agent** (or **Q-Learning** algorithm) will be in a state.
- Then in each iteration the **agent** needs take an **action**.
- The **agent** will continuously update the **reward** in the **Q-table**.
- The **learning** can come from either **exploiting** or **exploring**.
- This translates into the following pseudo algorithm for the **Q-Learning**.
- The **agent** is in a given **stateºº and needs to choose an **action**.

#### Algorithm
- Initialise the **Q-table** to all zeros
- Iterate
    - Agent is in state **state**.
    - With probability **epsilon** choose to **explore**, else **exploit**.
        - If **explore**, then choose a *random* **action**.
        - If **exploit**, then choose the *best* **action** based on the current **Q-table**.
    - Update the **Q-table** from the new **reward** to the previous state.
    - Q[**state, action**] = (1 – **alpha**) * Q[**state, action**] + **alpha** * (**reward + gamma** * max(Q[**new_state**]) — Q[**state, action**])
    
#### Variables
As you can se, we have introduced the following variables.

- **epsilon**: the probability to take a random action, which is done to explore new territory.
- **alpha**: is the learning rate that the algorithm should make in each iteration and should be in the interval from 0 to 1.
- **gamma**: is the discount factor used to balance the immediate and future reward. This value is usually between 0.8 and 0.99
- **reward**: is the feedback on the action and can be any number. Negative is penalty (or punishment) and positive is a reward.

### Description of task we want to solve
- To keep it simple, we create a field of size 10×10 positions. In that field there is an item that needs to be picked up and moved to a drop-off point.
![Field](img/Field-ReinforcementLearning.png)
- At each position there are 6 different actions that can be taken.
    - **Action 0**: Go South if on field.
    - **Action 1**: Go North if on field.
    - **Action 2**: Go East if on field (Please no).
    - **Action 3**: Go West if on field.
    - **Action 4**: Pickup item (it can try even if it is not there)
    - **Action 5**: Drop-off item (it can try even if it does not have it)
- Based on these actions we will make a reward system.
    - If the **agent** tries to go off the **field**, punish with **-10** in **reward**.
    - If the **agent** makes a (legal) move, punish with **-1** in **reward**, as we do not want to encourage endless walking around.
    - If the **agent** tries to pick up item, but it is not there or it has it already, punish with **-10** in **reward**.
    - If the **agent** picks up the item correct place, **reward** with **20**.
    - If **agent** tries to drop-off item in wrong place or does not have the item, punish with **-10** in **reward**.
    - If the **agent** drops-off item in correct place, **reward** with **20**.
- That translates into the following code. I prefer to implement this code, as I think the standard libraries that provide similar frameworks hide some important details. As an example, and shown later, how do you map this into a state in the Q-table?