## Reinforcement Learning
* Create a simulation to learn
* Data isn't iid
## Everything you care about can be reduced to a reward scalar**

---

### A new branch of DS 
* Supervised
* Unsupervised
* Reinforcement Learning

---

### Practical achievements in the field

* [Alpha Go](https://www.youtube.com/watch?v=l7ngy56GY6k)
* [DOTA](https://www.youtube.com/watch?v=tfb6aEUMC04)
* [Hide and Seek](https://www.youtube.com/watch?v=kopoLzvh5jY)
* [Grid world](https://www.youtube.com/watch?v=AMnW-OsOcl8)
* [Cartpole](https://youtu.be/XiigTGKZfks?t=100)

---

### Intuition - from Pavlov to Bellman

* Challenge - dog gets food
* Dog hears a bell
* Dog gets food
* After some time, the bell contains some information about the food
* **We should 'backfill' reward information back through time**

#### Bellman - dynamic programming

---

### Components in a RL problem

* Rewards - Challenge - competition - positive and negative rewards
* Policies - Strategies
* Environment - environment
* Agents - Players - agents
* States - the particular setup of the environment at some given time

![](SAR.jpg)

## P(S_t+1 | S_t) == P(S_t+1 | S_t, S_t-1, ... S_t-n)

---

### A Markov Decision Process - How is it different from a regular Markov Chain?

![](Markov_Decision_Process.png)

### two extra concepts: 
* action nodes - We have decision making power in this chain
* Reward 'lines' - positive and negative rewards

---

### The maths

### The problem space

#### This can be summarised as : maximize rewards!!

More Formally:

**RL** is defined as a tuple, containing States, Actions, Transitions, Rewards, and Discounting

$RL=\{S,A,P,R,γ\}$

**S** is a set of all possible states - the State Space

$S=\{s1,s2,...,sn\}$

**A** is a set of all possible actions - the Action Space

$A={a1,a2,...,am}$

**P** is the probability distribution of entering state **s'** after taking action **a** in state **s**. Action is intentional, but the resulting state is sampled from a distribution

$P(s′,r|s,a)$

**R** is the reward recieved for taking **a** in **s**

$R(a,s)$

The Discount Factor is applied to future rewards. It is normally less than 1, though not guaranteed.

$γ$

**G** - The Total expected Reward (or **Goal**) at time **t** is the cumulated future discounted return, depending on what actions are taken.

$Gt=R_{t+1}+γ∗R_{t+2}...+γ_{p−1}∗R_{t+p}$


---

### The solution space:
#### 3 main branches:
* Value based
    * Predicting value
    * Q-learning, DQN
* Policy based
    * Predicting policy
    * Policy Gradient, DPN
* Model based (Environment based)
    * Predicting what will happen
    * World models / MBMF

--- 

### Further reading

* Sutton and Barto - RL 
* Bellman - Dynamic Programming

---

# Let's do this!!!

### Implement an example using OpenAI's Gym
* A handy library for learning about RL - https://gym.openai.com/

`pip install gym`

In [1]:
import gym

---

### Let's work on the cartpole problem
#### First we make an environment in which the agent can be trained

In [2]:
env = gym.make('CartPole-v1')

#### Now we implement the agent-environment loop
* Start the process by resetting the environment
* And return an initial observation

In [3]:
import time
env.reset()
env.render()
time.sleep(10)
env.close()

We can achieve the same thing by taking an action - in this case a  `step` in a given direction, 0 for left and 1 for right
* This now contains a tuple of items
* The first is the previous observation
* We also get a reward value
* A boolean to tell us if we're done
* And a value we don't use

In [4]:
observation = env.step(1)[0]
reward = env.step(1)[1]
done = env.step(1)[2]
_ = env.step(1)[3]

In [5]:
#position of the machine (- == left of screen, + == right of screen), 
# velocity of the machine, 
# angle of the pole ((- == balancing left, + == balancing right)), 
# rotation of the pole
observation

array([-0.0131883 ,  0.2120274 , -0.03381001, -0.34193451])

### These are things we care about!!
* Rewards  - `reward`
* (Policies - Strategies)
* Environment - `env`
* Agent - `function`
* States - `obs[2]`

We can already use the `done` boolean to work out if we can stop the loop

---

### Take one: Lets build an agent that takes random actions

In [6]:
def random_agent():
    env.reset()
    for i in range(1000):
        env.render()
        obs, reward, done, _ = env.step(env.action_space.sample()) # take a random action
        time.sleep(0.1)
        if done:
            print(f'We survived {i} steps')
            env.reset()
            break
    env.close()

---

### Take two: Build an agent that observes the environment and takes according action

For example:
* If the pole is left, move left
* If the pole is right, move right

In [7]:
def better_rl():
    obs = env.reset()
    
    for i in range(1000):
        if obs[2] < 0 :
            action = 0
        else:
            action = 1
        obs, reward, done, _ = env.step(action)
        time.sleep(0.1)
        env.render()
        
        if done:
            print(f'We survived {i} steps')
            env.close()
            break

In [8]:
better_rl()

We survived 39 steps


---

### Take three: Use some RL techniques

#### Specifically, we are going to build a policy based RL algorithm
* To build policy based RL, sample from a game simulation
* Run multiple simulations, infer from the simulations which sets of moves results in the highest reward
* This is on-policy - a more inefficient method than Q-learning (we need lots of samples)

---

### 3.1: Sample from n game simulations

In [9]:
def sample_simulation_data(env): #training data
    number_of_games = 200
    last_moves = 25
    observations = []
    actions = []

    for i in range(number_of_games):
        game_obs = []
        game_acts = []
        obs = env.reset()

        for j in range(1000):
            action = env.action_space.sample()
            obs, reward, done, _ = env.step(action)
            game_obs.append(obs)
            game_acts.append(action)

            if done:
                observations += game_obs[:-(last_moves+1)]
                actions += game_acts[1:-last_moves]
                break

    observations = np.array(observations)
    actions = np.array(actions)

    return observations, actions

### 3.2: Train an agent to learn the policy embedded in the simulation

In [13]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np

In [14]:
X, y = sample_simulation_data(env)

### 3.3: Set the agent loose in a live simulation
* It will act based on the best policy for a given state of the game

In [None]:
m = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
m.fit(X,y)

In [None]:
def smart_rl(env, m):
    #setup the game
    obs = env.reset()

    for i in range(1000):
        #start to play the game
        #model, tell me what to do next please
        obs = obs.reshape(-1,4) #X data is the simulation
        action = int(m.predict(obs)) #y data is the action we should take

        #take an according step
        obs,reward,done,_ = env.step(action)
        #visusalise my results
        env.render()
        #print(obs, reward)
        time.sleep(0.1)
        #find out if i died
        if done:
            print(f'iterations survived {i}')
            env.close()
            break


In [None]:
smart_rl(env,m)