## Reinforcement Learning
### Understand the theory

* Practical achievements in the field
* Supervised / Unsupervised / Reinforcement
* Pavlov to Bellman
* Environment / State / Action / Reward
* Drawbacks - curse of dimensionality, credit assignment problem

---

### Implement it in practice using OpenAI's Gym
* A handy library for learning about RL - https://gym.openai.com/

`pip install gym`

In [1]:
import gym
import time
import numpy as np

---

### Let's work on the cartpole problem
#### First we make an environment in which the agent can be trained

In [2]:
env = gym.make('CartPole-v1')

#### Now we implement the agent-environment loop
* Start the process by resetting the environment
* And return an initial observation

In [3]:
initial_obs = env.reset()
initial_obs #position, speed, angle of pole, rotation of pole

array([ 0.02989791,  0.0465321 ,  0.01970152, -0.04242339])

We can achieve the same thing by taking an action - in this case a  `step` in a given direction, 0 for left and 1 for right

In [4]:
obs, reward, done, _ = env.step(0) #0 = move left, 1 = move right.

We can already use the `done` boolean to work out if we can stop the loop

In [5]:
done

False

And use `sample` the `action_space` space to randomly pick an action

In [6]:
random_step = env.action_space.sample()

And `render` the environment to see what our cart is doing

In [7]:
env.render()
time.sleep(5)
env.close()

**OK, but we need to build an RL agent. What next?**

First, lets try to build the simplest RL agent:
* If the pole is left, move left
* If the pole is right, move right

In [8]:
def simple_rl(env):
    # reset env and take a step
    obs = env.reset()
    # loop over:
    for i in range(1000):
        # measure: is pole angled left or right?
        # action: if left -> move left, if right -> move right
        if obs[2] < 0:
            action = 0
        elif obs[2] > 0:
            action = 1
        else:
            break
        obs, reward, done, _ = env.step(action)
        time.sleep(0.1)
        env.render()
        if done:
            print(f'iterations survived: {i}')
            env.close()
            break

In [9]:
simple_rl(env) #base model 36-45 iterations

iterations survived: 40


**I think we can do better than that. Lets build a model which learns to move better based on training data**

* First we need to generate some training data
* X = obs
* y = done bool

In [18]:
def collect_training_data(env):
    # create 1000 virtual games
    number_of_games = 10000
    last_moves = 20
    observations = []
    actions = []
    
    for i in range(number_of_games):
        #in each game
        game_observations = []
        game_actions = []
        obs = env.reset()
        
        for j in range(1000):
            # take a series of random steps
            action = env.action_space.sample()
            #measure how that action changed the state
            obs, reward, done, _ = env.step(action)
            # store results
            game_observations.append(obs)
            game_actions.append(action)
            
            if done: #if the agent dies
                #record everything except the end which is rubbish data
                observations += game_observations[:-last_moves]
                actions += game_actions[1:-(last_moves-1)]
                break
                
    return np.array(observations), np.array(actions)

* Then a model which plays based on its predictions

In [19]:
def smart_rl(env, m):
    # reset env and take a step
    obs = env.reset()
    # loop over:
    for i in range(700):
        # m.predict model's next best move
        obs = obs.reshape(-1,4)
        action = int(m.predict(obs))
        #take model's idea of the right move
        obs, reward, done, _ = env.step(action)
        time.sleep(0.1)
        env.render()
        if done:
            print(f'iterations survived: {i}')
            env.close()
            break

#### Now lets run the code, and measure the improvement
* Setup the gym
* Collect training data
* Train a model
* And play
* And measure

In [20]:
X, y = collect_training_data(env)

In [21]:
from sklearn.ensemble import RandomForestClassifier
m = RandomForestClassifier()

In [22]:
m.fit(X,y)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [38]:
smart_rl(env,m)

iterations survived: 226


In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [40]:
lr.fit(X,y)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [41]:
smart_rl(env,lr)

iterations survived: 499
