# Recreation of Student-Teacher Framework Paper

This notebook attempts to recreate the results shown in the paper `Teacher-Student Framework: a Reinforcement Learning Approach` [1].

We use the OpenAIGym MountainCar environment for this purpose, as is done in the paper. The agents use the SARSA algorithm with linear approximation functions and binary tile coding. As we get further into the implementation we can try to understand what exactly that means. But to start with we are going to just get on with it.

[1] Matthieu Zimmer, Paolo Viappiani, Paul Weng. Teacher-Student Framework: a Reinforcement Learning Approach. AAMAS Workshop Autonomous Robots and Multirobot Systems, May 2014, Paris,
France. ffhal-01215273f

---
### Create the environment and explore the space

Pretty simple block of code here just creates the environment and renders out 500 steps using a random policy. Obviously nothings going to happen here so we can ignore the `done` flag. Then we take an observation and look at the action and observation space.

Action space:

    Size = 3
    Actions = forwards, neutral, backwards

Observation space: 
    
    Size = 2
    Observations = x position, velocity

In [1]:
import gym
import tensorflow as tf
from tensorflow import keras
import numpy as np
from matplotlib import pyplot as plt



env = gym.make('MountainCar-v0')
env.reset()

for _ in range(500):
    env.render()
    observation, reward, done, info = env.step(env.action_space.sample()) # take a random action
    
print(f"Observation: {observation}")
env.reset()
print(f"Action space:\t{env.action_space}")
print(f"Observation space:\t{env.observation_space}")

env.close()

Observation: [-0.55384094 -0.00470186]
Action space:	Discrete(3)
Observation space:	Box([-1.2  -0.07], [0.6  0.07], (2,), float32)


### Setting up the SARSA algorithm

Here we set up the SARSA algorithm, defining the required parameters, setting action selection functions and update functions. This can then be used to train a model in future stages.

We are using the $\epsilon$-greedy strategy whereby random actions are chosen with a small probability (1 - $\epsilon$) while all other times we take the on policy action. Currently there is no epsilon scheduling planned for the training loop however this could be built in quite easily.

So theres still quite a bit to be done on this step but we are going to ignore that for now. Mainly, chosing a method for representing the q values is the problem. If I use a tabular form its quite simple to adjust and select. However the paper does use a linear approximation function which I need to understand better.

In [None]:
epsilon = 0.9
total_episodes = 10000
max_steps = 500
alpha = 0.85
gamma = 0.95

def select_action(state):
    if np.random.uniform(0, 1) > epsilon:
        action = env.action_space.sample()
    else:
        action = get_best_action(state)

    return action 

def get_best_action(state):
    # TODO: Decide on how this function is going to work. It depends heavily on the function used for learning
    return 1

def update(state, action, reward, state2, action2):
    q_val = get_q_val(state, action)
    future_q = get_q_val(state2, action2)
    new_q_val = q_val + alpha*(reward + gamma*future_q - q_val)
    update_q_value(new_q_val)

def get_q_val(state, action):
    # Again requires me to decide on the specific q value function
    return 1

def update_q_value(new_value):
    # See above
    pass

### Training loop

Heres where we actually put the meat on the bones of the system. Training *should* be easy but we all know what happens when we write something like that down dont we? So anyway, lets keep it super simple to start with.

In [None]:
for episode in range(total_episodes):

    # Reset to initial state
    state1 = env.reset()
    action1 = select_action(state1)

    for t in range(max_steps):
        env.render()

        # Take action
        state2, reward, done, info = env.step(action1)
        action2 = select_action(state2)

        reward = 0 if done else -1

        # Update Q values
        update(state1, action1, reward, state2, action2)

        # Prep for next step

        state1, action1 = state2, action2

        if done: break