# Q-learning using tables
##### Authors: Eirik Fagtun Kjærnli and Fabian Dietrichson

#### Question for participants during the workshop

#### Questions for us
- How complex can each task be?
- Which visualizations do we need?
- How to make sure progress is upheld during the workshop?
- How to make sure everybody manages to follow?

Possible environments
- taxi-v2 (https://medium.com/@anirbans17/reinforcement-learning-for-taxi-v2-edd7c5b76869)

#### Links
Q-learning and introduction to the algorithm
- Inspo: https://towardsdatascience.com/practical-reinforcement-learning-02-getting-started-with-q-learning-582f63e4acd9

#### Introduction to Jupyter Notebooks and structure of notebook
This workshop is structured such that for each cell you will write your code between .. blabal..

### Task:
Click on the cell below, and run it to import the necessary libaries
Hot key to run a cell: ctrl + enter

In [129]:
def multiply_input_by_2(a_variabel):

    "Write code below" 
    result = a_variabel * 2
    
    "Write code above" 
    
    return result

In [130]:
assert(multiply_input_by_2(10) == 20), "Your method did not multiple the input by 2"

### Import of neccesary packages
The first step to import the necessary packages. The gym package is the most central which contains the games which our Reinforcement learning agents are going to solve.

#### Task:
Click on the cell below and run it, to import the necessary packages

In [120]:
import numpy as np
import gym

### Introduce the game we are going to solve here

Blabla this is how the game works

#### Task:
Create the environment variabel containing all necessary methods to run the Taxi-v2 game

In [53]:
### Introduce the game we are going to solve here
env = gym.make("Taxi-v2")

### Create a Q-table
Q-learning is one the most iconic Reinforcement learning algorihms, and can be used to solve a great variaty of challenges. In its most simplistic form it uses a table to store its values.

Each row in the array is a state
Each column is an action

#### Task:
Create and return a Q-table, where each cell is initialized to zeros, with:
- The number of columns equal to number of actions in the environment (Use env.action_space.n)
- The number of rows equal to number of states in the environment (Use env.observation_space.n)

In [112]:
def create_q_table(env):
    
    "Write code below" 
    action_size = env.action_space.n
    observation_size = env.observation_space.n
    
    Q_table = np.zeros([env.observation_space.n, env.action_space.n])
    
    "Write code above" 
    
    return Q_table

In [93]:
# DO NOT EDIT THIS CELL
assert(np.count_nonzero(create_q_table(env) == 0)), "All values in Q-table should be zero"
assert(create_q_table(env).shape == (500,6)), "The dimensions are wrong"

### Select the best action
The next method will pick out the best action given the state

#### Task
Pick the action with the highest Q-value given state

In [62]:
def get_best_action(Q, state):
    
    "Write code below" 
    best_action =  np.argmax(Q[state,:]) 
    
    "Write code above"
    
    return best_action

In [96]:
# DO NOT EDIT THIS CELL
Q_temp = create_mock_q_table(env)
assert(get_best_action(Q_temp, 50) == 0), "The method did not pick the action with the highest Q_value"

### Select an action
The exploration vs. exploitation is a very effective method to ensure the agent explores a sufficent area of the state space, and avoid coverging to a local optima.

#### Task:
Compute the action to take in the current state, including exploration.  
If the probability, e.g. epsilon, is higher than a random number, we should take a random action
    otherwise - the best policy action (self.getPolicy).

Tip: 
- To pick a random action, use env.action_space.sample()
- To generate a random number with uniform probability, use np.random.uniform()

In [68]:
def select_action(Q, state, epsilon):
    
    "Write code below"
    if epsilon > np.random.uniform():
        action = env.action_space.sample()
    else:
        action = get_best_action(Q, state)
    
    "Write code above"    
    
    return action

In [102]:
Q_temp = create_mock_q_table(env)
assert(select_action(Q_temp, 50, 0) == 0), \
"Method should always return the same value for this state, since epsilon is 0"
assert not (len(set([select_action(Q_temp, 50, 1) for x in range(20)])) <= 2), \
"Method should not return identical values when a random action, should be chosen"

### Gradually shift towards exploitation, and reduce exploration of state space
To make sure we are gradually moving from exploring the environment by taking random actions, we need to reduce the possibility of choosing a random action. In other words, we need to reduce Epsilon. 

#### Task:
Create a method which reduces epsilon by a factor called epsilon decay

In [131]:
def reduce_epsilon(epsilon):
    epsilon_decay = 0.95
    
    "Write code below" 
    epsilon *= 0.95
    "Write code above"
    
    return epsilon

In [132]:
assert(reduce_epsilon(5) == 4.75), "Given an input of 5, the output should have been 4.75"

### Update Q-table

Q-learning equation
- Write equation here

#### Task
You are to implement the 

In [133]:

"""
Short explaination of q-table algorithm
"""
def update_q_table(Q, action, state, done, reward, new_state):
    
    if(done):
        # Since it is the final state
        Q[state,action] = reward
    else:
        # Error in estimate is not a completly correct name due to the discount factor...
        error_in_estimate = reward + discount_factor*np.max(Q[new_state,:]) - Q[state,action]
        Q[state, action] = Q[state, action] + learning_rate*error_in_estimate
    
    return Q

In [65]:
# Methods we should create?
def visualize_q_table():
    # See if we find some good ways to visualize the q-table
    yield

def store_data_for_visualization():
    # To plot reward development
    # Number of iterations
    yield

def reset_env_and_update_params(env, epsilon):
    epsilon_decay = 0.99;
    epsilon *= epsilon_decay
    
    done = False
    state = env.reset()
    iterations = 0
    total_reward = 0
    
    return state, done, iterations, total_reward, epsilon

def render_performance(Q):
    state = env.reset()
    env.render()
    done = False

    while not done:
        # Get action
        action = np.argmax(Q[state,:]) 

        # Take action in environment
        new_state, _, done, _ = env.step(action)

        # Update current state
        state =  new_state

        # Render
        env.render()
        
def create_mock_q_table(env):
    Q = create_q_table(env)
    Q[50,0] = 1
    return Q

# Test environment

In [None]:
# Simulation
num_episodes = 1000
max_iterations = 40

# Q-learning parameters
discount_factor = 0.95
learning_rate = 0.8
epsilon = 1
done = False

# Initiallize environment and create Q-table
Q = create_q_table(env)

# Run simulation

for episode in range(num_episodes):
    
    if (done and episode % 50 == 0): 
        print("Episode: {} | Iterations: {} | Total Reward: {}".format(episode, iterations, total_reward))
    state, done, iterations, total_reward, epsilon = reset_env_and_update_params(env, epsilon)
        
    while not done:
        # Get action
        action = select_action(state, epsilon)
        
        # Take action in environment
        new_state, reward, done, _ = env.step(action)

        # Update Q-table
        Q = update_q_table(Q, action, state, done, reward, new_state)
        
        # Update current state
        state = new_state
        
        # End check
        iterations += 1
        total_reward += reward
        
        if (iterations >= max_iterations): done = True

In [None]:
render_performance(Q)