# Tutorial on Q-learning (table lookup approach)

In it’s simplest implementation, Q-Learning is a table of values for every state (row) and action (column) possible in the environment. Within each cell of the table, we learn a value for how good it is to take a given action within a given state. In the case of the FrozenLake environment, we have 16 possible states (one for each block), and 4 possible actions (the four directions of movement), giving us a 16x4 table of Q-values. We start by initializing the table to be uniform (all zeros), and then as we observe the rewards we obtain for various actions, we update the table accordingly.

Bellman equation: $ Q(s,a) = r + \gamma(max(Q(s’,a’)) $

## Import tools

In [None]:
import gym
import numpy as np

def choose_Q_act(Q_table, state, policy, noise, noise_attenuation):
    num_acts = len(Q_table[state,:])
    return policy(Q_table[state,:] + noise(1,num_acts)*noise_attenuation)

## Interact with Frozen Lake
TODO: make an interactive widget based app with four buttons and a square 4x4 display

Features:
- Each step should leave a line trace, older traces being fainter than the new ones

In [None]:
i_env = gym.make('FrozenLake-v0')
_ = i_env.reset()
while True:
    x = input('Enter to random step, q to quit')
    if x!='q':
        i_env.step(np.random.randint(4))
        i_env.render()
    else: break

## Visualize Q-learning
Based on the app above, create a visual interactive Q-learner. One can click through simulated learning algorithm and see how the initially empty Q-table gets updated. Each step should result in an update of the environment and several indicators (e.g. whether the step was successful).

To visualize Q-learning, plot the current Q-table as well as the current update that results from a certain action and its success: 

```python
np.max(Q[new_s,:]) - Q[s,a]
```

## Initialize the environment

In [None]:
env = gym.make('FrozenLake-v0')
ospace = env.observation_space
aspace = env.action_space
actions = ['L', 'D', 'R', 'U'] # left, down, right, up

# Initialize table with all zeros
Q = np.zeros([ospace.n, aspace.n])

# Set learning parameters
lr = .8
gamma = .95
num_epochs = 1000
num_steps = 99

# Create empty lists to contain total rewards (returns) and steps per episode
data_steps = []
data_returns = []
data_acts = []
data_Q = [Q]

In [None]:
for i in range(num_epochs):
    
    # Begin epoch
    s = env.reset() # reset env and get init state
    R = 0 # init reward is 0
    d = False # no action is successful
    j = 0 # no learning steps commited
    
    # The Q-Table learning algorithm
    for j in range(num_steps):
        
        # Choose an action by greedily (with noise) picking from Q table (see def Q_act above)
        a = choose_Q_act(
            Q_table = Q,
            policy = np.argmax,
            state = s,
            noise = np.random.randn,
            noise_attenuation = 1./(i+1)
        )
        
        # Act and get new state and reward from environment
        new_s,r,d,_ = env.step(a)
        
        # Update Q-Table with new knowledge and transition to new state
        Q[s,a] = Q[s,a] + lr*(r + gamma*np.max(Q[new_s,:]) - Q[s,a])
        s = new_s
        
        # Accumulate return
        R += r
        
        # Store data
        data_acts.append(a)
        if d == True:
            data_steps.append(j)
            break
            
    data_returns.append(R)
    data_Q.append(Q)

In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
import ipywidgets as widgets

# slide = widgets.IntSlider(min=0, max=len(data_Q), description='epoch')

# im = plt.imshow(data_Q[0])

# def plot_Q(i, im, data):
#     im.set_data(data[i])
#     plt.gcf().canvas.draw_idle()

# widgets.interact(plot_Q, i=slide, im=widgets.fixed(im), data=widgets.fixed(data_Q))