### Multi-armed Bandits

Choose from K different actions repeatedly and receive a numerical reward chosen from a stationary probability distribution conditional on the selected action each time independent of the previous action(s). 

In the K-armed Bandit problem, each action $a$ has an expected reward $q_*(a)$. This is called the value of that action.

$  q_*(a) \doteq \mathbb E[R_t | A_t=a] $

Knowing the value of each action would enable us to solve the problem by simply selecting the highest-valued action always. However, it is likely we will have only estimates of the action values at any time, $Q_t(a)$. 
The objective is to get $Q_t(a)$ as close to $q_t(a)$ as possible.

### Action-value Methods

One way to estimate $Q_t(a)$ is to average rewards over the timesteps

$Q_t(a) \doteq \frac{\sum^{t-1}_{i=1}R_i \cdot \mathbb 1_{A_i=a}}{\sum^{t-1}_{i=1}\mathbb 1_{A_i=a}}$

The best way to select an action, a.k.a the policy would be 'greedy':

$A_t \doteq \underset{a}{argmax} Q_t(a)$


### Optimal Value Function

$V^{*}_{k}(s) \leftarrow \underset{a}{max} \underset{s'}{\sum} P(s'|s,a)( R(s,a,s')  + \gamma V^{*}_{k-1}(s')) $ 

k is timesteps from horizon

## Policy Evaluation

$V^{\pi}_{k}(s) \leftarrow \underset{a}{\sum}\pi(a|s) \underset{s'}{\sum} P(s'|s,\pi (s))(R(s,\pi(s),s')+ \gamma V^{\pi}_{k-1}(s'))$

## Policy Improvement

$\pi_{k+1}(s) \leftarrow \underset{a}{argmax}\ q^{\pi}(s,a)$

Where $ q^{\pi}(s,a) = \underset{s'}{\sum} P(s'|s,a)(R(s,\pi(s),s')+ \gamma V^{\pi}(s'))$

## Policy Iteration

1. Initialize Policy $\pi$

2. Evaluate: Until $V^{\pi}_{k}(s) - V^{\pi}_{k+1}(s) < $ tolerance: For all  $s$ in $\mathbb S$ , evalute $\pi$ as $V^{\pi}_{k}(s)$
3. Improve: If $\pi_{k}(s) \ \ != \pi_{k+1}(s) <$ perform policy improvement using $V^{\pi}_{k}(s)$

## Value Iteration

1. Initialize $V(s)$ for all s in $\mathbb S $
2. Until $V_{k}(s) - V_{k+1}(s) < $ tolerance: For all  $s$ in $\mathbb S$ , $V_{k}(s)$ as

$V_{k}(s) = \underset{a}{max} \underset{s}{\sum} P(s'|s,a)(R(s,a,s')+ \gamma V(s'))$
3. Perform policy improvement using $V_{k}(s)$ such that $ \pi \approx \pi *$

In [331]:
import gym
import numpy as np

In [375]:
def converged(V,V_new, tol=10e-3):
    """
    Checks whether first two arguments have converged to within the third argument, tolerance.
    Arguments:
        V : Numpy float array
        V_new: Numpy float array
        tol: float
    Returns:
        bool True or False
    """
    if np.all(np.abs(V - V_new) < tol) : 
        return True
    return False

In [245]:
def evaluate_policy(pi, P, nS, nA, gamma=0.9, max_iter=1000, tol=10e-3):
    V = np.zeros(nS)
    V_new = V.copy()
    for i in range(max_iter):
        V = V_new.copy()
        V_new = np.zeros(nS)
        for state in range(nS):
            for probability, next_state, reward, done in P[state][pi[state]]: 
                V_new[state] += probability*(reward + gamma*V[next_state])
        if converged(V,V_new,tol):
            break
    return V_new

In [246]:
def improve_policy(pi, P, V, nS, nA, gamma=0.9):
    pi_new = np.zeros(nS, dtype='int')
    for state in range(nS):
        B = np.zeros(nA)
        q = -99
        for action in range(nA):
            for probability, next_state, reward, done in P[state][action]: 
                B[action] += probability*(reward + gamma*V[next_state])
            if B[action]>q:
                q = B[action]
                pi_new[state] = action
            elif B[action] == q:
                if np.random.random() < 0.5:
                    pi_new[state] == action       
    return pi_new

In [313]:
def value_iteration(env, max_iter=1000, gamma=0.9, tol=10e-3):  
    #nS number of States in env
    nS = env.nS
    #nA number of actions in env
    nA = env.nA
    # P is dynamics model of env as [state,action] dictionary with (probability, next_state, reward, done) 
    P = env.P
    
    V = np.zeros(nS,dtype=float)
    pi = np.zeros(nS,dtype=int)
    
    for i in range(max_iter):
        V_new = np.zeros(nS, dtype=float)
        for state in range(nS):
            for action in range(nA):
                q = 0
                for probability, next_state, reward, done in P[state][action]: 
                    q += probability*(reward + gamma*V[next_state])
                if V_new[state] < q:
                    V_new[state] = q
        if converged(V,V_new,tol):
            break
        V = V_new.copy()
    pi_new = improve_policy(pi,P,V_new,nS,nA,gamma=gamma)
    
    return V_new, pi_new

In [319]:
def policy_iteration(env, max_iter=20, gamma=0.9, tol=10e-3):
    #nS number of States in env
    nS = env.nS
    #nA number of actions in env
    nA = env.nA
    # P is dynamics model of env as [state,action] dictionary with (probability, next_state, reward, done) 
    P = env.P
    
    V = np.zeros(nS,dtype=float)
    pi = np.zeros(nS,dtype=int)
    
    pi = np.array([np.random.randint(nA) for i in range(nS)])
    
    for i in range(max_iter):
        V_new = evaluate_policy(pi,P,nS,nA,gamma=gamma,tol=tol)
        pi_new = improve_policy(pi,P,V_new,nS,nA,gamma=gamma)
        
        if converged(V,V_new,tol):
            break
        
        V = V_new.copy()
        pi=pi_new.copy()
    
    return V_new, pi_new

In [341]:
env = gym.make("FrozenLake8x8-v0")

np.set_printoptions(precision=3)

state = env.reset()

In [369]:
V_PI, pi_PI = policy_iteration(env,300,0.99)
V_VI, pi_VI = value_iteration(env,20000,0.99)

## Inspection

In [386]:
#V_VI.reshape(8,8)

In [385]:
#V_PI.reshape(8,8)

In [384]:
actions = ['left','down','right','up']
policy = np.array([actions[int(i)] for i in pi_PI])
#policy.reshape(8,8)

In [345]:
#env.render()

## Test

In [380]:
def play(env, policy, max_episodes=10):
    done = False
    state = env.reset()
    while not done:
        state,reward,done,info = env.step(policy[state])
        print(state,reward,done,info)
        env.render()
        if done:
            env.reset()

In [382]:
#play(env,pi_VI)