### The Taxi Problem

There are 4 designated locations in the grid world indicated by `R(ed)`, `G(reen)`, `Y(ellow)`, and `B(lue)` and the task is to pick up the passenger at one location and drop him off in another. The agent receives `+20` points for a successful dropoff, and loses 1 point for every timestep it takes. There is also a `10` point penalty for illegal pick-up and drop-off actions.

When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger's location, picks up the passenger, drives to the passenger's destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.
    
##### Observations
There are `500` discrete states since there are `25` taxi positions, `5` possible locations of the passenger (including the case when the passenger is in the taxi), and `4` destination locations.
State encoding: `(taxi_row, taxi_col, passenger_location, destination)`
###### Passenger locations
 - 0: R(ed)
 - 1: G(reen)
 - 2: Y(ellow)
 - 3: B(lue)
 - 4: in taxi
###### Destinations
 - 0: R(ed)
 - 1: G(reen)
 - 2: Y(ellow)
 - 3: B(lue)

#### Actions
There are 6 discrete deterministic actions:
 - 0: move south
 - 1: move north
 - 2: move east
 - 3: move west
 - 4: pickup passenger
 - 5: drop off passenger

#### Rewards
There is a default per-step reward of `-1`, except for delivering the passenger, which is `+20`, or executing `pickup` and `drop-off` actions illegally, which is `-10`.

#### Rendering
 - blue: passenger
 - magenta: destination
 - yellow: empty taxi
 - green: full taxi
 - letters (R, G, Y and B): possible locations for passengers destinations

In [None]:
import gym
import sys
import math
import time
import numpy as np
from collections import defaultdict, deque
from IPython.display import clear_output

In [None]:
def interact(env, agent, num_episodes=20000, window=100):
    '''
    Monitor agent's performance.
    
    Params
    ======
    - env: instance of OpenAI Gym's Taxi-v1 environment
    - agent: instance of class Agent (see Agent.py for details)
    - num_episodes: number of episodes of agent-environment interaction
    - window: number of episodes to consider when calculating average rewards

    Returns
    =======
    - avg_rewards: deque containing average rewards
    - best_avg_reward: largest value in the avg_rewards deque
    '''
    # initialize average rewards
    avg_rewards = deque(maxlen=num_episodes)
    # initialize best average reward
    best_avg_reward = -math.inf
    # initialize monitor for most recent rewards
    samp_rewards = deque(maxlen=window)
    # for each episode
    for i_episode in range(1, num_episodes+1):
        # begin the episode
        state = env.reset()
        # initialize the sampled reward
        samp_reward = 0
        while True:
            # agent selects an action
            action = agent.select_action(state)
            # agent performs the selected action
            next_state, reward, done, _ = env.step(action)
            # agent performs internal updates based on sampled experience
            agent.step(state, action, reward, next_state, done)
            # update the sampled reward
            samp_reward += reward
            # update the state (s <- s') to next time step
            state = next_state
            if done:
                # save final sampled reward
                samp_rewards.append(samp_reward)
                break
        if (i_episode >= 100):
            # get average reward from last 100 episodes
            avg_reward = np.mean(samp_rewards)
            # append to deque
            avg_rewards.append(avg_reward)
            # update best average reward
            if avg_reward > best_avg_reward:
                best_avg_reward = avg_reward
        # monitor progress
        print("\rEpisode {}/{} || Best average reward {}".format(i_episode, num_episodes, best_avg_reward), end="")
        sys.stdout.flush()
        # check if task is solved (according to OpenAI Gym)
        if best_avg_reward >= 9.7:
            print('\nEnvironment solved in {} episodes.'.format(i_episode), end="")
            break
        if i_episode == num_episodes:
            print('\n')
        time.sleep(0.001) # 1ms
    return avg_rewards, best_avg_reward

In [None]:
class Agent:
    def __init__(self, nA=6):
        """ Initialize agent.

        Params
        ======
        - nA: number of actions available to the agent
        """
        self.nA = nA
        self.Q = defaultdict(lambda: np.zeros(self.nA))
        self.lr = 0.2
        self.gamma = 1.0
        self.epsilon = 0.0005

    def select_action(self, state):
        """ Given the state, select an action.

        Params
        ======
        - state: the current state of the environment

        Returns
        =======
        - action: an integer, compatible with the task's action space
        """
        policy = [self.epsilon / self.nA] * self.nA
        policy[np.argmax(self.Q[state])] += 1 - self.epsilon
        return np.random.choice(np.arange(self.nA), p=policy)

    def step(self, state, action, reward, next_state, done):
        """ Update the agent's knowledge, using the most recently sampled tuple.

        Params
        ======
        - state: the previous state of the environment
        - action: the agent's previous choice of action
        - reward: last reward received
        - next_state: the current state of the environment
        - done: whether the episode is complete (True or False)
        """
        if done:
            self.Q[state][action] += self.lr * (20 - self.Q[state][action])
            return
        
        # SARSA
        # self.Q[state][action] += self.lr * (reward + self.gamma * self.Q[next_state][self.select_action(next_state)] - self.Q[state][action])
        
        # Expected-SARSA
        # policy = [self.epsilon / self.nA] * self.nA
        # policy[np.argmax(self.Q[next_state])] += 1 - self.epsilon
        # self.Q[state][action] += self.lr * (reward + self.gamma * sum([policy[i] * self.Q[next_state][a] for i, a in enumerate(range(self.nA))]) - self.Q[state][action])
        
        # Q Learning
        self.Q[state][action] += self.lr * (reward + self.gamma * max([self.Q[next_state][a] for a in range(self.nA)]) - self.Q[state][action])

In [None]:
env = gym.make('Taxi-v3')
agent = Agent()

In [None]:
avg_rewards, best_avg_reward = interact(env, agent)

In [None]:
state = env.reset()
clear_output(wait=True)
print(env.render(mode='ansi'))
while True:
    action = agent.select_action(state)
    state, _, done, _ = env.step(action)
    time.sleep(1)
    clear_output(wait=True)
    print(env.render(mode='ansi'))
    if done:
        break

In [None]:
env.close()

---

Next: [RL in Continuous Spaces](./RL%20Continuous%20Spaces.ipynb)