<a href="https://colab.research.google.com/github/farhanhubble/discover-drl/blob/master/Rediscovering_RL_Notebook_0_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reinforcement learning with Foolsball
- Reinforcement learning is learning to make decisions from experience.
- Games are a good testbed for agents to interaction with an environment and explore it.
 

# About Foolsball
- 5x4 playground that provides a football/foosball-like environment.
- An agent or actor:
  - always spawned in the top-left corner
  - displayed as '⚽'
  - can move North, South, East or West.
  - can be controlled algorithmically
- A number of **static** opponents, each represented by 👕, that occupy certain locations on the field.
- A goalpost 🥅 that is fixed in the bottom right corner

## Primary goal
- We want the agent to learn to reach the goalpost 

## Secondary goals
- We may want the agent to learn to be efficient in some sense, for example, take the shortest path to the goalpost. **More precisely we want an algorithm to learn to control the agent and steer it towards the goalpost.**

In [None]:
agent = '⚽'
opponent = '👕'
goal = '🥅'

arena = [['⚽', ' ' , '👕', ' ' ],
         [' ' , ' ' , ' ' , '👕'],
         [' ' , '👕', ' ' , ' ' ],
         [' ' , ' ' , ' ' , '👕'],
         [' ' , '👕', ' ' , '🥅']]

# Implementing an environment for the game of Foolsball
- OpenAI Gym has many [text environments](https://github.com/openai/gym/tree/master/gym/envs/toy_text)
- Text environments are simple to render in a notebook and super-fast to experiment with.
- We want to build our own environment for two reasons:
  - It's a great exercise in understanding the finer details, like states, actions, rewards, returns.
  - Some of the experimentation we do requires looking under the hood of the environment, which is easier with your own implementation than OpenAI Gym.
  - OpenAI Gym has a simple `step(), reset()` API that we also implement. So porting our implentation over to Gym shoud be easy (and fun)!



# Understanding the first bits of terminology.
## State 
- In RL state refers to information about the environemnt and the agent.
- An RL algorithm inspects the state to decide which action to take.
- Exactly what information gets captured in `state` depends on a few factors:
  - The complexity of the environment: 
    - The number of actors, 
    - the nature of the environment, for example text or images. 
  - The complexity of the algorithm
    - A simple algorithm may only need information about the agent and its immediate surroundings.
    - A more complex algorithm may need information about the whole environment.


## Setup
- In our case we want the algorithm to only know about the location of the agent on the field. 
- We could have included information about the opponents too which would perhaps aid in the decision making but we chose not to.  

- The state therefore is a tuple: (row, col), representing the location of the agent. 
- There are 20 possible values that `state` can take on:
  - `row` can range from 0 through 4
  - `col` can range from 0 through 3

## Implementation details
- The state is actually stored as a single integer that can take on values between 0 and 19.

## Actions
The agents can perfrom actions in an environment.

## Setup
- Our agent can perform one kind of action: navigate up, down, right or left.
- It has 4 actions: 'n', 'e', 'w', 's'.

# Learning from experience
Any RL set up can be modeled as shown below:

![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTMDmrmnl_dAyjCOErHPak2gLXmQTgQnVT8gQ&usqp=CAU)

- The agent performs an action in the environment
- The state of the environment and agent change as a result
- The agent receives a reward and the updated state from the environment

## Rewards
- Reward is the signal that an agent receives after it performs an action.
- The reward structure has to be decided by us. 
- The biggest challenge of RL is that reward is often sparse. 

## Set up
- In our case the reward depends on the rules of the game and the goal.
  - If the agent runs into an opponent, the game gets over and the reward is negative (penalizes the agent)
  - If the agent makes it to the goalpost, the game gets over and the reward is positive.
  - if the agent takes the ball out of the field the reward is negative.
  - If the agent makes a valid move what shoud the reward be?

## Implementation
- The default reward structure in our case is  `{'unmarked':-1, 'opponent':-5, 'outside':-1, 'goal':+5}`
- This can be changed at any time.

#Let's start!
---
# Step 1
The code below provides an skeleton for the **Foolsball** environment we want to our agent to train in. Fill in the code marked with #Todo to create a working environment.

1. Go to the `__init__()` method and try to understand what it is doing
  1. Look at the deserialize method and complete all todos.
2. Complete the `__to_state_()` and `__to_indices__()` methods.
3. Complete the `reset()` method.
4. Go to the `step()` method and understand its intended behavior.
  1. Complete `__get_next_state_on_action__()`
  2. Complete `__get_reward_for_transition__()`
  3. Complete the `step()` method.


5. Read through the `render()` function to understand how we display the environment in the different situations. 

6. Execute the cell below and make sure there are no errors.

In [None]:
import numpy as np

class Foolsball(object):

  def __to_state__(self,row,col):
    """Convert from integer state to indices (row,col)."""
    return #Todo

  def __to_indices__(self, state):
    """Convert indices(row,col) to state (single integer)."""
    row = #Todo
    col = #Todo
    return row,col

  def __deserialize__(self,map:list,agent:str,opponent:str, goal:str):
    """Convrt a string representation of a map into a 2D numpy array
    Param map: list of lists of strings representing the player, opponents and goal.
    Param agent: string representing the agent on the map 
    Param opponent: string representing every instance of an opponent player
    Param goal: string representing the location of the goal on the map
    """
    ## Capture dimensions and map.
    self.n_rows = #Todo
    self.n_cols = #Todo
    self.n_states = #Todo 
    self.map = #Todo: convert map to Numpy array

    ## Store string representations for printing the map, etc.
    self.agent_repr = #Todo
    self.opponent_repr  = #Todo
    self.goal_repr = #Todo

    ## Find initial state, the desired goal state and the state of the opponents. 
    self.init_state = None
    self.goal_state = None
    self.opponents_states = []

    for row in range(self.n_rows):
      for col in range(self.n_cols):

        if map[row][col] == agent:
          # Store the initial state outside the map.
          # This helps in quickly resetting the game to the initial state and
          # also simplifies printing the map independent of the agent's state. 
          self.init_state = #Todo
          self.map[row,col] = ' ' 
        
        elif map[row][col] == opponent:
          #Todo

        elif map[row][col] == goal:
          #Todo

    assert self.init_state is not None, print(f"Map {map} does not specify an agent {agent} location")
    assert self.goal_state is not None,  print(f"Map {map} does not specify a goal {goal} location")
    assert self.opponents_states,  print(f"Map {map} does not specify any opponents {opponent} location")

    return self.init_state


  def __get_next_state_on_action__(self,state,action):
    """Return next state based on current state and action."""
    row, col = self.__to_indices__(state)
    action_to_index_delta = {'n':[-1,0], 'e':[0,+1], 'w':[0,-1], 's':[+1,0]}

    row_delta, col_delta = action_to_index_delta[action]
    new_row , new_col = row+row_delta, col+col_delta

    ## Return current state if next state is invalid
    if #Todo: add proper condition
      return state  

    ## Construct state from new row and col and return it.    
    return #Todo


  def __get_reward_for_transition__(self,state,next_state):
    """ Return the reward based on the transition from current state to next state. """
    ## Transition rejected due to illegal action (move)
    if next_state == state:
      reward = #Todo
    
    ## Goal!
    elif next_state == self.goal_state:
      reward = #Todo
    
    ## Ran into opponent. 
    elif next_state in self.opponents_states:
      reward = #Todo

    ## Made a safe and valid move.   
    else:
      reward = #Todo

    return reward


  def __is_terminal_state__(self, state):
    return (state == self.goal_state) or (state in self.opponents_states) 

  
  def __init__(self,map,agent,opponent,goal):
    """Spawn the world, create variables to track state and actions."""
    # We just need to track the location of the agent (the ball)
    # Everything else is static and so a potential algorithm doesn't 
    # have to look at it. The variable `done` flags terminal states.
    self.state = self.__deserialize__(map,agent,opponent,goal)
    self.done = False
    self.actions = ['n','e','w','s']

    # Set up the rewards
    self.default_rewards = {'unmarked':-1, 'opponent':-5, 'outside':-1, 'goal':+5}
    self.set_rewards(self.default_rewards)



  def reset(self):
    """Reset the environment to its initial state."""
    # There's really just two things we need to reset: the state, which should
    # be reset to the initial state, and the `done` flag which should be 
    # cleared to signal that we are not in a terminal state anymore, even if we 
    # were earlier. 
    self.state = #Todo: set to initial state 
    self.done  = #Clear the flag
    return self.state

  
  def set_rewards(self,rewards):
    if not self.state == self.init_state:
      print('Warning: Setting reward while not in initial state! You may want to call reset() first.')
    for key in self.default_rewards:
      assert key in rewards, print(f'Key {key} missing from reward.') 
    self.rewards = rewards

  
  def step(self,action):
    """Simulate state transition based on current state and action received."""
    assert not self.done, \
    print(f'You cannot call step() in a terminal state({self.state}). Check the "done" flag before calling step() to avoid this.')
    next_state = #Todo: Get next state for this (state, action) pair

    reward = #Todo: Get the reward for the state -> next_state transition.

    done = #Todo set the flag if we are in a terminal state

    self.state, self.done = next_state, done
    
    return next_state, reward, done


  def render(self):
    """Pretty-print the environment and agent."""
    ## Create a copy of the map and change data type to accomodate
    ## 3-character strings
    _map = np.array(self.map, dtype='<U3')

    ## Mark unoccupied positions with special symbol.
    ## And add extra spacing to align all columns.
    for row in range(_map.shape[0]):
      for col in range(_map.shape[1]):
        if _map[row,col] == ' ':
          _map[row,col] = ' + '
        
        elif _map[row,col] == self.opponent_repr: 
          _map[row,col] =  self.opponent_repr + ' '
        
        elif _map[row,col] == self.goal_repr:
          _map[row,col] = ' ' + self.goal_repr + ' '
      
    ## If current state overlaps with the goal state or one of the opponents'
    ## states, susbstitute a distinct marker.
    if self.state == self.goal_state:
      r,c = self.__to_indices__(self.state)
      _map[r,c] = ' 🏁 '
    elif self.state in self.opponents_states:
      r,c = self.__to_indices__(self.state)
      _map[r,c] = ' ❗ '
    else:
      r,c = self.__to_indices__(self.state)
      _map[r,c] = ' ' + self.agent_repr
    
    for row in range(_map.shape[0]):
      for col in range(_map.shape[1]):
        print(f' {_map[row,col]} ',end="")
      print('\n') 
    
    print()



# Step 2
Execute the two cell below and ensure that there are no runtime error and the rendering happens correctly. You should see output like this:

```
  ⚽   +   👕    +  

  +    +    +   👕  

  +   👕    +    +  

  +    +    +   👕  

  +   👕    +    🥅  

```

In [None]:
foolsball = Foolsball(arena, agent, opponent, goal)

In [None]:
foolsball.render()

# Step 3.
- Run the next cell to play with the environment and score a few goals. 
- If there are any errors you may want to go back and update the code for the `Foolsball` class. 
- Make sure to run the cell with `foolsball = Foolsball(arena, agent, opponent, goal)` if you update the class.

In [None]:
## Move: n,s,e,w
## Reset: r
## Exit: x
while True:
  try:
    act = input('>>')

    if act in foolsball.actions:
      print(foolsball.step(act))
      print()
      foolsball.render()
    elif act == 'r':
      print(foolsball.reset())
      print()
      foolsball.render()
    elif act == 'x':
      break
    else:
      print(f'Invalid input:{act}')
  except Exception as e:
    print(e)

# Step 4
Understand the concept of returns
- Complete the `get_return()` function.
- Calculate returns for a few sample paths by running the next few cells

In [None]:
## Reward and return
path1 = ['e','s','e','s','s','s','e']
path2 = ['s','e','e','s','s','s','e']
path3 = ['s','s','s','e','e','s','e']
path4 = ['s','s','s','s','n','e','e','s','e']

In [None]:
def get_return(path):
  # Todo: Reset the game to its initial state
  # Todo: Render the starting state
  
  _return_ = 0
  for act in path: 
    # Todo: use the step() API to run the action.
    # Todo: accumulate reward in to return.
    # Todo: render the current state (purely for visual delight)
    
    if #Todo: condition to break out.
      break
    
  print(f'Return (accumulated reward): {_return_}')

In [None]:
get_return(path1)

In [None]:
get_return(path2)

In [None]:
get_return(path3)

In [None]:
get_return(path4)

# Step 5.
- Experiment with a different reward structure.
- Does it encourage the agent to take the shortest route?

In [None]:
## Different reward structure
foolsball.set_rewards({'unmarked':0, 'opponent':-5, 'outside':-1, 'goal':+5})

In [None]:
get_return(path1)

In [None]:
get_return(path4)

# Step 6
- Get introduced to discounted return as a means to set acceptable time horizons.
$$Discounted\ Return = R_{t_1} + \gamma*R_{t_2} + \gamma^2*R_{t_3} + ... + \gamma^{n-1}*R_{t_n}$$
where $R_{t_k}$  is the reward after step `k` and $\gamma$ is called the discount factor. 
- Complete the code below to implement discounted returns.
- The discount factor $\gamma$ is a hyperparameter (why?) often set to 0.9 
😜
- Run the next few cells to see if discounting indeed has the effect we want (shorter paths)

In [None]:
def get_discounted_return(path, gamma=0):
  foolsball.reset()
  foolsball.render()
  _return_ = 0
  discount_coeff = 1
  for act in path: 
    #Todo: execute one step
    #Todo: Update discounted reward
    #Todo: Update the discount multiplier pow(gamma,n-1)
    
    foolsball.render()
    if done:
      break
    
  print(f'Return (accumulated reward): {_return_}')

In [None]:
HYPER_PARAMS = {'gamma':0.9}

In [None]:
get_discounted_return(path1, HYPER_PARAMS['gamma'])

In [None]:
get_discounted_return(path4, HYPER_PARAMS['gamma'])

# Step 7
## Formalizing the problem:
- We want to the agent to reach the goalpost AND attain the highest **discounted return**.
- This means making safe and efficient moves
  - Running into opponent means game over
  - Repeated 'outsides' means inefficiency
  - Long detours are also inefficient

## The Conundrum
- We already know how to compute the discounted return from a path.
- We can generate all possible paths and calculate their returns and pick a path with the highest return.

- Alas there are too many paths (4 possible decision at each step)


## The "Trick"
- Even though there are too many paths, the number of (state,action) pair is small.
- We can calculate the return for each of the 80(=20x4) state action pairs.
- To emphasize we want to caculate the return for each (state,action) pair not the reward
  - Calculating return means peekin into the future.


## Todo:
- As a precursor to calculating returns for every (state,action) pair let's try to calculate the reward for every (state,action) pair.

- Understand how the code in the next two cells creates a Pandas table to store the rewards for every (state, action) pair.

- We will cheat a little by using private methods of the `Foolsball` class
  - Use the `__get_next_state_on_action__()` and `__get_reward_for_transition__()` methods to complete the code in the third cell below
  - Run the fourth cell to view the rewards table. 
  - Notice that rewards for terminal states are kept undefined since no actions are allowed in those states.



In [None]:
import pandas as pd

In [None]:
REWARDS_TBL = pd.DataFrame.from_dict({s:{a:None for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')
REWARDS_TBL

In [None]:
for state in REWARDS_TBL.index:
  if not foolsball.__is_terminal_state__(state): #Only calculate rewards for non-terminal states
    for action in REWARDS_TBL.columns:
      next_state = #Todo get the next state for this state,action 
      REWARDS_TBL.loc[state, action] = #Todo get the reward for state -> new_state transition 

In [None]:
terminal_states = foolsball.opponents_states+[foolsball.goal_state]
print(terminal_states)
REWARDS_TBL

#Step 8
Create a returns table (no TODOs here)
- Run the next four cells and understand why we are setting the returns for terminal stated to 0.
  - We leave the returns for all non-terminal states undefined.
  - Trying to fill up these entries will be the focus of the rest of the notebook.

- A function to create new instances of the returns table is also provided in the fourth cell below.

In [None]:
RETURNS_TBL = pd.DataFrame.from_dict({s:{a:None for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')

In [None]:
RETURNS_TBL

In [None]:
RETURNS_TBL.loc[terminal_states]

In [None]:
RETURNS_TBL.loc[terminal_states] = 0
RETURNS_TBL

In [None]:
def make_returns_table(terminal_states):
  """Create an empty returns table."""
  table = pd.DataFrame.from_dict({s:{a:None for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')
  table.loc[terminal_states] = 0
  return table

# Step 9
## Try dynamic programming to fill up the returns table.
- Returns for a (state, action) are defined in terms of returns of the next state. 
  - $Return(state_t,action_t) = Reward(state_t,state_{t+1}) + \max[Return(state_{t+1}, action=='n'),\\ Return(state_{t+1}, action=='e'), \\ Return(state_{t+1}, action=='w'), \\ Return(state_{t+1}, action=='s')]$

  - This motivates the use of dynamic programming to fill up the returns table  

## Todo:
- Read the code in the next cell and try to understand the first dynamic programming based solution. 
- Run the code in the next cell. The code causes a stack overflow. Why?
- Pass debug= True to see what the problem is.

In [None]:
def fill_returns_table_v0(table,state,debug=False): 
  """ Recursively fill a returns table, one state at a time."""
  for action in table.columns:
    if table.loc[state][action] is None:
      next_state = foolsball.__get_next_state_on_action__(state, action)
      reward = foolsball.__get_reward_for_transition__(state, next_state)

      if debug:
        print(f'Trying to fill ({state},{action},{next_state})')
      
      fill_returns_table_v0(table, next_state, debug) # <= Earth shaking problem here!!! 😱😱😱
      table.loc[state][action]  = reward + HYPER_PARAMS['gamma'] * table.loc[next_state].max()
    
    else:
      if debug:
        print((state,action),f'already has a RETURN {table.loc[state][action]}')
  

In [None]:
table = make_returns_table(terminal_states)
fill_returns_table_v0(table,state=0)

## Contd..
- The code above crashed becasue of indefinite recursion caused by a state,action pairs that resulted in the next state being the same as the current state
- We can fix this by catching this case and a returning a large negative return.
- Why is the large negative return necessary?

In [None]:
def fill_returns_table_v1(table,state,debug=False):
  for action in table.columns:
    if table.loc[state][action] is None:
      next_state = foolsball.__get_next_state_on_action__(state, action)
      reward = foolsball.__get_reward_for_transition__(state, next_state)
      
      if debug:
        print(f'Trying to fill ({state},{action},{next_state})')
      
      if next_state == state:
        table.loc[state][action] = -np.inf # <= No self recursion
      else:
        fill_returns_table_v1(table,next_state,debug)
        table.loc[state][action]  = reward + HYPER_PARAMS['gamma'] * table.loc[next_state].max()
    else:
      if debug:
        print((state,action),f'already has a RETURN {table.loc[state][action]}')

In [None]:
table = make_returns_table(terminal_states)
fill_returns_table_v1(table, state=0, debug=False)

## Contd..
- The code above crashed becasue of indefinite mutual recursion caused by a state,action pairs that resulted in the next state being the same as the current state
- We can fix this by evading these cases.
- Let' see if we can get somewhere.
- Run the next few cells to find out.

In [None]:
def fill_returns_table_v2(table,state, debug=False):
  for action in table.columns:
    if table.loc[state][action] is None:
      next_state = foolsball.__get_next_state_on_action__(state, action)
      reward = foolsball.__get_reward_for_transition__(state, next_state)
      
      if debug:
        print(f'Trying to fill ({state},{action},{next_state})')
      
      if next_state == state:
        table.loc[state][action] = -np.inf # <= No self recursion
      
      elif not table.loc[next_state].isna().any(): # <= No recursion beyond immediate neighbor!
        table.loc[state][action]  = reward + HYPER_PARAMS['gamma'] * table.loc[next_state].max()
    
    else:
      if debug:
        print((state,action),f'already has a RETURN {table.loc[state][action]}')

In [None]:
table = make_returns_table(terminal_states)

In [None]:
fill_returns_table_v2(table,state=0)

In [None]:
table

In [None]:
fill_returns_table_v2(table,state=1)

In [None]:
table

In [None]:
fill_returns_table_v2(table,state=3)

In [None]:
table

In [None]:
for s in range(4,19):
  fill_returns_table_v2(table,state=s)

In [None]:
table

In [None]:
table.isna().sum()

In [None]:
for s in range(0,19):
  fill_returns_table_v2(table,state=s)

In [None]:
table

In [None]:
table.isna().sum()

In [None]:
for s in range(0,19):
  fill_returns_table_v2(table,state=s)

In [None]:
table

In [None]:
table.isna().sum()

# Get Some more Coffee
---

# Step 10
## Estimating returns through simulation
- and Monte Carlo sampling
- No more cheating by peeping into the environment (private APIs)

## Todo
- Run the code in the next two cells to collect and print a random episodes.
  - The episode starts with the environment in the initial state 
  - The agent tries random actions
  - The episode terminates when the agent collides with an opponent or reaches the goalpost.


In [None]:
def collect_random_episode():
  state = foolsball.reset()
  done = False
  episode = []

  while not done:
    action = np.random.choice(foolsball.actions)
    next_state, reward, done = foolsball.step(action)
    episode.append([state, action, reward])
    state = next_state
  
  return episode

In [None]:
ep = collect_random_episode()
foolsball.render()
print(ep)

# Step 11
- Complete the function `discounted_return_from_episode()` that computes the discounted return for every state in an episode.
  - If an episode is:  $(s_1,a_1,r_1), (s_2,a_2,r_2), (s_3, a_3, r_3)$, **excluding the terminal state**:
  - The (discounted) return for $s_1$ is $r_1 + \gamma * r_2 + \gamma^2 * r_3$
  - The (discounted) return for $s_2$ is $r_2 + \gamma * r_3$
  - The (discounted) return for $s_3$ is $r_3$ 

- Run the next couple of cells to print discounted returns for entire episodes.


In [None]:
def discounted_return_from_episode(ep, gamma=0):
  states, actions, rewards = list(zip(*ep))
  rewards = np.asarray(rewards)
  discount_coeffs = np.asarray([np.power(gamma,p) for p in range(len(rewards))])
  
  l = len(rewards)
  discounted_returns = [np.dot(rewards["""#Todo:Fill appropriate range"""],discount_coeffs["""#Todo:Fill appropriate range""") for i in range(l)]

  return (states, actions, discounted_returns)


In [None]:
discounted_return_from_episode(ep, gamma=HYPER_PARAMS['gamma'])

# Step 12
## Estimate returns by simulating lot of episodes.
- The code below creates two tables:
  - ESTIMATED_RETURNS_TBL for accumulating the return for every (state,action) pair  
  - VISITS_COUNTS_TBL for storing the number of times a (state,action) pair appears across all episodes.

- It then runs an algorithm to generate episodes and 

Here's the idea:
- Create many random episodes
  - Examine each (state, action) pair in an episode.
  - Calculate and accumulate the return for this pair
    - Since we have the full episode, we can "see the future" and calculate the return.
    - The return for a (state,action) pair is just (very bad) estimate of the "real" return, since we are looking at just one of the many paths that could possible contain the (state,action)
  - Record the visit count of the (state, action) pair.   

- At the end the we divide the accumulated returns by the visit counts to get an estimate of the retruns. 


## Todo:
- Complete the code in the **next two cells** to implement what's known as Monte Carlo estimation.
- Run the cells to see how well the alorithm fares.
- Does the algorithm generate a sensible looking returns (estimates)?

In [None]:
# Create empty returns table 
ESTIMATED_RETURNS_TBL = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')
VISITS_COUNTS_TBL = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')

n_episodes = 100  #Try 100, 500, 2000

for i in range(n_episodes):
  episode_i = #Todo: Create a random episode.
  states, actions, discounted_returns = #Todo: Generate discounted returns for the episode

  for s,a,ret in zip(states, actions, discounted_returns):
    ESTIMATED_RETURNS_TBL.loc[s,a] += #Todo: Accumulate the return for this (state,action) pair
    VISITS_COUNTS_TBL.loc[s,a] +=  #Todo: update visit count
  

In [None]:
estimated_returns = ESTIMATED_RETURNS_TBL.div(VISITS_COUNTS_TBL+1) ## Averaging returns. Avoid dividing by zeros.
estimated_returns

# Step 13:
## Intro to Policies
- The estimated returns table is hard to evaluate.
- To use the table to make decisions, we grab the action with the highest returns.
- We can extract the actions yielding the highest return in each state and call it a **policy**.
- This will be a greedy policy since we take the best action at each state

In [None]:
def greedy_policy_from_returns_tbl(table):
  policy = {s:None for s in table.index }

  for state in table.index:
    if state not in terminal_states:
      greedy_action_index = # Todo: get the index of the action with the highest return.
      greedy_action = table.columns[greedy_action_index]
      policy[state] = greedy_action

  return policy

In [None]:
policy0 = greedy_policy_from_returns_tbl(estimated_returns)

# Contd..
- Here's a function to superimpose a policy over the environment.
- Use the code in the next two cells to eyeball the policy we just generated

In [None]:
def pretty_print_policy(policy):
  direction_repr = {'n':' 🡑 ', 'e':' 🡒 ', 'w':' 🡐 ', 's':' 🡓 ', None:' ⬤ '}

  for row in range(foolsball.n_rows):
    for col in range(foolsball.n_cols):
      state = row * foolsball.n_cols + col
      print(direction_repr[policy[state]],end='')
    print()

In [None]:
pretty_print_policy(policy0)

# Step 14
## Exploiting the information in the returns table.
- We are improving our estimates of the returns with each successive episode. 
- But we are still generating random episodes throughout. 
- We should also exploit the information we accrue in returns table
- The implementation below is quite similar to `collect_random_episode` but here's the key difference:
  - In state s, the random policy returns a random action from ('n','s','e','w').
  - But from the returns table we know that one of the action, say 'e' generates the best returns so we can make a greedy choice and always return 'e'

- Run the next to cells to see the difference. 

In [None]:
def collect_greedy_episode_from_returns_tbl(table, max_ep_len=20):
  state = foolsball.reset()
  done = False
  episode = []

  for _ in range(max_ep_len):
    if done:
      break
    
    greedy_action_index = table.loc[state].argmax()
    greedy_action = table.columns[greedy_action_index]
    next_state, reward, done = foolsball.step(greedy_action)
    episode.append([state, greedy_action, reward])
    state = next_state
  
  return episode

In [None]:
collect_greedy_episode_from_returns_tbl(estimated_returns)

# Step 15
## Todo 
- Implement the loop in the cell below to update the returns table. 
- The code will be exactly what we used earlier, except that it will use greedy episodes.

- Run the next few cells to evaluate the effectiveness.


In [None]:
ESTIMATED_RETURNS_TBL = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')
VISITS_COUNTS_TBL = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')

n_episodes = 1000

for i in range(n_episodes):
  #Todo: Implement code block to update ESTIMATED_RETURNS_TBL and VISITS_COUNTS_TBL
  #Todo: Make sure you are using greedy episodes.

In [None]:
estimated_returns = ESTIMATED_RETURNS_TBL.div(VISITS_COUNTS_TBL+1) ## Averaging returns. Avoid dividing by zeros.
estimated_returns

In [None]:
policy1 = greedy_policy_from_returns_tbl(estimated_returns)

In [None]:
pretty_print_policy(policy1)

# Step 16
## The Exploration-exploitation Dilemma

- We have tried pure exploration (with random episodes)
- We have also tried pure exploitation (with policy generated from the returns table)
- A good agent should try to balance both.


## Epsilon-greedy episodes
- An epsilon greedy episode blends the previous two approaches
- Precisely, when in state `s`:
  - The epsilon greedy episode will pick the action yielding the highest returns with a high probability, say 0.8 
  - It will sometime, random action from the other, suboptimal, actions, albeit with a low probability, say 0.2.
  - The hyperparameter `epsilon` or $\epsilon$ decides the probability

  - Example with epsilon = 0.2
    - state `s`
    - Actions = ('n','e','w','s')
    - Best action (yielding highest return) = 'w'
    - Sampling probabilities = $[1-\epsilon+{\epsilon \over 4},{\epsilon \over 4},{\epsilon \over 4},{\epsilon \over 4}] = [0.85,0.05,0.05,0.05]$


## Todo:
Finish the code below and look at how the output differs from the other two methods. 


In [None]:
def collect_epsilon_greedy_episode_from_returns_tbl(table, max_ep_len=20, epsilon=0.1):
  
  state = foolsball.reset()
  done = False
  episode = []

  for _ in range(max_ep_len):
    if done:
      break
    
    actions = table.columns
    action_probs = np.asarray([epsilon/len(actions)]*len(actions),dtype=np.float)
    
    greedy_action_index = table.loc[state].argmax()
    action_probs[greedy_action_index] += 1-epsilon
    
    epsilon_greedy_action = #Todo: use np.random.choice to sample epsilon-greedily

    next_state, reward, done = foolsball.step(epsilon_greedy_action)
    episode.append([state, epsilon_greedy_action, reward])
    state = next_state

  return episode

In [None]:
collect_epsilon_greedy_episode_from_returns_tbl(estimated_returns, epsilon=1)

# Step 17
## Epsilon-greedy updates.
## Todo:
- Run the next few cells to see the effect of using an epsilon greedy approach.

In [None]:
ESTIMATED_RETURNS_TBL = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')
VISITS_COUNTS_TBL = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')

n_episodes = 1000

for i in range(n_episodes):
  estimated_returns = ESTIMATED_RETURNS_TBL.div(VISITS_COUNTS_TBL+1)
  
  episode_i = collect_epsilon_greedy_episode_from_returns_tbl(estimated_returns)
  #print(episode_i)
  states, actions, discounted_returns = discounted_return_from_episode(episode_i, gamma=HYPER_PARAMS['gamma'])

  for s,a,ret in zip(states, actions, discounted_returns):
    ESTIMATED_RETURNS_TBL.loc[s,a] += ret
    VISITS_COUNTS_TBL.loc[s,a] += 1

In [None]:
estimated_returns = ESTIMATED_RETURNS_TBL.div(VISITS_COUNTS_TBL+1) ## Averaging returns. Avoid dividing by zeros.
estimated_returns

In [None]:
policy2 = greedy_policy_from_returns_tbl(estimated_returns)
policy2

In [None]:
pretty_print_policy(policy2)

# Step 18
## Revisiting Exploration-Exploitation with Epsilon Decay

- What is the best way to balance exploitation with exploration?
  - In the beginning, pick absolutely random actions in every state.
  - Slowly reduce the randomness to a small value.

## Todo:
- In the code below pick a value of `epsilon` that makes all actions equiprobable in `collect_epsilon_greedy_episode_from_returns_tbl()`.

- Fill in the code anneal epsilon over episodes. The value of epsilon shoud not drop below the minimum threshold.

- Run the next few cells to evaluate this approach.

- Do the policies look any better?

In [None]:
ESTIMATED_RETURNS_TBL = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')
VISITS_COUNTS_TBL = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')

n_episodes = 10000
epsilon = 1
min_epsilon = 0.1
epsilon_decay = 0.999

for i in range(n_episodes):
  estimated_returns = ESTIMATED_RETURNS_TBL.div(VISITS_COUNTS_TBL+1)
  
  epsilon = #Todo: Pick annealed value unless it is lower than the minimum threshold
  episode_i = collect_epsilon_greedy_episode_from_returns_tbl(estimated_returns,epsilon=epsilon)
  epsilon *= epsilon_decay
  #print(episode_i)
  states, actions, discounted_returns = discounted_return_from_episode(episode_i, gamma=HYPER_PARAMS['gamma'])

  for s,a,ret in zip(states, actions, discounted_returns):
    ESTIMATED_RETURNS_TBL.loc[s,a] += ret
    VISITS_COUNTS_TBL.loc[s,a] += 1

In [None]:
estimated_returns = ESTIMATED_RETURNS_TBL.div(VISITS_COUNTS_TBL+1) ## Averaging returns. Avoid dividing by zeros.
print(estimated_returns)

policy3 = greedy_policy_from_returns_tbl(estimated_returns)
print(policy3)

pretty_print_policy(policy3)

# Step 19
## Constant Alpha

## The idea:
- Dividing the accumulated by visit count has a non linear effect on the updates. (Go back to previous step and see for yourself).

- Don't divide at all!

- But we need to ensure that updates are small

  - `ESTIMATED_RETURNS_TBL.loc[s,a]` and `ret` are both estimates of the same quantity. 

  - Use the difference of the two estimates to update `ESTIMATED_RETURNS_TBL.loc[s,a]` much like we do in Deep Learning.


## Todo:
- Complete the missing code in the next cell.
- Run the next few cells to get a policy and evaluate it.
- Does the policy help the agent attain its goal?


In [None]:
ESTIMATED_RETURNS_TBL = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')

n_episodes = 10000
epsilon = 1
min_epsilon = 0.1
epsilon_decay = 0.999

alpha = 0.01

for i in range(n_episodes):
  estimated_returns = ESTIMATED_RETURNS_TBL
  
  epsilon = max(epsilon,min_epsilon)
  episode_i = collect_epsilon_greedy_episode_from_returns_tbl(estimated_returns,epsilon=epsilon)
  epsilon *= epsilon_decay
  states, actions, discounted_returns = discounted_return_from_episode(episode_i, gamma=HYPER_PARAMS['gamma'])

  for s,a,ret in zip(states, actions, discounted_returns):
    ESTIMATED_RETURNS_TBL.loc[s,a] += #Todo: Update RHS using hints from the instructions. Use alpha as the "learning rate"

In [None]:
estimated_returns = ESTIMATED_RETURNS_TBL
print(estimated_returns)

policy4 = greedy_policy_from_returns_tbl(estimated_returns)
print(policy4)

pretty_print_policy(policy4)