d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Multi-armed bandit

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you learn:<br>
 - One state MDPs also known as Multi-armed bandit
 - Try to understand the trade-off between exploration and exploitation
  
## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) References
* Sutton chapter 2

### Problem statement
![Multi-armed bandit problem](https://files.training.databricks.com/images/rl/multiarm_bandit.png)
<br/><br/><br/><br/>
You arrive in a casino. You decide to play slot machine. The slot machine has 10 levers. When you pull down a lever, some money is given to you. Your objective is to maximize your cumulative reward by picking different levers. This problem is known as **multi-armed bandit** problem. 


**How would you approach this problem? Discuss this with your neighbors. <br/>**

In this lab we try to answer this question.

### Case 1: ###

0. Set q(a) to random values for all levers \\(a = 1, 2, 3, ... , 10\\). q(a) can be thought of as an expected value of the reward for lever a. Since we do not know anything about the levers, we initialize their q values to 0.
0. Take the first action by randomly picking one of the levers.
0. Observe the reward.
0. Calculate the expected value of the lever and update q for that lever.
0. Pick a lever with the highest q value. If there are ties, randomly pick among one of the levers.
0. Repeat 3-5 until step 1,000.
0. Repeat 1-6 for 1000 different multi-armed bandit problems.

In [5]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1234)

# Parameters of the distributions (for simulation)
mu_q_star = 0
sigma_q_star = 1
lever_count = 10
sigma_reward = 1
timesteps = 1000
runs = 1000
epsilon = 0.1


def qstar():
  """This function generates random mean values for levers."""

  return np.random.normal(mu_q_star, sigma_q_star, lever_count)

def reward(q_star, action):
  """This function generates rewards."""
  
  return np.random.normal(q_star[action], sigma_reward, 1)

def plot_average_rewards(x, y):
  """This function plot the average rewards of 200 10-armed bandit problems."""
  
  fig = plt.figure()
  plt.plot(x, y)
  fig.suptitle(r'Average reward per timestep', fontsize=20)
  plt.xlabel('timestep', fontsize=18)
  plt.ylabel('Average reward', fontsize=16)
  plt.show()
  display()

In [6]:
# ANSWER
def greedy():
  """This function runs multiple simulations for a 10-armed bandit problem."""
  
  # Initialize a one-dimensional array with size timestep. This array includes the average rewards of 2000 runs for each cross section
  cumulative_timestep_reward = np.zeros(timesteps)

 
  
  for i in range(runs):
    print(f"This is iteration {i+1}.")
    # Use qstar function defined above to initialize the mean of reward distribution.
    q_star = qstar()
    # Initialize q and count values to random int and zero, respectively.
    q = np.zeros(lever_count)
    
    count = np.zeros(lever_count)
    
    # Randomly pick a lever
    action = np.random.randint(lever_count, size= 1)
    
    # Initialize cumulative rewards and rewards
    cumulative_rewards = np.zeros(lever_count)
    rewards = np.zeros(timesteps)
    
    # For 1000 time steps run this simulation 
    for step in range (timesteps):
      
      # Keep track of number of times a lever has been picked.
      count[action] += 1 
      
      # Observe the reward. Reward comes from a distribution defined earlier. 
      rewards[step] = reward(q_star, action)
      # Calculate the cumulative 
      cumulative_rewards[action] = rewards[step]+cumulative_rewards[action]
      
      # Update the q values
      q[action] = (cumulative_rewards[action])/count[action]
      highest_q = np.argwhere(q == np.amax(q))
      action = np.random.choice(highest_q.flatten(), 1)
  
    
    
    cumulative_timestep_reward += rewards
      
    
  
  return(cumulative_timestep_reward/runs)

In [7]:
average_reward = greedy()
x = np.linspace(1,timesteps, timesteps)
plot_average_rewards(x, average_reward)

### Case 2: ###

Now assume that you want to do some exploration along with exploitation. Implement the following procedure and compare the results. 

0. Set q(a) to random values for all levers \\(a = 1, 2, 3, ... , 10\\). q(a) can be thought of as an expected value of the reward for lever a. Since we do not know anything about the levers, we initialize to 0.
0. Take the first action by randomly picking one of the levers.
0. Observe the reward
0. Calculate the expected value of the lever and update q for that lever
0. Pick a lever with highest q value with probability \\(1-\epsilon\\) where \\(\epsilon \in [0,1]\\). Otherwise pick other levers.
0. Repeat 3-5 until step 1,000.
0. Repeat 1-6 for 1000 different multi-armed bandit problems.
0. Repeat 1-7 for different values of \\(\epsilon\\). \\(\epsilon = 0, 0.01 , 0.1, 0.2, 0.5, 0.75, 1\\)

In [9]:
# ANSWER
def epsilon_greedy(epsilon):
  """This function implements e-greedy algorithm.""" 
  
  # Set average_reward to zeros
  cumulative_timestep_reward = np.zeros(timesteps)

  for i in range(runs):
    print(f"This is iteration {i+1}.")
    # Use qstar function defined above to initialize the mean of reward distribution. 
    q_star = qstar()
    # Initialize q and count values to random int and zero, respectively.
    q = np.zeros(lever_count)
    # Initialize count
    count = np.zeros(lever_count)
    # Randomly pick a lever
    action = np.random.randint(lever_count, size= 1)
    # Initialize the cumulative reward
    cumulative_rewards = np.zeros(lever_count)
    # Initialize the reward (for each run)
    rewards = np.zeros(timesteps)
    
    for step in range (timesteps):
      
      # Increase the count for each lever
      count[action] += 1
      # Observe the reward
      rewards[step] = reward(q_star, action)
      # Calculate the cumulative reward
      cumulative_rewards[action] = rewards[step]+cumulative_rewards[action]
      # Update the q value
      q[action] = (cumulative_rewards[action])/count[action]
      # Act greedily by picking the lever with the highest q
      highest_q = np.argwhere(q == np.amax(q))
      
      # Pick the lever that is NOT the best one with epsilon probability
      if (np.random.uniform() < epsilon and len(highest_q) != len(q)):
        not_highest_q = np.argwhere(q != np.amax(q))
        action = np.random.choice(not_highest_q.flatten(), 1)
       
        
      # Pick the lever with highest value of q with probability of 1-epsilon
      else:
        action = np.random.choice(highest_q.flatten(), 1)
  

    
    cumulative_timestep_reward += rewards
      
    
      
  
  return(cumulative_timestep_reward/runs)    

In [10]:
average_reward = epsilon_greedy(epsilon)
x = np.linspace(1,timesteps, timesteps)
plot_average_rewards(x, average_reward)

### Bonus: ###

In case 2 we had to accumulate all the previous rewards to update q values. Are there any more efficient methods to update q? Methods that are more memory efficient.<br/><br/><br/>

**Hint:** <br\>

\\(\begin{aligned} Q\_{n+1} &= \frac {\sum \_{i=1}^{i=n}R\_i} {n} \\\
&= \frac {1}{n}(R\_{n} + \sum \_{i=1}^{i=n-1}R\_i)\\\
&= \frac {1}{n}(R\_{n} + \frac {n-1}{n-1} \sum \_{i=1}^{i=n-1}R\_i)\\\
&= \frac {1}{n}(R\_{n} + (n-1)Q\_{n})\\\
&= Q\_{n} + \frac {1}{n}[R\_{n}-Q\_{n}]
\end{aligned} \\)

In [12]:
# ANSWER
def epsilon_greedy_memory_efficient(epsilon):
  """This is implementation of above algorithm.""" 
  
  cumulative_timestep_reward = np.zeros(timesteps)
  
  for i in range(runs):
    print(f"This is iteration {i+1}.")
    q_star = qstar()
    q = np.zeros(lever_count)
    count = np.zeros(lever_count)
    action = np.random.randint(lever_count, size= 1)
    rewards = np.zeros(timesteps)
    
    for step in range (timesteps):
      count[action] += 1
      rewards[step] = reward(q_star, action)
      q[action] = q[action] + 1.0/count[action] * (rewards[step] - q[action])
      highest_q = np.argwhere(q == np.amax(q))
      
      if (np.random.uniform() < epsilon and len(highest_q) != len(q)):
        not_highest_q = np.argwhere(q != np.amax(q))
        action = np.random.choice(not_highest_q.flatten(), 1)
      
      else:
        action = np.random.choice(highest_q.flatten(), 1)
  
    
   
    cumulative_timestep_reward += rewards
  
  return(cumulative_timestep_reward/runs)  

In [13]:
average_reward = epsilon_greedy_memory_efficient(epsilon)
x = np.linspace(1,timesteps, timesteps)
plot_average_rewards(x, average_reward)


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>