# CBS Week 9 Notebook: Reinforcement Learning

### Week 11

In [None]:
library(tidyverse)

### Epsilon Greedy Sampling

In [None]:
epsilon_greedy = function(arms, N, epsilon){
    trials = NULL
    outcomes <- rep(1, length(arms))
    choices <- rep(2, length(arms))
    for(i in 1:N){
        if(runif(1) < epsilon){
            choice <- sample(1:length(arms), 1)
        } else {
            valid_arms <- which(outcomes/choices == max(outcomes/choices))
            choice <- ifelse(length(valid_arms)==1, valid_arms, sample(valid_arms, 1))
        }
        outcome = rbinom(1, 1, arms[choice])
        choices[choice] <- choices[choice] + 1
        outcomes[choice] <- outcomes[choice] + outcome
        trials = rbind(trials,
                       data.frame(trial=i, choice=choice, outcome=outcome, optimal=max(arms), epsilon=epsilon))
    }
  trials
}


### Win Stay, Lose Sample

In [None]:
wsls <- function(arms, N){
    outcomes <- rep(0, length(arms))
    choices <- rep(0, length(arms))
    
    choice <- sample(1:length(arms), 1)
    outcome <- rbinom(1, 1, arms[choice])
    
    choices[choice] <- choices[choice] + 1
    outcomes[choice] <- outcomes[choice] + outcome
    trials = data.frame(trial=1, choice=choice, outcome=outcome, optimal=max(arms))
    
    for(i in 2:N){
        if(outcome==1){
            outcome <- rbinom(1, 1, arms[choice])
        } else {
            choice <- sample(1:length(arms), 1)
            outcome <- rbinom(1, 1, arms[choice])
        }
        choices[choice] <- choices[choice] + 1
        outcomes[choice] <- outcomes[choice] + outcome
        trials = rbind(trials,
                       data.frame(trial=i, choice=choice, outcome=outcome, optimal=max(arms)))
        
    }
    trials
}


### Thompson Sampling

In [None]:
thompson <- function(arms, N){
    trials = NULL
    outcomes <- rep(0, length(arms))
    choices <- rep(0, length(arms))
    
    alphas <- rep(1, length(arms))
    betas <- rep(1, length(arms))
    for(i in 1:N){
        thetas <- rbeta(length(arms), alphas, betas)
        valid_arms <- which(thetas == max(thetas))
        choice <- ifelse(length(valid_arms)==1, valid_arms, sample(valid_arms, 1))
        outcome <- rbinom(1, 1, arms[choice])
        if(outcome==1){
            alphas[choice] = alphas[choice] + 1
        } else {
            betas[choice] = betas[choice] + 1
        }
        choices[choice] <- choices[choice] + 1
        outcomes[choice] <- outcomes[choice] + outcome
        trials = bind_rows(trials,
                       tibble(trial=i, choice=choice, outcome=outcome, optimal=max(arms)) %>%
                           bind_cols(as_tibble_row(setNames(alphas/(alphas+betas), paste0('B', 1:length(arms))))) %>%
                           bind_cols(as_tibble_row(setNames(arms, paste0('A', 1:length(arms)))))
                          )
    }
    trials
}


# Do these strategies work equally well across MABs?

So in class we discussed two potential factors that make determining the best answer hard: the magnitude of the greatest possible reward and the relative reward between the best two options.

In this tutorial, we are going to see if different sampling algorithms are better suited to different MAB problems that vary along this dimension.

#### Exercise 1

To start, create some MABs that vary on these two dimensions. As a reminder we define MAB problems as vectors of reward probabilities. For example, 


In [None]:

mab_0 <- c(0.5, 0.5, 0.5)


Now for each MAB, we want to determine which strategy is the best. How do we want to measure this?

#### Exercise 2

Let's make a function that takes as input a simulation data.frame and outputs a numeric value.

In [None]:

score_simulation = function(df){
    NA # YOUR CODE HERE
}


Now sampling is a random process so we are going to repeat our simulations multiple times and compare the average scores. Here is some code that will run the simulations $N_{sims}$ times.

Pro Tip: If you want to ensure that your simulation will return the same results, you need to set the random seed using the `set.seed` function before you start your simulation code. Otherwise, your function will return different values each time it is run.

In [None]:

run_simulations = function(mab, horizon, N_sims, epsilon){
    scores = NULL
    for(i in 1:N_sims){
        scores <- bind_rows(scores,
                            data.frame(simulation = i,
                                       greedy = epsilon_greedy(mab, horizon, epsilon) %>% score_simulation(),
                                       wsls = wsls(mab, horizon) %>% score_simulation(),
                                       thompson = thompson(mab, horizon) %>% score_simulation()
                                      )
                           )    
    }
    scores
}


In [None]:
N_sims <- 10
horizon <- 100

sims = run_simulations(mab_0, horizon, N_sims, epsilon=0.1)

sims %>%
    gather(Sampler, Score, greedy:thompson) %>%
    ggplot(aes(Sampler, Score)) +
    stat_summary(fun=mean, geom='bar') +
    stat_summary(fun.data=mean_cl_boot, geom='linerange') +
    theme_bw(base_size=18)

#### Exercise 3

Okay, run an experiment to test which sampling algorithm is best as the absolute magnitude of the best option changes. You need to simulate different MABs and compare their scores.

In [None]:
# YOUR CODE HERE

#### Exercise 4

Okay, now let's run an experiment to test which sampling algorithm is best as the relative magnitude of the best two option changes. Again, you need to simulate different MABs and compare their scores.

In [None]:
# YOUR CODE HERE

# Can we formalize the Bandits as Markov Devision Processes?

In class, Frank told us that all of the bandits can be formalized as Markov-Decision Processes, which makes MDPs the Framework. Once formalized as an MDP, we can use the Bellman equation to uncover the optimal policy and value function.


In [None]:
bellman = function(MDP, maxiters=10000, verbose=FALSE){
        states <- MDP$states
        actions <- MDP$actions
        S <- length(states)
        A <- length(actions)
        transitions <- MDP$transitions
        rewards <- MDP$rewards
        discount <- MDP$discount
    
        V <- rep(0, S)
        for(i in 0:maxiters){
            if(verbose){message(i)}
            oldV <- V
            Q <- matrix(0, S, A)
            for(s in states){
                for(a in actions){
                    Q[s, a] <- sum((transitions[, a, s]) * (rewards[, a, s] + discount*V))
                }
            }
            V <- apply(Q, 1, max, na.rm=TRUE) #rowSums(Q)
            if(!any(abs(V-oldV) > 0.00001)){
                break
            }
            if(i + 1 == maxiters){
                message('WARNING: Values did not converge')
            }
        }
        return(list(policy=apply(Q, 1, which.max), value=V))
}


Let's do an example. 

- We have a contextual bandit with 3 states and 3 actions in each state.
- If you're in $S_1$, you always move to state $S_3$.
- If you're in $S_2$, you always move to state $S_4$.
- If you're in $S_3$, you always move to state $S_2$.
- If you're in $S_1$, the arm probabilities are 0.8, 0.4 and 0.1.
- If you're in $S_2$, the arm probabilities are 0.1, 0.4 and 0.8.
- If you're in $S_3$, the arm probabilities are 0.1, 0.8 and 0.1.
- $S_4$ is a goal state that says the game is over. We always need a goal state when formalizing a bandit.
- Let's say there is no discount.


In [None]:
# Transitions is a 3 dimensional array
    # The first dimension is the end state s'
    # The second dimension is the action taken a
    # The third dimension is the start state s
# Let's initialize the transistion array with 0s
transitions = array(0, c(4, 3, 4))

# If we're in S_1 we move to S_3
transitions[,1, 1] = c(0, 0, 1, 0)
transitions[,2, 1] = c(0, 0, 1, 0)
transitions[,3, 1] = c(0, 0, 1, 0)

# If we're in S_2 we move to S_4
transitions[,1, 2] = c(0, 0, 0, 1)
transitions[,2, 2] = c(0, 0, 0, 1)
transitions[,3, 2] = c(0, 0, 0, 1)

# If we're in S_3 we move to S_2
transitions[,1, 3] = c(0, 1, 0, 0)
transitions[,2, 3] = c(0, 1, 0, 0)
transitions[,3, 3] = c(0, 1, 0, 0)

# Rewards are also a 3 dimensional array
    # The first dimension is the end state s'
    # The second dimension is the action taken a
    # The third dimension is the start state s
# Let's initialize the reward array with 0s
rewards = array(0, c(4, 3, 4))

# If you're in S_1, the arm probabilities are 0.8, 0.4 and 0.1
rewards[,1, 1] = 0.8
rewards[,2, 1] = 0.4
rewards[,3, 1] = 0.1

# If you're in S_2, the arm probabilities are 0.1, 0.4 and 0.8
rewards[,1, 2] = 0.1
rewards[,2, 2] = 0.4
rewards[,3, 2] = 0.8

# If you're in S_3, the arm probabilities are 0.1, 0.8 and 0.1
rewards[,1, 3] = 0.1
rewards[,2, 3] = 0.8
rewards[,3, 3] = 0.1

MDP = list(states= c(1, 2, 3, 4),
     actions = c(1, 2, 3),
     rewards = rewards,
     transitions = transitions,
     discount=1.0)

bellman(MDP)

So according to the Bellman equation, the optimal policy is to choose the first arm when in $S_1$; the third arm in $S_2$; and the second arm in $S_3$.

Further, if we start in $S_1$, we can expect to win 2.4 times using the optimal policy---i.e., 0.8 in the first state, 0.8 in the third state and 0.8 in the second state. If we start in $S_2$, we can expect to win 0.8 times. If we start to win in $S_3$, we can expect to win 1.6 times. 

When interpreting, we ignore the goal state $S_4$ as there are no actions to take and no reward to win.

#### Last Exercise

It's your turn.

Hyssop's bored of information theory and wants to run a contextual bandit task. They're keeping it simple. Participants must choose whether to explore the dungeons or the towers of a castle. Trials occur either at night or during the day. At night, the dungeons are 20% likely to have loot and the towers are 10% likely to have loot. During the day, the towers are 30% likely to have loot and the dungeons are 15% likely to have loot. Raiding during the day normally takes all day but 10% of the time, two raids can happen in the same day. Raiding at night is fast, so only 10% of the time will a raid take all night.Participants get to storm 5 castles before the experiment ends.

He wants to use a discount parameter of $\gamma=0.8$.

Hyssop wants to see what the optimal policy is for his experiment. Specifically, they want to know:

- Under the optimal policy, what should a participant do if it's nightime on the 3rd trial?
- If a participant knows the optimal policy, can they score more if their first trial is a day trial or a night trial?


In [None]:
# YOUR CODE HERE
states = c()
actions = c()

S = length(states)
A = length(actions)

transitions = array(0, c(S, A, S))

reward = array(0, c(S, A, S))


MDP = list(states=states, 
           actions=actions,
           transitions=transitions,
           rewards=reward,
           discount=0.8)

bellman(MDP)