# DX 704 Week 3 Project

This week's project will give you practice with optimizing choices for bandit algorithms.
You will be given access to the bandit problem via a blackbox object, and you will investigate the bandit rewards to pick a suitable algorithm.

The full project description, a template notebook and supporting code are available on GitHub: [Project 3 Materials](https://github.com/bu-cds-dx704/dx704-project-03).


## Example Code

You may find it helpful to refer to these GitHub repositories of Jupyter notebooks for example code.

* https://github.com/bu-cds-omds/dx601-examples
* https://github.com/bu-cds-omds/dx602-examples
* https://github.com/bu-cds-omds/dx603-examples
* https://github.com/bu-cds-omds/dx704-examples

Any calculations demonstrated in code examples or videos may be found in these notebooks, and you are allowed to copy this example code in your homework answers.

## Part 1: Pick a Bandit Algorithm

Experiment with the multi-armed bandit interface using seed 0 to learn about the distribution of rewards and decide what kind of bandit algorithm will be appropriate.
A histogram will likely be helpful.

In [1]:
# DO NOT CHANGE

import numpy as np

class BanditProblem(object):
    def __init__(self, seed):
        self.seed = seed
        self.rng = np.random.default_rng(seed)

        self.num_arms = 3
        self.ns = self.rng.integers(low=1, high=10, size=self.num_arms)
        self.ps = self.rng.uniform(low=0.2, high=0.4, size=self.num_arms)

    def get_num_arms(self):
        return self.num_arms

    def get_reward(self, arm):
        if arm < 0 or arm >= self.num_arms:
            raise ValueError("Invalid arm")

        x = self.rng.uniform()
        x *= self.rng.binomial(self.ns[arm], self.ps[arm])

        return x


In [2]:
bandit0 = BanditProblem(0)

In [3]:
bandit0.get_num_arms()

3

In [4]:
bandit0.get_reward(arm=0)

1.8255111545554434

In [23]:
# YOUR CHANGES HERE


Based on your investigation, pick an appropriate bandit algorithm to implement from the algorithms covered this week.
Write a file "algorithm-choice.txt" that states your choice and gives a few sentences justifying your choice and rejecting the alternatives.
Keep your explanation concise; overly verbose responses will be penalized.

## Part 2: Implement Bandit

Based on your decision, implement an appropriate bandit algorithm and pick 1000 actions using seed 2025002.

In [5]:
# YOUR CHANGES HERE

# assumes BanditProblem from your snippet is already defined
import pandas as pd
import numpy as np
import scipy.stats

# environment
bandit0 = BanditProblem(0)

K = bandit0.get_num_arms()
probabilities = [0.4, 0.6, 0.5]
T_max = 1000

# sample data
trials = [0 for _ in range(K)]
successes = [0 for _ in range(K)]
samples_history = []

actions = []
rewards = []

rng = np.random.default_rng(2025002)

for i in range(T_max):
    samples = [
        scipy.stats.beta.rvs(
            1 + successes[j], 
            1 + trials[j] - successes[j], 
            random_state=rng
        )
    ]
    samples_history.append(samples)

    j = int(np.argmax(samples))
    r = bandit0.get_reward(j)

    trials[j] += 1
    successes[j] += int(r > 0)

    actions.append(j)
    rewards.append(r)

samples_history = np.array(samples_history)

NameError: name 'j' is not defined

In [41]:
rewards

[1.8255111545554434,
 1.4589931219679968,
 2.8052172713633046,
 0.008215500510444285,
 0.06717115061092871,
 0.526966861807677,
 0.5414612202490917,
 0.0,
 0.2485665529991279,
 1.2943790231485002,
 1.9183877713094173,
 1.9616706775524602,
 1.3009185525356326,
 0.0,
 1.4429766803881634,
 0.6204837511179113,
 2.6684635030470005,
 0.7155903934181405,
 0.6437387821518843,
 0.33791122550713326,
 0.8902743520047923,
 0.0,
 2.4979324429601935,
 0.7181083289788565,
 0.05856803480519435,
 0.15027946689483906,
 0.7963242702872942,
 0.05202130106440961,
 0.0,
 0.5803323859868507,
 0.6719948779563594,
 0.9421131105064978,
 0.21099055914045906,
 0.9271545530678674,
 1.9091809873814745,
 0.8504572496981511,
 3.9803860209412965,
 0.9200902786181921,
 0.9948453909752379,
 0.7857857007138075,
 1.4689671435774587,
 0.0,
 2.187045351228928,
 0.0,
 3.454560360982303,
 0.0,
 2.9178864414688648,
 1.6447476550861408,
 0.6971187589179115,
 0.9235301597834695,
 0.5389344076221869,
 0.0,
 1.4640123913131216,
 0

In [28]:
import math
np.random.seed(2025002)

# Environment
bandit0 = BanditProblem(0)

# Settings
K = bandit0.get_num_arms()
T_max = 1000

arm_counts = [0 for _ in range(K)]
arm_totals = [0.0 for _ in range(K)]
arm_averages = [None for _ in range(K)]
chosen_averages = []

# Track lists of actions and rewards
ucb1_actions = []
ucb1_rewards = []

# Function to sample the reward
def sample_reward(j):
    r = float(bandit0.get_reward(j))

    chosen_averages.append(arm_averages[j])

    # record the action and reward
    ucb1_actions.append(int(j))
    ucb1_rewards.append(r)

    # update stats only after recording
    arm_counts[j] += 1
    arm_totals[j] += r
    arm_averages[j] = arm_totals[j] / arm_counts[j]

    actions.append(int(j))
    rewards.append(r)
    return r

# sample each arm once
for t in range(K):
    sample_reward(t)

arm_averages_history = []
ucb1_bounds_history = []

# sample using UCB1
for t in range(K, T_max):
    # calculate the UCB1 bound for each arm
    ucb1_bounds = [
        arm_averages[j] + math.sqrt(2 * math.log(t) / arm_counts[j]) 
        for j in range(K)
    ]

    arm_averages_history.append(arm_averages.copy())
    ucb1_bounds_history.append(ucb1_bounds)

    # pick and sample the arm with the highest bound
    j = int(np.argmax(ucb1_bounds))
    sample_reward(j)

arm_averages_history = np.array(arm_averages_history)
ucb1_bounds_history = np.array(ucb1_bounds_history)
chosen_averages = np.array(chosen_averages)

In [29]:
# tidy table with chosen action and observed reward each step
ucb1_runs = pd.DataFrame({"action": ucb1_actions, "reward": ucb1_rewards})

In [30]:
ucb1_runs.head()

Unnamed: 0,action,reward
0,0,1.825511
1,1,0.729497
2,2,2.805217
3,2,0.008216
4,0,0.067171


Write a file "history.tsv" with columns action and reward in the order that the actions were taken.

In [9]:
# YOUR CHANGES HERE

# Output the history dataframe as history.tsv
ucb1_runs.to_csv("history.tsv", sep="\t", index=False)

Submit "history.tsv" in Gradescope.

## Part 3: Action Statistics

Based on the data from part 2, estimate the expected reward for each arm and write a file "actions.tsv" with the columns action, min_reward, mean_reward, max_reward.

In [None]:
# YOUR CHANGES HERE

...

Submit "actions.tsv" in Gradescope.

## Part 4: Regret Estimates

Calculate the expected regret taking 1000 actions with the following strategies.

* uniform: Pick an arm uniformly at random.
* just-i: Always pick arm $i$. Do this for $i=0$ to $K-1$ where $K$ is the number of arms.
* actual: This should match your output in part 2.

In [None]:
# YOUR CHANGES HERE

...

Write your results to a file "strategies.tsv" with the columns strategy and regret.

In [None]:
# YOUR CHANGES HERE

...

Submit "strategies.tsv" in Gradescope.

## Part 5: Acknowledgments

Make a file "acknowledgments.txt" documenting any outside sources or help on this project.
If you discussed this assignment with anyone, please acknowledge them here.
If you used any libraries not mentioned in this module's content, please list them with a brief explanation what you used them for.
If you used any generative AI tools, please add links to your transcripts below, and any other information that you feel is necessary to comply with the generative AI policy.
If no acknowledgements are appropriate, just write none in the file.


Submit "acknowledgments.txt" in Gradescope.

## Part 6: Code

Please submit a Jupyter notebook that can reproduce all your calculations and recreate the previously submitted files.

Submit "project.ipynb" in Gradescope.