In [4]:
import numpy as np
from tqdm import tqdm

def intuitive_policy(N, gamma, true_theta, alpha, beta):
    """
    Implements the intuitive policy for a two-armed bandit problem with discounted rewards.
    
    Args:
        N (int): Number of time steps
        gamma (float): Discount factor
        true_theta (np.ndarray): True probabilities for each arm
        alpha (list): Initial alpha parameters for Beta distribution
        beta (list): Initial beta parameters for Beta distribution
    
    Returns:
        float: Sum of discounted rewards
    """
    rewards = np.zeros(N)
    alpha = np.array(alpha)  
    beta = np.array(beta)

    for t in range(N):
        theta_estimate = alpha / (alpha + beta)
        chosen_arm = np.argmax(theta_estimate)

        reward = np.random.rand() < true_theta[chosen_arm]
        rewards[t] = reward * (gamma ** t)
        
        alpha[chosen_arm] += reward
        beta[chosen_arm] += 1 - reward

    return np.sum(rewards)

num_trials = 200
N = 5000
gamma_list = [0.1, 0.3, 0.5, 0.7, 0.9, 0.99]
alpha_list = [[1, 1], [2, 1], [20, 1]]
beta_list = [[1, 1], [1, 1], [10, 1]]

for gamma in gamma_list:
    for alpha, beta in zip(alpha_list, beta_list):
        rewards = np.zeros(num_trials)
        regret_rate = np.zeros(num_trials)
        
        for i in tqdm(range(num_trials)):
            true_theta = np.random.rand(2)

            rewards[i] = intuitive_policy(N, gamma, true_theta, alpha, beta)
            max_value = np.max(true_theta) / (1 - gamma)
            regret_rate[i] = 1 - rewards[i] / max_value
            
        print(f"gamma: {gamma}, alpha: {alpha}, beta: {beta}, "
              f"avg_reward: {np.mean(rewards):.4f}, "
              f"avg_regret_rate: {np.mean(regret_rate):.4f}")

100%|██████████| 200/200 [00:03<00:00, 51.53it/s]


gamma: 0.1, alpha: [1, 1], beta: [1, 1], avg_reward: 0.5813, avg_regret_rate: 0.2141


100%|██████████| 200/200 [00:03<00:00, 51.54it/s]


gamma: 0.1, alpha: [2, 1], beta: [1, 1], avg_reward: 0.5480, avg_regret_rate: 0.2491


100%|██████████| 200/200 [00:03<00:00, 51.57it/s]


gamma: 0.1, alpha: [20, 1], beta: [10, 1], avg_reward: 0.5822, avg_regret_rate: 0.2060


100%|██████████| 200/200 [00:03<00:00, 51.96it/s]


gamma: 0.3, alpha: [1, 1], beta: [1, 1], avg_reward: 0.7785, avg_regret_rate: 0.1925


100%|██████████| 200/200 [00:03<00:00, 50.51it/s]


gamma: 0.3, alpha: [2, 1], beta: [1, 1], avg_reward: 0.7315, avg_regret_rate: 0.1923


100%|██████████| 200/200 [00:03<00:00, 52.67it/s]


gamma: 0.3, alpha: [20, 1], beta: [10, 1], avg_reward: 0.7297, avg_regret_rate: 0.2278


100%|██████████| 200/200 [00:03<00:00, 53.59it/s]


gamma: 0.5, alpha: [1, 1], beta: [1, 1], avg_reward: 1.0941, avg_regret_rate: 0.1806


100%|██████████| 200/200 [00:03<00:00, 53.42it/s]


gamma: 0.5, alpha: [2, 1], beta: [1, 1], avg_reward: 1.0408, avg_regret_rate: 0.2142


100%|██████████| 200/200 [00:03<00:00, 50.06it/s]


gamma: 0.5, alpha: [20, 1], beta: [10, 1], avg_reward: 0.9554, avg_regret_rate: 0.2816


100%|██████████| 200/200 [00:03<00:00, 50.87it/s]


gamma: 0.7, alpha: [1, 1], beta: [1, 1], avg_reward: 1.9450, avg_regret_rate: 0.0872


100%|██████████| 200/200 [00:03<00:00, 50.43it/s]


gamma: 0.7, alpha: [2, 1], beta: [1, 1], avg_reward: 1.7140, avg_regret_rate: 0.1933


100%|██████████| 200/200 [00:03<00:00, 51.18it/s]


gamma: 0.7, alpha: [20, 1], beta: [10, 1], avg_reward: 1.6554, avg_regret_rate: 0.2925


100%|██████████| 200/200 [00:04<00:00, 49.81it/s]


gamma: 0.9, alpha: [1, 1], beta: [1, 1], avg_reward: 6.0248, avg_regret_rate: 0.1107


100%|██████████| 200/200 [00:03<00:00, 51.06it/s]


gamma: 0.9, alpha: [2, 1], beta: [1, 1], avg_reward: 5.9431, avg_regret_rate: 0.1089


100%|██████████| 200/200 [00:03<00:00, 53.42it/s]


gamma: 0.9, alpha: [20, 1], beta: [10, 1], avg_reward: 5.0471, avg_regret_rate: 0.1986


100%|██████████| 200/200 [00:03<00:00, 53.22it/s]


gamma: 0.99, alpha: [1, 1], beta: [1, 1], avg_reward: 64.2234, avg_regret_rate: 0.0325


100%|██████████| 200/200 [00:03<00:00, 51.84it/s]


gamma: 0.99, alpha: [2, 1], beta: [1, 1], avg_reward: 64.7984, avg_regret_rate: 0.0525


100%|██████████| 200/200 [00:03<00:00, 52.72it/s]

gamma: 0.99, alpha: [20, 1], beta: [10, 1], avg_reward: 60.8028, avg_regret_rate: 0.1167





To evaluate the performance of the algorithm, we need to find a suitable metric. Regret seems to be a good choice, but it is not normalized, leading to different scales for different settings. (For example, larger $\gamma$ leads to larger regret.) Thus, we use the regret rate, which shows the portion of regret to the maximum expected reward.   
The regret rate is defined as follow:  

$\text{regret rate} = 1 - \frac{Reward}{\max_i \theta_i/(1-\gamma)}$

where the maximum possible reward is achieved by always pulling the arm with the highest true probability. For the discounted setting, this equals $\frac{\max_i \theta_i}{1-\gamma}$.

The simulation results show that the intuitive policy performs well in most cases, achieving low regret rates.  

| γ | Prior (α, β) | Average Reward | Average Regret Rate |
|---|-------------|----------------|-------------------|
| 0.1 | [1,1], [1,1] | 0.5813 | 0.2141 |
| 0.1 | [2,1], [1,1] | 0.5480 | 0.2491 |
| 0.1 | [20,1], [10,1] | 0.5822 | 0.2060 |
| 0.3 | [1,1], [1,1] | 0.7785 | 0.1925 |
| 0.3 | [2,1], [1,1] | 0.7315 | 0.1923 |
| 0.3 | [20,1], [10,1] | 0.7297 | 0.2278 |
| 0.5 | [1,1], [1,1] | 1.0941 | 0.1806 |
| 0.5 | [2,1], [1,1] | 1.0408 | 0.2142 |
| 0.5 | [20,1], [10,1] | 0.9554 | 0.2816 |
| 0.7 | [1,1], [1,1] | 1.9450 | 0.0872 |
| 0.7 | [2,1], [1,1] | 1.7140 | 0.1933 |
| 0.7 | [20,1], [10,1] | 1.6554 | 0.2925 |
| 0.9 | [1,1], [1,1] | 6.0248 | 0.1107 |
| 0.9 | [2,1], [1,1] | 5.9431 | 0.1089 |
| 0.9 | [20,1], [10,1] | 5.0471 | 0.1986 |
| 0.99 | [1,1], [1,1] | 64.2234 | 0.0325 |
| 0.99 | [2,1], [1,1] | 64.7984 | 0.0525 |
| 0.99 | [20,1], [10,1] | 60.8028 | 0.1167 |

This can be attributed to several factors:

1. **Efficient Exploration**: The policy naturally balances exploration and exploitation through Bayesian updating of the Beta distributions.

2. **Prior Knowledge Integration**: The Beta distribution parameters (α, β) allow incorporating prior knowledge about the arms, which helps guide initial exploration.

3. **Quick Convergence**: As more rewards are observed, the posterior distributions quickly concentrate around the true probabilities, leading to optimal arm selection.

Looking at the simulation results across different discount factors (γ) and prior parameters (α, β), we see consistently low regret rates, indicating the policy's robustness to different parameter settings. However, as we'll see in the counter-example, there are specific scenarios where this policy can be suboptimal.


In [23]:
# Here's a counter-example with a strongly biased prior, but the second arm is actually better.
# We'll run fewer steps (N_short=100), so the policy doesn't have time to correct the prior bias.

test_runs = 1000
test_rewards = np.zeros(test_runs)
N_short = 100
gamma_close_to_1 = 0.99


counter_alpha = [200, 1]
counter_beta = [1, 1]
counter_true_theta = np.array([0.6, 0.8])  # second arm has higher probability

for i in range(test_runs):
    local_alpha = counter_alpha.copy()
    local_beta = counter_beta.copy()
    test_rewards[i] = intuitive_policy(N_short, gamma_close_to_1, counter_true_theta, local_alpha, local_beta)

avg_r = np.mean(test_rewards)
optimal_r = np.max(counter_true_theta) / (1 - gamma_close_to_1)
print(f"Average reward: {avg_r:.2f}")
print(f"Maximum possible reward: {optimal_r:.2f}")
print(f"Regret rate: {1 - avg_r/optimal_r:.2f}")

Average reward: 37.93
Maximum possible reward: 80.00
Regret rate: 0.53


Given two arms with prior distributions:  
#### Arm1  
Beta(200,1), suggesting an expected value close to 1.0
#### Arm2
Beta(1,1), suggesting an expected value close to 0.5

A greedy strategy that consistently favors the arm with the higher expected value may lead to repeatedly selecting Arm 1. However, this approach has critical flaws. The prior for Arm 2 suggests significant uncertainty, as the Beta(1, 1) distribution is essentially non-informative, assigning equal probability to all values between 0 and 1.

By selecting Arm 2 more frequently, we can reduce this uncertainty and potentially uncover a true value for Arm 2 that exceeds that of Arm 1.

Focusing exclusively on Arm 1 due to its higher initial expected value neglects the possibility that Arm 2 could ultimately provide greater rewards once more data is collected. Failing to explore Arm 2 adequately risks missing out on higher returns that could arise if its true value is found to be higher than initially estimated.

When priors differ significantly in terms of uncertainty, a strategy that relies solely on expected values can lead to consistently selecting a suboptimal arm. It is crucial to strike a balance between exploiting known information and exploring uncertain but potentially more rewarding alternatives.