<div style="background-color: #eaf7fd; padding: 20px; border-radius: 10px; margin-bottom: 20px; box-shadow: 0px 4px 15px rgba(0, 0, 0, 0.1);">

### MAB Agent Problem Formulation

In the Multi-Armed Bandit (MAB) problem, the agent is confronted with a situation where there are multiple actions or "arms" to choose from, and each arm is associated with a specific reward. The overarching challenge is to maximize the cumulative reward over a series of trials or time steps.

#### Limited Information

Initially, the agent operates with limited information about the true reward distribution for each action, introducing an element of uncertainty.

#### Exploration and Exploitation Trade-off

The agent faces a critical trade-off between exploration and exploitation:

- **<span style="color: green; font-weight: bold;">Exploration:</span>**
  Trying different actions to learn their effectiveness and gather more information about their potential rewards.

- **<span style="color: red; font-weight: bold;">Exploitation:</span>**
  Choosing the action that appears to be most rewarding based on the current knowledge.

#### Balancing Strategy

The ultimate goal is to develop a strategy that effectively balances exploration and exploitation. This strategic balance is crucial for achieving the highest possible cumulative reward over time.

</div>

In [1]:
import pandas as pd
import numpy as np

# Function to simulate ε-Greedy action
def epsilon_greedy_action(Q_values, epsilon):
    if np.random.rand() < epsilon:
        # Exploration: Randomly choose an action
        return np.random.choice(len(Q_values))
    else:
        # Exploitation: Choose the action with the highest Q-value
        return np.argmax(Q_values)

# Function to simulate Upper-Confidence-Bound (UCB) action
def ucb_action(Q_values, c, counts, total_steps):
    # Calculate the Upper Confidence Bound for each action
    ucb_values = Q_values + c * np.sqrt(np.log(total_steps) / (counts + 1e-6))
    # Choose the action with the highest UCB value
    return np.argmax(ucb_values)

# Function to simulate the MAB problem
def run_multi_armed_bandit(ads_clicks, epsilon, c, total_steps):
    num_ads = len(ads_clicks.columns)
    
    # Initialize action values and counts
    Q_values = np.zeros(num_ads)
    action_counts = np.zeros(num_ads)
    
    total_rewards = 0
    
    for step in range(total_steps):
        # Choose action using ε-Greedy or UCB strategy
        if np.random.rand() < 0.5:
            action = epsilon_greedy_action(Q_values, epsilon)
        else:
            action = ucb_action(Q_values, c, action_counts, step + 1)
        
        # Get reward from the chosen action
        reward = ads_clicks.iloc[step, action]
        
        # Update action values and counts
        action_counts[action] += 1
        Q_values[action] += (reward - Q_values[action]) / action_counts[action]
        
        # Update total rewards
        total_rewards += reward
    
    return total_rewards

# Load the Ads_Clicks dataset from CSV
ads_clicks = pd.read_csv("Ads_Clicks.csv")

# Set parameters
epsilon_values = [0.01, 0.3]
c_value = 1.5
total_steps = 2000

# Function to run multiple simulations and compute average rewards
def run_multiple_simulations(ads_clicks, epsilon, c, total_steps, num_simulations):
    total_rewards_list = []
    for _ in range(num_simulations):
        total_rewards = run_multi_armed_bandit(ads_clicks, epsilon, c, total_steps)
        total_rewards_list.append(total_rewards)
    return np.mean(total_rewards_list)

# Set additional parameter
num_simulations = 10

# Run simulations for ε-Greedy with different ε values
for epsilon in epsilon_values:
    avg_total_rewards_epsilon = run_multiple_simulations(ads_clicks, epsilon, c_value, total_steps, num_simulations)
    print(f"Average Total Rewards for ε-Greedy (ε={epsilon}): {avg_total_rewards_epsilon}")

# Run simulation for UCB
avg_total_rewards_ucb = run_multiple_simulations(ads_clicks, epsilon_values[0], c_value, total_steps, num_simulations)
print(f"Average Total Rewards for UCB (c={c_value}): {avg_total_rewards_ucb}")

Average Total Rewards for ε-Greedy (ε=0.01): 390.2
Average Total Rewards for ε-Greedy (ε=0.3): 346.9
Average Total Rewards for UCB (c=1.5): 376.0


<div style="background-color: #f5f5f5; padding: 20px; border-radius: 10px; margin-bottom: 20px; box-shadow: 0px 2px 10px rgba(0, 0, 0, 0.1);">

### Comparison of Action Value Estimates to Optimal Action

- **<span style="color: #007BFF; font-weight: bold;">Estimation of Action Value:</span>**
  For both the ε-greedy and UCB approaches, the action value serves as an estimate of how good each action is based on the data collected during the experiment.

- **<span style="color: #007BFF; font-weight: bold;">Optimal Action:</span>**
  The optimal action is the one with the highest true expected reward.

</div>

<div style="background-color: #eaf7fd; padding: 15px; border-radius: 10px; margin-bottom: 20px; box-shadow: 0px 2px 10px rgba(0, 0, 0, 0.1);">

### ε-Greedy

- **<span style="color: #28a745; font-weight: bold;">When ε=0.01:</span>**
  - With a very low exploration rate (ε=0.01), the agent primarily exploits the action that seems to have the highest estimated reward.
  - The action value estimate tends to be closer to the optimal action.
  - Average Total Rewards: 390.2

- **<span style="color: #28a745; font-weight: bold;">When ε=0.3:</span>**
  - With a higher exploration rate (ε=0.3), the agent explores more, occasionally choosing suboptimal actions.
  - The action value estimate might deviate from the optimal action more frequently.
  - Average Total Rewards: 346.9

</div>

<div style="background-color: #eaf7fd; padding: 15px; border-radius: 10px; margin-bottom: 20px; box-shadow: 0px 2px 10px rgba(0, 0, 0, 0.1);">

### UCB (c=1.5)

- **<span style="color: #dc3545; font-weight: bold;">Incorporating Uncertainty:</span>**
  - UCB incorporates uncertainty in its action selection.
  - The action value estimate is influenced by both the estimated reward and the uncertainty term.

- **<span style="color: #dc3545; font-weight: bold;">With Confidence Parameter c=1.5:</span>**
  - With a higher confidence parameter (c=1.5), the agent tends to explore cautiously, giving more weight to actions that are less explored but have the potential for higher rewards.
  - The action value estimate could be more stable and closer to the optimal action.
  - Average Total Rewards: 376.0

</div>

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 10px; box-shadow: 0px 2px 10px rgba(0, 0, 0, 0.1);">

### Summary

- **<span style="color: #007BFF; font-weight: bold;">ε-Greedy Approach:</span>**
  Trades off exploration and exploitation, and the action value estimate can vary based on the exploration rate.

- **<span style="color: #007BFF; font-weight: bold;">UCB Approach:</span>**
  Considers uncertainty explicitly, and the action value estimate tends to be more influenced by both the estimated reward and uncertainty.
  Provides a potentially more balanced exploration-exploitation strategy.

- **<span style="color: #007BFF; font-weight: bold;">Performance Dependency:</span>**
  The actual performance depends on the specific problem and parameter values.

</div>