## Name : Adwaiy Singh
## Reg. No. : 220968424
## Sec. : A
## Batch: 2
## WEEK 5

### MULTI-ARMED BANDITS – AD OPTIMIZATION

Consider the dataset **"Ads_Clicks,"** which contains information about user interactions with advertisements over time. An advertising company is running **18 different ads** on a webpage, all targeted toward a similar audience. The dataset records whether a user clicked at a given time step. Each column corresponds to a specific ad, where **YES(1) indicates that the ad was clicked, and NO(0) indicates that it was not.**

##### Consider the attached csv fi each adle.
    1.	Define the multi-armed bandit (MAB) problem in the context of ad optimization, considering how an agent selects among multiple ads to maximize clicks.
    2.	How does the exploration-exploitation trade-off influence decision-making in this scenario?
    3.	Implement the ε-greedy algorithm to optimize ad selection and compute the total rewards after 2000-time steps for:ε = 0.05 and ε = 0.2
    4.	Compare the effect of different ε values on total rewards and action selection.
    5.	Implement the UCB method with an exploration factor c = 2.0 and compute total rewards after 2000-time steps.
    6.	How does increasing or decreasing the exploration factor c affect the performance?
    7.	Analyze how the estimated action values (Q-values) compare to the actual optimal action in both ε-greedy and UCB methods.
    8.	Which approach leads to a better approximation of the optimal action?
    9.	Evaluate how the performance of ε-greedy and UCB changes when the time horizon is extended to 5000-time steps instead of 2000-time steps.
    10.	Does a longer time horizon reduce the impact of exploration parameters (ε or c) on total rewards?


#### Imports

In [1]:
import pandas as pd
import numpy as np
import random
import math

#### Loading data

In [2]:
clickedAdsDf = pd.read_csv("Clicked Ads Dataset.csv")
clickedAdsDf = clickedAdsDf.iloc[:,1:]

clickedAdsDf.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Timestamp,Clicked on Ad,city,province,category
0,68.95,35,432837300.0,256.09,Perempuan,3/27/2016 0:53,No,Jakarta Timur,Daerah Khusus Ibukota Jakarta,Furniture
1,80.23,31,479092950.0,193.77,Laki-Laki,4/4/2016 1:39,No,Denpasar,Bali,Food
2,69.47,26,418501580.0,236.5,Perempuan,3/13/2016 20:35,No,Surabaya,Jawa Timur,Electronic
3,74.15,29,383643260.0,245.89,Laki-Laki,1/10/2016 2:31,No,Batam,Kepulauan Riau,House
4,68.37,35,517229930.0,225.58,Perempuan,6/3/2016 3:36,No,Medan,Sumatra Utara,Finance


#### Preprocessing data

In [3]:
clickedAdsDf.isnull().sum()

Daily Time Spent on Site    13
Age                          0
Area Income                 13
Daily Internet Usage        11
Male                         3
Timestamp                    0
Clicked on Ad                0
city                         0
province                     0
category                     0
dtype: int64

In [4]:
clickedAdsDf['Daily Time Spent on Site'] = clickedAdsDf['Daily Time Spent on Site'].fillna(clickedAdsDf['Daily Time Spent on Site'].mean())
clickedAdsDf['Area Income'] = clickedAdsDf['Area Income'].fillna(clickedAdsDf['Area Income'].mean())
clickedAdsDf['Daily Internet Usage'] = clickedAdsDf['Daily Internet Usage'].fillna(clickedAdsDf['Daily Internet Usage'].mean())

clickedAdsDf['Male'] = clickedAdsDf['Male'].fillna(clickedAdsDf['Male'].mode()[0])
clickedAdsDf['Clicked on Ad'] = clickedAdsDf['Clicked on Ad'].map({'No': 0, 'Yes': 1})

### 1.	Define the multi-armed bandit (MAB) problem in the context of ad optimization, considering how an agent selects among multiple ads to maximize clicks.

#### Ans.

There are **10 different ads** being displayed to users, and each ad can either be clicked or not clicked. The goal is to select the best-performing ads, those that are most likely to generate a click, in order to **maximize the total number of clicks** over time.

Each ad corresponds to an "arm" of the bandit, and the agent must select one ad (arm) at each time step. The agent will receive feedback on whether the selected ad was clicked (yes = 1) or not clicked (no = 0).

##### The agent’s task is to balance between:
    Exploration: Trying out different ads to gather more information about their click-through rates (CTRs) and performance.
    Exploitation: Selecting the ads that have already shown a higher likelihood of getting clicked, thereby maximizing the total number of clicks in the long run.

### 2. How does the exploration-exploitation trade-off influence decision-making in this scenario?

#### Ans.

Exploration involves trying out different ads to gather more data about their click-through rates (CTR). Even if an ad has not performed well in the past, exploring it can provide valuable insights that might lead to discovering a more effective ad in the future. This is important for discovering new ads that might perform better over time.

Exploitation involves focusing on ads that have already been shown to perform well (i.e., those with a higher CTR). By exploiting these ads, the agent maximizes immediate clicks, since it is relying on the ads that have historically yielded the best results.

##### The trade-off arises because:

    If the agent explores too much, it may waste resources on ads that aren't performing well, leading to fewer total clicks.
    If the agent exploits too much, it may miss out on discovering better-performing ads, leading to stagnation in click performance.

### 3. Implement the ε-greedy algorithm to optimize ad selection and compute the total rewards after 2000-time steps for: ε = 0.05 and ε = 0.2

In [5]:
def epsilon_greedy(epsilon, time_steps, df):
    Q = [0] * n_ads
    N = [0] * n_ads 
    total_reward = 0

    for t in range(time_steps):
        if random.random() < epsilon:
            ad_index = random.randint(0, n_ads - 1)
            chosen_category = ad_categories[ad_index]
        else:
            maxQ = max(Q)
            ad_indices = [i for i, value in enumerate(Q) if value == maxQ]
            ad_index = random.choice(ad_indices)
            chosen_category = ad_categories[ad_index]

        category_rows = df[df['category'] == chosen_category]
        if not category_rows.empty:
            reward = category_rows['Clicked on Ad'].iloc[t % len(category_rows)]
        else:
            reward = 0 
            print(f"Warning: No data found for category {chosen_category}.  Setting reward to 0.")

        N[ad_index] += 1
        Q[ad_index] += (1 / N[ad_index]) * (reward - Q[ad_index])

    return total_reward, Q, N

In [6]:
ad_categories = clickedAdsDf['category'].unique().tolist()
n_ads = len(ad_categories)
print(f"Identified {n_ads} ad categories:\n{ad_categories}")

Identified 10 ad categories:
['Furniture', 'Food', 'Electronic', 'House', 'Finance', 'Travel', 'Health', 'Bank', 'Fashion', 'Otomotif']


In [7]:
time_steps = 2000
epsilon_05_reward, epsilon_05_Q, epsilon_05_N = epsilon_greedy(0.05, time_steps, clickedAdsDf)
epsilon_2_reward, epsilon_2_Q, epsilon_2_N = epsilon_greedy(0.2, time_steps, clickedAdsDf)

print(f"Epsilon-Greedy (ε=0.05, {time_steps} steps): Total Reward = {epsilon_05_reward}")
print(f"Epsilon-Greedy (ε=0.2, {time_steps} steps): Total Reward = {epsilon_2_reward}\n")

print(f"Epsilon-Greedy (ε=0.05): Action Counts = {epsilon_05_N}") 
print(f"Epsilon-Greedy (ε=0.2): Action Counts = {epsilon_2_N}")

Epsilon-Greedy (ε=0.05, 2000 steps): Total Reward = 1119
Epsilon-Greedy (ε=0.2, 2000 steps): Total Reward = 1066

Epsilon-Greedy (ε=0.05): Action Counts = [57, 37, 24, 11, 1698, 14, 128, 12, 13, 6]
Epsilon-Greedy (ε=0.2): Action Counts = [44, 37, 86, 53, 42, 49, 40, 33, 1528, 88]


### 4. Compare the effect of different ε values on total rewards and action selection.

A lower **ε (0.05)** leads to more exploitation, resulting in a higher **total reward (1119)** by focusing on the best-performing ad (action 4, chosen 1578 times). A higher **ε (0.2)** increases exploration, leading to a lower **total reward (1066)** with more balanced action selection (action 8 chosen 1504 times).
##### Thus, lower ε favors short-term gains, while higher ε allows better adaptation in dynamic environments.

### 5. Implement the UCB method with an exploration factor c = 2.0 and compute total rewards after 2000-time steps.

In [8]:
def ucb(c, time_steps, df):
    Q = [0] * n_ads
    N = [0] * n_ads
    total_reward = 0

    for t in range(time_steps):
        ucb_values = [Q[a] + c * math.sqrt(math.log(t + 1) / (N[a] + 1e-6)) for a in range(n_ads)]

        ad_index = ucb_values.index(max(ucb_values))
        ad_indices = [i for i, value in enumerate(ucb_values) if value == ucb_values[ad_index]]
        ad_index = random.choice(ad_indices)
        chosen_category = ad_categories[ad_index]

        category_rows = df[df['category'] == chosen_category]
        if not category_rows.empty:
            reward = category_rows['Clicked on Ad'].iloc[t % len(category_rows)]
        else:
            reward = 0 
            print(f"Warning: No data found for category {chosen_category}. Setting reward to 0.")

        # Update
        N[ad_index] += 1
        Q[ad_index] += (1 / N[ad_index]) * (reward - Q[ad_index])
        total_reward += reward

    return total_reward, Q, N

In [9]:
time_steps = 2000
ucb_reward, ucb_Q, ucb_N = ucb(2.0, time_steps, clickedAdsDf)

print(f"UCB (c=2.0, {time_steps} steps): Total Reward = {ucb_reward}")
print(f"UCB (c=2.0): Action Counts = {ucb_N}")

UCB (c=2.0, 2000 steps): Total Reward = 1015
UCB (c=2.0): Action Counts = [291, 189, 139, 183, 307, 139, 139, 148, 240, 225]


### 6. How does increasing or decreasing the exploration factor c affect the performance?

In [10]:
exploration_factors = [0.1, 0.25, 0.5, 1.0, 3.0, 5.0, 10.0]

for c in exploration_factors:
    ucb_reward, ucb_Q, ucb_N = ucb(c, time_steps, clickedAdsDf)
    print(f"UCB (c={c}, {time_steps} steps): Total Reward = {ucb_reward}")
    print(f"UCB (c={c}): Action Counts = {ucb_N}\n")

UCB (c=0.1, 2000 steps): Total Reward = 1011
UCB (c=0.1): Action Counts = [3, 1, 1434, 1, 1, 1, 1, 1, 556, 1]

UCB (c=0.25, 2000 steps): Total Reward = 1097
UCB (c=0.25): Action Counts = [24, 86, 123, 68, 1622, 2, 12, 26, 30, 7]

UCB (c=0.5, 2000 steps): Total Reward = 1060
UCB (c=0.5): Action Counts = [149, 122, 57, 107, 1175, 33, 70, 16, 171, 100]

UCB (c=1.0, 2000 steps): Total Reward = 1004
UCB (c=1.0): Action Counts = [156, 200, 195, 151, 421, 89, 121, 70, 373, 224]

UCB (c=3.0, 2000 steps): Total Reward = 992
UCB (c=3.0): Action Counts = [218, 180, 215, 174, 262, 200, 190, 182, 210, 169]

UCB (c=5.0, 2000 steps): Total Reward = 1039
UCB (c=5.0): Action Counts = [234, 201, 193, 183, 239, 176, 179, 150, 253, 192]

UCB (c=10.0, 2000 steps): Total Reward = 999
UCB (c=10.0): Action Counts = [192, 202, 207, 211, 216, 189, 188, 182, 209, 204]



Lower values of c **(e.g. c= 0.1, 0.25)** prioritize exploitation, leading to **higher total rewards** but heavily favoring a single ad category, while higher values **(e.g.,  c = 3.0 , 5.0 , 10.0)** encourage exploration, resulting in a more balanced selection but **slightly lower rewards.** The best trade-off appears around c = 0.1 to c = 1.0, where the algorithm still exploits high-reward ads while maintaining some exploration to avoid suboptimal choices.

### 7. Analyze how the estimated action values (Q-values) compare to the actual optimal action in both ε-greedy and UCB methods

In [11]:
true_ctr = []
for ad_category in ad_categories:
    true_ctr.append(clickedAdsDf[clickedAdsDf['category'] == ad_category]['Clicked on Ad'].mean())

results_df = pd.DataFrame({
    "True CTR": true_ctr,
    "Epsilon-Greedy (ε=0.05)": epsilon_05_Q,
    "Epsilon-Greedy (ε=0.2)": epsilon_2_Q,
    "UCB (c=2.0)": ucb_Q
}, index=ad_categories)

results_df

Unnamed: 0,True CTR,Epsilon-Greedy (ε=0.05),Epsilon-Greedy (ε=0.2),UCB (c=2.0)
Furniture,0.459184,0.491228,0.295455,0.458333
Food,0.494949,0.513514,0.486486,0.50495
Electronic,0.494845,0.458333,0.546512,0.531401
House,0.522936,0.363636,0.54717,0.549763
Finance,0.571429,0.575383,0.5,0.574074
Travel,0.479592,0.357143,0.510204,0.444444
Health,0.461538,0.476563,0.45,0.43617
Bank,0.433333,0.5,0.333333,0.406593
Fashion,0.54902,0.461538,0.550393,0.54067
Otomotif,0.526786,0.333333,0.488636,0.519608


### 8. Which approach leads to a better approximation of the optimal action?

**UCB (c = 2.0) provides the best approximation** of the optimal action, as its Q-values are closest to the true CTR values across most categories. It balances exploration and exploitation effectively, preventing both excessive optimism (as seen in ε-greedy with low ϵ) and poor estimation due to insufficient exploration (as seen in ε-greedy with higher ϵ). While **ε-greedy (ϵ = 0.05) tends to overexploit** and miss optimal choices, and **ϵ = 0.2** explores more but still **struggles with accurate estimates**

##### UCB consistently updates its action values based on a principled exploration strategy, leading to more reliable learning over time.

### 9. Evaluate how the performance of ε-greedy and UCB changes when the time horizon is extended to 5000-time steps instead of 2000-time steps

In [12]:
time_steps_5000 = 5000

epsilon_05_reward_5000, epsilon_05_Q_5000, epsilon_05_N_5000 = epsilon_greedy(0.05, time_steps_5000, clickedAdsDf)
epsilon_2_reward_5000, epsilon_2_Q_5000, epsilon_2_N_5000 = epsilon_greedy(0.2, time_steps_5000, clickedAdsDf)

ucb_reward_5000, ucb_Q_5000, ucb_N_5000 = ucb(2.0, time_steps_5000, clickedAdsDf)

In [13]:
print(f"Epsilon-Greedy (ε=0.05, {time_steps_5000} steps): Total Reward = {epsilon_05_reward_5000}")
print(f"Epsilon-Greedy (ε=0.05): Action Counts = {epsilon_05_N_5000}\n")

print(f"Epsilon-Greedy (ε=0.2, {time_steps_5000} steps): Total Reward = {epsilon_2_reward_5000}")
print(f"Epsilon-Greedy (ε=0.2): Action Counts = {epsilon_2_N_5000}\n")

print(f"UCB (c=2.0, {time_steps_5000} steps): Total Reward = {ucb_reward_5000}")
print(f"UCB (c=2.0): Action Counts = {ucb_N_5000}\n")

Epsilon-Greedy (ε=0.05, 5000 steps): Total Reward = 2725
Epsilon-Greedy (ε=0.05): Action Counts = [28, 205, 30, 31, 663, 16, 26, 35, 3940, 26]

Epsilon-Greedy (ε=0.2, 5000 steps): Total Reward = 2658
Epsilon-Greedy (ε=0.2): Action Counts = [110, 136, 123, 222, 155, 110, 125, 94, 3796, 129]

UCB (c=2.0, 5000 steps): Total Reward = 2487
UCB (c=2.0): Action Counts = [435, 460, 492, 375, 965, 431, 427, 360, 651, 404]



### 10. Does a longer time horizon reduce the impact of exploration parameters (ε or c) on total rewards?

**Longer time horizon reduces the impact of exploration parameters** (ϵ or c) on total rewards. Initially, higher exploration (ε = 0.2, higher c) helps discover better actions, but as time progresses, the need for exploration diminishes because the algorithm has gathered sufficient information.

##### This is evident in the results:
    ε-Greedy (ε = 0.05) outperforms ε = 0.2 and UCB at 5000 steps because it quickly converges to the best ad and exploits it.
    ε-Greedy (ε = 0.2) performs worse because it continues exploring more than necessary.
    UCB (c = 2.0) underperforms compared to ε-Greedy (ε = 0.05), likely because it maintains exploration for too long.

With a longer time horizon, less exploration is needed, and lower ε (or c) leads to better performance as exploitation dominates. Exploration is crucial in the early stages but has diminishing returns over time.

# DataSet 2

##### ⚠️ Incorrect Dataset: 1000-Armed Bandit Problem  
##### The dataset contains **1000 unique "Ad Topic Line" values**, meaning this is a **1000-armed multi-armed bandit problem**, not a 10-armed one.

In [14]:
AdsDf = pd.read_csv("Ad Click Data.csv")

AdsDf.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0.0,Tunisia,3/27/2016 0:53,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1.0,Nauru,4/4/2016 1:39,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0.0,San Marino,3/13/2016 20:35,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1.0,Italy,1/10/2016 2:31,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0.0,Iceland,6/3/2016 3:36,0


In [15]:
AdsDf.isnull().sum()

Daily Time Spent on Site    13
Age                          0
Area Income                 13
Daily Internet Usage        11
Ad Topic Line                0
City                         1
Male                         3
Country                      9
Timestamp                    0
Clicked on Ad                0
dtype: int64

In [16]:
AdsDf['Daily Time Spent on Site'] = AdsDf['Daily Time Spent on Site'].fillna(AdsDf['Daily Time Spent on Site'].mean())
AdsDf['Area Income'] = AdsDf['Area Income'].fillna(AdsDf['Area Income'].mean())
AdsDf['Daily Internet Usage'] = AdsDf['Daily Internet Usage'].fillna(AdsDf['Daily Internet Usage'].mean())

AdsDf['City'] = AdsDf['City'].fillna(AdsDf['City'].mode()[0])
AdsDf['Male'] = AdsDf['Male'].fillna(AdsDf['Male'].mode()[0])
AdsDf['Country'] = AdsDf['Country'].fillna(AdsDf['Country'].mode()[0])

In [17]:
ad_categories = AdsDf['Ad Topic Line'].unique().tolist()
n_ads = len(ad_categories)
print(f"Identified {n_ads} ad categories")
# print({ad_categories}) // too big to print

Identified 1000 ad categories


In [18]:
def epsilon_greedy(epsilon, time_steps, df):
    Q = [0] * n_ads
    N = [0] * n_ads 
    total_reward = 0

    for t in range(time_steps):
        if random.random() < epsilon:
            ad_index = random.randint(0, n_ads - 1)
            chosen_category = ad_categories[ad_index]
        else:
            ad_index = Q.index(max(Q))
            ad_indices = [i for i, value in enumerate(Q) if value == Q[ad_index]]
            ad_index = random.choice(ad_indices)
            chosen_category = ad_categories[ad_index]

        category_rows = df[df['Ad Topic Line'] == chosen_category]
        if not category_rows.empty:
            reward = category_rows['Clicked on Ad'].iloc[t % len(category_rows)]
        else:
            reward = 0 
            print(f"Warning: No data found for category {chosen_category}.  Setting reward to 0.")

        N[ad_index] += 1
        Q[ad_index] += (1 / N[ad_index]) * (reward - Q[ad_index])
        total_reward += reward

    return total_reward, Q, N

In [19]:
time_steps = 2000
epsilon_05_reward, epsilon_05_Q, epsilon_05_N = epsilon_greedy(0.05, time_steps, AdsDf)
epsilon_2_reward, epsilon_2_Q, epsilon_2_N = epsilon_greedy(0.2, time_steps, AdsDf)

print(f"Epsilon-Greedy (ε=0.05, {time_steps} steps): Total Reward = {epsilon_05_reward}")
print(f"Epsilon-Greedy (ε=0.2, {time_steps} steps): Total Reward = {epsilon_2_reward}\n")

# print(f"Epsilon-Greedy (ε=0.05): Action Counts = {epsilon_05_N}") // too big to print
# print(f"Epsilon-Greedy (ε=0.2): Action Counts = {epsilon_2_N}") // too big to print

Epsilon-Greedy (ε=0.05, 2000 steps): Total Reward = 1952
Epsilon-Greedy (ε=0.2, 2000 steps): Total Reward = 1808



In [20]:
def ucb(c, time_steps, df):
    Q = [0] * n_ads
    N = [0] * n_ads
    total_reward = 0

    for t in range(time_steps):
        ucb_values = [Q[a] + c * math.sqrt(math.log(t + 1) / (N[a] + 1e-6)) for a in range(n_ads)]

        ad_index = ucb_values.index(max(ucb_values))
        ad_indices = [i for i, value in enumerate(ucb_values) if value == ucb_values[ad_index]]
        ad_index = random.choice(ad_indices)
        chosen_category = ad_categories[ad_index]

        category_rows = df[df['Ad Topic Line'] == chosen_category]
        if not category_rows.empty:
            reward = category_rows['Clicked on Ad'].iloc[t % len(category_rows)]
        else:
            reward = 0 
            print(f"Warning: No data found for category {chosen_category}. Setting reward to 0.")

        # Update
        N[ad_index] += 1
        Q[ad_index] += (1 / N[ad_index]) * (reward - Q[ad_index])
        total_reward += reward

    return total_reward, Q, N

In [21]:
time_steps = 2000
ucb_reward, ucb_Q, ucb_N = ucb(2.0, time_steps, AdsDf)

print(f"UCB (c=2.0, {time_steps} steps): Total Reward = {ucb_reward}")
# print(f"UCB (c=2.0): Action Counts = {ucb_N}") // too big to print

UCB (c=2.0, 2000 steps): Total Reward = 1000


In [22]:
exploration_factors = [0.1, 0.25, 0.5, 1.0, 3.0, 5.0, 10.0]

for c in exploration_factors:
    ucb_reward, ucb_Q, ucb_N = ucb(c, time_steps, AdsDf)
    print(f"UCB (c={c}, {time_steps} steps): Total Reward = {ucb_reward}")
    # print(f"UCB (c={c}): Action Counts = {ucb_N}\n") // too big to print

UCB (c=0.1, 2000 steps): Total Reward = 1500
UCB (c=0.25, 2000 steps): Total Reward = 1500
UCB (c=0.5, 2000 steps): Total Reward = 1500
UCB (c=1.0, 2000 steps): Total Reward = 1500
UCB (c=3.0, 2000 steps): Total Reward = 1000
UCB (c=5.0, 2000 steps): Total Reward = 1000
UCB (c=10.0, 2000 steps): Total Reward = 1000


In [23]:
true_ctr = []
for ad_category in ad_categories:
    true_ctr.append(AdsDf[AdsDf['Ad Topic Line'] == ad_category]['Clicked on Ad'].mean())

results_df = pd.DataFrame({
    "True CTR": true_ctr,
    "Epsilon-Greedy (ε=0.05)": epsilon_05_Q,
    "Epsilon-Greedy (ε=0.2)": epsilon_2_Q,
    "UCB (c=2.0)": ucb_Q
}, index=ad_categories)

results_df

Unnamed: 0,True CTR,Epsilon-Greedy (ε=0.05),Epsilon-Greedy (ε=0.2),UCB (c=2.0)
Cloned 5thgeneration orchestration,0.0,0.0,0.0,0.0
Monitored national standardization,0.0,0.0,0.0,0.0
Organic bottom-line service-desk,0.0,0.0,0.0,0.0
Triple-buffered reciprocal time-frame,0.0,0.0,0.0,0.0
Robust logistical utilization,0.0,0.0,0.0,0.0
...,...,...,...,...
Fundamental modular algorithm,1.0,0.0,0.0,1.0
Grass-roots cohesive monitoring,1.0,0.0,0.0,1.0
Expanded intangible solution,1.0,1.0,1.0,1.0
Proactive bandwidth-monitored policy,0.0,0.0,0.0,0.0


In [24]:
time_steps_5000 = 5000

epsilon_05_reward_5000, epsilon_05_Q_5000, epsilon_05_N_5000 = epsilon_greedy(0.05, time_steps_5000, AdsDf)
epsilon_2_reward_5000, epsilon_2_Q_5000, epsilon_2_N_5000 = epsilon_greedy(0.2, time_steps_5000, AdsDf)

ucb_reward_5000, ucb_Q_5000, ucb_N_5000 = ucb(2.0, time_steps_5000, AdsDf)

In [25]:
print(f"Epsilon-Greedy (ε=0.05, {time_steps_5000} steps): Total Reward = {epsilon_05_reward_5000}")
# print(f"Epsilon-Greedy (ε=0.05): Action Counts = {epsilon_05_N_5000}\n") // too big to print

print(f"Epsilon-Greedy (ε=0.2, {time_steps_5000} steps): Total Reward = {epsilon_2_reward_5000}")
# print(f"Epsilon-Greedy (ε=0.2): Action Counts = {epsilon_2_N_5000}\n") // too big to print

print(f"UCB (c=2.0, {time_steps_5000} steps): Total Reward = {ucb_reward_5000}")
# print(f"UCB (c=2.0): Action Counts = {ucb_N_5000}\n") // too big to print

Epsilon-Greedy (ε=0.05, 5000 steps): Total Reward = 4879
Epsilon-Greedy (ε=0.2, 5000 steps): Total Reward = 4486
UCB (c=2.0, 5000 steps): Total Reward = 3500
