In [None]:
import pandas as pd
import numpy as np

# Use pandas to read the file into a DataFrame
data = pd.read_csv('/content/Ads_Optimisation.csv')
print(data.head())

   Ad 1  Ad 2  Ad 3  Ad 4  Ad 5  Ad 6  Ad 7  Ad 8  Ad 9  Ad 10
0     1     0     0     0     1     0     0     0     1      0
1     0     0     0     0     0     0     0     0     1      0
2     0     0     0     0     0     0     0     0     0      0
3     0     1     0     0     0     0     0     1     0      0
4     0     0     0     0     0     0     0     0     0      0


###**QA.  Write down the MAB agent problem formulation in your own words**

The Multi-Armed Bandit (MAB) problem is a framework that involves selecting actions from a set of alternatives (arms) in a repeated manner. 

Each arm has an unknown reward distribution, and the goal is to maximize the total rewards accumulated over time. In this specific problem, the MAB agent is tasked with selecting the advertisement to display to a user in order to maximize the number of clicks on the ad. The MAB agent chooses an arm, which corresponds to an ad, and observes a reward, which is either 1 (if the user clicks on the ad) or 0 (if the user does not click on the ad). 

The MAB agent uses the information gained from previous arm selections and rewards to make better decisions in the future, while balancing the exploration of new arms with the exploitation of the best-performing arm so far.

In [None]:
# Define the MAB agent class
class MABAgent:
    

    '''This code defines a multi-armed bandit (MAB) agent class, which is used to solve a MAB problem. 
    A MAB problem involves choosing one action out of multiple actions in order to maximize the reward. 
    The MABAgent class is initialized with the number of actions (n_arms), 
    the epsilon value for the epsilon-greedy algorithm (epsilon), 
    and the c value for the UCB algorithm (c). 
    The class also initializes the Q-value estimates and the number of times an action was selected (N), 
    both of which are initialized to zero.
    '''
    def __init__(self, n_arms, epsilon=None, c=None):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.c = c
        self.Q = np.zeros(n_arms)
        self.N = np.zeros(n_arms)
        self.total_rewards = 0



    ''' 
    The select_action method is used to select an action for the agent to take. 
    If the epsilon value is not None, then it selects an action with probability epsilon by choosing a random action. 
    Otherwise, if the c value is not None, then it selects an action using the UCB algorithm. 
    Otherwise, it selects the action with the highest Q-value estimate.
    '''    
    def select_action(self):
        if self.epsilon is not None and np.random.rand() < self.epsilon:
            # Select a random action with probability epsilon
            action = np.random.randint(self.n_arms)
        elif self.c is not None:
            # Select an action using the UCB algorithm
            t = np.sum(self.N) + 1
            ucb = self.Q + self.c * np.sqrt(np.log(t) / (self.N + 1e-6))
            action = np.argmax(ucb)
        else:
            # Select the action with the highest estimated value
            action = np.argmax(self.Q)
        return action
    

    '''
    The update method is used to update the Q-value estimate and the number of times an action was selected. 
    It takes in the action that was taken and the reward for taking that action. 
    It then increments the number of times the action was selected and updates the Q-value estimate using a formula.
    '''
    
    def update(self, action, reward):
        # Update the action value estimates and the number of times the action was taken
        self.N[action] += 1
        self.Q[action] += (reward - self.Q[action]) / self.N[action]
        self.total_rewards += reward
        

    '''
    The run method is used to run the MABAgent on a dataset. It iterates over the dataset 2000 times, 
    selecting an action using the select_action method, getting the reward for taking that action, 
    and updating the Q-value estimate and number of times the action was selected using the update method.
    '''    
    def run(self, data):
        for t in range(2000):
            action = self.select_action()
            reward = data.iloc[t, action]
            self.update(action, reward)

###**QB a. Compute the total rewards after 2000-time steps using the ε-greedy action for ε=0.01**

In [None]:
# Compute the total rewards using the epsilon-greedy algorithm with epsilon=0.01
agent1 = MABAgent(n_arms=10, epsilon=0.01)
agent1.run(data)
print(f"Total rewards using epsilon=0.01: {agent1.total_rewards}")

Total rewards using epsilon=0.01: 338


###**QB b. Compute the total rewards after 2000-time steps using the ε-greedy action for ε=0.3**

In [None]:
# Compute the total rewards using the epsilon-greedy algorithm with epsilon=0.3
agent2 = MABAgent(n_arms=10, epsilon=0.3)
agent2.run(data)
print(f"Total rewards using epsilon=0.3: {agent2.total_rewards}")

Total rewards using epsilon=0.3: 403


###**QC. Compute the total rewards after 2000-time steps using the Upper-Confidence-Bound action method for c= 1.5**

In [None]:
# Compute the total rewards using the UCB algorithm with c=1.5
agent3 = MABAgent(n_arms=10, c=1.5)
agent3.run(data)
print(f"Total rewards using UCB with c=1.5: {agent3.total_rewards}")

Total rewards using UCB with c=1.5: 319


###**QD. For all approaches, explain how the action value estimated compares to the optimal action**

Based on the output of the code, the total rewards obtained by each approach are as follows:

**Epsilon-greedy with epsilon=0.01: 338**

**Epsilon-greedy with epsilon=0.3: 403**

**UCB with c=1.5: 319**

The epsilon-greedy approach with epsilon=0.3 obtained the highest total rewards, followed by the epsilon-greedy approach with epsilon=0.01 and then the UCB approach with c=1.5. This suggests that the epsilon-greedy approach with a higher value of epsilon (0.3) was able to explore more and find better actions than the epsilon-greedy approach with a lower value of epsilon (0.01) and the UCB approach.