**Implementing D-MAB, as described in DaCosta et al. - 2008 - Adaptive operator selection with dynamic multi-arm**

>  (hybrid between UCB1 and Page-Hinkley (PH) test)

D-MAB maintains four indicators for each arm $i$:
1. number $n_{i, t}$ of times $i$-th arm has been played up to time $t$;
2. the average empirical reward $\widehat{p}_{j, t}$ at time $t$;
3. the average and maximum deviation $m_i$ and $M_i$ involved in the PH test, initialized to $0$ and updated as detailed below. At each time step $t$:

D-MAB selects the arm $i$ that maximizes equation 1:

$$\widehat{p}_{i, t} + \sqrt{\frac{2 \log \sum_{k}n_{k, t}}{n_{i, t}}}$$

> Notice that the sum of the number of times each arm was pulled is equal to the time $\sum_{k}n_{k, t} = t$, but since their algorithm resets the number of picks, we need to go with the summation. 

and receives some reward $r_t$, drawn after reward distribution $p_{i, t}$.

> I think there is a typo in the eq. 1 on the paper. I replaced $j$ with $i$ in the lower indexes.

The four indicators are updated accordingly:

- $\widehat{p}_{i, t} :=\frac{1}{n_{i, t} + 1}(n_{i, t}\widehat{p}_{i, t} + r_t)$
- $n_{i, t} := n_{i, t}+1$
- $m_i := m_i + (\widehat{p}_{i, t} - r_t + \delta)$
- $M_i:= \text{max}(M_i, m_i)$

And if the PH test is triggered ($M_i - m_i > \lambda$), the bandit is restarted, i.e., for all arms, all indicators are set to zero (the authors argue that, empirically, resetting the values is more robust than decreasing them with some mechanism such as probability matching).

> I will reset to 1 instead of 0 (as the original paper does) to avoid divide by zero when calculating UCB1.

The PH test is a standard test for the change hypothesis. It works by monitoring the difference between $M_i$ and $m_i$, and when the difference is greater than some uuser-specified threshold $\lambda$, the PH test is triggered, i.e., it is considered that the Change hypothesis holds.

Parameter $\lambda$ controls the trade-off between false alarms and un-noticed changes. Parameter $\delta$ enforces the robustness of the test when dealing with slowly varying environments.

We also need a scaling mechanism to control the Exploration _versus_ Exploitation balance. They proposed two, from which I will focus on the first: Multiplicative Scaling (cUCB). **It consists on multiplying all rewards by a fixed user-defined parameter $C_{M-\text{scale}}$.

This way, we need to give to our D-MAB 3 parameters: $\lambda$, $\delta$, and $C_{M - \text{scale}}$. In the paper they did a sensitivity analysis of the parameters, but I think they should be fine tuned for each specific data set.

In [1]:
class D_MAB:
    def __init__(self, num_bandits, verbose=False, *, delta, lmbda, scaling, pull_f, reward_f):
        self.num_bandits = num_bandits
        self.verbose     = verbose
        self.delta       = delta
        self.lmbda       = lmbda
        self.scaling     = scaling
        self.pull_f      = pull_f
        self.reward_f    = reward_f

        # History of choices and time instant t (just to track the behavior)
        self.history = {i:[] for i in range(self.num_bandits)}

        self._reset_indicators()

    def _reset_indicators(self):
        self.avg_reward    = np.zeros(self.num_bandits)
        self.num_played    = np.zeros(self.num_bandits)
        self.avg_deviation = np.zeros(self.num_bandits)
        self.max_deviation = np.zeros(self.num_bandits)

    def _calc_UCB1s(self):
        # log1p and +1 on denominator fixes some numeric problems in the original eq.
        scores = np.array([self.avg_reward[i] + np.sqrt(2*np.log1p(sum(self.num_played))/(self.num_played[i]+1))
            for i in range(self.num_bandits)])
        
        return np.nan_to_num(scores, nan=0)

    def _scale_reward(self, reward):
        return reward*self.scaling
    
    def playAndOptimize(self, *pull_args):
        # It will pick the bandit that maximizes eq.1. 
        UCB1s  = self._calc_UCB1s()

        # We need to know which arm we picked, what it returned, and how to calculate the reward given what the arm returned
        picked = np.nanargmax(np.nan_to_num(UCB1s, nan=-np.inf))
        pulled = self.pull_f(picked, *pull_args)
        reward = self.reward_f(pulled)
        
        self.history[picked].append(reward)

        if self.verbose:
            print(f"Avg. Rewards: {self.avg_reward}\nUCB1 scores : {UCB1s}\nPicked      : {picked}\nReward      : {reward}")

        # After choosing, it will implicitly update the parameters based on the return
        if np.isfinite(reward):
            self.avg_reward[picked]    = (self.num_played[picked]*self.avg_reward[picked] + self._scale_reward(reward))/(self.num_played[picked]+1)
            self.avg_deviation[picked] = self.avg_deviation[picked] + (self.avg_reward[picked] - self._scale_reward(reward) + self.delta)
            
        self.num_played[picked]    = self.num_played[picked] +1
        self.max_deviation[picked] = np.maximum(self.max_deviation[picked], self.avg_deviation[picked])

        if (self.max_deviation[picked] - self.avg_deviation[picked] > self.lmbda):
            self._reset_indicators()
            if self.verbose:
                print("Reseted indicators ----------------------------------------")

        return picked, pulled, reward

Below I'll create a simple bandit configuration so we can do a sanity check of our `D_MAB` implementation.

In [2]:
# Sanity checks
import numpy as np

for bandits, descr, expec in [
    (np.array([1.0, 1.0,  1.0,  1.0]), 'All bandits with same probs', 'similar amount of pulls for each arm'),
    (np.array([-1.0, 0.2,  0.0,  1.0]), 'One bandit with higher prob', 'more pulls for first arm, less pulls for last'),
    (np.array([-0.2, -1.0,  0.0,  -1.0]), 'Two bandits with higher probs', '2nd and 4th have similar number of pulls, higher than 1st and 3rd'),
]:
    # Implementing simple bandits
    def pullBandit(bandit):

        #Get a random number based on a normal dist with mean 0 and var 1
        result = np.random.randn()
        
        # bandits: This is the true reward probabilities, which we shoudn't have access (in the optimizer)
        # return a positive or negative reward based on bandit prob.
        return 1 if result > bandits[bandit] else -1

    
    print("\n==============================================================")
    print(descr)

    print("------------- Uniformly Distributed Random pulls -------------")
    picks   = [0, 0, 0, 0]
    rewards = [0, 0, 0, 0]

    for _ in range(10000):
        index  = np.random.randint(len(bandits))
        reward = pullBandit(index)

        picks[index]   = picks[index]+1
        rewards[index] = rewards[index]+reward

    print("Probabilities for each arm: ", bandits, "(the smaller the better)")
    print("cum. reward for each arm  : ", rewards)
    print("pulls for each arm        : ", picks)

    print("------------------------ optimizing ------------------------")

    # We have the problem that we need to determine delta and lambda values previously.
    # This needs domain knowledge (in SR context, I think we need to know if data is homogenic or
    # if it changes a lot through time).
    optimizer = D_MAB(4, verbose=False, 
                      delta=0.25, lmbda=1, scaling=2,
                      pull_f=pullBandit, reward_f=lambda r:r)

    # Let's optimize
    for i in range(10000):
        optimizer.playAndOptimize()

    total_rewards = {k : sum(v) for (k, v) in optimizer.history.items()}
    total_played  = {k : len(v) for (k, v) in optimizer.history.items()}

    print("cum. reward for each arm: ", total_rewards)
    print("pulls for each arm      : ", total_played)
    print(f"(it was expected: {expec})")


All bandits with same probs
------------- Uniformly Distributed Random pulls -------------
Probabilities for each arm:  [1. 1. 1. 1.] (the smaller the better)
cum. reward for each arm  :  [-1707, -1628, -1716, -1671]
pulls for each arm        :  [2527, 2452, 2508, 2513]
------------------------ optimizing ------------------------
cum. reward for each arm:  {0: -1847, 1: -1734, 2: -1670, 3: -1559}
pulls for each arm      :  {0: 2837, 1: 2500, 2: 2402, 3: 2261}
(it was expected: similar amount of pulls for each arm)

One bandit with higher prob
------------- Uniformly Distributed Random pulls -------------
Probabilities for each arm:  [-1.   0.2  0.   1. ] (the smaller the better)
cum. reward for each arm  :  [1633, -483, 80, -1686]
pulls for each arm        :  [2435, 2547, 2518, 2500]
------------------------ optimizing ------------------------
cum. reward for each arm:  {0: 5127, 1: -135, 2: -21, 3: -383}
pulls for each arm      :  {0: 7425, 1: 983, 2: 1065, 3: 527}
(it was expected: 

Ok, so the D-MAB seems to work. Now let's add this MAB inside mutation to update PARAMS option and control dinamically the mutaiton probabilities during evolution.

We can import the brush estimator and replace the `_mutation` by a custom function. Ideally, to use this python MAB optimizer, we need to have an object created to keep track of the variables, and the object needs to wrap the _pull_ action, as well as evaluating the reward based on the result.

> we'll need to do a _gambiarra_ to know which mutation is used so we can correctly update `D_MAB`. All MAB logic is implemented in python, and we chose the mutation in python as well. To make sure a specific mutation was used, we force it to happen by setting others' weights to zero. this way we know exactly what happened in the C++ code

In [3]:
from brush import BrushRegressor
from deap import creator
import _brush
from deap_api import nsga2, DeapIndividual 

#prg.mutate is a convenient interface that uses the current search space to sample mutations

class BrushRegressorMod(BrushRegressor):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def _mutate(self, ind1):
        # Overriding the mutation so it is wrapped with D_MAB
        
        mutation, offspring, reward = self.D_MAB_.playAndOptimize(ind1)
        
        #print(mutation, ind1.prg.get_model(), offspring.prg.get_model(), reward)
        return offspring
    
    def fit(self, X, y):

        _brush.set_params(self.get_params())

        self.data_ = self._make_data(X,y)

        # Creating a wrapper for mutation to be able to control what is happening in the C++
        # code (this should be prettier in a future implementation)
        def _pull_mutation(mutation_idx, ind1):
            mutations = ['point', 'insert', 'delete', 'toggle_weight']
            params = self.get_params()

            for i, m in enumerate(mutations):
                params['mutation_options'][m] = 0 if i != mutation_idx else 1.0

            _brush.set_params(params)
        
            offspring = creator.Individual(ind1.prg.mutate())

            return offspring
        
        # Given the result of a pull (the mutated offspring), how do I evaluate it?
        # (here I am manually writing the multi-optimization problem nsga2 is
        # designed to solve)
        def _evaluate_reward(ind):
            if not ind.fitness.valid:
                ind.prg.fit(self.data_)
                fit = (
                    np.sum((self.data_.y- ind.prg.predict(self.data_))**2),
                    ind.prg.size()
                )
            
                ind.fitness.values = fit
            
            error, size = ind.fitness.values
            return -1.0*error + -1.0*size
            
        # We have 4 different mutations
        self.D_MAB_ = D_MAB(4, verbose=False, 
                            delta=0.05, lmbda=5, scaling=1e-5, # How to determine these values???
                            pull_f=_pull_mutation, reward_f=_evaluate_reward)

        if isinstance(self.functions, list):
            self.functions_ = {k:1.0 for k in self.functions}
        else:
            self.functions_ = self.functions

        self.search_space_ = _brush.SearchSpace(self.data_, self.functions_)
        self.toolbox_ = self._setup_toolbox(data=self.data_)

        archive, logbook = nsga2(self.toolbox_, self.max_gen, self.pop_size, 0.9, self.verbosity)

        self.archive_ = archive
        self.best_estimator_ = self.archive_[0].prg
        total_played  = {k : len(v) for (k, v) in self.D_MAB_.history.items()}

        print(total_played)
        print(self.D_MAB_.avg_reward)
        print('best model:',self.best_estimator_.get_model())
        return self


Finally, lets use this new mutation into an ES algorithm (because this is only based on mutation) and see if it improves the performance

In [28]:
import pandas as pd

# I am getting tons of unharmful warnings
import warnings
warnings.filterwarnings("ignore")

#df = pd.read_csv('../../docs/examples/datasets/d_enc.csv')
#X = df.drop(columns='label')
#y = df['label']

df = pd.read_csv('../../docs/examples/datasets/d_2x1_subtract_3x2.csv')
X = df.drop(columns='target')
y = df['target']

kwargs = {
    'pop_size'  : 100,
    'max_gen'   : 100,
    'verbosity' : 0,
    'max_depth' : 10,
    'max_size'  : 20,
    'mutation_options' : {"point":0.25, "insert": 0.25, "delete":  0.25, "toggle_weight": 0.25}
}

# 30 executions just to compare avg score
scores = []
for i in range(30):
    print(f"-------------------------------------- Run {i} --------------------------------------")
    est_mab = BrushRegressorMod(**kwargs)

    # use like you would a sklearn regressor
    est_mab.fit(X,y)
    y_pred = est_mab.predict(X)

    scores.append(est_mab.score(X,y))
print(f"Score (30 runs): {np.mean(scores)}")

# Single run with verbosity
kwargs['verbosity'] = 1
est_mab = BrushRegressorMod(**kwargs)

# use like you would a sklearn regressor
est_mab.fit(X,y)
y_pred = est_mab.predict(X)

print('score:', est_mab.score(X,y))

-------------------------------------- Run 0 --------------------------------------
{0: 2504, 1: 2440, 2: 2505, 3: 2451}
[-0.00036238 -0.00147657 -0.00033448 -0.00126673]
best model: 3.60*Sin(2.72*x1)
-------------------------------------- Run 1 --------------------------------------
{0: 2491, 1: 2480, 2: 2491, 3: 2438}
[-0.00034092 -0.00051414 -0.0003384  -0.00126214]
best model: -4.24*x2
-------------------------------------- Run 2 --------------------------------------
{0: 63, 1: 3366, 2: 3383, 3: 3088}
[-1.35675027e+14 -5.23720451e-04 -3.38402328e-04 -3.78375709e-03]
best model: 2.79*Tanh(36.11*x1)
-------------------------------------- Run 3 --------------------------------------
{0: 3316, 1: 33, 2: 3316, 3: 3235}
[-3.42976253e-04 -9.02122534e+06 -3.33060718e-04 -1.26440401e-03]
best model: -4.24*x2
-------------------------------------- Run 4 --------------------------------------
{0: 2491, 1: 2479, 2: 2491, 3: 2439}
[-0.00035025 -0.00055322 -0.00033663 -0.00125379]
best model: -

Comparing with the original implementation

In [29]:
# 30 executions just to compare avg score
scores = []
for _ in range(30):
    kwargs['verbosity'] = 0

    est_mab = BrushRegressorMod(**kwargs)

    # use like you would a sklearn regressor
    est_mab.fit(X,y)
    y_pred = est_mab.predict(X)

    scores.append(est_mab.score(X,y))
print(f"Score (30 runs): {np.mean(scores)}")

# Single run with verbosity

kwargs['verbosity'] = 1
est = BrushRegressor(**kwargs)

# use like you would a sklearn regressor
est.fit(X,y)
y_pred = est.predict(X)

print('score:', est.score(X,y))

{0: 3366, 1: 16, 2: 3370, 3: 3148}
[-4.05962183e-04 -1.12526941e+11 -3.62192692e-04 -2.91524196e-03]
best model: -4.24*x2
{0: 2490, 1: 2479, 2: 2491, 3: 2440}
[-0.00036652 -0.00054139 -0.00033253 -0.00123083]
best model: 3.60*Sin(2.72*x1)
{0: 3305, 1: 63, 2: 3305, 3: 3227}
[-3.47628793e-04 -4.52236312e+00 -3.38324719e-04 -1.24529618e-03]
best model: 3.60*Sin(2.72*x1)
{0: 84, 1: 3286, 2: 3305, 3: 3225}
[-3.19134285e+08 -5.59272536e-04 -3.41379892e-04 -1.25437165e-03]
best model: 3.60*Sin(2.72*x1)
{0: 19, 1: 107, 2: 110, 3: 9664}
[-6.07635009e+01 -5.12827890e+17 -8.54437202e+09 -9.15424529e-04]
best model: 3.68*Sin(2.74*x1)
{0: 25, 1: 76, 2: 4974, 3: 4825}
[-2.77204121e+00 -9.65579745e-01 -3.84279937e-04 -1.32060016e-03]
best model: -4.24*x2
{0: 129, 1: 4868, 2: 4897, 3: 6}
[-3.16229731e-01 -5.37142383e-04 -3.59548331e-04 -3.51616980e+17]
best model: -4.24*x2
{0: 1537, 1: 2799, 2: 2812, 3: 2752}
[-0.03085949 -0.00054441 -0.00033255 -0.0012708 ]
best model: 2.79*Tanh(469.33*x1)
{0: 45, 1: