# XGBoost, Imbalanced Classification and Hyperopt 

This is a tutorial/explanation of how to set up XGBoost for imbalanced classification while tuning for imbalanced data.  

There are three main sections:
1. Hyperopt/Bayesian Hyperparameter Tuning
2. Focal and Crossentropy losses
3. XGBoost Parameter Meanings

(references are dropped as-needed)

# Hyperopt  

The `hyperopt` package is associated with [Bergstra et. al.](http://proceedings.mlr.press/v28/bergstra13.pdf). The authors argued that the performance of a given model depends both on the fundamental quality of the algorithm as well as details of its tuning (also known as its *hyper-parameters*). 


For the unitiated, if we have some dataset `X, y` and we train a model on it:

In [None]:
from sklearn.linear_model import ElasticNet 

X = np.expand_dims(np.arange(0, 100, 1), -1)
y = 2*X + 1

lr = ElasticNet(alpha=0.5)

lr.fit(X, y)

The hyper-parameters are "meta" parameters which control the training process:

In [None]:
lr.get_params() # returns hyper-parameters

while parameters are (e.g. in this case) model coefficients

In [None]:
lr.coef_, lr.intercept_ # returns parameters

The authors of [Bergstra et. al.](http://proceedings.mlr.press/v28/bergstra13.pdf) proposed an optimization algorithm which transformed the underlying expression graph of how a performance metric is computed from hyper-parameters.

The idea (in a **very** summarized manner), is to take as inputs, the null prior and an experimental history of $H$ values of the loss function, and returns suggestions for which configurations to try next. Random sampling from the prior is taken as valid, and was shown to significantly increase model performance in vision-related tasks 

The accompanying package which implements much of these ideas is [hyperopt](https://hyperopt.github.io/hyperopt/). 

The basic steps to set this up is as follows:

In [None]:
# STEP 1: define a search space
SPACE = {
    'param1': hp.uniform('param1', 0, 1),
    'param2': hp.choice('param2', ['option1', 'option2']),
    # and so on
}

# STEP 2: define an objective function
def objective(params):

    # do some computation and evaluation here

    loss: float # compute some loss
    return {'loss': loss, 'status': STATUS_OK}

# STEP 3: evaluate the search space

best_hyperparams = fmin(
    fn=objective,
    max_evals=5, 
    space=SPACE,
    algo=tpe.suggest,
    trials=Trials()
)

The above needs some heavy explanation, so let's break this down:

### Search Space
This is a dictionary of parameters that are used as the inputs to the optimization. Each parameter is randomly sampled (like statistic random, not "completely undefined" random) from some domain. `hyperopt` provides [multiple methods](https://hyperopt.github.io/hyperopt/getting-started/search_spaces/) for generating these values, but the ones I used the most are as follows:

`hp.choice(label, options)`  
Gives a random choice from a list of options  

`hp.uniform(label, low, high)`  
Uniform float in the bounded range inclusive


The `label` parameter is used to to retrieve the value from the output  

### Objective Function
This takes in a single argument (in this case a dictionary), does some computation and returns a loss. This function must return a single dictionary with EXACTLY two entries: loss and status.

### Algorithm
This is the novelty proposed in the paper. In the above I use the tree of parzen estimators (TPE), while RandomSearch and Adaptive TPE are also available

### Trials
Finally, this object simply stores a list of all the parameters at a given run along with the run counter  


In [None]:
# a dummy example showing how parameters are chosen to minimize loss

from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

SPACE = {
    'param1': hp.choice('param1', ['a', 'b']),
    'param2': hp.uniform('param2', 0, 1)
}

# some "computation" which changes our loss
def objective(params):
    loss = float('Inf')

    print('params: ', params) # to show choices

    if params['param1']  == 'a':
        loss += 3
    elif params['param1'] == 'b':
        loss += 1

    if params['param2'] < 0.5:
        loss += 1
    if params['param2'] >= 0.5:
        loss += 2

    return {'loss': loss, 'status': STATUS_OK}

trials = Trials()
fmin(
    fn=objective,
    space=SPACE,
    max_evals=5,
    algo=tpe.suggest,
    trials=trials
)

# Imbalanced Learning 

[Wang et. al](https://arxiv.org/pdf/1908.01672.pdf) proposed modification to the [original XGBoost](https://cran.microsoft.com/snapshot/2017-12-11/web/packages/xgboost/vignettes/xgboost.pdf) algorithm, by modification of the loss-function. Concretely, two losses were proposed as follows:

Weighted Crossentropy:  
$$
L_w = -\sum_{i=1}^m \left(\alpha y_i \log(\hat{y_i}) +  (1-y_i)\log(1-\hat{y_i})\right)
$$

if α is greater than 1, extra loss will be counted on ’classifying
1 as 0’; On the other hand, if α is less than 1, the loss function will weight relatively more on whether data points with
label 0 are correctly identified

and

Focal Loss:  
$$
L_f = -\sum_{i=1}^m y_i (1-\hat{y_i})^\gamma \log(\hat{y_i}) + (1-y_i)\hat{y_i}^\gamma \log(1-\hat{y}_i)
$$

If $\gamma=0$, the above becomes regular crossentropy. The paper goes into more detail on the first and second-order derivatives (since XGBoost does not implement autodif), and how it is integrated into the algorithm. 

The main focus of both of the above losses is in weighting misclassification of the minority class more heavily than misclassification of the majority class. 

*Calling a 0 a 1 is penalized less than calling a 1 a 0 if we have substantially more 0's than 1's*  

The authors of the paper implemented these loss functions in [imbalance-xgboost](https://github.com/jhwjhw0123/Imbalance-XGBoost). We will not be using the entire library (since it masks too many moving parts behind high-level interfaces for my liking), but we will borrow their implementation of the Weighted Crossentropy and Focal losses:



In [5]:
# credit to https://github.com/jhwjhw0123/Imbalance-XGBoost

class Weight_Binary_Cross_Entropy:
    '''
    The class of binary cross entropy loss, allows the users to change the weight parameter
    '''

    def __init__(self, imbalance_alpha):
        '''
        :param imbalance_alpha: the imbalanced \alpha value for the minority class (label as '1')
        '''
        self.imbalance_alpha = imbalance_alpha

    def weighted_binary_cross_entropy(self, pred, dtrain):
        # assign the value of imbalanced alpha
        imbalance_alpha = self.imbalance_alpha
        # retrieve data from dtrain matrix
        label = dtrain.get_label()
        # compute the prediction with sigmoid
        sigmoid_pred = 1.0 / (1.0 + np.exp(-pred))
        # gradient
        grad = -(imbalance_alpha ** label) * (label - sigmoid_pred)
        hess = (imbalance_alpha ** label) * sigmoid_pred * (1.0 - sigmoid_pred)

        return grad, hess
      
      
class Focal_Binary_Loss:
    '''
    The class of focal loss, allows the users to change the gamma parameter
    '''

    def __init__(self, gamma_indct):
        '''
        :param gamma_indct: The parameter to specify the gamma indicator
        '''
        self.gamma_indct = gamma_indct

    def robust_pow(self, num_base, num_pow):
        # numpy does not permit negative numbers to fractional power
        # use this to perform the power algorithmic

        return np.sign(num_base) * (np.abs(num_base)) ** (num_pow)

    def focal_binary_object(self, pred, dtrain):
        gamma_indct = self.gamma_indct
        # retrieve data from dtrain matrix
        label = dtrain.get_label()
        # compute the prediction with sigmoid
        sigmoid_pred = 1.0 / (1.0 + np.exp(-pred))
        # gradient
        # complex gradient with different parts
        g1 = sigmoid_pred * (1 - sigmoid_pred)
        g2 = label + ((-1) ** label) * sigmoid_pred
        g3 = sigmoid_pred + label - 1
        g4 = 1 - label - ((-1) ** label) * sigmoid_pred
        g5 = label + ((-1) ** label) * sigmoid_pred
        # combine the gradient
        grad = gamma_indct * g3 * self.robust_pow(g2, gamma_indct) * np.log(g4 + 1e-9) + \
               ((-1) ** label) * self.robust_pow(g5, (gamma_indct + 1))
        # combine the gradient parts to get hessian components
        hess_1 = self.robust_pow(g2, gamma_indct) + \
                 gamma_indct * ((-1) ** label) * g3 * self.robust_pow(g2, (gamma_indct - 1))
        hess_2 = ((-1) ** label) * g3 * self.robust_pow(g2, gamma_indct) / g4
        # get the final 2nd order derivative
        hess = ((hess_1 * np.log(g4 + 1e-9) - hess_2) * gamma_indct +
                (gamma_indct + 1) * self.robust_pow(g5, gamma_indct)) * g1

        return grad, hess


divide by zero encountered in log



-inf


divide by zero encountered in log


invalid value encountered in double_scalars



nan

# Application to XGBoost

Finally, we will combine the Bayesian `hyperopt` with the imbalanced losses and apply these to a theoretical imbalanced dataset. Before this however, we need some clarity on the `xgb` parameters:

`booster`  
This determines which booster to use, can be `gbtree`, `gblinear` or `dart`. `gbtree` drops trees in order to solve over-fitting, and actually inherits from `gbtree`. This booster combines a large amount of regression trees with a small learning rate. 

`eta`  
This is learning rate, or how much influence the newly updated gradients affect old parameters. It reduces the influence of each individual tree and leaves space for future trees to improve the model

`gamma`   
This is the minimum loss reduction required to further partition on a leaf node (larger gamma means more underfitting)

`max_depth`   
`0` is a no-limit. In this case, this controls how deep a single decision tree is allowed to branch in any update step  

`subsample`  
This is borrowed from RandomForest, and prevents over-fitting by randomly subsampling columns to use/not-used, whilst also speeding up computations. 

`lambda` and `alpha`  
L2 and L1 regularization 

`tree_method`   
The ones you most commonly would use on large data are `approx` and `hist` (or if you have a GPU `gpu_hist`)

`scale_pos_weight`   
This works in tandem with the imbalanced losses to upscale the penalty of misclassifying minority classes. Typically set to the number of negative instances over the number of positive instances. 

Another keyword argument that we will also use is `feval` which allows specification of a custom evaluation metric (in our case we use a customized `f1` score)  

Finally, we use `opt` to pass in our custom objective functions

Okay let's put this all together:



In [None]:
# some toy data 
import xgboost as xgb
from sklearn.datasets import make_moons
X_train, y_train = make_moons(n_samples=200, shuffle=True, noise=0.5, random_state=10)
X_test, y_test = make_moons(n_samples=200, shuffle=True, noise=0.5, random_state=10)

dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)

In [None]:
# custom f1 evaluation metric
from sklearn.metrics import f1_score

# f1 evaluating score for XGBoost
def f1_eval(y_pred, dtrain):
    y_true = dtrain.get_label()
    
    y_pred[y_pred < 0.20] = 0
    y_pred[y_pred > 0.20] = 1
    
    err = 1-skmetrics.f1_score(y_true, np.round(y_pred))
    return 'f1_err', err

In [3]:
import numpy as np
from sklearn.datasets import make_moons
import xgboost as xgb
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from hyperopt.pyll import scope

# Step I: Define the search space
# here is where we use hyperopt's choice to choose between Weighted Cross Entropy and the Focal loss functoin
#    as a parameter of the optimization!

wbce = Weight_Binary_Cross_Entropy(imbalance_alpha=0.5)
weighted_ce_obj = wbce.weighted_binary_cross_entropy
wf = Focal_Binary_Loss(0.5)
weighted_focal_obj = wf.focal_binary_object

SPACE = {
    'n_jobs': 0, 
    'objective': 'binary:hinge',
    'subsample': hp.uniform('subsample', 0.5, 1),
    'min_child_weight': hp.uniform('min_child_weight', 1, 10),
    'eta': hp.uniform('eta', 0, 1),
    'max_depth': scope.int(hp.quniform('max_depth', 1, 12, 1)),
    'min_split_loss': hp.uniform('min_split_loss', 0, 0.2),
    'obj': hp.choice('obj', (weighted_ce_obj, weighted_focal_obj)), # hyperopt will sample one of these objectives
    'num_parallel_tree': hp.choice('n_estimators', np.arange(1, 10, 1)),
    'lambda': hp.uniform('lambda', 0, 1),
    'alpha': hp.uniform('alpha', 0, 1),
    'booster': hp.choice('booster', ['gbtree', 'gblinear', 'dart']),
    'tree_method': hp.choice('tree_method', ('approx', 'hist')), 
}

# Step II: Define the objective function
def objective(space):

    # this is a "hack" since I want to pass obj in as
    #   a member of the search space
    #   but treat it ALONE as a keyword argument
    #   may increase computation time ever-so-slightly 
    params = {}
    for k, v in space.items():
        if k != 'obj' :
            params[k] = v

    obj = space['obj']

    # train the classifier
    booster = xgb.train(
        params,
        dtrain,
        obj=obj,
        feval=f1_eval # we also pass in a custom F1 evaluation metric here
    )

    y_pred = booster.predict(dtest)
    y_pred[y_pred < 0.5] = 0
    y_pred[y_pred >= 0.5] = 1

    # evaluate and return
    # note we want to maximize F1 and hence MINIMIZE NEGATIVE F1
    return {'loss': -f1_score(y_pred, y_test), 'status': STATUS_OK}


# Step III: Optimize! 
trials = Trials()
best_hyperparams = fmin(
  space=SPACE, 
  fn=objective,
  algo=tpe.suggest,
  max_evals=100, # this would be 100, 500 or something higher when actually optimizing
  trials=trials
)

NameError: name 'Weight_Binary_Cross_Entropy' is not defined

In [None]:
# we can get the best_parameters:
print(best_hyperparams)

We can also plot how the F1 score varied with any of our hyperparameters:  

In [None]:
import plotly.graph_objects as go

trials.trials[0]

fig = go.Figure(data=go.Scatter(
    x=[t['misc']['idxs']['max_depth'][0] for t in trials.trials],
    y=[-t['result']['loss'] for t in trials.trials],
    mode='markers'
))

fig.update_layout(
    xaxis=dict(title='max_depth'),
    yaxis=dict(title='f1_score'),
    autosize=False,
    width=800,
    height=800,
    template='plotly_dark',
    title='F1 Score against Max-Depth of XGBoost Trees'
)

fig.show()

# END BLOG

In [None]:
import abc

class BaseEval(metaclass=abc.ABCMeta):
    '''Base class for creating eval_metrics for XGBoost supporting binary classification threshold
    
    metric() must be overridden
    
    '''
    def __init__(self, thresh:float=0.5):
        self.thresh = thresh

    def __call__(self, predt:np.ndarray, dtest:xgb.DMatrix):
        y_thresh = deepcopy(predt)
        y_thresh[y_thresh > self.thresh] = 1
        y_thresh[y_thresh <= self.thresh] = 0 

        return 'recall', self.metric(dtest.get_label(), y_thresh) 

    @abc.abstractmethod
    def metric(self, y_true, y_pred):
        '''Define evaluation function here'''
        pass


class RecallEval(BaseEval):
    def metric(self, y_true, y_pred):
       
        return metrics.recall_score(y_true, y_pred) 


class PrecisionEval(BaseEval):
    def metric(self, y_true, y_pred):
       
        return metrics.precision_score(y_true, y_pred) 


class F2Eval(BaseEval):
    def metric(self, y_true, y_pred):
       
        return metrics.fbeta_score(y_true, y_pred, beta=2.0) 

In [None]:
from sklearn import metrics
from copy import deepcopy

# class RecallEval:
#     def __init__(self, thresh:float=0.5):
#         self.thresh = thresh

#     def __call__(self, predt:np.ndarray, dtest:xgb.DMatrix):
#         y_thresh = deepcopy(predt)
#         y_thresh[y_thresh > self.thresh] = 1
#         y_thresh[y_thresh <= self.thresh] = 0 

#         return 'recall', metrics.recall_score(dtest.get_label(), y_thresh)

class PrecisionEval:
    def __init__(self, thresh:float=0.5):
        self.thresh = thresh

    def __call__(self, predt:np.ndarray, dtest:xgb.DMatrix):
        y_thresh = deepcopy(predt)
        y_thresh[y_thresh > self.thresh] = 1
        y_thresh[y_thresh <= self.thresh] = 0 

        return 'precision', metrics.precision_score(dtest.get_label(), y_thresh)


recall_eval = RecallEval(0.6)

In [None]:
X, y = make_moons()
dtrain = xgb.DMatrix(X, y)
dvalid = xgb.DMatrix(X, y)

In [None]:
booster = xgb.train(
    {
        'objective':'binary:logistic',
        'tree_method': 'hist',
        'verbosity': 1,
        },
    obj=weighted_ce_obj,
    dtrain=dtrain,
    num_boost_round=3,
    feval=recall_eval,
    evals=[(dtrain, 'train'), (dvalid, 'eval')],
    verbose_eval=True
)

In [None]:
booster.predict(dtrain)

In [None]:
!pip install catboost

From https://coderzcolumn.com/tutorials/machine-learning/catboost-an-in-depth-guide-python  


As a part of this section, we'll explain how we can use the custom loss function with catboost. We can create a class that can be used as a custom loss function but it should have one method named calc_ders_range(). This method takes as an input list of predictions, actual target labels, and weights. It then returns a list of tuples where the first value in the tuple is the first derivative of the loss function and the second value is the second derivative of a loss function. The list of tuple must have the same length as the list of predictions and target labels passed to it.

We can then pass this class to the loss_function parameter of estimators. Below we have created a simple mean squared error loss function and explained usage of it with a simple example in the next cell.

In [4]:
from catboost import CatBoostClassifier


class FbetaMetric:
    
    @staticmethod
    def get_fbeta(y_true, y_pred):
        y_pred[y_pred < 0.5] = 0
        y_pred[y_pred >= 0.5] = 1
        y_true = y_true.astype(int)

        return metrics.fbeta_score(y_true, y_pred, beta=2)

    def is_max_optimal(self):
        return True # greater is better

    def evaluate(self, approxes, target, weight):            
        assert len(approxes) == 1
        assert len(target) == len(approxes[0])
        y_true = np.array(target).astype(int)
        approx = approxes[0]
        score = self.get_fbeta(y_true, approx)
        return score, 1

    def get_final_error(self, error, weight):
        return error




class Focal_Binary_Loss_Catboost(Focal_Binary_Loss):
    def calc_ders_range(self, approxes, targets, weights):
        dtest = xgb.DMatrix(X_test, targets)
        grad, hess = self.focal_binary_object(approxes, dtest)
        grad = np.expand_dims(grad, -1)
        hess = np.expand_dims(hess, -1)

        return np.hstack((grad, hess))

try:
    loss_fn = Focal_Binary_Loss_Catboost(2)
    model = CatBoostClassifier(
        loss_function=loss_fn, 
        eval_metric='F:beta=2',
        verbose=0
    )
    model.fit(X_train, y_train)

except Exception as e:
    print(e)

class Weighted_Binary_Cross_Entropy(object):
    def __init__(self, imbalance_alpha):
        self.imbalance_alpha = imbalance_alpha

    def calc_ders_range(self, approxes, targets, weights):
        imbalance_alpha = self.imbalance_alpha
        # retrieve data from dtrain matrix
        # compute the prediction with sigmoid
        sigmoid_pred = 1.0 / (1.0 + np.exp(-approxes))
        # gradient
        grad = -(imbalance_alpha ** targets) * (targets- sigmoid_pred)
        hess = (imbalance_alpha ** targets) * sigmoid_pred * (1.0 - sigmoid_pred)

        grad = np.expand_dims(grad, -1)
        hess = np.expand_dims(hess, -1)

        return np.hstack((grad, hess)) 

NameError: name 'Focal_Binary_Loss' is not defined

In [None]:
class LoglossObjective(object):
    def calc_ders_range(self, approxes, targets, weights):
        assert len(approxes) == len(targets)
        if weights is not None:
            assert len(weights) == len(approxes)

        result = []
        for index in range(len(targets)):
            e = np.exp(approxes[index])
            p = e / (1 + e)
            der1 = targets[index] - p
            der2 = -p * (1 - p)

            if weights is not None:
                der1 *= weights[index]
                der2 *= weights[index]

            result.append((der1, der2))
        return result

In [None]:
y_pred = model.predict(X_test)

In [None]:
metrics.fbeta_score(y_test, y_pred, beta=2)

In [None]:
grad = np.expand_dims(grad, -1)
hess = np.expand_dims(hess, -1)

In [None]:
result = np.hstack((grad, hess))

# AdaCost

In [None]:
from sklearn.ensemble._weight_boosting import BaseWeightBoosting
import numpy as np
import pandas as pd
from numpy.core.umath_tests import inner1d
from sklearn.base import ClassifierMixin
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.utils.validation import has_fit_parameter, check_is_fitted

class AdaCost(BaseWeightBoosting, ClassifierMixin):
    """An AdaCost classifier.
    
    AdaCost [1], a variant of AdaBoost, is a misclassification cost-sensitive 
    boosting method. It uses the cost of misclassications to update the 
    training distribution on successive boosting rounds. The purpose is to 
    reduce the cumulative misclassification cost more than AdaBoost.
    
    This class implements the algorithm known as Adacost which is a 
    modification of the algorithm AdaBoost-SAMME [2].
    
    
    Parameters
    ----------
    
    base_estimator : object, optional (default=DecisionTreeClassifier)
        The base estimator from which the boosted ensemble is built.
        Support for sample weighting is required, as well as proper `classes_`
        and `n_classes_` attributes.
    
    n_estimators : integer, optional (default=50)
        The maximum number of estimators at which boosting is terminated.
        In case of perfect fit, the learning procedure is stopped early.

    max_depth : integer, optional (default=1)
        The maximum depth of the decision trees.
    
    learning_rate : float, optional (default=1.)
        Learning rate shrinks the contribution of each classifier by
        ``learning_rate``. There is a trade-off between ``learning_rate`` and
        ``n_estimators``.
    
    cost_matrix : numpy.ndarray, optional (default=None)
        A matrix representing the cost of misclassification. The rows 
        represent the predicted class and columns represent the actual class.
        If None, all misclassifications have equal cost which results in the
        implementation of AdaBoostClassifier in sklearn.
        
    algorithm : {'SAMME', 'SAMME.R'}, optional (default='SAMME.R')
        If 'SAMME.R' then use the SAMME.R real boosting algorithm.
        ``base_estimator`` must support calculation of class probabilities.
        If 'SAMME' then use the SAMME discrete boosting algorithm.
        The SAMME.R algorithm typically converges faster than SAMME,
        achieving a lower test error with fewer boosting iterations.
    
    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.
    
    Attributes
    ----------
    estimators_ : list of classifiers
        The collection of fitted sub-estimators.
    
    classes_ : array of shape = [n_classes]
        The classes labels.
    
    n_classes_ : int
        The number of classes.
    
    estimator_weights_ : array of floats
        Weights for each estimator in the boosted ensemble.
    
    estimator_errors_ : array of floats
        Classification error for each estimator in the boosted
        ensemble.
    
    feature_importances_ : array of shape = [n_features]
        The feature importances if supported by the ``base_estimator``.

    See also
    --------
    AdaBoostClassifier, GradientBoostingClassifier, DecisionTreeClassifier
    
    References
    ----------
    .. [1] W. Fan, S. Stolfo, J. Zhang, P. Chan, " AdaCost: Misclassification 
           Cost-sensitive Boosting", 1999
    .. [2] J. Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class AdaBoost", 2009.
    """
    
    def __init__(self,
                 base_estimator=None,
                 n_estimators=50,
                 max_depth = 1,
                 learning_rate=1.,
                 algorithm='SAMME.R',
                 cost_matrix=None,
                 random_state=None):

        super(AdaCost, self).__init__(
            base_estimator=base_estimator,
            n_estimators=n_estimators,
            learning_rate=learning_rate,
            random_state=random_state)

        self.max_depth = max_depth
        self.algorithm = algorithm
        self.cost_matrix = cost_matrix
        
        if self.cost_matrix is not None:
            self.cost_table = self.cost_table_calc(self.cost_matrix)
    
    def cost_table_calc(self,cost_matrix):
        """Creates a table of values from the cost matrix.

        Parameters
        ----------
        cost_matrix : array-like of shape = [n_classes, n_classes]

        Returns
        -------
        df : dataframe of shape = [n_classes * n_classes, 3]      
                      
        """
        table = np.empty((0,3))

        for (x,y), value in np.ndenumerate(cost_matrix):
            table = np.vstack((table,np.array([x+1,y+1,value])))        
        
        return pd.DataFrame(table,columns = ['row','column','cost'])    
        
    
    def fit(self, X, y, sample_weight=None):
        """Build a boosted classifier from the training set (X, y).
        
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape = [n_samples, n_features]
            The training input samples. Sparse matrix can be CSC, CSR, COO,
            DOK, or LIL. DOK and LIL are converted to CSR.
        
        y : array-like of shape = [n_samples]
            The target values (class labels).
        
        sample_weight : array-like of shape = [n_samples], optional
            Sample weights. If None, the sample weights are initialized to
            ``1 / n_samples``.
        
        Returns
        -------
        self : object
            Returns self.
        """
        if self.cost_matrix is None:
            n_classes = len(np.unique(y))
            self.cost_table = self.cost_table_calc(np.ones([n_classes,n_classes]))
            
        if self.algorithm not in ('SAMME', 'SAMME.R'):
            raise ValueError("algorithm %s is not supported" % self.algorithm)

        # Fit
        return super(AdaCost, self).fit(X, y, sample_weight)
        
        
    def _validate_estimator(self):
        """Check the estimator and set the base_estimator_ attribute."""
        super(AdaCost, self)._validate_estimator(
            default=DecisionTreeClassifier(max_depth = self.max_depth))

        #  SAMME-R requires predict_proba-enabled base estimators
        if self.algorithm == 'SAMME.R':
            if not hasattr(self.base_estimator_, 'predict_proba'):
                raise TypeError(
                    "AdaBoostClassifier with algorithm='SAMME.R' requires "
                    "that the weak learner supports the calculation of class "
                    "probabilities with a predict_proba method.\n"
                    "Please change the base estimator or set "
                    "algorithm='SAMME' instead.")
        if not has_fit_parameter(self.base_estimator_, "sample_weight"):
            raise ValueError("%s doesn't support sample_weight."
                             % self.base_estimator_.__class__.__name__)        


    def _boost(self, iboost, X, y, sample_weight, random_state):
        """Implement a single boost.
        
        Perform a single boost according to the real multi-class SAMME.R
        algorithm or to the discrete SAMME algorithm and return the updated
        sample weights.

        Parameters
        ----------
        iboost : int
            The index of the current boost iteration.

        X : {array-like, sparse matrix} of shape = [n_samples, n_features]
            The training input samples. Sparse matrix can be CSC, CSR, COO,
            DOK, or LIL. DOK and LIL are converted to CSR.

        y : array-like of shape = [n_samples]
            The target values (class labels).

        sample_weight : array-like of shape = [n_samples]
            The current sample weights.

        random_state : numpy.RandomState
            The current random number generator

        Returns
        -------
        sample_weight : array-like of shape = [n_samples] or None
            The reweighted sample weights.
            If None then boosting has terminated early.

        estimator_weight : float
            The weight for the current boost.
            If None then boosting has terminated early.

        estimator_error : float
            The classification error for the current boost.
            If None then boosting has terminated early.
        """
        if self.algorithm == 'SAMME.R':
            return self._boost_real(iboost, X, y, sample_weight, random_state)

        else:  # elif self.algorithm == "SAMME":
            return self._boost_discrete(iboost, X, y, sample_weight,
                                        random_state)
     
    def _boost_real(self, iboost, X, y, sample_weight, random_state):
        """Implement a single boost using the SAMME.R real algorithm."""
        estimator = self._make_estimator(random_state=random_state)

        estimator.fit(X, y, sample_weight=sample_weight)

        y_predict_proba = estimator.predict_proba(X)

        if iboost == 0:
            self.classes_ = getattr(estimator, 'classes_', None)
            self.n_classes_ = len(self.classes_)

        y_predict = self.classes_.take(np.argmax(y_predict_proba, axis=1),
                                       axis=0)

        # Instances incorrectly classified
        incorrect = y_predict != y
        cost = self.misclassification_cost(y,y_predict)

        # Error fraction
        estimator_error = np.mean(
            np.average(incorrect, weights=sample_weight, axis=0))

        # Stop if classification is perfect
        if estimator_error <= 0:
            return sample_weight, 1., 0.

        # Construct y coding as described in Zhu et al [2]:
        #
        #    y_k = 1 if c == k else -1 / (K - 1)
        #
        # where K == n_classes_ and c, k in [0, K) are indices along the second
        # axis of the y coding with c being the index corresponding to the true
        # class label.
        n_classes = self.n_classes_
        classes = self.classes_
        y_codes = np.array([-1. / (n_classes - 1), 1.])
        y_coding = y_codes.take(classes == y[:, np.newaxis])

        # Displace zero probabilities so the log is defined.
        # Also fix negative elements which may occur with
        # negative sample weights.
        proba = y_predict_proba  # alias for readability
        proba[proba < np.finfo(proba.dtype).eps] = np.finfo(proba.dtype).eps

        # Boost weight using multi-class AdaBoost SAMME.R alg
        estimator_weight = (-1. * self.learning_rate
                                * (((n_classes - 1.) / n_classes) *
                                   inner1d(y_coding, np.log(y_predict_proba))))

        # Only boost the weights if it will fit again
        if not iboost == self.n_estimators - 1:
            # Only boost positive weights
            sample_weight *= np.exp(estimator_weight * cost *
                                    ((sample_weight > 0) |
                                     (estimator_weight < 0)))

        return sample_weight, 1., estimator_error
                                        
    def _boost_discrete(self, iboost, X, y, sample_weight, random_state):
        """Implement a single boost using the SAMME discrete algorithm."""
        estimator = self._make_estimator(random_state=random_state)

        estimator.fit(X, y, sample_weight=sample_weight)

        y_predict = estimator.predict(X)

        if iboost == 0:
            self.classes_ = getattr(estimator, 'classes_', None)
            self.n_classes_ = len(self.classes_)

        # Instances incorrectly classified
        incorrect = y_predict != y
        cost = self.misclassification_cost(y,y_predict)

        # Error fraction
        estimator_error = np.mean(
            np.average(incorrect, weights=sample_weight, axis=0))

        # Stop if classification is perfect
        if estimator_error <= 0:
            return sample_weight, 1., 0.

        n_classes = self.n_classes_

        # Stop if the error is at least as bad as random guessing
        if estimator_error >= 1. - (1. / n_classes):
            self.estimators_.pop(-1)
            if len(self.estimators_) == 0:
                raise ValueError('BaseClassifier in AdaBoostClassifier '
                                 'ensemble is worse than random, ensemble '
                                 'can not be fit.')
            return None, None, None

        # Boost weight using multi-class AdaBoost SAMME alg
        estimator_weight = self.learning_rate * (
            np.log((1. - estimator_error) / estimator_error) +
            np.log(n_classes - 1.))

        # Only boost the weights if I will fit again
        if not iboost == self.n_estimators - 1:
            # Only boost positive weights
            sample_weight *= np.exp(estimator_weight * incorrect * cost *
                                    ((sample_weight > 0) |
                                     (estimator_weight < 0)))

        return sample_weight, estimator_weight, estimator_error 

    def predict(self, X):
        """Predict classes for X.
        The predicted class of an input sample is computed as the weighted mean
        prediction of the classifiers in the ensemble.
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape = [n_samples, n_features]
            The training input samples. Sparse matrix can be CSC, CSR, COO,
            DOK, or LIL. DOK and LIL are converted to CSR.
        Returns
        -------
        y : array of shape = [n_samples]
            The predicted classes.
        """
        pred = self.decision_function(X)

        if self.n_classes_ == 2:
            return self.classes_.take(pred > 0, axis=0)

        return self.classes_.take(np.argmax(pred, axis=1), axis=0)
        
    def decision_function(self, X):
        """Compute the decision function of ``X``.
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape = [n_samples, n_features]
            The training input samples. Sparse matrix can be CSC, CSR, COO,
            DOK, or LIL. DOK and LIL are converted to CSR.
        Returns
        -------
        score : array, shape = [n_samples, k]
            The decision function of the input samples. The order of
            outputs is the same of that of the `classes_` attribute.
            Binary classification is a special cases with ``k == 1``,
            otherwise ``k==n_classes``. For binary classification,
            values closer to -1 or 1 mean more like the first or second
            class in ``classes_``, respectively.
        """
        check_is_fitted(self, "n_classes_")
        #X = self._validate_X_predict(X)
        X = self._check_X(X)

        n_classes = self.n_classes_
        classes = self.classes_[:, np.newaxis]
        pred = None

        if self.algorithm == 'SAMME.R':
            # The weights are all 1. for SAMME.R
            pred = sum(_samme_proba(estimator, n_classes, X)
                       for estimator in self.estimators_)
        else:   # self.algorithm == "SAMME"
            pred = sum((estimator.predict(X) == classes).T * w
                       for estimator, w in zip(self.estimators_,
                                               self.estimator_weights_))

        pred /= self.estimator_weights_.sum()
        if n_classes == 2:
            pred[:, 0] *= -1
            return pred.sum(axis=1)
        return pred
    
    def predict_proba(self, X):
        """Predict class probabilities for X.
        The predicted class probabilities of an input sample is computed as
        the weighted mean predicted class probabilities of the classifiers
        in the ensemble.
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape = [n_samples, n_features]
            The training input samples. Sparse matrix can be CSC, CSR, COO,
            DOK, or LIL. DOK and LIL are converted to CSR.
        Returns
        -------
        p : array of shape = [n_samples]
            The class probabilities of the input samples. The order of
            outputs is the same of that of the `classes_` attribute.
        """
        check_is_fitted(self, "n_classes_")

        n_classes = self.n_classes_
        #X = self._validate_X_predict(X)
        X = self._check_X(X)

        if self.algorithm == 'SAMME.R':
            # The weights are all 1. for SAMME.R
            proba = sum(_samme_proba(estimator, n_classes, X)
                        for estimator in self.estimators_)
        else:   # self.algorithm == "SAMME"
            proba = sum(estimator.predict_proba(X) * w
                        for estimator, w in zip(self.estimators_,
                                                self.estimator_weights_))

        proba /= self.estimator_weights_.sum()
        proba = np.exp((1. / (n_classes - 1)) * proba)
        normalizer = proba.sum(axis=1)[:, np.newaxis]
        normalizer[normalizer == 0.0] = 1.0
        proba /= normalizer

        return proba
    

    def misclassification_cost(self,y_true, y_pred):
        """Appends misclassification costs to model predictions.
        Parameters
        ----------
        y_true : array-like of shape = [n_samples, 1]
                 True class values.

        y_pred : array-like of shape = [n_samples, 1]
                 Predicted class values.
        """
        df = pd.DataFrame({'row':y_pred,'column':y_true})
        df = df.merge(self.cost_table,how = 'left', on = ['row','column'])
        
        return df['cost'].values
    
def _samme_proba(estimator, n_classes, X):
    """Calculate algorithm 4, step 2, equation c) of Zhu et al [1].
        
        References
        ----------
        .. [1] J. Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class AdaBoost", 2009.
                    
    """
    proba = estimator.predict_proba(X)

    # Displace zero probabilities so the log is defined.
    # Also fix negative elements which may occur with
    # negative sample weights.
    proba[proba < np.finfo(proba.dtype).eps] = np.finfo(proba.dtype).eps
    log_proba = np.log(proba)

    return (n_classes - 1) * (log_proba - (1. / n_classes)
                              * log_proba.sum(axis=1)[:, np.newaxis])

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, confusion_matrix
from sklearn.datasets import make_moons
X_train, y_train = make_moons(n_samples=200, shuffle=True, noise=0.5, random_state=10)
X_test, y_test = make_moons(n_samples=200, shuffle=True, noise=0.5, random_state=10)

dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)

In [None]:
cost_matrix = np.array([[0.0, 1.0],
                        [2.0, 0.0]])

def cost_calc(y_p,y,print_result = False):
    con_mat = metrics.confusion_matrix(y_p,y)
    cost_mat = np.multiply(con_mat,cost_matrix)
    cost = np.sum(np.multiply(con_mat,cost_matrix))/len(y) 
    if print_result:
        print("Confusion Matrix\n",con_mat)
        print("Costs\n",cost_mat)
        print("Total Cost = ", cost)
    else:
        return cost

In [None]:
score = make_scorer(cost_calc, greater_is_better=False)

In [None]:
y_test

In [None]:
from sklearn import metrics
cost_calc(y_test, y_test, True)

In [None]:
ada = AdaBoostClassifier(random_state=100)
ada.fit(X_train, y_train)
y_pred = ada.predict(X_test)
cost_calc(y_pred, y_test, True)

In [None]:
adac = AdaCost(algorithm='SAMME.R', cost_matrix=cost_matrix, random_state=100)

# Hyperopt Demo 

```python
import numpy as np
from sklearn.datasets import make_moons
import xgboost as xgb
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from hyperopt.pyll import scope

# Step I: Define the search space
# here is where we use hyperopt's choice to choose between Weighted Cross Entropy and the Focal loss functoin
#    as a parameter of the optimization!

wbce = Weight_Binary_Cross_Entropy(imbalance_alpha=0.5)
weighted_ce_obj = wbce.weighted_binary_cross_entropy
wf = Focal_Binary_Loss(0.5)
weighted_focal_obj = wf.focal_binary_object

SPACE = {
    'n_jobs': 0, 
    'objective': 'binary:hinge',
    'subsample': hp.uniform('subsample', 0.5, 1),
    'min_child_weight': hp.uniform('min_child_weight', 1, 10),
    'eta': hp.uniform('eta', 0, 1),
    'max_depth': scope.int(hp.quniform('max_depth', 1, 12, 1)),
    'min_split_loss': hp.uniform('min_split_loss', 0, 0.2),
    'obj': hp.choice('obj', (weighted_ce_obj, weighted_focal_obj)), # hyperopt will sample one of these objectives
    'num_parallel_tree': hp.choice('n_estimators', np.arange(1, 10, 1)),
    'lambda': hp.uniform('lambda', 0, 1),
    'alpha': hp.uniform('alpha', 0, 1),
    'booster': hp.choice('booster', ['gbtree', 'gblinear', 'dart']),
    'tree_method': hp.choice('tree_method', ('approx', 'hist')), 
}

# Step II: Define the objective function
def objective(space):

    # this is a "hack" since I want to pass obj in as
    #   a member of the search space
    #   but treat it ALONE as a keyword argument
    #   may increase computation time ever-so-slightly 
    params = {}
    for k, v in space.items():
        if k != 'obj' :
            params[k] = v

    obj = space['obj']

    # train the classifier
    booster = xgb.train(
        params,
        dtrain,
        obj=obj,
        feval=f1_eval # we also pass in a custom F1 evaluation metric here
    )

    y_pred = booster.predict(dtest)
    y_pred[y_pred < 0.5] = 0
    y_pred[y_pred >= 0.5] = 1

    # evaluate and return
    # note we want to maximize F1 and hence MINIMIZE NEGATIVE F1
    return {'loss': -f1_score(y_pred, y_test), 'status': STATUS_OK}


# Step III: Optimize! 
trials = Trials()
best_hyperparams = fmin(
  space=SPACE, 
  fn=objective,
  algo=tpe.suggest,
  max_evals=100, # this would be 100, 500 or something higher when actually optimizing
  trials=trials
)
```

In [2]:
# some toy data 
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_moons
import xgboost as xgb
from sklearn import metrics
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from hyperopt.pyll import scope
from sklearn.datasets import make_moons

X_train, y_train = make_moons(n_samples=20000, shuffle=True, noise=0.5, random_state=10)
X_test, y_test = make_moons(n_samples=10000, shuffle=True, noise=0.5, random_state=10)
xgb.set_config(verbosity=0)

dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)

In [3]:
import dash
from dash import dcc, html
import plotly
from time import sleep
from sklearn.metrics import f1_score
from dash.dependencies import Input, Output
from plotly import graph_objects as go

acc_fig, f2_fig, recall_fig = None, None, None
accs, f2s, recalls = [], [], []
from jupyter_dash import JupyterDash

app = JupyterDash(__name__)

app.layout = html.Div([
    html.H1("Evolution of Hyperparameter Tuning"),

    html.Div(id='live-output'),
    html.Div([
        html.Div([
            dcc.Graph(id = 'alpha', figure = {}),
        ], style={'width': '33%','display': 'inline-block'}),
        html.Div([
            dcc.Graph(id = 'eta', figure = {})
        ], style={'width': '33%', 'display': 'inline-block'}),
        html.Div([
            dcc.Graph(id = 'max_depth', figure = {}),
        ], style={'width': '33%', 'display': 'inline-block'}),
    ], className="row"),
    html.Div([
        dcc.Graph(id='F1', figure={})
    ]),
    # relationshiups
    html.Div([
        html.Div([
            dcc.Graph(id = 'alpha_loss', figure = {}),
        ], style={'width': '33%','display': 'inline-block'}),
        html.Div([
            dcc.Graph(id = 'eta_loss', figure = {})
        ], style={'width': '33%', 'display': 'inline-block'}),
        html.Div([
            dcc.Graph(id = 'max_depth_loss', figure = {}),
        ], style={'width': '33%', 'display': 'inline-block'}),
    ], className="row"),
  
    dcc.Interval(
        id='interval-component',
        interval=100,
        n_intervals=0
    ),

])
SPACE = {
    'n_jobs': 0, 
    'objective': 'binary:logistic',
    'subsample': hp.uniform('subsample', 0.5, 1),
    'min_child_weight': hp.uniform('min_child_weight', 1, 10),
    'eta': hp.uniform('eta', 0, 1),
    'max_depth': scope.int(hp.quniform('max_depth', 1, 12, 1)),
    'min_split_loss': hp.uniform('min_split_loss', 0, 0.2),
    'num_parallel_tree': hp.choice('n_estimators', np.arange(1, 10, 1)),
    'lambda': hp.uniform('lambda', 0, 1),
    'alpha': hp.uniform('alpha', 0, 1),
    'booster': hp.choice('booster', ['gbtree', 'gblinear', 'dart']),
    'tree_method': hp.choice('tree_method', ('approx', 'hist')), 
}


def objective(space):

    # this is a "hack" since I want to pass obj in as
    #   a member of the search space
    #   but treat it ALONE as a keyword argument
    #   may increase computation time ever-so-slightly 
    params = {}
    for k, v in space.items():
        if k != 'obj' :
            params[k] = v


    # train the classifier
    booster = xgb.train(
        params,
        dtrain,
    )

    y_pred = booster.predict(dtest)
    y_pred[y_pred < 0.5] = 0
    y_pred[y_pred >= 0.5] = 1

    # evaluate and return
    # note we want to maximize F1 and hence MINIMIZE NEGATIVE F1
    return {'loss': -f1_score(y_pred, y_test), 'status': STATUS_OK, 'accuracy': metrics.accuracy_score(dtest.get_label(), y_pred), 'recall': metrics.recall_score(dtest.get_label(), y_pred), 'f2': metrics.fbeta_score(dtest.get_label(), y_pred, beta=2)}

trials = Trials()
@app.callback(
    [
        Output('live-output', 'children'),
        Output('alpha', 'figure'),
        Output('eta', 'figure'),
        Output('max_depth', 'figure'),
        Output('F1', 'figure'),
        Output('alpha_loss', 'figure'),
        Output('eta_loss', 'figure'),
        Output('max_depth_loss', 'figure'),
      
    ],
    [Input('interval-component', 'n_intervals')]
)
def update_graph(n):
   
    fig1, fig2, fig3 = None, None, None
    fig1_loss, fig2_loss, fig3_loss = None, None, None
    fig_loss = None

    losses = []
    try:

        n_trials = len(trials)
        alphas = [i['misc']['vals']['alpha'][0] for i in trials.trials]
        lambdas = [i['misc']['vals']['lambda'][0] for i in trials.trials]
        max_depths = [i['misc']['vals']['max_depth'][0] for i in trials.trials]
        etas = [i['misc']['vals']['eta'][0] for i in trials.trials]
 


        # fig = go.Figure(go.Scatter(x=np.linspace(0, n_trials), y=alphas), layout=go.Layout(autosize=False, width=600, height=600, title='Alpha'))
        fig1 = go.Figure(go.Scatter(x=np.linspace(0, n_trials), y=alphas, mode='lines+markers'), layout=go.Layout(title='Alpha Evolution', xaxis=dict(title='n_trials'), yaxis=dict(title='alpha')))
        fig2 = go.Figure(go.Scatter(x=np.linspace(0, n_trials), y=etas, mode='lines+markers'), layout=go.Layout(title='ETA Evolution', xaxis=dict(title='n_trials'), yaxis=dict(title='Learning Rate')))
        fig3 = go.Figure(go.Scatter(x=np.linspace(0, n_trials), y=max_depths, mode='lines+markers'), layout=go.Layout(title='Depth Evolution', xaxis=dict(title='n_trials'), yaxis=dict(title='Max Depth')))

        losses = trials.losses()

        fig_loss = go.Figure(go.Scatter(
            x=np.linspace(0, n_trials),
            y=losses,
            mode='lines+markers',
        ), layout=go.Layout(xaxis=dict(title='n_trials'), yaxis=dict(title='F2 Metric')))

        fig1_loss = go.Figure(go.Scatter(x=alphas, y=losses, mode='markers'), layout=go.Layout(title='F2 vs Alpha', xaxis=dict(title='alpha'), yaxis=dict(title='F2')))
        fig2_loss = go.Figure(go.Scatter(x=etas, y=losses, mode='markers'), layout=go.Layout(title='F2 vs Learning Rate', xaxis=dict(title='eta'), yaxis=dict(title='F2')))
        fig3_loss = go.Figure(go.Scatter(x=max_depths, y=losses, mode='markers'), layout=go.Layout(title='F2 vs Depth', xaxis=dict(title='max_depth'), yaxis=dict(title='F2'))) 
    except Exception as e:
        print(e)

    return html.Span(f'Current Loss: {trials.losses()[-1]}'), fig1, fig2, fig3, fig_loss, fig1_loss, fig2_loss, fig3_loss 

app.run_server()

best_hyperparams = fmin(
    space=SPACE, 
    algo=tpe.suggest,
    fn=objective,
    max_evals=2000, # this would be 100, 500 or something higher when actually optimizing
    trials=trials
)


Dash app running on http://127.0.0.1:8050/
 14%|█▎        | 270/2000 [02:44<20:21,  1.42trial/s, best loss: -0.8277306733167084] 