<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Imbalanced_Dataset_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Classification predictive modeling involves predicting a class label for an example. On some problems, a crisp class label is not required, and instead a probability of class membership is preferred. The probability summarizes the likelihood (or uncertainty) of an example belonging to each class label. Probabilities are more nuanced and can be interpreted by a human operator or a system in decision making

the probability of an example belonging to class 1 to represent the probabilities for binary classification

This focus on predicted probabilities may mean that the crisp class labels predicted by a model are ignored. This focus may mean that a model that predicts probabilities may appear to have terrible performance when evaluated according to its crisp class labels, such as using accuracy or a similar score. This is because although the predicted probabilities may show skill, they must be interpreted with a threshold prior to being converted into crisp class labels.

the focus on predicted probabilities may also require that the probabilities predicted by some nonlinear models to be calibrated prior to being used or evaluated. Some models will learn calibrated probabilities as part of the training process (e.g. logistic regression), but many will not and will require calibration (e.g. support vector machines, decision trees, and neural networks). We will cover the topic of calibrating predicted probabilities in Chapter 22. A given probability metric is typically calculated for each example, then averaged across all examples in the training dataset. There are two popular metrics for evaluating predicted probabilities; they are:
􏰀 Log Loss
􏰀 Brier Score

In [None]:
from numpy import unique
from sklearn.datasets import make_classification
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.metrics import brier_score_loss

In [None]:
# create an imbalanced dataset

# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99], flip_y=0,
  random_state=1)
# summarize dataset
classes = unique(y)
total = len(y)
for c in classes:
  n_examples = len(y[y==c])
  percent = n_examples / total * 100
  print('> Class=%d : %d/%d (%.1f%%)' % (c, n_examples, total, percent))

Running the example reports the log loss for each naive strategy. As expected, predicting certainty for each class label is punished with large log loss scores, with the case of being certain for the minority class in all cases resulting in a much larger score. We can see that predicting the distribution of examples in the dataset as the baseline results in a better score than either of the other naive measures. This baseline represents the no skill classifier and log loss scores

# **LogLoss**

In [None]:
# log loss for naive probability predictions.

# split into train/test sets with same class ratio
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)
# no skill prediction 0
probabilities = [[1, 0] for _ in range(len(testy))] 
avg_logloss = log_loss(testy, probabilities) 
print('P(class0=1): Log Loss=%.3f' % (avg_logloss)) # no skill prediction 1
probabilities = [[0, 1] for _ in range(len(testy))] 
avg_logloss = log_loss(testy, probabilities) 
print('P(class1=1): Log Loss=%.3f' % (avg_logloss)) # baseline probabilities
probabilities = [[0.99, 0.01] for _ in range(len(testy))] 
avg_logloss = log_loss(testy, probabilities) 
print('Baseline: Log Loss=%.3f' % (avg_logloss))
# perfect probabilities
avg_logloss = log_loss(testy, testy) 
print('Perfect: Log Loss=%.3f' % (avg_logloss))

# **Brier Score**

The Brier score, named for Glenn Brier, calculates the mean squared error between predicted probabilities and the expected values. The score summarizes the magnitude of the error in the predicted probabilities and is designed for binary classification problems. It is focused on evaluating the probabilities for the positive class. Nevertheless, it can be adapted for problems with multiple classes. It is also an appropriate probabilistic metric for imbalanced classification problems.

The error score is always between 0.0 and 1.0, where a model with perfect skill has a score of 0.0.

Running the example, we can see the scores for the naive models and the baseline no skill classifier. As we might expect, we can see that predicting a 0.0 for all examples results in a low score, as the mean squared error between all 0.0 predictions and mostly 0 classes in the test set results in a small value. Conversely, the error between 1.0 predictions and mostly 0 class values results in a larger error score. Importantly, we can see that the default no skill classifier results in a lower score than predicting all 0.0 values. Again, this represents the baseline score, below which models will demonstrate skill.

In [None]:
# brier score for naive probability predictions.

# no skill prediction 0
probabilities = [0.0 for _ in range(len(testy))] 
avg_brier = brier_score_loss(testy, probabilities) 
print('P(class1=0): Brier Score=%.4f' % (avg_brier)) # no skill prediction 1
probabilities = [1.0 for _ in range(len(testy))] 
avg_brier = brier_score_loss(testy, probabilities) 
print('P(class1=1): Brier Score=%.4f' % (avg_brier)) # baseline probabilities
probabilities = [0.01 for _ in range(len(testy))] 
avg_brier = brier_score_loss(testy, probabilities) 
print('Baseline: Brier Score=%.4f' % (avg_brier)) # perfect probabilities
avg_brier = brier_score_loss(testy, testy)
print('Perfect: Brier Score=%.4f' % (avg_brier))

A common practice is to transform the score using a reference score, such as the no skill classifier. This is called a Brier Skill Score, or BSS, and is calculated as follows:
   BrierSkillScore = 1 −
(BrierScore/BrierScoreref)

a BSS of 0.0. This represents a no skill prediction. Values below this will be negative and represent worse than no skill. Values above 0.0 represent skillful predictions with a perfect prediction value of 1.0.

In [None]:
# brier skill score for naive probability predictions.

# calculate the brier skill score
def brier_skill_score(y, yhat, brier_ref): # calculate the brier score
  bs = brier_score_loss(y, yhat)
  # calculate skill score
  return 1.0 - (bs / brier_ref)
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99], flip_y=0, random_state=1)
# split into train/test sets with same class ratio
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)
# calculate reference
probabilities = [0.01 
for _ in range(len(testy))] 
brier_ref = brier_score_loss(testy, probabilities) 
print('Reference: Brier Score=%.4f' % (brier_ref)) # no skill prediction 0
probabilities = [0.0 for _ in range(len(testy))]
bss = brier_skill_score(testy, probabilities, brier_ref) 
print('P(class1=0): BSS=%.4f' % (bss))
# no skill prediction 1
probabilities = [1.0 for _ in range(len(testy))]
bss = brier_skill_score(testy, probabilities, brier_ref) 
print('P(class1=1): BSS=%.4f' % (bss))
# baseline probabilities
probabilities = [0.01 for _ in range(len(testy))]
bss = brier_skill_score(testy, probabilities, brier_ref) 
print('Baseline: BSS=%.4f' % (bss))
# perfect probabilities
bss = brier_skill_score(testy, testy, brier_ref) 
print('Perfect: BSS=%.4f' % (bss))
