you will discover how to develop and evaluate naive classification strategies for machine learning.

Classification predictive modeling problems involve predicting a class label given an input to the model.

Given a classification model, how do you know if the model has skill or not?

This is a common question on every classification predictive modeling project. The answer is to compare the results of a given classifier model to a baseline or naive classifier model.

Consider a simple two-class classification problem where the number of observations is not equal for each class (e.g. it is imbalanced) with 25 examples for class-0 and 75 examples for class-1. This problem can be used to consider different naive classifier models.

For example, consider a model that randomly predicts class-0 or class-1 with equal probability. How would it perform?

We can calculate the expected performance using a simple probability model.
P(yhat = y) = P(yhat = 0) * P(y = 0) + P(yhat = 1) * P(y = 1)
We can plug in the occurrence of each class (0.25 and 0.75) and the predicted probability for each class (0.5 and 0.5) and estimate the performance of the model.
P(yhat = y) = 0.5 * 0.25 + 0.5 * 0.75
P(yhat = y) = 0.5
It turns out that this classifier is pretty poor.

Now, what if we consider predicting the majority class (class-1) every time? Again, we can plug in the predicted probabilities (0.0 and 1.0) and estimate the performance of the model.
P(yhat = y) = 0.0 * 0.25 + 1.0 * 0.75
P(yhat = y) = 0.75
It turns out that this simple change results in a better naive classification model, and is perhaps the best naive classifier to use when classes are imbalanced.

The scikit-learn machine learning library provides an implementation of the majority class naive classification algorithm called the DummyClassifier that you can use on your next classification predictive modeling project.

The complete example is listed below.


In [1]:
# example of the majority class naive classifier in scikit-learn
from numpy import asarray
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score



In [2]:
# define dataset
X = asarray([0 for _ in range(100)])
class0 = [0 for _ in range(25)]
class1 = [1 for _ in range(75)]
y = asarray(class0 + class1)



In [6]:
# reshape data for sklearn
X = X.reshape((len(X), 1))

# define model
model = DummyClassifier(strategy='most_frequent')

# fit model
model.fit(X, y)

# make predictions
yhat = model.predict(X)

# calculate accuracy
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.3f' % accuracy)


Accuracy: 0.750


Running the example prepares the dataset, then defines and fits the DummyClassifier on the dataset using the majority class strategy.


Your task in this lesson is to run the example and report the result, confirming whether the model performs as we expected from our calculation.

As a bonus, calculate the expected probability of a naive classifier model that randomly chooses a class label from the training dataset each time a prediction is made.