<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Imbalanced_Datasets_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **The Accuracy Paradox**

**Evaluate a majority class classifier on an 1:100 imbalanced dataset**

In [None]:
# define an imbalanced dataset with a 1:100 class ratio
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
from numpy import where

In [None]:
from numpy import mean
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.metrics import confusion_matrix
#importing accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report

**DummyClassifier** makes predictions that ignore the input features.

This classifier serves as a simple baseline to compare against other more complex classifiers.

The specific behavior of the baseline is selected with the strategy parameter.

**“most_frequent”:** the predict method always returns the most frequent class label in the observed y argument passed to fit.

The **predict_proba** method returns the matching one-hot encoded vector.


**Create a synthetic dataset**

**Weights Parameter**<br>
The proportions of samples assigned to each class. If None, then classes are balanced. <br>
Note that if len(weights) == n_classes - 1, then the last class weight is automatically inferred. <br>

More than n_samples samples may be returned if the sum of weights exceeds 1. Note that the actual class proportions will not exactly match weights when flip_y isn’t 0.

In [None]:
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1,
                           weights=[0.99], flip_y=0, random_state=4)

In [None]:
# summarize class distribution
counter = Counter(y)
print(counter)

In [None]:
# scatter plot of examples by class label
for label,  _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
  pyplot.legend()
pyplot.show()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42)

**Create a model and train it on the dataset**

In [None]:
# evaluate a model using repeated k-fold cross-validation
def evaluate_model(X, y, model):
  # define the evaluation procedure
  cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
  # evaluate the model on the dataset
  scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1) # return scores from each fold and each repeat
  return scores

**Create and train the model**

In [None]:
# define model
model = DummyClassifier(strategy='most_frequent') # evaluate the model
scores = evaluate_model(X_test, y_test, model)
# summarize performance
print('Mean Accuracy: %.2f%%' % (mean(scores) * 100))

In [None]:
model.fit(X_train, y_train)
y_pred=model.predict(X_test)
confusion = confusion_matrix(y_test, y_pred)
print('Confusion Matrix\n')
print(confusion)

**Assignment:** <br>
What do you think of this model?

Consider the case of an imbalanced dataset with a 1:100 class imbalance. In this problem, each example of the minority class (class 1) will have a corresponding 100 examples for the majority class (class 0). In problems of this type, the majority class represents normal and the minority class represents abnormal, such as a fault, a diagnosis, or a fraud. Good performance on the minority class will be preferred over good performance on both classes.

Many machine learning models are designed around the assumption of balanced class distribution, and often learn simple rules (explicit or otherwise) like always predict the majority class, causing them to achieve an accuracy of 99 percent, although in practice performing no better than an unskilled majority class classifier.

A beginner will see the performance of a sophisticated model achieving 99 percent on an imbalanced dataset of this type and believe their work is done, when in fact, they have been misled. This situation is so common that it has a name, referred to as the accuracy paradox.

In [None]:
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1,
                           weights=[0.99], flip_y=0, random_state=4)

In [None]:
for i in range(len(y)):
  if i%2==0:
    y[i]=0


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42)

In [None]:
model.fit(X_train, y_train)
y_pred=model.predict(X_test)
confusion = confusion_matrix(y_test, y_pred)
print('Confusion Matrix\n')
print(confusion)

In [None]:
# define model
model = DummyClassifier(strategy='most_frequent') # evaluate the model
scores = evaluate_model(X_test, y_test, model)
# summarize performance
print('Mean Accuracy: %.2f%%' % (mean(scores) * 100))