<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/ImbalancedData_MCC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo

In [None]:
from numpy import unique
from numpy import where
from matplotlib import pyplot
from sklearn.datasets import make_blobs

In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

In [None]:
from numpy import mean
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import f1_score
from sklearn.metrics import matthews_corrcoef, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report

**Precision**<br>
A quick calculation shows that now precision is 33%, recall 25%, and the F1-score is 29% — oy vey! Our classifier is terrible at classifying cats. Let’s go on and look at the overall accuracy

**Accuracy**<br>
Accuracy is the proportion of samples that are correctly classified.<br>
In total there are (TP+FP)+(FN+TN)=20+4=24 samples, and TP+TN=19 are correctly classified. The accuracy is thus a formidable 79%. But this is quite misleading, since although 90% of dogs are accurately classified, it’s only 25% for cats. If you average 90% and 25% you’ll get an average accuracy of 57.5%, which is much lower than the classifier’s accuracy of 79%. And the reason? There are many more dog samples than cat samples in our dataset. The classes are imbalanced.

To see how the class imbalance affects the accuracy, imagine that now instead of 4 cat photos, we had 100 sets of these 4 photos for a total of 400 photos. Since we use the same classifier, 100 out of 400 of the photos will be correctly classified, and 300 will be misclassified.

In [None]:
Image("SoManyCats.png")

Ouch! A correlation of 0.17 signifies the that predicted class and the true class are weakly correlated. And we know exactly why. Our classifier is bad at classifying cats

**Matthews Correlation Coefficient**
For binary classification, there is another (and arguably more elegant) solution: treat the true class and the predicted class as two (binary) variables, and compute their correlation coefficient (in a similar way to computing correlation coefficient between any two variables). The higher the correlation between true and predicted values, the better the prediction. This is the phi-coefficient (φ), rechristened Matthews Correlation Coefficient (MCC) when applied to classifiers. Computing the MCC is not rocket science:

In [None]:
from IPython.display import Image
Image("images/ConfusionMatrixMCC.png" , width=640)

In fact, MCC value is always between -1 and 1, with 0 meaning that the classifier is no better than a random flip of a fair coin. MCC is also perfectly symmetric, so no class is more important than the other; if you switch the positive and negative, you’ll still get the same value.

MCC takes into account all four values in the confusion matrix, and a high value (close to 1) means that both classes are predicted well, even if one class is disproportionately under- (or over-) represented.

In [None]:
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1,
                           weights=[0.99], flip_y=0, random_state=4)


In [None]:

# summarize class distribution
counter = Counter(y)
print(counter)

In [None]:
# scatter plot of examples by class label
for label,  _ in counter.items():
  row_ix = where(y == label)[0]
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
  pyplot.legend()
pyplot.show()


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42)

In [None]:
# evaluate a model using repeated k-fold cross-validation
def evaluate_model(X, y, model):
  # define the evaluation procedure
  cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
  # evaluate the model on the dataset
  scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1) # return scores from each fold and each repeat
  return scores

In [None]:
# define model
model = LogisticRegression(solver='lbfgs')
scores = evaluate_model(X_test, y_test, model)
# summarize performance
print('Mean Accuracy: %.2f%%' % (mean(scores) * 100))

In [None]:
model.fit(X_train, y_train)
y_pred=model.predict(X_test)
pos_probs = y_pred
confusion = confusion_matrix(y_test, y_pred)
print('Confusion Matrix\n')
print(confusion)

In [None]:
# calculate roc auc
roc_auc = roc_auc_score(y_test, pos_probs)
print('No Skill ROC AUC %.3f' % roc_auc)

In [None]:

# plot no skill roc curve
pyplot.plot([0, 1], [0, 1], linestyle='--', label='No Skill')
# calculate roc curve for model
fpr, tpr, _ = roc_curve(y_test, pos_probs)
# plot model roc curve
pyplot.plot(fpr, tpr, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()

In [None]:
matthews_corrcoef(y_test,y_pred)

In [None]:
f1_score(y_test, y_pred, average='macro')

In [None]:
f1_score(y_test, y_pred, average='micro')

In [None]:
f1_score(y_test, y_pred, average='weighted')

In [None]:
target_names = ['class 0', 'class 1']

In [None]:
print(classification_report(y_test, y_pred, target_names=target_names))