<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/ImbalancedData_MCC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Matthews Correlation Coefficient(MCC)

Matthew's correlation coefficient, also abbreviated as MCC was invented by Brian Matthews in 1975.<br>

MCC is a statistical tool used for model evaluation. Its job is to gauge or measure the difference between the predicted values and actual values

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/Data.git cloned-repo
%cd cloned-repo

In [None]:
from numpy import unique
from numpy import where
import matplotlib as plt
from matplotlib import pyplot
from sklearn.datasets import make_blobs
from IPython.display import Image

In [None]:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.metrics import balanced_accuracy_score

In [None]:
from numpy import mean
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import f1_score
from sklearn.metrics import matthews_corrcoef, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report

For measuring model performance there are four metrics in addition to ROC curves.<br>
Each of the four can someimes be misleading. <br>
Which is why we will discuss MCC

Let's look at a small example

In [None]:
Image("images/DogsCats1.png")

**The dataset: 21 dogs, 3 cats**<br>
Precision is the proportion of true positives out of all detected positives, or simply TP/(TP+FP).<br>

In this case, dog photos are the positive class, and 18 out of 21 photos that were classified as dogs actually contain dogs.<br>

**The precision is 18/21=86%.** <br>
The recall is the number of true positives that are correctly classified (TP/(TP+FN)). From the above matrix it is easy to see that there are 20 true positives, and 18 of them are successfully detected.<br>
**Recall is 18/(18+2), or 90%.**<br>

**The F1 score is 88%**

Precision and recall consider one class , the positive class, to be the class we are interested in. <br>

They use only three of the values in the confusion matrix: TP, FP, and FN. The 4th value — TN — is not used in these metrics.<br>


You can put any value in the TN cell —0, 100, infinity — and the precision, recall and F1-score will not change.



In [None]:
Image("images/DogsCats2.png")

Now, flip the confusion matrix. Let’s consider “cat” to be the positive class, i.e., the one we are interested in.<br>
The dataset is still 21 dogs, 3 cats<br>

**Precision is 33%**,<br>
**Recall 25%**, <br>
**F1-score is 29%**<br>

There are 24 datapoints, and 19 were correctly classified.<br>
Accuracy for dogs: 90%<br>
Accuracy for cats: 25%



**The change the metrics occurs because the classes are imbalanced. Depending which one is the positive class in an imbalanced dataset can greatly change the metrics.**

**Matthews Correlation Coefficient**
*For binary classification, there is another solution:* <BR>
- treat the true class and the predicted class as two (binary) variables,
- compute their correlation coefficient. <br>

**The higher the correlation between true and predicted values, the better the prediction.** <br>

This is the phi-coefficient (φ), rechristened Matthews Correlation Coefficient (MCC) when applied to classifiers.

In [None]:
Image("images/MCCEQUATION.png" , width=640)

**MCC value is always between -1 and 1, with 0 meaning that the classifier is no better than a random flip of a fair coin**. <br>

**MCC is also perfectly symmetric**, so no class is more important than the other; if you switch the positive and negative, you’ll still get the same value.

**MCC takes into account all four values in the confusion matrix**, and a high value (close to 1) means that both classes are predicted well, even if one class is disproportionately under- (or over-) represented.

# A larger dataset example

**Create an impbalanced dataset**

In [None]:
# define dataset
X1, y1 = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1,
                           weights=[0.99,0.01], flip_y=0, random_state=4)

y2=y1.copy()

In [None]:
for i in range(len(y1)):
  if y1[i]==1:
    y2[i]=0
  elif y1[i]==0:
    y2[i]=1

In [None]:
y1

In [None]:
y2

In [None]:
# summarize class distribution
counter = Counter(y1)
print(counter)
print(counter.items())
print(counter.keys())

In [None]:
# summarize class distribution
counter = Counter(y2)
print(counter)
print(counter.items())
print(counter.keys())

In [None]:
# scatter plot of examples by class label
for label,  _ in counter.items():
  row_ix = where(y1 == label)[0]
  pyplot.scatter(X1[row_ix, 0], X1[row_ix, 1], label=str(label))
  pyplot.legend()
pyplot.show()

In [None]:
# scatter plot of examples by class label
for label,  _ in counter.items():
  row_ix = where(y2 == label)[0]
  pyplot.scatter(X1[row_ix, 0], X1[row_ix, 1], label=str(label))
  pyplot.legend()
pyplot.show()

**Split the data into train and test sets**

In [None]:
X1_train, X1_test, y1_train, y1_test = train_test_split(
    X1, y1, test_size=0.20, random_state=42)

In [None]:
X2_train, X2_test, y2_train, y2_test = train_test_split(
    X1, y2, test_size=0.20, random_state=42)

**Define the model**

In [None]:
# evaluate a model using repeated k-fold cross-validation
def evaluate_model(X, y, model, X_train, y_train):
  # define the evaluation procedure
  cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
  # evaluate the model on the dataset
  scores = cross_val_score(model, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1) # return scores from each fold and each repeat
  return scores

In [None]:
# define model
model1 = LogisticRegression(solver='lbfgs')
scores1 = evaluate_model(X1_test, y1_test, model1, X1_train, y1_train)

In [None]:
# define model
model2 = LogisticRegression(solver='lbfgs')
scores2 = evaluate_model(X2_test, y2_test, model2, X2_train, y2_train)

**Calcuate the metrics for the model**

In [None]:
model1.fit(X1_train, y1_train)
y_pred1=model1.predict(X1_test)

In [None]:
# summarize performance
print('Mean Accuracy: %.2f%%' % (mean(scores1) * 100))
print('Balanced Accuracy: %.2f%%'% balanced_accuracy_score(y1_test, y_pred1))

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(y1_test, y_pred1)

In [None]:
model2.fit(X2_train, y2_train)
y_pred2=model2.predict(X2_test)
# summarize performance
print('Mean Accuracy: %.2f%%' % (mean(scores2) * 100))
print('Balanced Accuracy: %.2f%%'% balanced_accuracy_score(y2_test, y_pred2))

In [None]:
ConfusionMatrixDisplay.from_predictions(y2_test, y_pred2)

In [None]:
pos_prob=model1.predict_proba(X1_test)
pos_probs1=pos_prob[:,1]

In [None]:
# calculate roc auc
roc_auc1 = roc_auc_score(y1_test, pos_probs1)
print('ROC AUC %.3f' % roc_auc1)

In [None]:
pos_prob=model2.predict_proba(X2_test)
pos_probs2=pos_prob[:,1]

In [None]:
# calculate roc auc
roc_auc2 = roc_auc_score(y2_test, pos_probs2)
print('ROC AUC %.3f' % roc_auc2)

In [None]:
# plot no skill roc curve
pyplot.plot([0, 1], [0, 1], linestyle='--', label='No Skill')
# calculate roc curve for model
fpr, tpr, _ = roc_curve(y1_test, pos_probs1)
# plot model roc curve
pyplot.plot(fpr, tpr, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()

In [None]:
# plot no skill roc curve
pyplot.plot([0, 1], [0, 1], linestyle='--', label='No Skill')
# calculate roc curve for model
fpr, tpr, _ = roc_curve(y2_test, pos_probs2)
# plot model roc curve
pyplot.plot(fpr, tpr, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()

In [None]:
target_names = ['class 0', 'class 1']

In [None]:
print(classification_report(y1_test, y_pred1, target_names=target_names))

In [None]:
from imblearn.metrics import classification_report_imbalanced
print(classification_report_imbalanced(y2_test, y_pred2, target_names=target_names))
#Text summary of the precision, recall, specificity, geometric mean, and index balanced accuracy

In [None]:
from sklearn.metrics import mean_absolute_error
from imblearn.metrics import macro_averaged_mean_absolute_error

y_true_balanced = [1, 1, 2, 2]
y_true_imbalanced = [1, 2, 2, 2]
y_pred = [1, 2, 1, 2]
print("True balanced:", mean_absolute_error(y_true_balanced, y_pred))
print("Macro Avg Absolute error Balanced:",macro_averaged_mean_absolute_error(y_true_balanced, y_pred))
print("Absolute error:", mean_absolute_error(y_true_imbalanced, y_pred))
print("Macro Avg Absolute error:",macro_averaged_mean_absolute_error(y_true_imbalanced, y_pred))

In [None]:
print("MCC 1:",matthews_corrcoef(y1_test,y_pred1))
print("MCC 2:",matthews_corrcoef(y2_test,y_pred2))



---



---



---



In [None]:
RANDOM_STATE = 42


In [None]:

X, y = make_classification(
    n_classes=3,
    class_sep=2,
    weights=[0.1, 0.9],
    n_informative=10,
    n_redundant=1,
    flip_y=0,
    n_features=20,
    n_clusters_per_class=4,
    n_samples=5000,
    random_state=RANDOM_STATE,
)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=RANDOM_STATE
)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE

In [None]:
from imblearn.pipeline import make_pipeline

model = make_pipeline(
    StandardScaler(),
    SMOTE(random_state=RANDOM_STATE),
    LogisticRegression(max_iter=10_000, random_state=RANDOM_STATE),
)

In [None]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


The geometric mean corresponds to the square root of the product of the sensitivity and specificity. Combining the two metrics should account for the balancing of the dataset.

In [None]:
from imblearn.metrics import geometric_mean_score
print(f"The geometric mean is {geometric_mean_score(y_test, y_pred):.3f}")

The index balanced accuracy can transform any metric to be used in imbalanced learning problems.

In [None]:
from imblearn.metrics import make_index_balanced_accuracy

alpha = 0.1
geo_mean = make_index_balanced_accuracy(alpha=alpha, squared=True)(geometric_mean_score)

print(
    f"The IBA using alpha={alpha} and the geometric mean: "
    f"{geo_mean(y_test, y_pred):.3f}"
)


In [None]:
alpha = 0.5
geo_mean = make_index_balanced_accuracy(alpha=alpha, squared=True)(geometric_mean_score)

print(
    f"The IBA using alpha={alpha} and the geometric mean: "
    f"{geo_mean(y_test, y_pred):.3f}"
)