---
title: Evaluation for Skewed Dataset
math: 
    '\abs': '\left\lvert #1 \right\rvert' 
    '\norm': '\left\lvert #1 \right\rvert' 
    '\Set': '\left\{ #1 \right\}'
    '\mc': '\mathcal{#1}'
    '\M': '\boldsymbol{#1}'
    '\R': '\mathsf{#1}'
    '\RM': '\boldsymbol{\mathsf{#1}}'
    '\op': '\operatorname{#1}'
    '\E': '\op{E}'
    '\d': '\mathrm{\mathstrut d}'
    '\precision': '\text{precision}'
    '\recall': '\text{recall}'
    '\specificity': '\text{specificity}'
    '\TP': '\text{TP}'
    '\FP': '\text{FP}'
    '\FN': '\text{FN}'
    '\TN': '\text{TN}'
    '\cost': '\text{cost}'    
---

**CS5483 Data Warehousing and Data Mining**
___

In [None]:
import logging

import numpy as np
import pandas as pd
import weka.core.jvm as jvm
import weka.plot.classifiers as plcls
from weka.classifiers import Classifier, Evaluation
from weka.core.classes import Random
from weka.core.converters import Loader

%matplotlib widget
jvm.start(logging_level=logging.ERROR)

## Class imbalance problem

In this notebook, we will analyze a skewed dataset for detecting microcalcifications in mammograms. The goal is to build a classifier to identify whether a bright spot in a mammogram is a micro-calcification (an early sign of breast cancer).

::::{figure} https://upload.wikimedia.org/wikipedia/commons/e/e7/Mammo_breast_cancer_wArrows.jpg
:name: fig:micro-calcification
:alt: Mammo breast cancer
:align: left
Micro-calcification
::::

The dataset can be downloaded from
[OpenML](https://www.openml.org/d/310) in [ARFF format](https://www.openml.org/data/download/52214/phpn1jVwe). The following loads the data using `python-weka-wrapper`.

In [None]:
loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_url("https://www.openml.org/data/download/52214/phpn1jVwe")
data.class_is_last()
print(data.summary(data))

There are 7 attributes and over 11 thousand instances. To understand the dataset, refer to Section 4 of the original paper [(Woods et al. 1993)](https://doi.org/10.1142/9789812797834_0011):

::::{note}

A set of 24 mammograms were segmented to locate small bright spots, which are the candidates for the classification of malignant clusters of micro-calcifications. The classification problem is based on the following attributes of the image segments:

- Area (number of pixels)
- Average grey level
- Gradient strength (of perimeter pixels)
- Root mean square noise (fluctuation of the pixel values)
- Root mean square noise of local background
- Contrast (average grey level minus average of a 2-pixel wide border)
- (Low order moment-based) Shape descriptor

::::

To compute the 10-fold cross-validation accuracy for `J48`:

In [None]:
clf = Classifier(classname="weka.classifiers.trees.J48")
evl = Evaluation(data)
evl.crossvalidate_model(clf, data, 10, Random(1))

print(f"Accuracy: {evl.percent_correct:.3g}%")

You should see that the accuracy is close to 100%. To show the confusion matrix:

In [None]:
confusion_matrix = pd.DataFrame(
    evl.confusion_matrix,
    dtype=int,
    columns=[f'predicted class "{v}"' for v in data.class_attribute.values],
    index=[f'class "{v}"' for v in data.class_attribute.values],
)
confusion_matrix

Each row of the confusion matrix corresponds to a class value (1: malignant, -1: benign), and each column corresponds to a predicted class. Each entry is a count of instances belonging to a specific class and having a particular predicted class.

::::{exercise}
:label: ex:1
Assign to `percent_of_malignant_detected` the percentage of instances of class 1 predicted as class 1:
::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
print(f"Percentage of malignant detected: {percent_of_malignant_detected:.3g}%")

In [None]:
# tests

## Different Performance Metrics

For a skewed dataset, one can achieve very high accuracy even by `ZeroR`, i.e., also predicting the class as the majority class regardless of the values of the input features. 
We must use other performance metrics to train and evaluate a classification algorithm properly.

::::{prf:definition}

These metrics can be computed from the numbers of true/false positives/negatives:

$$
\begin{align}
\precision &= \frac{\TP}{\TP + \FP}\\ 
\recall &= \frac{\TP}{\TP + \FN}\\ 
\specificity &= \frac{\TN}{\TN + \FP}
\end{align}
$$

::::

To show the above metrics:

In [None]:
pos_class = 1  # specify the postive class value
performance = {
    "precision": evl.precision(pos_class),
    "recall": evl.recall(pos_class),
    "specificity": evl.true_negative_rate(pos_class),
}
performance

Although specificity is close to 100%, precision and recall are below 80% and 60% respectively:

- If a bright spot is classified as malignant, the chance it is malignant is less than 80%.
- Out of all malignant bright spots, less than 60% are identified as malignant.

The reason why close to 100% benign bright spots are identified as benign 

- is mainly because most bright spots are benign, but
- not because the classifier can distinguish malignant bright spots from benign ones.

In [None]:
TP = evl.num_true_positives(pos_class)
FN = evl.num_false_negatives(pos_class)
FP = evl.num_false_positives(pos_class)
TN = evl.num_true_negatives(pos_class)

assert np.isclose(performance["precision"], TP / (TP + FP))
assert np.isclose(performance["recall"], TP / (TP + FN))
assert np.isclose(performance["specificity"], TN / (TN + FP))

TFPN = pd.DataFrame(
    [[TP, FN], [FP, TP]],
    dtype=int,
    columns=["predicted +ve", "predicted -ve"],
    index=["+ve", "-ve"],
)
TFPN

The above table is not the same as a confusion matrix since a confusion matrix

- does not specify a positive class, and
- can have more than two rows/columns in multi-class classification problems.

::::{exercise}
:label: ex:2
Modify `performance` to include the negative predictive value. You can add the value using
```python
performance['NPV'] = ...
```
::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
print(f"negative predictive value (NPV): {performance['NPV']:.3g}")

In [None]:
# tests

$F_{\beta}$-score is another measure that captures the performance in both precision and recall:

::::{prf:definition}

$F_{\beta}$-score is defined as

$$
\begin{align}
F_{\beta} &:= \left( \frac{\precision^{-1} + \beta^2 \cdot \recall^{-1}}{\beta^2 + 1}\right)^{-1}\\
&= \frac{(\beta^2+1)\cdot \precision\cdot \recall }{\beta^2\precision + \recall}.
\end{align}
$$ (F_beta)

$F$-score is the special case when $\beta=1$,

$$
\begin{align}
F := F_1 &= \left( \frac{\precision^{-1} + \recall^{-1}}{2}\right)^{-1} \\
&= \frac{2\cdot \precision\cdot \recall }{\precision + \recall},
\end{align}
$$ (F)

which is the harmonic mean of precision and recall. 

::::

$F$-score is useful in training a classifier to maximize both precision and recall.

In [None]:
performance["F"] = evl.f_measure(pos_class)
print(f"F-score: {performance['F']:.3g}")

::::{exercise}
:label: ex:3
For the $F_{\beta}$ score, we can increase $\beta$ to put more weight on recall. Modify `performance` to include the $F_2$ score. You can set the value using

```python
performance['F_2'] = ...
```
::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
print(f"F_2 score: {performance['F_2']:.3g}")

In [None]:
# tests

::::{exercise}
:label: ex:4
Using `ZeroR` as the classifier, assign to `ZeroR_performance` a dictionary of precision, recall, and specificity. You can create the dictionary as follows:
```python
ZeroR_performance = {
    'precision': ___,
    'recall': ___,
    'specificity': ___
}
```

Use 10-fold cross-validation with a random seed of `1`. If the value is not a number, you may enter it as `np.nan`.

::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
ZeroR_performance

In [None]:
# hidden tests

::::{exercise}
:label: ex:5
Is ZeroR a good baseline classifier?

:::{hint}
:class: dropdown

Is the accuracy misleading? Can a random decision maker do better than zeroR?

:::

::::

YOUR ANSWER HERE

## Operating Curves for Probabilistic Classifier

For a probabilistic classifier that returns probabilities of different classes, we can obtain a trade-off between precision and recall by changing a threshold $\gamma$ for positive prediction, i.e., predict positive if and only if the probability estimate for positive class is larger than $\gamma$.

To plot the [precision-recall curve](https://waikato.github.io/weka-wiki/roc_curves/) and prints the area under the curve, we can use the following tool:

```python
import weka.plot.classifiers as plcls
```

In [None]:
plcls.plot_prc(evl, class_index=[1])
performance["PRC"] = evl.area_under_prc(pos_class)
print(f"area under precision-recall curve (PRC): {performance['PRC']:.3g}")

::::{exercise}
:label: ex:6
One can operate the classifier at any point on the curve by an appropriate choice of $\gamma$. Is it a good idea to operate at the initial part of the curve where the slope is strictly positive?
::::

YOUR ANSWER HERE

::::{exercise}
:label: ex:7
The above curve shows that the classifier can achieve 100% recall but not 100% precision. If you can choose any classifier, is it always possible to achieve 100% recall for any given data set? How about 100% precision?
::::

YOUR ANSWER HERE

We can also plot the ROC (receiver operator characteristics) curve to show the trade-off between recall (true positive rate) and false positive rate:

In [None]:
plcls.plot_roc(evl, class_index=[1])
performance["AUC"] = evl.area_under_roc(pos_class)
print(f"area under ROC curve (AUC): {performance['AUC']:.3g}")

::::{exercise}
:label: ex:8
The above curve shows that the classifier can achieve 0% false positive rate. If you can choose the classifier, is it always possible to achieve 0% false positive rate for any given dataset?
::::

YOUR ANSWER HERE