---
title: Training with Skewed Dataset
math: 
    '\abs': '\left\lvert #1 \right\rvert' 
    '\norm': '\left\lvert #1 \right\rvert' 
    '\Set': '\left\{ #1 \right\}'
    '\mc': '\mathcal{#1}'
    '\M': '\boldsymbol{#1}'
    '\R': '\mathsf{#1}'
    '\RM': '\boldsymbol{\mathsf{#1}}'
    '\op': '\operatorname{#1}'
    '\E': '\op{E}'
    '\d': '\mathrm{\mathstrut d}'
    '\precision': '\text{precision}'
    '\recall': '\text{recall}'
    '\specificity': '\text{specificity}'
    '\TP': '\text{TP}'
    '\FP': '\text{FP}'
    '\FN': '\text{FN}'
    '\TN': '\text{TN}'
    '\cost': '\text{cost}'    
---

In [None]:
import logging
import os
import pprint

import numpy as np
import weka.core.jvm as jvm
import weka.core.packages as packages
from weka.classifiers import (
    Classifier,
    Evaluation,
    FilteredClassifier,
    SingleClassifierEnhancer,
)
from weka.core.classes import Random, complete_classname
from weka.core.converters import Loader
from weka.filters import Filter

if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

## Setup

In this notebook, we will train classifiers properly on the skewed dataset for detecting microcalcifications in mammograms.

In particular, we will use the meta classifier `ThresholdSelector` and the filter `SMOTE` [Synthetic Minority Over-sampling Technique](https://doi.org/10.1613/jair.953). They need to be installed as [additional packages in WEKA](https://weka.sourceforge.io/packageMetaData/). To do so, we have imported `packages`:

```python
import weka.core.packages as packages
```

`packages` must also be enabled for the java virtual machine:

In [None]:
jvm.start(packages=True, logging_level=logging.ERROR)

The following prints the information of the packages we will install:

In [None]:
pkgs = ["thresholdSelector", "SMOTE"]

for item in packages.all_packages():
    if item.name in pkgs:
        pprint.pp(item.metadata)

:::: {attention}

Some packages, such as the latest version of `thresholdSelector`, cannot be downloaded from the default package URL. As a workaround, we used [Ant](https://waikato.github.io/weka-wiki/ant/) to build all the packages from [source](https://git.cms.waikato.ac.nz/weka/weka/-/tree/stable-3-8) and saved the zip files at `/data/pkgs/`.

:::{seealso}
:class: dropdown

To build Weka packages to zip files:

1. Obtain the source (2GB), which has been cloned to the shared folder `/data/weka` on JupyterHub. If you would like to clone the source to your computer:
   ```bash
   git clone https://git.cms.waikato.ac.nz/weka/weka.git
   ```

3. Build the zip files:
   ```bash
   cd weka/trunk/weka
   mkdir -p "$HOME/data/pkgs"
   ```
   - For an internal package such as [thresholdSelector](https://git.cms.waikato.ac.nz/weka/weka/-/tree/stable-3-8/packages/internal/thresholdSelector):
     ```bash
     ant -Ddist="$HOME/data/pkgs" -DpackageName=thresholdSelector make_internal
     ```
   - For an external package such as [SMOTE](https://git.cms.waikato.ac.nz/weka/weka/-/tree/stable-3-8/packages/external/SMOTE)
     ```bash
     ant -Ddist="$HOME/data/pkgs" -DpackageName=thresholdSelector make_external
     ```
   - To build all packages:
     ```bash
     ant -Ddist="$HOME/data/pkgs" -DpackageCommand=make apply_all
     ```
   - To get help on other commands:
     ```bash
     ant -projecthelp
     ```

4. Access the built packages:
   ```bash
   cd "$HOME/data/pkgs"
   ```

:::

::::

 You may install the packages using the Weka package manager. To install them in `python-weka-wrapper3`, run the following code:

In [None]:
for pkg in pkgs:
    if not packages.is_installed(pkg):
        print(f"Installing {pkg}...")
        packages.install_package(f"/data/pkgs/{pkg}.zip")
    else:
        print(f"Skipping {pkg}, already installed. ")
else:
    print("Done.")

::::{note}

The first time you run the above cell, you should see

```
Installing thresholdSelector...
Installing SMOTE...
Done.
```

The next time you run the cell, you should see

```
Skipping thresholdSelector, already installed. 
Skipping SMOTE, already installed. 
Done.
```

because the packages have already been installed.

::::

By default, packages are installed under your home directory `~/wekafiles/packages/`:

In [None]:
!ls ~/wekafiles/packages

````{caution}

For the packages to take effect, you must restart the kernel (`Kernel` -> `Restart`). Note that running `jvm.stop()` followed by `jvm.start(packages=True)` will not work because [`javabridge` currently does not support restarting a virtual machine](https://stackoverflow.com/questions/51795945/after-stopping-jvm-unable-to-start-it-again).

````

After restarting the kernel, check that the packages have been successfully installed using `complete_classname` imported by

```python
from weka.core.classes import complete_classname
```

In [None]:
print(complete_classname("ThresholdSelector"))
print(complete_classname("SMOTE"))
print(packages.installed_packages())

We will use the same mammography dataset from
[OpenML](https://www.openml.org/d/310) and J48 as the base classifier. The following loads the dataset into the notebook:

In [None]:
loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_url("https://www.openml.org/data/download/52214/phpn1jVwe")
data.class_is_last()
pos_class = 1
clf = Classifier(classname="weka.classifiers.trees.J48")

## Threshold Selector

The meta classifier `ThresholdSelector` uses the threshold-moving technique to optimize a performance measure you specify, which can be the precision, recall, $F$-score, etc.[^threshold]

[^threshold]: See an explanation of threshold moving technique [here](https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/).

The following shows how to maximize recall:

In [None]:
tsc = SingleClassifierEnhancer(classname="weka.classifiers.meta.ThresholdSelector")
tsc.options = ["-M", "RECALL"]
tsc.classifier = clf

evl = Evaluation(data)
evl.crossvalidate_model(tsc, data, 10, Random(1))

print(f"maximum recall: {evl.recall(pos_class):.3g}")

The maximum recall is 100%, as expected by setting the threshold to 1.

::::{exercise}
:label: ex:1
Using J48 as the base classifier and 10-fold cross-validation, obtain the highest precision and $F$-score. Assign the values to `max_precision` and `max_f`, respectively. 

If you use `python-weka-wrapper3`, be careful that resetting `tsc.options` may also reset the base classifier to the default one, which is not J48. To ensure that you are using J48, set the base classifier again after the options:
```python
tsc.options=['-M', ___]
tsc.classifier = clf
```
::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
max_precision, max_f

In [None]:
# hidden tests

In [None]:
%%ai chatgpt -f text
For multi-class classification, how should the threshold moving scheme work?

## Cost-sensitive Classifier

In addition to precision and recall, we can build a classifier to minimize a cost with specific weights on the number of correctly/incorrected instances of different classes:

::::{prf:definition} cost/benefit analysis

The total cost can be defined as

$$
\begin{align}
\TP\cdot \cost_\TP
+ \FN\cdot \cost_\FN
+ \FP\cdot \cost_\FP 
+ \TN\cdot \cost_\TN.
\end{align}
$$ (cost)

given a cost matrix

$$
\begin{align}
\begin{bmatrix} \cost_\TP & \cost_\FN\\ \cost_\FP & \cost_\TN\end{bmatrix}.
\end{align}
$$ (cost-matrix)

::::

Weka provides a convenient interface for cost/benefit analysis:

- In the explorer interface, train J48 on the mammography dataset with 10-fold cross-validation.
- Right-click on the result in the result list.
- Choose Cost/Benefit analysis and 1 as the positive class value.
- Specify the cost matrix.
- Click `Minimize Cost/Benefit` to minimize the cost.

::::{exercise}
:label: ex:2
Assign to `cost_matrix` the cost matrix that achieves the maximum precision. You can define the cost matrix as follows:
```python
cost_matrix = np.array([[__, __],
                        [__, __]])
```

:::{hint}
:class: dropdown
Pay attention to the row and column labels of the confusion matrix. It changes after you specify $1$ as the positive class value.
:::

::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
cost_matrix

The following test cell demonstrates how to train a meta classifier to minimize the cost defined using the cost matrix you provided.

In [None]:
# tests
csc = SingleClassifierEnhancer(
    classname="weka.classifiers.meta.CostSensitiveClassifier",
    options=[
        "-cost-matrix",
        "["
        + " ; ".join(
            " ".join(str(entry) for entry in cost_matrix[:, i]) for i in range(2)
        )
        + "]",
        "-S",
        "1",
    ],
)
csc.classifier = clf

evl = Evaluation(data)
evl.crossvalidate_model(csc, data, 10, Random(1))

precision = evl.precision(pos_class)
print(f"maximum precision: {precision:.3g}")

In [None]:
# hidden tests

In [None]:
%%ai chatgpt -f text
For the cost-benefit analysis, is there an implementation that optimizes a more
general cost function, which may be non-linear with respect to the counts of
TP, FP, TN, and FN?

## SMOTE

Synthetic Minority Over-sampling TEchnique (SMOTE) [(Chawler et al., 2002)](https://doi.org/10.1613/jair.953) is a filter that up-samples the minority class. Instead of duplicates of the same instance, it creates new samples as convex combinations of existing ones.[^SMOTE]

[^SMOTE]: See an explanation of SMOTE [here](http://rikunert.com/SMOTE_explained).

::::{exercise}
:label: ex:3
Using the FilteredClassifier with J48 as the classifier and SMOTE as the filter, try to tweak the setting of SMOTE to give the highest possible value of $F$ score larger than the maximum one achieved by `ThresholdSelector`. Assign to `smote.options` your choice of filter. E.g., you can change the percentage of SMOTE instances to 150% as follows:
```python
smote.options = ['-P', '150']
```
::::

In [None]:
smote = Filter(classname="weka.filters.supervised.instance.SMOTE")
print("Default smote.options:", smote.options)
# YOUR CODE HERE
raise NotImplementedError
print("Your smote.options:", smote.options)

In [None]:
# tests
fc = FilteredClassifier()
fc.filter = smote
fc.classifier = clf

evl = Evaluation(data)
evl.crossvalidate_model(fc, data, 10, Random(1))

f_score = evl.f_measure(pos_class)
print(f"F-score by SMOTE: {f_score:.3g}")

In [None]:
# hidden tests

In [None]:
%%ai chatgpt -f text
In SMOTE, since the synthetic data are generated based on existing data with 
randomness independent of the data, which should be regarded as noise, how can
it be better than upsampling? In other words, no new relevant information is 
generated because no new data is collected.