# Training with Skewed Dataset

**CS5483 Data Warehousing and Data Mining**
___

In [None]:
import pprint

import numpy as np
import weka.core.jvm as jvm
import weka.core.packages as packages
from weka.classifiers import (
    Classifier,
    Evaluation,
    FilteredClassifier,
    SingleClassifierEnhancer,
)
from weka.core.classes import Random, complete_classname
from weka.core.converters import Loader
from weka.filters import Filter

## Setup

In this notebook, we will train classifiers properly on the skewed dataset for detecting microcalcifications in mammograms.

In particular, we will use the meta classifier `ThresholdSelector` and the filter `SMOTE` [Synthetic Minority Over-sampling Technique](https://doi.org/10.1613/jair.953). They needs to be installed as [additional packages in WEKA](https://weka.sourceforge.io/packageMetaData/). To do so, we have imported `packages`:

```python
import weka.core.packages as packages
```

`packages` must also be enabled for the java virtual machine:

In [None]:
jvm.start(packages=True)

The following prints the information of the packages we will install:

In [None]:
pkgs = ["thresholdSelector", "SMOTE"]
for item in packages.all_packages():
    if item.name in pkgs:
        pprint.pp(item.metadata)

 You may install the packages directly using Weka package manager instead of downloading the zip files. To install them in `python-weka-wrapper`, run the following code:

In [None]:
for pkg in pkgs:
    if not packages.is_installed(pkg):
        print(f"Installing {pkg}...")
        packages.install_package(pkg)
    else:
        print(f"Skipping {pkg}, already installed. ")
else:
    print("Done.")

The first time you run the above cell, you should see:

```
Installing thresholdSelector...
Installing SMOTE...
Done.
```

The next time you run the cell, you should see

```
Skipping thresholdSelector, already installed. 
Skipping SMOTE, already installed. 
Done.
```

because the packages have already been installed.

By default, packages are installed under your home directory `~/wekafiles/packages/`:

In [None]:
!ls ~/wekafiles/packages

---

**Note**

For the packages to take effect, you must restart the kernel (`Kernel` -> `Restart`). Note that running `jvm.stop()` followed by `jvm.start(packages=True)` will not work because [`javabridge` currently does not support restarting a virtual machine](https://stackoverflow.com/questions/51795945/after-stopping-jvm-unable-to-start-it-again).

---

After restarting the kernel, check that the packages have been successfully installed using `complete_classname` imported by

```python
from weka.core.classes import complete_classname
```

In [None]:
print(complete_classname("ThresholdSelector"))
print(complete_classname("SMOTE"))
print(packages.installed_packages())

We will use the same mammography dataset from
[OpenML](https://www.openml.org/d/310) and J48 as the base classifier. The following loads the dataset into the notebook:

In [None]:
loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_url("https://www.openml.org/data/download/52214/phpn1jVwe")
data.class_is_last()
pos_class = 1
clf = Classifier(classname="weka.classifiers.trees.J48")

## Threshold Selector

The meta classifier `ThresholdSelector` uses the threshold-moving technique to optimize a performance measure you specify, which can be the precision, recall, $F$-score, etc. See an explanation of threshold moving technique [here](https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/).

The following shows how to maximize recall:

In [None]:
tsc = SingleClassifierEnhancer(classname="weka.classifiers.meta.ThresholdSelector")
tsc.options = ["-M", "RECALL"]
tsc.classifier = clf

evl = Evaluation(data)
evl.crossvalidate_model(tsc, data, 10, Random(1))

print(f"maximum recall: {evl.recall(pos_class):.3g}")

The maximum recall is 100% as expected by setting the threshold to 1.

**Exercise** Using J48 as the base classifier and 10-fold cross-validation, obtain the highest precision and F-score. Assign the values to `max_precision` and `max_f` respectively. 

If you use `python-weka-wrapper`, be careful that reseting `tsc.options` may also reset the base classifier to the default one, which is not J48. To ensure that you are using J48, set the base classifier again after the options:
```Python
tsc.options=['-M', ___]
tsc.classifier = clf
```

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
max_precision, max_f

In [None]:
# hidden tests

## Cost-sensitive Classifier

---

**Definition**

We can build a classifier to maximize certain cost defined as

$$
\begin{align}
\text{cost}_\text{TP} \text{TP}
+ \text{cost}_\text{FN} \text{FN}
+ \text{cost}_\text{FP} \text{FP} 
+ \text{cost}_\text{TN} \text{TN}.
\end{align}
$$ (cost)

given a cost matrix

$$
\begin{align}
\begin{bmatrix} \text{cost}_\text{TP} & \text{cost}_\text{FN}\\ \text{cost}_\text{FP} & \text{cost}_\text{TN}\end{bmatrix}.
\end{align}
$$ (cost-matrix)

---

Weka provides a convenient interface for cost/benefit analysis:

- In the explorer interface, train J48 on the mammography dataset with 10-fold cross-validation.
- Right click on the result in the result list.
- Choose Cost/Benefit analysis and 1 as the positive class value.
- Specify the cost matrix.
- Click `Minimize Cost/Benefit` to minimize the cost.

**Exercise** Assign to `cost_matrix` the cost matrix that achieves the maximum precision. You can define the cost matrix as follows:
```python
cost_matrix = np.array([[__, __],
                        [__, __]])
```

---

**Hint**

Pay attention to the row and column labels of the confusion matrix. It changes after you specify $1$ as the positive class value.

---

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
cost_matrix

The following test cell demonstrates how to train a meta classifier to minimize the cost defined using the cost matrix you provided.

In [None]:
# tests
csc = SingleClassifierEnhancer(
    classname="weka.classifiers.meta.CostSensitiveClassifier",
    options=[
        "-cost-matrix",
        "["
        + " ; ".join(
            " ".join(str(entry) for entry in cost_matrix[:, i]) for i in range(2)
        )
        + "]",
        "-S",
        "1",
    ],
)
csc.classifier = clf

evl = Evaluation(data)
evl.crossvalidate_model(csc, data, 10, Random(1))

precision = evl.precision(pos_class)
print(f"maximum precision: {precision:.3g}")

In [None]:
# hidden tests

## SMOTE

Synthetic Minority Over-sampling TEchnique (SMOTE) is a filter that up-samples the minority class. Instead of creating duplicates of the same instance, it creates new samples as convex combinations of existing ones. See a more detailed explanation of SMOTE [here](http://rikunert.com/SMOTE_explained).

**Exercise** Using the FilteredClassifier with J48 as the classifer and SMOTE as the filter, try to tweek the setting of SMOTE to give the highest possilbe value of $F$ score larger than the maximum one achieved by `ThresholdSelector`. Assign to `smote.options` your choice of the filter. E.g., you can change the percentage of SMOTE instances to 150% as follows:
```Python
smote.options = ['-P', '150']
```

In [None]:
smote = Filter(classname="weka.filters.supervised.instance.SMOTE")
print("Default smote.options:", smote.options)
# YOUR CODE HERE
raise NotImplementedError()
print("Your smote.options:", smote.options)

In [None]:
# tests
fc = FilteredClassifier()
fc.filter = smote
fc.classifier = clf

evl = Evaluation(data)
evl.crossvalidate_model(fc, data, 10, Random(1))

f_score = evl.f_measure(pos_class)
print(f"F-score by SMOTE: {f_score:.3g}")

In [None]:
# hidden tests