---
title: Different Classifiers with Weka
---

In [None]:
import os
import logging
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython import display

%matplotlib widget
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %reload_ext jupyter_ai

In this notebook, you will use Weka to compare different classifiers trained using different algorithms and parameters.

## Noise Curve

Complete the tutorial exercises in [\[Witten11\] Ex 17.2.6 to 17.2.11][Ex17.2] using the dataset `glass.arff` described at the beginning of [\[Witten11\] 17.2][17.2].

[Ex17.2]: https://ebookcentral.proquest.com/lib/cityuhk/reader.action?docID=634862&ppg=603
[17.2]: https://ebookcentral.proquest.com/lib/cityuhk/reader.action?docID=634862&ppg=601

The video below demonstrates how to use the Explorer interface to train 3-Nearest-Neighbor (3NN) classifier with 50% of the training data corrupted by noise. 

::::{card}
:header: [open in new tab](https://cityuhk-lms.ap.panopto.com/Panopto/Pages/Embed.aspx?id=f5c8bff9-11bf-4977-97dc-acbb00b7b129&autoplay=false&offerviewer=true&showtitle=true&showbrand=false&start=0&interactivity=all)
:::{iframe} https://cityuhk-lms.ap.panopto.com/Panopto/Pages/Embed.aspx?id=f5c8bff9-11bf-4977-97dc-acbb00b7b129&autoplay=false&offerviewer=true&showtitle=true&showbrand=false&start=0&interactivity=all
:::
::::

Weka provides a convenient interface, called the [Experimenter][Experimenter], to compare the performances of different classification algorithms on different datasets. This is demonstrated by the video below.

::::{card}
:header: [open in new tab](https://cityuhk-lms.ap.panopto.com/Panopto/Pages/Embed.aspx?id=ff11b9a7-9844-4b5e-9e30-acbb00c5374a&autoplay=false&offerviewer=true&showtitle=true&showbrand=false&start=0&interactivity=all)
:::{iframe} https://cityuhk-lms.ap.panopto.com/Panopto/Pages/Embed.aspx?id=ff11b9a7-9844-4b5e-9e30-acbb00c5374a&autoplay=false&offerviewer=true&showtitle=true&showbrand=false&start=0&interactivity=all
:::
::::

[Experimenter]: https://ebookcentral.proquest.com/lib/cityuhk/reader.action?docID=634862&ppg=540

A more flexible way is to use the `python-weka-wrapper3`. First, run the following to start the java virtual machine and load the `glass.arff` dataset:

In [None]:
import weka.core.jvm as jvm
from weka.classifiers import Classifier, Evaluation, FilteredClassifier
from weka.core.classes import Random
from weka.core.converters import Loader
from weka.filters import Filter

jvm.start(logging_level=logging.ERROR)
loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_url(
    "https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/"
    + "glass.arff"
)
data.class_is_last()

We can then create a filtered classifier with the following tools:

```python
from weka.classifiers import Classifier, Evaluation, FilteredClassifier
from weka.core.classes import Random
from weka.filters import Filter
```

In [None]:
add_noise = Filter(classname="weka.filters.unsupervised.attribute.AddNoise")
IBk = Classifier(classname="weka.classifiers.lazy.IBk")
fc = FilteredClassifier()
fc.filter = add_noise
fc.classifier = IBk

To compute the 10-fold cross-validated accuracy of 3-NN classification with 50% noise:

In [None]:
add_noise.options = ["-P", str(50), "-S", str(0)]  # percentage noise  # random seed
IBk.options = ["-K", str(3)]  # number of nearest neighbors
evl = Evaluation(data)
evl.crossvalidate_model(fc, data, 10, Random(1))
evl.percent_correct

::::{exercise} Ex 17.2.6
:label: ex:17.2.6

To answer **Ex 17.2.6**, use any of the above methods and complete the `pandas` `DataFrame` in the following cell by filling in the accuracies (as floating point numbers) for different percentages of noise and numbers of nearest neighbors. You can assign each column of accuracies as follows:
```python
noise_df['k=1'] = [___, ___, ...]  # for 1-NN
noise_df['k=3'] = [___, ___, ...]  # for 3-NN
noise_df['k=5'] = [___, ___, ...]  # for 5-NN
```

To obtain the typical performance, consider repeating the experiment 10 times with different seeds and compute the average accuracies.

::::

In [None]:
noise_df = pd.DataFrame(columns=["k=1", "k=3", "k=5"], dtype=float)
noise_df.insert(0, "Percentage Noise", np.arange(0, 101, 10))

# YOUR CODE HERE
raise NotImplementedError

display.display(noise_df.round(2))

plt.figure(num=1, figsize=(8, 5), clear=True)
for k in ["1", "3", "5"]:
    plt.plot(
        noise_df["Percentage Noise"], noise_df["k=" + k], label="k=" + k, marker="o"
    )
plt.legend()
plt.xlabel("Percentage Noise")
plt.ylabel("Accuracies")
plt.title(r"Training IB$k$ on noisy data for different $k$")
plt.show()

In [None]:
# hidden tests

In [None]:
%%ai
Explain in a paragraph how the noise curve can show whether a learning algorithm 
is prone to overfitting.

::::{exercise} Ex 17.2.7
:label: ex:17.2.7
::::

YOUR ANSWER HERE

::::{exercise} Ex 17.2.8
:label: ex:17.2.8
::::

YOUR ANSWER HERE

In [None]:
%%ai
Explain in a paragraph whether it is possible to overfit even when the training
data has no noise, which is defined as the irregularity irrelevant to the
general pattern?

## Training Curve

::::{exercise} Ex 17.2.9
:label: ex:17.2.9


Complete the `pandas` `DataFrame` in the following cell by filling in the accuracies (as floating point numbers) for different percentages of the dataset for training and different classifiers. You can assign each column of accuracies as follows:
```python
train_df['IBk'] = [___, ___, ...]
train_df['J48'] = [___, ___, ...]
```
To obtain the typical performance, consider repeating the experiment 10 times with different seeds and compute the average accuracies.

::::

In [None]:
train_df = pd.DataFrame(columns=["IBk", "J48"], dtype=float)
train_df.insert(0, "Percentage of Training Set", np.arange(10, 101, 10))

# YOUR CODE HERE
raise NotImplementedError

display.display(train_df.round(2))

plt.figure(num=3, figsize=(8, 5), clear=True)
for clf in ["IBk", "J48"]:
    plt.plot(
        train_df["Percentage of Training Set"], train_df[clf], label=clf, marker="o"
    )
plt.legend()
plt.xlabel("Percentage of Training Set")
plt.ylabel("Accuracies")
plt.title(r"Training IB$k$ and J48 with different amount of data")
plt.show()

In [None]:
# hidden tests

In [None]:
%%ai
Explain in a paragraph how the training curve can show whether a learning
algorithm is prone to underfitting.

::::{exercise} Ex 17.2.10
:label: ex:17.2.10
::::

YOUR ANSWER HERE

::::{exercise} Ex 17.2.11
:label: ex:17.2.11
::::

YOUR ANSWER HERE

::::{exercise}
:label: ex:7
 Will the accuracy be higher if we resample without replacement instead of with replacement? Why?
::::

YOUR ANSWER HERE

In [None]:
%%ai
Explain in a paragraph whether it is always possible to find the best fit
for a given training data.

## Classification Boundaries

Complete the tutorial exercises in [\[Witten11\] Ex 17.3.1 to 17.3.6][Witten-17-3] using the boundary visualizer for different classifiers on `iris.2D.arff` (NOT `iris.arff`) dataset. 

For OneR, note that the boundary is decided based on two conditions in Appendix A of [Holte93](https://mlpack.org/papers/ds.pdf):
- (3a) Minimum size of the optimal class should be at least `minBucketSize`, and 
- (3b) the optimal class of the smallest attribute value just above the boundary should be different from the optimal class just below the boundary.

::::{figure} images/OneR_boundary.dio.svg
:name: fig:OneR
:align: left

OneR decision boundary
::::

[Witten-17-3]: https://ebookcentral.proquest.com/lib/cityuhk/reader.action?docID=634862&ppg=606

In [None]:
%%ai
Explain the following two rules in deciding how values are partitioned into 
intervals so that every interval satisfies the following constraints: 
(a) there is at least one class that is "optimal" for more than SMALL of the
    values in the interval. This constraint does not apply to the rightmost
    interval. 
(b) If $V[I]$ is the smallest value for attribute $A$ in the training set that 
    is larger than the values in interval $I$ then there is no class $C$ that
    is optimal both for $V[I]$ and for interval $I$.

::::{exercise} Ex 17.3.1
:label: ex:17.3.1
::::

YOUR ANSWER HERE

::::{exercise} Ex 17.3.2
:label: ex:17.3.2

- Change "20" to "60" in the statement of the exercise. 
- Look for the smallest critical value of the `minBucketSize` where the decision boundaries begin to change as `minBucketSize` increases. The critical value should be larger than 20.
- If you want to use the Explorer interface to speed up your search, as mentioned in Ex 17.3.2, you will need to set the Test options to use the training set and observe the relevant numbers in the confusion matrix.
::::

YOUR ANSWER HERE

::::{exercise} Ex 17.3.3
:label: ex:17.3.3
::::

YOUR ANSWER HERE

::::{exercise} Ex 17.3.4
:label: ex:17.3.4
::::

YOUR ANSWER HERE

::::{exercise} Ex 17.3.5
:label: ex:17.3.5

This is tricky. Follow the hint provided by the exercise and click the relevant data point in Explorer's Visualize panel.

![](images/IBk_boundary.dio.svg)
::::

YOUR ANSWER HERE

::::{exercise} Ex 17.3.6
:label: ex:17.3.6

![](images/IBk_boundary_5_10.dio.svg)
::::

YOUR ANSWER HERE

In [None]:
%%ai
How does Weka's BoundaryVisualizer plot the decision boundaries, especially
when there are more than two input features?