# Different Classifiers with Weka

**CS5483 Data Warehousing and Data Mining**
___

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import weka.core.jvm as jvm
from IPython import display
from weka.classifiers import Classifier, Evaluation, FilteredClassifier
from weka.core.classes import Random
from weka.core.converters import Loader
from weka.filters import Filter

%matplotlib inline

In this notebook, you will use Weka to compare different classifiers trained using different algorithms or different hyper-parameters.

## Noise and Training Curves

Complete the tutorial exercises in [\[Witten11\] Ex 17.2.6 to 17.2.11][Ex17.2]. Use the dataset `glass.arff` described at the beginning of [\[Witten11\] 17.2][17.2].

[Ex17.2]: https://ebookcentral.proquest.com/lib/cityuhk/reader.action?docID=634862&ppg=603
[17.2]: https://ebookcentral.proquest.com/lib/cityuhk/reader.action?docID=634862&ppg=601

**Ex 17.2.6**

The following demonstrates how to use the Explorer interface to train 3-Nearest-Neighbor (3NN) classifier with 50% of the training data corrupted by noise.

In [None]:
display.IFrame(
    src="https://cityuhk-lms.ap.panopto.com/Panopto/Pages/Embed.aspx?id=f5c8bff9-11bf-4977-97dc-acbb00b7b129&autoplay=false&offerviewer=true&showtitle=true&showbrand=false&start=0&interactivity=all",
    height=450,
    width=800,
)

As demonstrated in the following video, Weka provides a convenient interface, called the [Experimenter][Experimenter], to compare the performances of different classification algorithms on different datasets.

[Experimenter]: https://ebookcentral.proquest.com/lib/cityuhk/reader.action?docID=634862&ppg=540

In [None]:
display.IFrame(
    src="https://cityuhk-lms.ap.panopto.com/Panopto/Pages/Embed.aspx?id=ff11b9a7-9844-4b5e-9e30-acbb00c5374a&autoplay=false&offerviewer=true&showtitle=true&showbrand=false&start=0&interactivity=all",
    height=450,
    width=800,
)

A more flexible way is to use the `python-weka-wrapper`. To start the java virtual machine and load the `glass.arff` dataset:

In [None]:
jvm.start()
loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_url(
    "https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/"
    + "glass.arff"
)
data.class_is_last()

We can then create a filtered classifier with the following tools:

```python
from weka.classifiers import Classifier, Evaluation, FilteredClassifier
from weka.core.classes import Random
from weka.filters import Filter
```

In [None]:
add_noise = Filter(classname="weka.filters.unsupervised.attribute.AddNoise")
IBk = Classifier(classname="weka.classifiers.lazy.IBk")
fc = FilteredClassifier()
fc.filter = add_noise
fc.classifier = IBk

To compute the 10-fold cross-validated accuracy of 3-NN classification with 50% noise:

In [None]:
add_noise.options = ["-P", str(50), "-S", str(0)]  # percentage noise  # random seed
IBk.options = ["-K", str(3)]  # number of nearest neighbors
evl = Evaluation(data)
evl.crossvalidate_model(fc, data, 10, Random(1))
evl.percent_correct

Using any of the above suggested method, complete the `pandas` `DataFrame` in the following cell by filling in the accuracies (as floating point numbers) for different percentages of noise and number of nearest neighbor. You can assign each column of accuracies as follows:
```Python
noise_df['k=1'] = [___, ___, ...]  # for 1-NN
noise_df['k=3'] = [___, ___, ...]  # for 3-NN
noise_df['k=5'] = [___, ___, ...]  # for 5-NN
```

To obtain the typical performance, consider repeating the experiements 10 times with different seeds and compute the average accuracies.

In [None]:
noise_df = pd.DataFrame(columns=["k=1", "k=3", "k=5"], dtype=float)
noise_df.insert(0, "Percentage Noise", np.arange(0, 101, 10))

# YOUR CODE HERE
raise NotImplementedError()

display.display(noise_df.round(2))

plt.figure(figsize=(8, 5))
for k in ["1", "3", "5"]:
    plt.plot(
        noise_df["Percentage Noise"], noise_df["k=" + k], label="k=" + k, marker="o"
    )
plt.legend()
plt.xlabel("Percentage Noise")
plt.ylabel("Accuracies")
plt.title(r"Training IB$k$ on noisy data for different $k$")
plt.show()

In [None]:
# hidden tests

**Ex 17.2.7**

YOUR ANSWER HERE

**Ex 17.2.8**

YOUR ANSWER HERE

**Ex 17.2.9**

Complete the `pandas` `DataFrame` in the following cell by filling in the accuracies (as floating point numbers) for different percentages of training set and classifiers. You can assign each column of accuracies as follows:
```Python
noise_df['IBk'] = [___, ___, ...]
noise_df['J48'] = [___, ___, ...]
```
To obtain the typical performance, consider repeating the experiements 10 times with different seeds and compute the average accuracies.

In [None]:
train_df = pd.DataFrame(columns=["IBk", "J48"], dtype=float)
train_df.insert(0, "Percentage of Training Set", np.arange(10, 101, 10))

# YOUR CODE HERE
raise NotImplementedError()

display.display(train_df.round(2))

plt.figure(figsize=(8, 5))
for clf in ["IBk", "J48"]:
    plt.plot(
        train_df["Percentage of Training Set"], train_df[clf], label=clf, marker="o"
    )
plt.legend()
plt.xlabel("Percentage of Training Set")
plt.ylabel("Accuracies")
plt.title(r"Training IB$k$ and J48 with different amount of data")
plt.show()

In [None]:
# hidden tests

**Ex 17.2.10**

YOUR ANSWER HERE

**Ex 17.2.11**

YOUR ANSWER HERE

**Exercise** Will the accuracy be higher if we resample without replacement instead of with replacement? Why?

YOUR ANSWER HERE

## Classification Boundaries

Complete the tutorial exercises in [\[Witten11\] Ex 17.3.1 to 17.3.6][Witten-17-3] using the boundary visualizer for different classifiers on `iris.2D.arff` (NOT `iris.arff`) dataset. 

For OneR, note that the boundary is decided based on two conditions in Appendix A of [Holte93](https://mlpack.org/papers/ds.pdf), 
- (3a) minimum size of the optimal class to be at least `minBucketSize`, and 
- (3b) the optimal class of the smallest value larger than the boundary to be of a different class value. 

![](images/OneR_boundary.dio.svg)

[Witten-17-3]: https://ebookcentral.proquest.com/lib/cityuhk/reader.action?docID=634862&ppg=606

**Ex 17.3.1**

The decision regions are stripes with boundaries perpendicular to an axis of the feature space. That axis corresponds to the feature chosen by `OneR` for prediction.

**Ex 17.3.2**  
- Change "20" to "60" in the statement of the exercise. 
- Look for the smallest critical value of the `minBucketSize` where the decision boundaries begin to change as `minBucketSize` increases. The critical value should be larger than 20.
- If you want to use the Explorer interface to speed up your search as stated in Ex 17.3.2, you will need to set the Test options to use training set and observe the relevant numbers in the confusion matrix.

YOUR ANSWER HERE

**Ex 17.3.3**

YOUR ANSWER HERE

**Ex 17.3.4**

YOUR ANSWER HERE

**Ex 17.3.5**  

This is a bit tricky. Follow the hint provided by the exercise and click the relevant data point in the Explorer's Visualize panel.

![](images/IBk_boundary.dio.svg)

YOUR ANSWER HERE

**Ex 17.3.6**

![](images/IBk_boundary_5_10.dio.svg)

YOUR ANSWER HERE