# Man vs Machine

**CS5483 Data Warehousing and Data Mining**
___

<a title="Paramount Pictures, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Terminator_(franchise_logo).png"><img width="512" alt="Terminator (franchise logo)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/28/Terminator_%28franchise_logo%29.png/512px-Terminator_%28franchise_logo%29.png"></a>

In this notebook, you will compete with your classmates and the machine by
1. hand-building a decision tree using Weka `UserClassifier`, and
2. use  `python-weka-wrapper` to build the J48 (C4.5) decision tree as a comparison.

Let's find out who is more intelligent!

In [1]:
%reset -f
from IPython import display

## Interactive Decision Tree Construction

Follow the instruction above [\[Witten11\] Exercise 17.2.12][Witten11] to 

1. install the package `UserClassifier`,
2. hand-build a decision tree using [`segment-challenge.arff`][segment-challenge] as the training set, and 
3. test the performance using [`segment-test.arff`][segment-test] as the test set.

[Witten11]: https://ebookcentral.proquest.com/lib/cityuhk/reader.action?docID=634862&ppg=604
[segment-challenge]: https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/segment-challenge.arff
[segment-test]: https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/segment-test.arff

See the following video for an quick illustration. If Weka GUI take more than 10 seconds to load on Windows 10, launch Weka directly by double-clicking weka.jar in your installation folder.

In [2]:
display.IFrame(
    src=
    "https://cityuhk-lms.ap.panopto.com/Panopto/Pages/Embed.aspx?id=24055b2e-b310-4bcf-b2a5-acb80075209e&autoplay=false&offerviewer=true&showtitle=true&showbrand=false&start=0&interactivity=all",
    height=450,
    width=800)

**Exercise** Why should you NOT use the training set `segment-challenge.arff` to test your decision tree?

YOUR ANSWER HERE

**Exercise** Why should you NOT use the class attribute to split the decision tree?

YOUR ANSWER HERE

**Exercise** Include the model and result summary sections from the result buffer of your best hand-built decision tree. Your answer should look like:

```
=== Classifier model (full training set) ===

Split on ...

=== Summary ===

Correctly Classified Instances ...

```

Try your best to beat your classmates and the machine.
- Build at least two decision trees and pick the best one. 
- Share your result on the [discussion page](https://canvas.cityu.edu.hk/courses/39808/discussion_topics/306325) and check if your classmates have a better decision tree.

YOUR ANSWER HERE

**Exercise** For the best decision tree you hand-built, is the empirical accuracy on the test set an unbiased estimate of the true accuracy? Explain the direction of the bias, if any.

YOUR ANSWER HERE

## Python Weka Wrapper

To see if your hand-built classifier can beat the machine, use J48 (C4.5) to build a decision tree. Instead of using the Weka Explorer Interface, you will run Weka directly from the notebook using [`python-weka-wrapper`][pww].

[pww]: http://fracpete.github.io/python-weka-wrapper3/

Because Weka is written in Java, we need to start the java virtual machine first.

In [None]:
import weka.core.jvm as jvm
jvm.start()

### Load datasets

To load the dataset, create an `ArffLoader` as follows:

In [None]:
from weka.core.converters import Loader
loader = Loader(classname="weka.core.converters.ArffLoader")

The loader has the method `load_url` to load data from the web such as the [Weka GitHub repository][weka]:

[weka]: https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/

In [None]:
weka_data_path = 'https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/'
trainset = loader.load_url(
    weka_data_path +
    'segment-challenge.arff')  # use load_file to load from file instead

For classification, we have to specify the class attribute. For instance, the method `class_is_last` mutates `trainset` to have the last attribute as the class attribute:

In [None]:
trainset.class_is_last()

**Exercise** 
1. Assign `testset` to the data loaded from `weka_data_path + 'segment-test.arff'`. 
2. Specify the last attribute of `testset` as the target.

In [None]:
from weka.core.dataset import Instances
# YOUR CODE HERE
raise NotImplementedError()
print(Instances.summary(testset))

In [None]:
# tests
assert testset.relationname == 'segment'
assert testset.num_instances == 810
assert testset.num_attributes == 20

### Train a decision tree using J48

To train a decision tree using J48, we create the classifier and then apply the method `build_classifier` on the training set.

In [None]:
from weka.classifiers import Classifier

J48 = Classifier(classname="weka.classifiers.trees.J48")
J48.build_classifier(trainset)
J48

`J48.graph` is a [DOT graph][dot] that can be rendered using the method `pydotplus.graph_from_dot_data`.

[dot]: https://en.wikipedia.org/wiki/DOT_(graph_description_language)

In [None]:
import pydotplus
from IPython.display import SVG

J48tree = pydotplus.graph_from_dot_data(J48.graph)
J48tree.set_size('"10,10!"')
SVG(J48tree.create_svg()) # use write_svg to write to file instead

### Evaluation

Finally, to evaluate the decision tree on the test set:

In [None]:
from weka.classifiers import Evaluation

J48test = Evaluation(testset)
J48test.test_model(J48, testset)
test_accuracy = J48test.percent_correct
print(f'Test accuracy: {test_accuracy:.4g}%')

**Exercise** Assign `train_accuracy` to the accuracy of the classifier on the training set.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
print(f'Training accuracy: {train_accuracy:.4g}%')

In [None]:
# hidden tests

**Exercise** By comparing the training accuracy and test accuracy, can we tell whether a classifier overfitted the data?

YOUR ANSWER HERE

To stop the Java virtual machine:

In [None]:
jvm.stop()