# Machine vs Machine

**CS5483 Data Warehousing and Data Mining**
___

<a title="Freddycastillo9871, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Terminator_2.png"><img width="512" alt="Terminator 2" src="https://upload.wikimedia.org/wikipedia/commons/3/38/Terminator_2.png"></a>

In [5]:
%reset -f
from IPython import display
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# produce vector inline graphics
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg')
from weka.core import dataset
import weka.core.jvm as jvm
from weka.core.converters import Loader
from weka.classifiers import Classifier, Evaluation, SingleClassifierEnhancer
from sklearn import ensemble, tree
from weka.core.classes import Random
from weka.core.classes import complete_classname
from sklearn import ensemble
from scipy.io import arff
import urllib.request
import io

In this notebook, we will try to build the best machine to classify the image segmentation dataset:
- `segment-challenge.arff` for training, and
- `segment-test.arff` for testing.

## Individual Classification Algorithms

Weka provides a KnowledgeFlow interface to flow data through a learning algorithm. The following is a demo for training and testing a J48 decision tree.

In [6]:
display.IFrame(src="https://cityuhk-lms.ap.panopto.com/Panopto/Pages/Embed.aspx?id=3f33d95a-c5c2-4893-925b-acd10064acec&autoplay=false&offerviewer=true&showtitle=true&showbrand=false&start=0&interactivity=all", height=450, width=805)

Similar to the other interfaces, the evaluation results can be saved to files as demonstrated below. You may also load an existing template using a button in the toolbar on the top right-hand corner.

In [7]:
display.IFrame(src="https://cityuhk-lms.ap.panopto.com/Panopto/Pages/Embed.aspx?id=e60913e2-128b-4e40-962d-acd1006b8347&autoplay=false&offerviewer=true&showtitle=true&showbrand=false&start=0&interactivity=all", height=450, width=805)

Unlike the other interfaces, KnowledgeFlow interface can train a classifier incrementally as more and more data becomes available. For more details, refer to the manual [here](https://www.cs.waikato.ac.nz/ml/weka/Witten_et_al_2016_appendix.pdf) and the [video tutorial](https://www.futurelearn.com/info/courses/more-data-mining-with-weka/0/steps/29106) by Witten.

Unlike the Explorer interface, we can flow data through multiple classification algorithms simultaneously as illustrated below:

In [8]:
display.IFrame(src="https://cityuhk-lms.ap.panopto.com/Panopto/Pages/Embed.aspx?id=0d3c11b4-7809-45c7-a208-acd1006ea2f1&autoplay=false&offerviewer=true&showtitle=true&showbrand=false&start=0&interactivity=all",  height=450, width=805)

**Exercise** Follow the video above to add other classifiers: `IBk`, `ZeroR`, `OneR`, `PART`, and `JRIP`. Record their *fractional* accuracies in the dictionary `performance` as follows:

```Python
performance = {'J48': 0.961728,
               'IBk': ___,
               'ZeroR': ___,
               'OneR': ___,
               'PART': ___,
               'JRIP': ___}
```

Use the default parameters.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
performance

In [None]:
# tests

## Ensemble Methods

Next, we use the ensemble methods implemented in scikit-learn and Weka to train armies of classifiers.

### Data Preparation

The following load the data for scikit-learn.

In [None]:
def load_url(url):
    ftpstream = urllib.request.urlopen(url)
    df = pd.DataFrame(arff.loadarff(io.StringIO(ftpstream.read().decode('utf-8')))[0])
    return df.loc[:,lambda df: ~df.columns.isin(['class'])], df['class'].astype(str)

weka_data_path = 'https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/'
X_train, Y_train = load_url(weka_data_path + 'segment-challenge.arff')
X_test, Y_test = load_url(weka_data_path + 'segment-test.arff')

The following loads the data for `python-weka-wrapper`.

In [None]:
jvm.start()
loader = Loader(classname="weka.core.converters.ArffLoader")
trainset = loader.load_url(
    weka_data_path +
    'segment-challenge.arff')  # use load_file to load from file instead
trainset.class_is_last()

testset = loader.load_url(weka_data_path + 'segment-test.arff')
testset.class_is_last()

### Bagging

A simple ensemble method is called Bagging (Bootstrap Aggregation), which train different base classifiers by applying one classification algorithm to different bootstrapped datasets.

For instance, the following uses the [`sklearn.ensemble.BaggingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) to train 10 decision trees with maximum depth 5:

In [None]:
from sklearn import ensemble

BAG = ensemble.BaggingClassifier(
    base_estimator=tree.DecisionTreeClassifier(max_depth=5),
    n_estimators=10,
    random_state=0)

BAG.fit(X_train, Y_train)
print(f'Accuracy: {BAG.score(X_test, Y_test):.4g}')

The ensemble method can be parallelized for both training and classification, by setting additional parameter `n_jobs` (number of jobs to run in parallel) to a number other than 1. To see the effect, execute the following cell and answer `y` to the prompt.

In [None]:
if input('execute? [y/N] ') == 'y':
    for n_jobs in [1, 2, 4, 8]:
        BAG.set_params(n_estimators=1000, verbose=1, n_jobs = n_jobs)
        BAG.fit(X_train, Y_train)

Note that there is a diminishing speedup as the number of jobs increases. This is because of both the memory and communication overheads required to parallelize the training.

Next, we would like to see the effect of changing the depth and number of estimators as follows.

In [None]:
max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]

def bagging(n_estimators, max_depth):
    BAG = ensemble.BaggingClassifier(
        base_estimator=tree.DecisionTreeClassifier(max_depth=max_depth),
        n_estimators=n_estimators,
        random_state=0)
    BAG.fit(X_train, Y_train)
    return BAG.score(X_test, Y_test)

We will run `bag` for different values of `max_depth` and `n_estimators` stored in `max_depth_list` and `n_estimators_list` respectively. The following parallelize the execution using `joblib`.

In [None]:
from joblib import Parallel, delayed

if input('execute? [y/N] ') == 'y':
    results = Parallel(n_jobs=4, verbose=1)(
        delayed(bagging)(n_estimators, max_depth)
        for max_depth in max_depth_list for n_estimators in n_estimators_list)

To present the result nicely in a table and a plot:

In [None]:
if input('execute? [y/N] ') == 'y':
    BAG_df = pd.DataFrame(
    columns=[f'max_depth={max_depth}' for max_depth in max_depth_list],
    dtype=float)
    BAG_df.insert(0, 'n_estimators', n_estimators_list)
    BAG_df.loc[:, lambda df: ~df.columns.isin(['n_estimators'])] = np.reshape(
        results, (len(n_estimators_list), len(max_depth_list)), order='F')
    display.display(BAG_df)
    
    plt.figure(figsize=(8, 5))
    for col in BAG_df.columns[1:]:
        plt.plot(BAG_df['n_estimators'], BAG_df[col], label=col, marker='o')
    plt.legend()
    plt.xlabel('n_estimators')
    plt.ylabel('Accuracies')
    plt.title(r'Bagging Decision Trees')
    plt.show()

We can also use `GridSearchCV` to tune the parameters for the best model. To tune `n_estimators` by searching for the best value from `n_estimators_list` that maximizes the cross-validated accuracy:

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid=[{'n_estimators': n_estimators_list}]

grid_search = GridSearchCV(
    ensemble.BaggingClassifier(
        base_estimator=tree.DecisionTreeClassifier(max_depth=10),
        random_state=0), 
    param_grid, verbose=1, n_jobs=4)

if input('execute? [y/N] ') == 'y':
    grid_search.fit(X_train, Y_train)
    print(f'Accuracy: {grid_search.score(X_test, Y_test):.4g}')

**Exercise** What happens to the accuracy as `n_estimators` and `max_depth` increase?

YOUR ANSWER HERE

### Random Forest

Another ensemble method, called random forest, is similar to Bagging decision trees. However, to further diversify the base classifiers, it randomly selects or combines features before building each tree. The following trains a random forest of 10 decision trees with maximum depth 5.

In [None]:
RF = ensemble.RandomForestClassifier(max_depth=5, 
                                     n_estimators=10, 
                                     random_state=0)
RF.fit(X_train, Y_train)
print(f'Accuracy: {RF.score(X_test, Y_test):.4g}')

Like Bagging, we can also parallelize the training and classification by setting the `n_jobs` parameter. In the above setting, however, the overhead out-weights the benefit, so it is better not to parallelize the training.

**Exercise** Complete the pandas `DataFrame` `RF_df` in the following cell by filling in the accuracies (as floating point numbers) for different `n_estimators` and `max_depth`. Use `random_state = 0`.

Please keep your code in the body of the condition `if input('execute? [y/N] ') == 'y':` so we do not re-train the classifiers during auto-grading of your notebook. Re-training will likely take too much time, which will be stopped forcibly by the server.

In [None]:
if input('execute? [y/N] ') == 'y':
    max_depth_list = [1, 2, 3, 5, 10, 20]
    n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]
    RF_df = pd.DataFrame(columns=[f'max_depth={max_depth}' for max_depth in max_depth_list], dtype=float)
    RF_df.insert(0, 'n_estimators', n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError()
    display.display(RF_df.round(4))

    plt.figure(figsize=(8, 5))
    for col in RF_df.columns[1:]:
        plt.plot(RF_df['n_estimators'],
                 RF_df[col],
                 label=col,
                 marker='o')
    plt.legend()
    plt.xlabel('n_estimators')
    plt.ylabel('Accuracies')
    plt.title(r'Random forest of different sizes and depths')
    plt.show()

To train a random forest of 10 decision trees with maximum depth 5 using `python-weka-wrapper`:

In [None]:
RF = Classifier(classname="weka.classifiers.trees.RandomForest")
RF.options = ['-I', '10',
              '-depth', '5',
              '-S', '1']
RF.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(RF, testset)
print(f'Accuracy {evl.percent_correct/100:.4g}')

**Exercise** Repeat the previous exercise but with Weka instead. Use a random seed of 1.

In [None]:
if input('execute? [y/N] ') == 'y':
    max_depth_list = [1, 2, 3, 5, 10, 20]
    n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]
    RF_df = pd.DataFrame(columns=[f'max_depth={max_depth}' for max_depth in max_depth_list], dtype=float)
    RF_df.insert(0, 'n_estimators', n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError()
    display.display(RF_df.round(4))

    plt.figure(figsize=(8, 5))
    for col in RF_df.columns[1:]:
        plt.plot(RF_df['n_estimators'],
                 RF_df[col],
                 label=col,
                 marker='o')
    plt.legend()
    plt.xlabel('n_estimators')
    plt.ylabel('Accuracies')
    plt.title(r'Random forest of different sizes and depths')
    plt.show()

**Exercise** What are the best choices of `n_estimators` and `max_depth`?

YOUR ANSWER HERE

### AdaBoost

To train AdaBoost with 10 decision trees of maximum depth 5:

In [None]:
ADB = ensemble.AdaBoostClassifier(
    base_estimator=tree.DecisionTreeClassifier(max_depth=5),
    n_estimators=10,
    random_state=0)
ADB.fit(X_train, Y_train)
print(f'Accuracy: {ADB.score(X_test, Y_test):.4g}')

**Exercise** Note that, unlike random forest, AdaBoost cannot be parallelized. Why?

YOUR ANSWER HERE

**Exercise** Complete the pandas `DataFrame` `ADB_df` in the following cell by filling in the accuracies (as floating point numbers) for different `n_estimators` and `max_depth`. Use `random_state = 0`.

In [None]:
if input('execute? [y/N] ') == 'y':
    max_depth_list = [1, 2, 3, 5, 10]
    n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]
    ADB_df = pd.DataFrame(columns=[f'max_depth={max_depth}' for max_depth in max_depth_list], dtype=float)
    ADB_df.insert(0, 'n_estimators', n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError()
    display.display(ADB_df.round(4))

    plt.figure(figsize=(8, 5))
    for col in ADB_df.columns[1:]:
        plt.plot(ADB_df['n_estimators'],
                 ADB_df[col],
                 label=col,
                 marker='o')
    plt.legend()
    plt.xlabel('n_estimators')
    plt.ylabel('Accuracies')
    plt.title(r'Adaboost of different sizes and depths')
    plt.show()

To train AdaBoost with 10 decision trees of maximum depth 5 using `python-weka-wrapper`:

In [None]:
REPTree = Classifier(classname="weka.classifiers.trees.REPTree")
REPTree.options = ['-L', '5']
ADB = SingleClassifierEnhancer(classname="weka.classifiers.meta.AdaBoostM1")
ADB.options=['-I', '10',
             '-S', '1']
ADB.classifier = REPTree
ADB.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(ADB, testset)
print(f'Accuracy {evl.percent_correct/100:.4g}')

**Exercise** Repeat the previous exercise but with Weka instead. Use a random seed of 1.

In [None]:
if input('execute? [y/N] ') == 'y':
    max_depth_list = [1, 2, 3, 5, 10]
    n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]
    ADB_df = pd.DataFrame(columns=[f'max_depth={max_depth}' for max_depth in max_depth_list], dtype=float)
    ADB_df.insert(0, 'n_estimators', n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError()
    display.display(ADB_df.round(4))

    plt.figure(figsize=(8, 5))
    for col in ADB_df.columns[1:]:
        plt.plot(ADB_df['n_estimators'],
                 ADB_df[col],
                 label=col,
                 marker='o')
    plt.legend()
    plt.xlabel('n_estimators')
    plt.ylabel('Accuracies')
    plt.title(r'Adaboost of different sizes and depths')
    plt.show()

**Exercise** Which ensemble method is better, Adaboost or random forest? Why?

YOUR ANSWER HERE

## Optional Challenge

Train your own classifier to achieve the highest possible accuracies. You may:
- choose different classification algorithms or ensemble methods such as Bagging, Stacking, Voting, XGBoost, etc.
- tune the hyper-parameters manually or automatically using `GridSearchCV` in `scikit-learn` or `CVParameterSelection` in Weka.

Post your model and results on [Canvas](https://canvas.cityu.edu.hk/courses/39808/discussion_topics/306324) to compete with others. If you want to include your code to this notebook, make sure you put it inside the condition `if input('execute? [y/N] ') == 'y':` to avoid re-training your classifier.

The following is an example using XGBoost with its default parameters.
```Python
import xgboost
XGB = xgboost.XGBClassifier()
XGB.fit(X_train, Y_train)
print(f'Accuracy: {XGB.score(X_test, Y_test):.4g')
```