# Machine vs Machine

**CS5483 Data Warehousing and Data Mining**
___

<a title="Freddycastillo9871, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Terminator_2.png"><img width="512" alt="Terminator 2" src="https://upload.wikimedia.org/wikipedia/commons/3/38/Terminator_2.png"></a>

```{important}
The deadline for this tutorial is extended by one week, i.e., the lecture one week after the midterm date.
```

In [None]:
import io
import os
import urllib.request

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import weka.core.jvm as jvm
from IPython import display
from joblib import Memory, Parallel, delayed, dump, load
from scipy.io import arff
from sklearn import ensemble, tree
from sklearn.model_selection import GridSearchCV
from weka.classifiers import Classifier, Evaluation, SingleClassifierEnhancer
from weka.core.converters import Loader

%matplotlib widget
jvm.start()

In this notebook, we will try to build the best machine to classify the image segmentation dataset:
- `segment-challenge.arff` for training, and
- `segment-test.arff` for testing.

## Individual Classification Algorithms

Weka provides a KnowledgeFlow interface to flow data through a learning algorithm. Unlike other interfaces (Explorer and Experimenter), KnowledgeFlow interface can train a classifier incrementally as more and more data are available. 

To have an overview of the interface, take a look at the [video tutorial](https://www.futurelearn.com/info/courses/more-data-mining-with-weka/0/steps/29106) by Witten.

```{seealso}
For more details, refer to the manual [here](https://www.cs.waikato.ac.nz/ml/weka/Witten_et_al_2016_appendix.pdf).
```

**To open a layout:**
- Run Weka.
- Click `KnowledgeFlow` button under Applications.
    ```{tip}
    An untitled new layout should have been opened. You can also create new layout using `File`->`New Layout`.
    ```
- To load a layout, click `File->Open...`. Try opening [segment-RF.kf](segment-RF.kf) and [segment-Adaboost.kf](segment-Adaboost.kf).
    ```{tip}
    To save the current layout to a new file, click the menu item `File->Save/Save As...`.
    ```

**To run a layout:**

With [segment-RF.kf](segment-RF.kf) or [segment-Adaboost.kf](segment-Adaboost.kf) opened:
- Click the play button to start the training/testing.
- If the `Status` panel at the bottom shows a successful run, 
  - right-click any `TextViewer` and select `show results` to show the collected result;
  - right-click any `GraphViewer` and select `show plots` to show the plots.

```{tip}
- You may save the text results to files as in the Explorer interface.
- Unlike the Explorer interface, we can flow data through multiple classification algorithms simultaneously.
```

The following demonstrate how to use the interface to create a layout to train and test a J48 decision tree.

**To setup the training data:**
- Click `ArffLoader` (under the `DataSources` folder) from the `Design` panel on the left, then click anywhere on the layout panel to add it.
- Similarly, add a `ClassAssigner` and a `TrainingSetMaker` (under `Evaluation` folder) to the layout.
- Right-click the `ArffLoader` and select `dataSet` (under `Connections:`). Click `ClassAssigner` in the layout to connect the data to it.
- Similarly, connect the data from the `ClassAssigner` to `TrainingSetMaker`.
- Right-click the `ArffLoader` and click `Browse...` to select the training data `segment-challenge.arff`.
```{tip}
- On the JupyterHub server, you may find the data under `/data/`.
- By default, `ClassAssigner` selects the last attribute as the target, but you can change the target attribute by double-clicking `ClassAssigner`.
```

**To setup the test data:**
- Add another `ArffLoader` and `ClassAssigner` to the layout and connect the data from the prior to the latter.
- Add a `TestSetMaker` (instead of a `TrainingSetMaker` under the `Evaluation` folder) and connect the data from the new `ClassAssigner` to it.
- Configure the new `ArffLoader` to load `segment-test.arff` (instead of `segment-challenge.arff`).

```{tip}
- Instead of adding the same block/connection multiple times, you can select, copy, and paste existing blocks.
- You can rename a block to differentiate it from other blocks in the layout.
```

**To setup the classifier:**

- Add a `J48` (under `Classifiers/trees` folder) and a `ClassifierPerformanceEvaluator` (under `Evaluation`) to the layout.
- Connect the `trainingSet` from `TrainingSetMaker` to `J48`.
- Connect the `testSet` from `TestSetMaker` to `J48`.
- Connect the `batchClassifer` from `J48` to `ClassifierPerformanceEvaluator`.
```{tip}
- You can configure the classifier by double clicking `J48`.
- You can configure the evaluation metrics by double clicking `ClassifierPerformanceEvaluator.
```

**To display the training result:**

- Add two `TextViewer`s and a `GraphViewer` (under `Visualization` folder) to the layout.
- Connect the `text` and `graph` from `J48` to the first `TextViewer` and the `GraphViewer`, respectively.
- Connect the `text` from the `ClassifierPerformanceEvaluator` to the second `TextViewer`.

```{tip}
- You can use one `TextViewer` instead to receive multiple `text`s from different blocks.
```

**Exercise** Using the KnowledgeFlow interface, add other classifiers: `IBk`, `ZeroR`, `OneR`, `PART`, and `JRIP`. Record their *fractional* accuracies in the dictionary `performance` as follows:

```Python
performance = {'J48': 0.961728,
               'IBk': ___,
               'ZeroR': ___,
               'OneR': ___,
               'PART': ___,
               'JRIP': ___}
```

Use the default parameters.

In [None]:
# your python code here
# end of python code

performance

In [None]:
# hidden tests

## Ensemble Methods

Unlike training an individual classifier, ensemble methods
- train an army of base classifiers and
- combine their decisions into a final decision.

We will use the ensemble methods implemented in scikit-learn and Weka. To load the data for scikit-learn:

In [None]:
def load_url(url):
    ftpstream = urllib.request.urlopen(url)
    df = pd.DataFrame(arff.loadarff(io.StringIO(ftpstream.read().decode("utf-8")))[0])
    return df.loc[:, lambda df: ~df.columns.isin(["class"])], df["class"].astype(str)


weka_data_path = (
    "https://raw.githubusercontent.com/fracpete/wekamooc/master/dataminingwithweka/data/"
)
X_train, Y_train = load_url(weka_data_path + "segment-challenge.arff")
X_test, Y_test = load_url(weka_data_path + "segment-test.arff")

To load the data for `python-weka-wrapper`:

In [None]:
loader = Loader(classname="weka.core.converters.ArffLoader")
trainset = loader.load_url(
    weka_data_path + "segment-challenge.arff"
)  # use load_file to load from file instead
trainset.class_is_last()

testset = loader.load_url(weka_data_path + "segment-test.arff")
testset.class_is_last()

### Bagging

Bagging (Bootstrap Aggregation) is a simple ensemble method that trains different base classifiers by *applying a classification algorithm to different bootstrapped datasets.*

For instance, the following uses the [`sklearn.ensemble.BaggingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) to train 10 decision trees with a maximum depth of 5:

```python
from sklearn import ensemble
```

In [None]:
BAG = ensemble.BaggingClassifier(
    base_estimator=tree.DecisionTreeClassifier(max_depth=5),
    n_estimators=10,
    random_state=0,
)

BAG.fit(X_train, Y_train)
print(f"Accuracy: {BAG.score(X_test, Y_test):.4g}")

The ensemble method can be parallelized for both training and classification by setting the additional parameter `n_jobs`, the number of jobs to run in parallel. Different jobs will be run in different CPU cores or threads. To see the effect, execute the following cell and answer `y` to the prompt or just press enter. Your output may look like the following:

```
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.0s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    3.1s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    3.1s finished
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   2 out of   4 | elapsed:    1.8s remaining:    1.8s
[Parallel(n_jobs=4)]: Done   4 out of   4 | elapsed:    1.8s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    1.1s remaining:    3.4s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    1.2s finished
```

In [None]:
if input("execute? [Y/n] ").lower() != "n":
    for n_jobs in [1, 2, 4, 8]:
        BAG.set_params(n_estimators=1000, verbose=1, n_jobs=n_jobs)
        BAG.fit(X_train, Y_train)

````{note}

Observe that the speedup may not be proportional to the number of jobs because of the memory and communication overheads required to parallelize the training. Concurrent access to the same memory can lead to unexpected behavior, so it must be prevented by duplicating or locking the data, leading to overheads. For more details, see the documentation [here](https://joblib.readthedocs.io/en/latest/parallel.html).

````

Next, we would like to see the effect of changing the depth and number of estimators. The following are the lists of possible values to explore:

In [None]:
max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]

We will define a function `bagging(n_estimators, max_depth)` that returns the accuracy of Bagging `n_estimators` decision trees of maximum depth `max_depth`. To avoid re-training/evaluating a classifier, we additionally cache the result using `joblib.Memory`:

```python
from joblib import Memory
import os
```

In [None]:
# cache to private folder
os.makedirs("private", exist_ok=True)
memory = Memory(location="private", verbose=0)


@memory.cache
def bagging(n_estimators, max_depth):
    BAG = ensemble.BaggingClassifier(
        base_estimator=tree.DecisionTreeClassifier(max_depth=max_depth),
        n_estimators=n_estimators,
        random_state=0,
    )
    BAG.fit(X_train, Y_train)
    return BAG.score(X_test, Y_test)

To run `bagging` in parallel for different choices of parameters, we can use the following tools from `joblib`:

```python
from joblib import Parallel, delayed
```

The following will divide the work into 4 jobs to run in parallel:

```
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  47 out of  54 | elapsed:    2.3s remaining:    0.3s
[Parallel(n_jobs=4)]: Done  54 out of  54 | elapsed:    3.0s finished
```

In [None]:
if input("execute? [Y/n] ").lower() != "n":
    BAG_results = Parallel(n_jobs=4, verbose=1)(
        delayed(bagging)(n_estimators, max_depth)
        for max_depth in max_depth_list
        for n_estimators in n_estimators_list
    )

````{note}

- The argument to `Parallel(n_jobs=4, verbose=1)` uses a generator expression that generates each call to `bagging` one by one. For more details, see the notebook [here](https://www.cs.cityu.edu.hk/~ccha23/cs1302book/Lecture6/Generator.html#id1) and the documentation [here](https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).
- The reason for `delayed(bagging)(n_estimators, max_depth)` is to delay the call of `bagging(n_estimators, max_depth)` until `Parallel(n_jobs=4, verbose=1)(...)` assigns it to a job/core/thread.

````

To present the result nicely in a `DataFrame`:

In [None]:
def tabulate(results):
    df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    df.insert(0, "n_estimators", n_estimators_list)
    df.loc[:, lambda df: ~df.columns.isin(["n_estimators"])] = np.reshape(
        results, (len(n_estimators_list), len(max_depth_list)), order="F"
    )
    return df


BAG_df = tabulate(BAG_results)
display.display(BAG_df)

Although the above calls `bagging` again for the same combinations of arguments, the cached results are returned without re-training the classifiers. You can clear the cache with the following:

```python
bagging.clear()
```

It is helpful to save the `DataFrame` to a particular file. This can be done using [`joblib.dump`](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html):

```python
from joblib import dump
```

In [None]:
if input("(over-)write file? [Y/n] ").lower() != "n":
    dump(BAG_df, "BAG_df.gz")

Unlike caching, we can load the data anywhere beyond this notebook:

```python
from joblib import load
```

In [None]:
BAG_df = load("BAG_df.gz")

To remove the file:

```python
os.remove("BAG_df.gz")
```

To plot the `DataFrame`:

In [None]:
def plot(df):
    for col in df.columns[1:]:
        plt.plot(df["n_estimators"], df[col], label=col, marker="o")
    plt.legend()
    plt.xlabel("n_estimators")
    plt.ylabel("Accuracies")


plt.figure()
plot(BAG_df)
plt.title(r"Bagging decision trees")
plt.show()

**Exercise** What happens to the accuracy as `n_estimators` and `max_depth` increase?

YOUR ANSWER HERE

To apply the ensemble method using `python-weka-wrapper` instead of scikit-learn:

In [None]:
REPTree = Classifier(classname="weka.classifiers.trees.REPTree")
REPTree.options = ["-L", "5"]
BAG_weka = SingleClassifierEnhancer(classname="weka.classifiers.meta.Bagging")
BAG_weka.options = ["-I", "10", "-S", "1"]
BAG_weka.classifier = REPTree
BAG_weka.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(BAG_weka, testset)
print(f"Accuracy: {evl.percent_correct/100:.4g}")

The base classifiers for [`Bagging`](https://weka.sourceforge.io/doc.dev/weka/classifiers/meta/Bagging.html) are trained using [`REPTree`](https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/REPTree.html), which is a fast decision tree induction algorithm that is neither C4.5 nor CART.

**Exercise** Complete the pandas `DataFrame` `BAG_weka_df` in the following cell by filling in the accuracies (as floating point numbers) for different `n_estimators` and `max_depth`.

In [None]:
max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]

if input("execute? [Y/n] ").lower() != "n":
    BAG_weka_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    BAG_weka_df.insert(0, "n_estimators", n_estimators_list)
    # your python code here
    # end of python code
    

    display.display(BAG_weka_df.round(4))

    plt.figure()
    plot(BAG_weka_df)
    plt.title(r"Bagging decision trees")
    plt.show()

    dump(BAG_weka_df, "BAG_weka_df.gz")

````{important}

To avoid re-training, the last line above saves your result to a file called `BAG_weka_df.gz`. Otherwise, the server cannot auto-grade your submission as it aborts execution that takes excessive time or memory. Ensure that your code is indented correctly, so it is part of the body of the conditional in the solution cell:
```Python
...
if input('execute? [Y/n] ').lower() != 'n':
    ...
```

````

In [None]:
# hidden tests

### Random Forest

Another ensemble method, called [random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), is similar to Bagging decision trees. However, it randomly selects or combines features to further diversify the base classifiers before building each tree. The following trains a random forest of 10 decision trees with a maximum depth of 5.

In [None]:
RF = ensemble.RandomForestClassifier(max_depth=5, n_estimators=10, random_state=0)
RF.fit(X_train, Y_train)
print(f"Accuracy: {RF.score(X_test, Y_test):.4g}")

Like Bagging, we can also parallelize the training and classification by setting the `n_jobs` parameter.

**Exercise** Complete the pandas `DataFrame` `RF_df` in the following cell by filling in the accuracies (as floating point numbers) for different `n_estimators` and `max_depth`, and with `random_state = 0`.

In [None]:
max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]

if input("execute? [Y/n] ").lower() != "n":
    RF_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    RF_df.insert(0, "n_estimators", n_estimators_list)
    # your python code here
    # end of python code
    
    display.display(RF_df.round(4))

    plt.figure()
    plot(RF_df)
    plt.title(r"Random forest")
    plt.show()

    dump(RF_df, "RF_df.gz")

In [None]:
# hidden tests

To train a random forest of 10 decision trees with a maximum depth of 5 using `python-weka-wrapper`:

In [None]:
RF_weka = Classifier(classname="weka.classifiers.trees.RandomForest")
RF_weka.options = ["-I", "10", "-depth", "5", "-S", "1"]
RF_weka.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(RF_weka, testset)
print(f"Accuracy {evl.percent_correct/100:.4g}")

**Exercise** Repeat the previous exercise but with [Weka](https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/RandomForest.html) instead. Use a random seed of 1.

In [None]:
max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]

if input("execute? [Y/n] ").lower() != "n":
    RF_weka_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    RF_weka_df.insert(0, "n_estimators", n_estimators_list)
    # your python code here
    # end of python code
    
    display.display(RF_weka_df.round(4))

    plt.figure()
    plot(RF_weka_df)
    plt.title(r"Random forest")
    plt.show()

    dump(RF_weka_df, "RF_weka_df.gz")

In [None]:
# hidden tests

**Exercise** What are the values of `n_estimators` and `max_depth` that maximize the above accuracy on the test set?

YOUR ANSWER HERE

For `sklearn`, we can tune the parameters of a classification algorithm using `GridSearchCV` imported as follows:

```python
from sklearn.model_selection import GridSearchCV
```

For instance, to tune `n_estimators` by searching for the best value from `n_estimators_list` that maximizes the cross-validated accuracy on the training set:

In [None]:
if input("execute? [Y/n] ").lower() != "n":
    max_depth_list = [1, 2, 3, 5, 10, 20]
    n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]
    param_grid = {"n_estimators": n_estimators_list, "max_depth": max_depth_list}

    grid_search = GridSearchCV(
        ensemble.RandomForestClassifier(random_state=0), param_grid, verbose=1, n_jobs=4
    )

    grid_search.fit(X_train, Y_train)

    print(f"Accuracy: {grid_search.score(X_test, Y_test):.4g}")
    print(f"Best parameters: {grid_search.best_params_}")

**Exercise** Observe that the best parameter values that maximize the cross-validated accuracy need not maximize the accuracy on the test set. Explain whether it is better to use the maximum accuracy on the test set as the performance estimate of random forest with parameter tuning.

YOUR ANSWER HERE

### AdaBoost

Using [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html), we can boost the performance by adding base classifiers one-by-one to improve the error made by previously trained classifiers. To train AdaBoost with 10 decision trees of a maximum depth of 5:

In [None]:
ADB = ensemble.AdaBoostClassifier(
    base_estimator=tree.DecisionTreeClassifier(max_depth=5),
    n_estimators=10,
    random_state=0,
)
ADB.fit(X_train, Y_train)
print(f"Accuracy: {ADB.score(X_test, Y_test):.4g}")

````{note}

The original AdaBoost only works for binary classification. `AdaBoostClassifier` is indeed the multi-class extension of AdaBoost called AdaBoost-SAMME.

````

**Exercise** Unlike Bagging and random forest, the training of the base classifiers for AdaBoost cannot be parallelized. Why?

YOUR ANSWER HERE

**Exercise** Complete the pandas `DataFrame` `ADB_df` in the following cell by filling in the accuracies (as floating point numbers) for different `n_estimators` and `max_depth`. Use `random_state = 0`.

In [None]:
max_depth_list = [1, 2, 3, 5, 10]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]

if input("execute? [Y/n] ").lower() != "n":
    ADB_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    ADB_df.insert(0, "n_estimators", n_estimators_list)
    # your python code here
    # end of python code
    
    display.display(ADB_df.round(4))

    plt.figure()
    plot(ADB_df)
    plt.title(r"Adaboost decision trees")
    plt.show()

    dump(ADB_df, "ADB_df.gz")

In [None]:
# hidden tests

To train [AdaBoost](https://weka.sourceforge.io/doc.dev/weka/classifiers/meta/AdaBoostM1.html) with 10 decision trees of maximum depth 5 using `python-weka-wrapper`:

In [None]:
REPTree = Classifier(classname="weka.classifiers.trees.REPTree")
REPTree.options = ["-L", "5"]
ADB_weka = SingleClassifierEnhancer(classname="weka.classifiers.meta.AdaBoostM1")
ADB_weka.options = ["-I", "10", "-S", "1"]
ADB_weka.classifier = REPTree
ADB_weka.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(ADB_weka, testset)
print(f"Accuracy {evl.percent_correct/100:.4g}")

Weka uses the multi-class extension called AdaboostM1, which is different from Adaboost-SAMME.

**Exercise** Repeat the previous exercise but with Weka instead. Use a random seed of 1.

In [None]:
if input("execute? [Y/n] ").lower() != "n":
    max_depth_list = [1, 2, 3, 5, 10]
    n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]
    ADB_weka_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    ADB_weka_df.insert(0, "n_estimators", n_estimators_list)
    # your python code here
    # end of python code
    
    display.display(ADB_weka_df.round(4))

    plt.figure()
    plot(ADB_weka_df)
    plt.title(r"Adaboost decision trees")
    plt.show()

    dump(ADB_weka_df, "ADB_weka_df.gz")

In [None]:
# hidden tests

**Exercise** Which ensemble method is better, Adaboost or random forest? Why?

YOUR ANSWER HERE

## Challenge

Train your own classifier to achieve the highest possible accuracy. You may:
- choose different classification algorithms or ensemble methods such as Bagging, Stacking, Voting, and XGBoost.
- tune the hyper-parameters manually or automatically using `GridSearchCV` in scikit-learn or `CVParameterSelection` in Weka.

Post your model and results on [Canvas](https://canvas.cityu.edu.hk/courses/51777/discussion_topics/440013) to compete with others. To include your code in this notebook, make sure you avoid excessive time or memory by putting your code in the body of the conditional `if input('execute? [Y/n] ').lower() != 'n':`.

The following is an example using XGBoost with its default parameters.

In [None]:
if input("execute? [Y/n] ").lower() != "n":
    !pip install xgboost
    import xgboost
    codes, uniques = pd.concat([Y_train, Y_test]).factorize()
    Y_train_codes, Y_test_codes = codes[:len(Y_train)], codes[len(Y_train):]
    XGB = xgboost.XGBClassifier(n_jobs=1, use_label_encoder=False)
    XGB.fit(X_train, Y_train_codes)
    print(f"Accuracy: {XGB.score(X_test, Y_test_codes):.4g}")