---
title: Machine vs Machine
---

::::{figure} https://upload.wikimedia.org/wikipedia/commons/3/38/Terminator_2.png
:alt: Terminator 2
::::

```{attention}
The deadline for this tutorial is extended by one week, i.e., the lecture one week after the midterm date.
```

In [None]:
import io
import logging
import os
import urllib.request

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import weka.core.jvm as jvm
from IPython import display
from joblib import Memory, Parallel, delayed, dump, load
from scipy.io import arff
from sklearn import ensemble, tree
from sklearn.model_selection import GridSearchCV
from weka.classifiers import Classifier, Evaluation, SingleClassifierEnhancer
from weka.core.converters import Loader

%matplotlib widget
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

Do NOT modify the following configuration as it is required to cache/load your results.

In [None]:
jvm.start(logging_level=logging.ERROR)
# cache to private folder
os.makedirs("private", exist_ok=True)
memory = Memory(location="private", verbose=0)

# To tabulate results of ensemble methods
def tabulate(results):
    df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    df.insert(0, "n_estimators", n_estimators_list)
    df.loc[:, lambda df: ~df.columns.isin(["n_estimators"])] = np.reshape(
        results, (len(n_estimators_list), len(max_depth_list)), order="F"
    )
    return df

# To plot the dataframe
def plot(df):
    for col in df.columns[1:]:
        plt.plot(df["n_estimators"], df[col], label=col, marker="o")
    plt.legend()
    plt.xlabel("n_estimators")
    plt.ylabel("Accuracies")

# Load file
def load_file(filename):
    if os.path.exists(filename):
        os.replace(filename, "private/" + filename)
    return load("private/" + filename)

In this notebook, we will strive to build the best machine to classify the image segmentation datasets:

- `segment-challenge.arff` for training, and
- `segment-test.arff` for testing.

## Knowledge Flow Interface

Weka provides a KnowledgeFlow interface to flow data through a learning algorithm. Unlike other interfaces (Explorer and Experimenter), KnowledgeFlow interface can train a classifier incrementally as more and more data are available. To have an overview of the interface, take a look at the [video tutorial](https://www.futurelearn.com/info/courses/more-data-mining-with-weka/0/steps/29106) by Witten. For more details, refer to the manual [here](https://www.cs.waikato.ac.nz/ml/weka/Witten_et_al_2016_appendix.pdf).

::::{seealso} If you are interested in data stream mining, ...
:class: dropdown

- Consider exploring [MOA (Massive Online Analysis)](https://moa.cms.waikato.ac.nz/), a framework for data stream mining that is based on Weka. It is designed for massive data streams and provides a comprehensive suite of machine learning algorithms.
    - Start a `Desktop` from the `File->New Launcher` and run `moa` in the terminal.
- Try the python package [CapyMOA](https://capymoa.org/), which allows you to leverage MOA's capabilities within a Python environment.

::::

### Open a layout

1. Run Weka.
2. Click `KnowledgeFlow` button under Applications.
3. An untitled new layout should have been opened. You can also create new layout using `File`->`New Layout`.
4. To load a layout, click `File->Open...`. Try opening [segment-RF.kf](segment-RF.kf) and [segment-Adaboost.kf](segment-Adaboost.kf) in `Tutorial6` folder.
5. To save the current layout to a new file, click the menu item `File->Save/Save As...`.

### Run a layout

With [segment-RF.kf](segment-RF.kf) or [segment-Adaboost.kf](segment-Adaboost.kf) opened:
1. Click the play button to start the training/testing. Unlike the Explorer interface, we can flow data through multiple classification algorithms simultaneously.
2. If the `Status` panel at the bottom shows a successful run, 
    - right-click any `TextViewer` block and select `show results` to show the collected result;[^save]
    - right-click any `GraphViewer` block (e.g., in `segment-RF.kf`) and select `show plots` to show the plots.

[^save]: You may save the text results to files as in the Explorer interface.

The following demonstrate how to use the interface to create a layout to train and test a J48 decision tree.

### Load the data

To load the training data:

1. Click `ArffLoader` (under the `DataSources` folder) from the `Design` panel on the left, then click anywhere on the layout panel to add it.[^data]
2. Similarly, add a `ClassAssigner` and a `TrainingSetMaker` (under `Evaluation` folder) to the layout.
3. Right-click the `ArffLoader` and select `dataSet` (under `Connections:`). Click `ClassAssigner` in the layout to connect the data to it.[^class]
4. Similarly, connect the data from the `ClassAssigner` to `TrainingSetMaker`.
5. Right-click the `ArffLoader` and click `Browse...` to select the training data `segment-challenge.arff`.

[^data]: On the JupyterHub server, you may find the data under `/data/`.
[^class]: By default, `ClassAssigner` selects the last attribute as the target, but you can change the target attribute by double-clicking `ClassAssigner`.

To load the test data:

1. Add another `ArffLoader` and `ClassAssigner` to the layout and connect the data from the prior to the latter. Alternatively, instead of adding the same block/connection multiple times, you can select, copy, and paste existing blocks.[^rename]
2. Add a `TestSetMaker` (instead of a `TrainingSetMaker` under the `Evaluation` folder) and connect the data from the new `ClassAssigner` to it.
3. Configure the new `ArffLoader` to load `segment-test.arff` (instead of `segment-challenge.arff`).

[^rename]: You can rename a block to differentiate it from other blocks in the layout.

### Setup the classifier

1. Add a `J48` (under `Classifiers/trees` folder)[^config1] and a `ClassifierPerformanceEvaluator`[^config2] (under `Evaluation`) to the layout.
2. Connect the `trainingSet` from `TrainingSetMaker` to `J48`.
3. Connect the `testSet` from `TestSetMaker` to `J48`.
4. Connect the `batchClassifer` from `J48` to `ClassifierPerformanceEvaluator`.

[^config1]: You can configure the classifier by double clicking `J48`.
[^config2]: You can configure the evaluation metrics by double clicking `ClassifierPerformanceEvaluator.

### Display the result

1. Add two `TextViewer`s and a `GraphViewer` (under `Visualization` folder) to the layout.[^single]
2. Connect the `text` and `graph` from `J48` to the first `TextViewer` and the `GraphViewer`, respectively.
3. Connect the `text` from the `ClassifierPerformanceEvaluator` to the second `TextViewer`.

[^single]: You can also reuse the same `TextViewer` instead to receive multiple `text`s from different blocks.

::::{exercise}
:label: ex:1
Using the KnowledgeFlow interface, add other classifiers: `IBk`, `ZeroR`, `OneR`, `PART`, and `JRIP`. Record their *fractional* accuracies in the dictionary `performance` as follows:

```python
performance = {'J48': 0.961728,
               'IBk': ___,
               'ZeroR': ___,
               'OneR': ___,
               'PART': ___,
               'JRIP': ___}
```

Use the default parameters.

::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
performance

In [None]:
# hidden tests

## Ensemble Methods

Unlike training an individual classifier, ensemble methods
- train an army of base classifiers and
- combine their decisions into a final decision.

We will use the ensemble methods implemented in scikit-learn and Weka. To load the data for scikit-learn:

In [None]:
def load_url(url):
    ftpstream = urllib.request.urlopen(url)
    df = pd.DataFrame(arff.loadarff(io.StringIO(ftpstream.read().decode("utf-8")))[0])
    return df.loc[:, lambda df: ~df.columns.isin(["class"])], df["class"].astype(str)


weka_data_path = (
    "https://raw.githubusercontent.com/fracpete/wekamooc/master/dataminingwithweka/data/"
)
X_train, Y_train = load_url(weka_data_path + "segment-challenge.arff")
X_test, Y_test = load_url(weka_data_path + "segment-test.arff")

To load the data for `python-weka-wrapper`:

In [None]:
loader = Loader(classname="weka.core.converters.ArffLoader")
trainset = loader.load_url(
    weka_data_path + "segment-challenge.arff"
)  # use load_file to load from file instead
trainset.class_is_last()

testset = loader.load_url(weka_data_path + "segment-test.arff")
testset.class_is_last()

### Bagging

Bagging (Bootstrap Aggregation) is a simple ensemble method that trains different base classifiers by *applying a classification algorithm to different bootstrapped datasets.*

For instance, the following uses the [`sklearn.ensemble.BaggingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) to train 10 decision trees with a maximum depth of 5:

```python
from sklearn import ensemble
```

In [None]:
BAG = ensemble.BaggingClassifier(
    estimator=tree.DecisionTreeClassifier(max_depth=5),
    n_estimators=10,
    random_state=0,
)

BAG.fit(X_train, Y_train)
print(f"Accuracy: {BAG.score(X_test, Y_test):.4g}")

The ensemble method can be parallelized for both training and classification by setting the additional parameter `n_jobs`, the number of jobs to run in parallel. Different jobs will be run in different CPU cores or threads. See the documentation [here](https://joblib.readthedocs.io/en/latest/parallel.html) for the implementation details.

In [None]:
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    for n_jobs in [1, 2, 4, 8, -1]:
        BAG.set_params(n_estimators=1000, verbose=1, n_jobs=n_jobs)
        BAG.fit(X_train, Y_train)

::::{seealso} Why the speedup may not be proportional to the number of jobs?
:class: dropdown

There are memory and communication overheads required to parallelize the training. Concurrent access to the same memory can lead to unexpected behavior, so it must be prevented by duplicating or locking the data, leading to overheads.

::::

You can check the number of CPU cores available using the following command:

In [None]:
_nproc = !nproc
_nproc

In [None]:
%%ai chatgpt -f text
I have {_nproc} CPU cores available. Explain the best choice of parameters to
parallelize BaggingClassifier of sklearn.

Next, we would like to see the effect of changing the depth and number of estimators. The following are the lists of possible values to explore:

In [None]:
max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]

We will define a function `bagging(n_estimators, max_depth)` that returns the accuracy of Bagging `n_estimators` decision trees of maximum depth `max_depth`. To avoid re-training/evaluating a classifier, we additionally cache the result using `joblib.Memory`, which has been initialized in the first code cell as follows:

```python
from joblib import Memory
import os

# cache to private folder
os.makedirs("private", exist_ok=True)
memory = Memory(location="private", verbose=0)
```

In [None]:
@memory.cache
def bagging(n_estimators, max_depth):
    BAG = ensemble.BaggingClassifier(
        estimator=tree.DecisionTreeClassifier(max_depth=max_depth),
        n_estimators=n_estimators,
        random_state=0,
    )
    BAG.fit(X_train, Y_train)
    return BAG.score(X_test, Y_test)

To run `bagging` in parallel for different choices of parameters, we can use the following tools from `joblib`:

```python
from joblib import Parallel, delayed
```

The following will divide the work into 4 jobs to run in parallel:

In [None]:
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    BAG_results = Parallel(n_jobs=4, verbose=1)(
        delayed(bagging)(n_estimators, max_depth)
        for max_depth in max_depth_list
        for n_estimators in n_estimators_list
    )
_code = In[-1].rsplit('\n', maxsplit=1)[0]

::::{tip}

- The argument to `Parallel(n_jobs=4, verbose=1)` uses a generator expression that generates each call to `bagging` one by one. For more details, see the documentation [here](https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html).
- The reason for `delayed(bagging)(n_estimators, max_depth)` is to delay the call of `bagging(n_estimators, max_depth)` until `Parallel(n_jobs=4, verbose=1)(...)` assigns it to a job/core/thread.

::::

In [None]:
%%ai chatgpt -f text
Explain how generator and decorator are used in the code:
--
{_code}

To present the result nicely in a `DataFrame`:

In [None]:
BAG_df = tabulate(BAG_results)
display.display(BAG_df)

Although the above calls `bagging` again for the same combinations of arguments, the cached results are returned without re-training the classifiers. You can clear the cache with the following:

```python
bagging.clear()
```

It is helpful to save the `DataFrame` to a particular file. This can be done using [`joblib.dump`](https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html):

```python
from joblib import dump
```

In [None]:
if input("(over-)write file? [Y/n] ").lower() != "n":
    dump(BAG_df, "BAG_df.gz")

Unlike caching, we can load the data anywhere beyond this notebook:

```python
from joblib import load
```

In [None]:
BAG_df = load("BAG_df.gz")

To remove the file:

```python
os.remove("BAG_df.gz")
```

To plot the `DataFrame`:

In [None]:
plt.figure(num=1, clear=True)
plot(BAG_df)
plt.title(r"Bagging decision trees")
plt.show()

::::{exercise}
:label: ex:2
What happens to the accuracy as `n_estimators` and `max_depth` increase?
::::

YOUR ANSWER HERE

To apply the ensemble method using `python-weka-wrapper` instead of scikit-learn:

In [None]:
REPTree = Classifier(classname="weka.classifiers.trees.REPTree")
REPTree.options = ["-L", "5"]
BAG_weka = SingleClassifierEnhancer(classname="weka.classifiers.meta.Bagging")
BAG_weka.options = ["-I", "10", "-S", "1"]
BAG_weka.classifier = REPTree
BAG_weka.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(BAG_weka, testset)
print(f"Accuracy: {evl.percent_correct/100:.4g}")

The base classifiers for [`Bagging`](https://weka.sourceforge.io/doc.dev/weka/classifiers/meta/Bagging.html) are trained using [`REPTree`](https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/REPTree.html), which is a fast decision tree induction algorithm that is neither C4.5 nor CART.

::::{exercise}
:label: ex:3
Complete the pandas `DataFrame` `BAG_weka_df` in the following cell by filling in the accuracies (as floating point numbers) for different `n_estimators` and `max_depth`.
::::

In [None]:
max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]

if input("execute? [Y/n] ").lower() != "n":
    BAG_weka_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    BAG_weka_df.insert(0, "n_estimators", n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError

    display.display(BAG_weka_df.round(4))

    plt.figure()
    plot(BAG_weka_df)
    plt.title(r"Bagging decision trees")
    plt.show()

    dump(BAG_weka_df, "BAG_weka_df.gz")

::::{caution}

To avoid re-training, the last line above saves your result to a file called `BAG_weka_df.gz`. Otherwise, the server cannot auto-grade your submission as it aborts execution that takes excessive time or memory. Ensure that your code is indented correctly, so it is part of the body of the conditional in the solution cell:
```python
...
if input('execute? [Y/n] ').lower() != 'n':
    ...
```

::::

In [None]:
# hidden tests

### Random Forest

Another ensemble method, called [random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), is similar to Bagging decision trees. However, it randomly selects or combines features to further diversify the base classifiers before building each tree. The following trains a random forest of 10 decision trees with a maximum depth of 5.

In [None]:
RF = ensemble.RandomForestClassifier(max_depth=5, n_estimators=10, random_state=0)
RF.fit(X_train, Y_train)
print(f"Accuracy: {RF.score(X_test, Y_test):.4g}")

Like Bagging, we can also parallelize the training and classification by setting the `n_jobs` parameter.

::::{exercise}
:label: ex:4
Complete the pandas `DataFrame` `RF_df` in the following cell by filling in the accuracies (as floating point numbers) for different `n_estimators` and `max_depth`, and with `random_state = 0`.
::::

In [None]:
max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]

if input("execute? [Y/n] ").lower() != "n":
    RF_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    RF_df.insert(0, "n_estimators", n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError
    display.display(RF_df.round(4))

    plt.figure()
    plot(RF_df)
    plt.title(r"Random forest")
    plt.show()

    dump(RF_df, "RF_df.gz")

In [None]:
# hidden tests

To train a random forest of 10 decision trees with a maximum depth of 5 using `python-weka-wrapper`:

In [None]:
RF_weka = Classifier(classname="weka.classifiers.trees.RandomForest")
RF_weka.options = ["-I", "10", "-depth", "5", "-S", "1"]
RF_weka.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(RF_weka, testset)
print(f"Accuracy {evl.percent_correct/100:.4g}")

::::{exercise}
:label: ex:5
Repeat the previous exercise but with [Weka](https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/RandomForest.html) instead. Use a random seed of 1.
::::

In [None]:
max_depth_list = [1, 2, 3, 5, 10, 20]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]

if input("execute? [Y/n] ").lower() != "n":
    RF_weka_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    RF_weka_df.insert(0, "n_estimators", n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError
    display.display(RF_weka_df.round(4))

    plt.figure()
    plot(RF_weka_df)
    plt.title(r"Random forest")
    plt.show()

    dump(RF_weka_df, "RF_weka_df.gz")

In [None]:
# hidden tests

::::{exercise}
:label: ex:6
What are the values of `n_estimators` and `max_depth` that maximize the above accuracy on the test set?
::::

YOUR ANSWER HERE

For `sklearn`, we can tune the parameters of a classification algorithm using `GridSearchCV` imported as follows:

```python
from sklearn.model_selection import GridSearchCV
```

For instance, to tune `n_estimators` by searching for the best value from `n_estimators_list` that maximizes the cross-validated accuracy on the training set:

In [None]:
if input("execute? [Y/n] ").lower() != "n":
    max_depth_list = [1, 2, 3, 5, 10, 20]
    n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100, 200, 300]
    param_grid = {"n_estimators": n_estimators_list, "max_depth": max_depth_list}

    grid_search = GridSearchCV(
        ensemble.RandomForestClassifier(random_state=0), param_grid, verbose=1, n_jobs=4
    )

    grid_search.fit(X_train, Y_train)

    print(f"Accuracy: {grid_search.score(X_test, Y_test):.4g}")
    print(f"Best parameters: {grid_search.best_params_}")

::::{exercise}
:label: ex:7
Observe that the best parameter values that maximize the cross-validated accuracy need not maximize the accuracy on the test set. Explain whether it is better to use the maximum accuracy on the test set as the performance estimate of random forest with parameter tuning.
::::

YOUR ANSWER HERE

### AdaBoost

Using [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html), we can boost the performance by adding base classifiers one-by-one to improve the error made by previously trained classifiers. To train AdaBoost with 10 decision trees of a maximum depth of 5:

In [None]:
ADB = ensemble.AdaBoostClassifier(
    estimator=tree.DecisionTreeClassifier(max_depth=5),
    n_estimators=10,
    random_state=0,
)
ADB.fit(X_train, Y_train)
print(f"Accuracy: {ADB.score(X_test, Y_test):.4g}")

::::{note}

The original AdaBoost only works for binary classification. `AdaBoostClassifier` implements the multi-class extension of AdaBoost called AdaBoost-SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss function). See Algorithm 2 of [(Zhu et al., 2009)](https://dx.doi.org/10.4310/SII.2009.v2.n3.a8).

::::

In [None]:
%%ai chatgpt -f text
Explain how AdaBoost-SAMME extends AdaBoostClassifier to multi-class extension.

::::{exercise}
:label: ex:8
Unlike Bagging and random forest, the training of the base classifiers for AdaBoost cannot be parallelized. Why?
::::

YOUR ANSWER HERE

::::{exercise}
:label: ex:9
Complete the pandas `DataFrame` `ADB_df` in the following cell by filling in the accuracies (as floating point numbers) for different `n_estimators` and `max_depth`. Use `random_state = 0`.
::::

In [None]:
max_depth_list = [1, 2, 3, 5, 10]
n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]

if input("execute? [Y/n] ").lower() != "n":
    ADB_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    ADB_df.insert(0, "n_estimators", n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError
    display.display(ADB_df.round(4))

    plt.figure()
    plot(ADB_df)
    plt.title(r"Adaboost decision trees")
    plt.show()

    dump(ADB_df, "ADB_df.gz")

In [None]:
# hidden tests

To train [AdaBoost](https://weka.sourceforge.io/doc.dev/weka/classifiers/meta/AdaBoostM1.html) with 10 decision trees of maximum depth 5 using `python-weka-wrapper`:

In [None]:
REPTree = Classifier(classname="weka.classifiers.trees.REPTree")
REPTree.options = ["-L", "5"]
ADB_weka = SingleClassifierEnhancer(classname="weka.classifiers.meta.AdaBoostM1")
ADB_weka.options = ["-I", "10", "-S", "1"]
ADB_weka.classifier = REPTree
ADB_weka.build_classifier(trainset)
evl = Evaluation(testset)
evl.test_model(ADB_weka, testset)
print(f"Accuracy {evl.percent_correct/100:.4g}")

Weka uses the multi-class extension called AdaboostM1, which is different from Adaboost-SAMME.

::::{exercise}
:label: ex:10
Repeat the previous exercise but with Weka instead. Use a random seed of 1.
::::

In [None]:
if input("execute? [Y/n] ").lower() != "n":
    max_depth_list = [1, 2, 3, 5, 10]
    n_estimators_list = [1, 2, 3, 5, 10, 20, 30, 50, 100]
    ADB_weka_df = pd.DataFrame(
        columns=[f"max_depth={max_depth}" for max_depth in max_depth_list], dtype=float
    )
    ADB_weka_df.insert(0, "n_estimators", n_estimators_list)
    # YOUR CODE HERE
    raise NotImplementedError
    display.display(ADB_weka_df.round(4))

    plt.figure()
    plot(ADB_weka_df)
    plt.title(r"Adaboost decision trees")
    plt.show()

    dump(ADB_weka_df, "ADB_weka_df.gz")

In [None]:
# hidden tests

::::{exercise}
:label: ex:11
Which ensemble method is better, Adaboost or random forest? Why?
::::

YOUR ANSWER HERE

### Other methods

::::{exercise} optional[^optional]
:label: ex:12

Train your own classifier to achieve the highest possible accuracy. You may:
- choose different classification algorithms or ensemble methods such as Bagging, Stacking, Voting, and XGBoost.
- tune the hyper-parameters manually or automatically using `GridSearchCV` in scikit-learn or `CVParameterSelection` in Weka.

Post your model and results on [Canvas](https://canvas.cityu.edu.hk/courses/62414/discussion_topics/539586) to compete with others. To include your code in this notebook, make sure you avoid excessive time or memory by putting your code in the body of the conditional `if input('execute? [Y/n] ').lower() != 'n':`.

::::

[^optional]: This exercise is optional and does not count towards the grade.

The following is an example using XGBoost with its default parameters.

In [None]:
if input("execute? [Y/n] ").lower() != "n":
    # NOTE: Restart the kernel after installing xgboost
    %pip install xgboost
    import xgboost
    codes, uniques = pd.concat([Y_train, Y_test]).factorize()
    Y_train_codes, Y_test_codes = codes[:len(Y_train)], codes[len(Y_train):]
    XGB = xgboost.XGBClassifier(n_jobs=1)
    XGB.fit(X_train, Y_train_codes)
    print(f"Accuracy: {XGB.score(X_test, Y_test_codes):.4g}")