---
title: Evaluation with scikit-learn
---

In [None]:
import os
from functools import lru_cache

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython import display
from ipywidgets import IntSlider, interact
from sklearn import datasets, tree
from sklearn.model_selection import (
    StratifiedKFold,
    cross_val_predict,
    cross_val_score,
    train_test_split,
)

%matplotlib widget
if not os.getenv(
    "NBGRADER_EXECUTION"
):
    %load_ext jupyter_ai
    %ai update chatgpt dive:chat
    # %ai update chatgpt dive-azure:gpt4o

In this notebook, you will learn to use the popular Python package scikit-learn to evaluate a classifier. Compared to Weka GUI, the programming language gives a greater flexibility in controlling the learning process.

::::{tip} How to Use LLM for Programming

You can instruct the LLM to:

- Write programs to achieve specific goals.
- Explain error messages and suggest fixes.

The reply quality depends on prompt clarity and provided information:

- For JupyterNaut, learn [how to include additional information](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#asking-about-something-in-your-notebook) in your prompt.
- For GitHub Copilot, include multiple files using [chat variables](https://docs.github.com/en/copilot/using-github-copilot/asking-github-copilot-questions-in-your-ide#chat-variables).

::::

## Data Preparation

### About the dataset

We will use a popular dataset called the [*iris dataset*](https://en.wikipedia.org/wiki/Iris_flower_data_set). Iris is a flower with three different species shown below.

::::{grid} 1 1 3 3

:::{figure} https://upload.wikimedia.org/wikipedia/commons/1/11/Iris_setosa_2.jpg
:name: fig:iris-setosa
:alt: Iris setosa
:width: 300
:align: left


Iris setosa
:::

:::{figure} https://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg
:name: fig:iris-versicolor
:alt: Iris versicolor
:width: 300
:align: left

Iris versicolor
:::

:::{figure} https://upload.wikimedia.org/wikipedia/commons/9/9f/Iris_virginica.jpg
:name: fig:iris-virginica
:alt: Iris virginica
:width: 300
:align: left

Iris verginica
:::

::::

The three iris species differ in the lengths and widths of their *petals* and *sepals*.

::::{figure} https://upload.wikimedia.org/wikipedia/commons/7/78/Petal-sepal.jpg
:name: fig:petal-sepal
:alt: petal and sepal
:align: left
:width: 300

Petal and sepal of a flower
::::

A standard data mining task is to train a model that can classify the spieces (*target*) automatically based on the lengths and widths of the petals and sepals (*input features*).

### Load dataset from scikit-learn

**How to load the iris dataset?**

To load the iris dataset, we can simply import the [`sklearn.datasets` package](https://scikit-learn.org/stable/datasets/index.html).

In [None]:
from sklearn import datasets

iris = datasets.load_iris()
type(iris)  # object type

`sklearn` stores the dataset as a [`Bunch` object](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html), which is essentially [a bunch of properties](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) put together.

**How to learn more about a library?**

Detailed documentation can be found at [https://scikit-learn.org](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html). You can also learn about the library in the notebook:

In [None]:
# Display the docstring
datasets.load_iris?

::::{tip}

In a notebook, we can get the docstring (document string) of a function conveniently using the symbol ? such as:

```python
datasets.load_iris?
```

or `?datasets.load_iris`, which is the same as running `help(datasets.load_iris)`. You can also display the source code using double question mark `??` such as `datasets.load_iris??`.

We can also use the contextual help by placing the cursor over a function name and

- click the menu item `Help`->`Show Contextual Help` or
- press the short-cut key <kbd>Shift + Tab</kbd>.

::::

**How to learn more about the dataset?**

The property `DESCR` (description) is a string that contains some background information about the dataset:

In [None]:
print(iris.DESCR)

All the properties of an object can be listed using the built-in function `dir` (directory):

In [None]:
dir(iris)

**How to show the data?**

The properties `data` and `target` contains the data values.

In [None]:
type(iris.data), type(iris.target)

The data are stored as `numpy` array, a powerful data type optimized for performance. It provides useful properties and methods to describe and process the data:

In [None]:
iris.data.shape, iris.data.ndim, iris.data.dtype

::::{note} How does the Iris dataset look like?

`iris.data` is a 150-by-4 2-dimensional array of 64-bit floating-point numbers.
- 150 corresponds to the number of instances, while
- 4 corresponds to the number of input attributes.

::::

To show the input feature names:

In [None]:
iris.feature_names

To show the means and standard deviations of the input features:

In [None]:
iris.data.mean(axis=0), iris.data.std(axis=0)

All the public properties/methods of `numpy` array are printed below using [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions):

In [None]:
import numpy as np  # import numpy and rename it as np

print([attr for attr in dir(np.ndarray) if attr[0] != "_"])
# private attributes starting with underscore are not printed

**What is the target feature?**

The target variable of the iris dataset is the flower type, whose names are stored by the following property:

In [None]:
iris.target_names

`iris.target` is an array of integer indices from `{0, 1, 2}` for the three classes.

In [None]:
iris.target

::::{exercise}
:label: ex:data-info

Complete the following cell with a tuple of the following properties for the target (instead of input features) of the iris dataset:
- shape, 
- number of dimensions, and 
- the data types of the values.

:::{tip}

Your solution should look like the following using chained assignments:
```python
shape, ndim, dtype = iris.___.___, iris.___.___, iris.___.___
```
:::

::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
shape, ndim, dtype

In [None]:
# tests
assert (
    isinstance(shape, tuple) and isinstance(ndim, int) and isinstance(dtype, np.dtype)
)

In [None]:
# hidden tests

::::{exercise}
:label: ex:min-max

Fill in the following cell with a tuple of
- the list of minimum values of the input features, and
- the list of maximum values of the input features.

:::{tip}

You answer may look like:
```python
feature_min, feature_max = iris.___.___(axis=0), iris.___.___(axis=0)
```
:::

::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
feature_min, feature_max

In [None]:
# tests
assert feature_min.shape == (4,) == feature_max.shape

In [None]:
# hidden tests

### Create pandas DataFrame

The package [`pandas`](https://pandas.pydata.org/docs/user_guide/index.html) provides additional tools to display and process a dataset. 
First, we translate the `Bunch` object into a `pandas` [`DataFrame` object](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame).

In [None]:
import pandas as pd

# write the input features first
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# append the target values to the last column
iris_df["target"] = iris.target

iris_df  # to display the DataFrame

In jupyter notebook, a `DataFrame` object is conveniently displayed as an HTML table. We can control how much information to show by setting the [display options](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).

We can also display the statistics of different numerical attributes using the method `describe` and `boxplot`.

In [None]:
iris_df.describe()

In [None]:
plt.figure(1, clear=True)
iris_df.boxplot()
plt.show()

**How to handle nominal class attribute?**

Note that the boxplot also covers the target attribute, but it should not. (Why?) Let's take a look at the current datatypes of the different attributes.

In [None]:
print(iris_df.dtypes)

The target is regarded as a numeric attribute with type integer `int64`. Instead, the target should be categorical with only three possible values, one for each iris species.

To fix this, we can use the `astype` method to convert the data type automatically.[^categorical] 

[^categorical]: See the details about cateogrical data type [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#series-creation.section).

In [None]:
iris_df.target = iris_df.target.astype("category")
iris_df.dtypes

In [None]:
plt.figure(num=2, clear=True)
iris_df.select_dtypes(exclude=['category']).boxplot()
plt.show()

We can also rename the target categories `{0, 1, 2}` to the more meaningful names of the iris species in `iris.target_names`.[^renaming-categories]

[^renaming-categories]: See the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#renaming-categories).)

In [None]:
iris_df.target = iris_df.target.cat.rename_categories(iris.target_names)
iris_df  # check that the target values are now setosa, versicolor, or virginica.

::::{exercise}
:label: ex:counts

For nominal attributes, a more meaningful statistic than the mean is the counts of different possible values. To count the number of instances for each flower class, assign `target_counts` to the output of the `value_counts` method of an appropriate column of `iris_df`.

:::{tip}

Your solution should look like:
```python
target_counts = iris_df.target.___()
```

:::

::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
target_counts

In [None]:
# tests
assert target_counts.shape == (3,)

In [None]:
# hidden tests

**How to select rows and columns?**

The following uses [`ipywidgets`](https://ipywidgets.readthedocs.io/en/latest/) to demonstrate different ways of selecting (slicing) the rows of a `DataFrame`:[^decorator]

[^decorator]: `@interact(...)` is a decorator that adds interactive widgets to the following function `def select_rows`. See [decorator in pydoc](https://docs.python.org/3/glossary.html#term-decorator).

In [None]:
from ipywidgets import interact


@interact(
    command=[
        "iris_df.head()",
        "iris_df[0:4]",
        "iris_df.iloc[0:4]",
        "iris_df.loc[0:4]",
        "iris_df.loc[iris_df.index.isin(range(0,4))]",
        "iris_df.loc[lambda df: df.target=='setosa']",
        "iris_df.tail()",
        "iris_df[-1:]",
    ]
)
def select_rows(command):
    output = eval(command)
    display.display(output)

The following demonstrates different ways to slice columns:

In [None]:
@interact(
    command=[
        "iris_df.target",
        'iris_df["target"]',
        'iris_df[["target"]]',
        "iris_df[iris_df.columns[:-1]]",
        "iris_df.loc[:,iris_df.columns[0]:iris_df.columns[-1]]",
        'iris_df.loc[:,~iris_df.columns.isin(["target"])]',
        "iris_df.iloc[:,:-1]",
    ]
)
def select_columns(command):
    output = eval(command)
    display.display(output)

For instance, to compute the mean values of the input features for iris setosa:

In [None]:
iris_df

In [None]:
iris_df.iloc[lambda df: (df["target"] == "setosa").to_numpy(), :-1].mean()

We can also use the method `groupby` to obtain the mean values by flower types:

In [None]:
iris_df.groupby("target", observed=False).mean()

::::{exercise}
:label: ex:drop

Using the method `drop`, create a new `DataFrame` `iris2d_df` by dropping the first two features `sepal length (cm)` and `sepal width (cm)` from `iris_df`.
::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
iris2d_df

In [None]:
# tests
assert set(iris2d_df.columns) == {"petal length (cm)", "petal width (cm)", "target"}

In [None]:
## hidden tests

### Alternative methods of loading a dataset

The following code loads the iris dataset from an [ARFF file](https://waikato.github.io/weka-wiki/formats_and_processing/arff/) instead.

In [None]:
import io
import urllib.request

from scipy.io import arff

ftpstream = urllib.request.urlopen(
    "https://raw.githubusercontent.com/Waikato/weka-3.8/master/wekadocs/data/iris.arff"
)
iris_arff = arff.loadarff(io.StringIO(ftpstream.read().decode("utf-8")))
iris_df2 = pd.DataFrame(iris_arff[0])
iris_df2["class"] = iris_df2["class"].astype("category")
iris_df2

Pandas also provides a method to read the iris dataset directly from a CSV file locally or the internet, such as the [UCI respository](https://archive.ics.uci.edu/ml/datasets/iris).

In [None]:
iris_df3 = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
    sep=",",
    dtype={"target": "category"},
    header=None,
    names=iris.feature_names + ["target"],
)
iris_df3

The additional arguments `dtype`, `header`, and `names` allow us to specify the attribute datatypes and names.

## Training and Testing

To obtain unbiased performance estimates of a learning algorithm, the fundamental principle is *to use separate datasets for training and testing*. If there is only one dataset, we can split it into *training sets* and *test sets* by *random sampling*. In the following subsections, we will illustrate some methods of splitting the datasets for training and testing.

### Stratified holdout method

The (stratified) holdout method randomly samples data for training or testing without replacement. It is implemented by the `train_test_split` function from the `sklearn.model_selection` package.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    iris_df[iris.feature_names],  # We also separated the input features
    iris_df.target,  # and target as X and Y for the training and test sets.
    test_size=0.2,  # fraction for test set
    random_state=1,
    stratify=iris_df.target,
)  # random seed

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

The fraction of holdout test data is:

In [None]:
len(Y_test) / (len(Y_test) + len(Y_train))

The class proportion of the iris dataset can be plotted as follows:

In [None]:
plt.figure(num=3, clear=True)
iris_df.target.value_counts().plot(kind="bar", ylabel="counts")
plt.show()
_code = In[-1].rsplit(maxsplit=1)[0] # store the code for chatting with LLM

::::{tip}

The last line of code captures the cell content using [`In` variable](https://ipython.org/ipython-doc/3/interactive/reference.html#input-caching-system) and assign it to `_code` to be [interpolated in the subsequent prompt to the LLM](https://jupyter-ai.readthedocs.io/en/latest/users/index.html#interpolating-in-prompts).

::::

In [None]:
%%ai chatgpt -f text
The label of the x-axis is too long and they got clipped. How to fix?
--
{_code}

We can check the effect of stratification on the class proportions:

In [None]:
@interact(stratify=False, data=["Y_train", "Y_test"], seed=(0, 10))
def plot_class_proportions(stratify, data, seed=0):
    Y_train, Y_test = train_test_split(
        iris_df.target,
        test_size=0.2,
        random_state=seed,
        stratify=iris_df.target if stratify else None,
    )
    plt.figure(num=4, clear=True)
    eval(data).value_counts().sort_index().plot(kind="bar", ylabel="counts")
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()

We first apply a learning algorithm, say the decision tree induction algorithm, to train a classifier using only the training set.

In [None]:
from sklearn import tree

clf = tree.DecisionTreeClassifier(random_state=0)  # the training is also randomized
clf.fit(X_train, Y_train)  # fit the model to the training set

We can use the `predict` method of the trained classifier to predict the flower type from input features.

In [None]:
Y_pred = clf.predict(X_test)
Y_pred

The following code returns the accuracy of the classifier, namely, the fraction of correct predictions on the test set.

In [None]:
accuracy_holdout = (Y_pred == Y_test).mean()
accuracy_holdout

The `score` method performs essentially the same computation. The following code uses f-string to format the accuracy to 3 decimal places.

In [None]:
print(f"Accuracy: {clf.score(X_test, Y_test):0.3f}")

To see input features of misclassified test instances:

In [None]:
X_test[Y_pred != Y_test]

::::{exercise}
:label: ex:accuracy

Assign `accuracy_holdout_training_set` to the accuracy of the predictions on the training set. Check that the accuracy is overly optimistic.
::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
accuracy_holdout_training_set

In [None]:
# hidden tests

::::{exercise}
:label: ex:classify

Complete the following function which applies random subsampling to reduce the variance of the accuracy estimate. In particular, the function `subsampling_score` should return the average of `N` accuracies of $20\%$ stratified holdout with random seed set from `0` up to `N-1`, where `N` is the integer input argument of the function.
::::

In [None]:
from functools import lru_cache

import numpy as np


@lru_cache(None)  # cache the return value to avoid repeated computation
def holdout_score(seed):
    clf = tree.DecisionTreeClassifier(random_state=seed)
    X_train, X_test, Y_train, Y_test = train_test_split(
        iris_df[iris.feature_names], iris_df.target, test_size=0.2, random_state=seed
    )
    # YOUR CODE HERE
    raise NotImplementedError


@lru_cache(None)
def subsampling_score(N):
    return sum(holdout_score(i) for i in range(N)) / N

In [None]:
# tests
assert np.isclose(subsampling_score(50), 0.9466666666666663)

In [None]:
# hidden tests

After implementing `subsampling_score`, the following code should plot the mean accuracies for different `N`. Check that the variation becomes smaller as `N` increases.

In [None]:
plt.figure(num=5, clear=True)
plt.stem([subsampling_score(i) for i in range(1, 50)])
plt.xlabel(r"$N$")
plt.ylabel(r"Mean accuracy")
plt.show()

### Stratified cross-validation

A popular method of evaluating a classification algorithm is to randomly partition the data into $k$ *folds*, which are nearly equal-sized blocks of instances. The score is the average of the accuracies obtained by using each fold to test a classifier trained using the remaining folds.

The module `sklearn.model_selection` provides two functions `cross_val_predict` and `cross_val_score` for this purpose.

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_predict, cross_val_score

cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)

For instance, the following returns the misclassified instances by 5-fold cross-validation.

In [None]:
iris_df["prediction"] = pd.Categorical(
    cross_val_predict(clf, iris_df[iris.feature_names], iris_df.target, cv=cv)
)
iris_df.loc[lambda df: df["target"] != df["prediction"]]

In [None]:
clf = tree.DecisionTreeClassifier(random_state=0)
scores = cross_val_score(clf, iris_df[iris.feature_names], iris_df.target, cv=5)
print("Accuracies: ", ", ".join(f"{acc:.4f}" for acc in scores))
print(f"Mean accuracy: {scores.mean():.4f}")

::::{exercise}
:label: ex:cross_val
Assign `accuracy_cv` to the accuracy obtained by the cross validation result above.
::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
accuracy_cv

In [None]:
# hidden tests

::::{seealso} Any other methods to split data?

Another method is the bootstrap method, which is similar to the holdout method but samples without replacement. There is an [implementation in scikit-learn](https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/modules/generated/sklearn.cross_validation.Bootstrap.html). Unlike the original bootstrap method, the test set is also sampled with replacement instead of taken directly from the out-of-the-bag instances not used for training.

::::