# Different Classifiers with scikit-learn

**CS5483 Data Warehousing and Data Mining**
___

In [None]:
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# produce vector inline graphics
from IPython.display import Code, display
from ipywidgets import interact
from sklearn import datasets, neighbors, preprocessing, tree
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import make_pipeline

# from wittgenstein import RIPPER
from util import plot_decision_regions

%matplotlib inline

## Normalization of Attributes

For this notebook, we consider the binary classification problem on the [breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)):

In [None]:
# load the dataset from sklearn
dataset = datasets.load_breast_cancer()

# create a DataFrame to help further analysis
df = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
df["target"] = dataset.target
df.target = df.target.astype("category")
df.target.cat.categories = dataset.target_names
df  # display an overview of the data

The goal is to train a classifier to diagnose whether a breast mass is malignant or benign. The target class distribution is shown as follows:

In [None]:
plt.figure()
display(df.target.value_counts())
df.target.value_counts().plot(kind="bar", title="counts of different classes")
plt.show()

The input features are characteristics of cell images obtained by [fine needle analysis (FNA)](https://en.wikipedia.org/wiki/Fine-needle_aspiration):

In [None]:
%%html
<a href="https://www.researchgate.net/figure/FNA-biopsy-samples-of-benign-left-and-malignant-center-and-right-breast-tumor-cells_fig1_261959799"><img src="https://www.researchgate.net/profile/Carsten_Eickhoff/publication/261959799/figure/fig1/AS:296624837414914@1447732280878/FNA-biopsy-samples-of-benign-left-and-malignant-center-and-right-breast-tumor-cells.png" alt="FNA biopsy samples of benign (left) and malignant (center and right) breast tumor cells."/></a>
<p>FNA biopsy samples of benign (left) and malignant (center and right) breast tumor cells.</p>
<p>Eickhoff, Carsten. (2014). Crowd-powered experts: helping surgeons interpret breast cancer images. ACM International Conference Proceeding Series. 53-56. 10.1145/2594776.2594788.</p>

The following function displays the statistics of the features grouped by the class values:

In [None]:
def show_feature_statistics(df):
    df.groupby("target").boxplot(rot=90, layout=(1, 2), figsize=(12, 5), fontsize=7)


show_feature_statistics(df)
plt.show()

From the above plots, it can be observed that the attributes `mean area` and `worst area` have much larger ranges than other features have.

**Exercise** Is it true that a feature with the larger range is a better feature? Why?

YOUR ANSWER HERE

### Min-max Normalization

We can normalize a numeric feature $Z$ to the unit interval as follows:

$$
\begin{align}
Z':= \frac{Z}{b - a}
\end{align}
$$ (min-max)

where $a$ and $b$ are respectively the minimum and maximum possible values of $Z$.  

$a$ and $b$ may be unknown in practice as the distribution of $Z$ is unknown. We perform the normalization on the samples: The min-max normalization of the sequence (in $i$) of $z_i$ is the sequence of

$$
\begin{align}
z'_i := \frac{z_i - \min_j z_j}{\max_j z_j - \min_j z_j},
\end{align}
$$ (min-max-sample)

where $\min_j z_j$ and $\max_j z_j$ are respectively the minimum and maximum sample values. It follows that $0\leq z'_i \leq 1$ and the equalities hold with equality for some indices $i$.

An implementation is as follows:

In [None]:
df.select_dtypes(include="number").min()

In [None]:
df[[c for c in df.columns if pd.api.types.is_numeric_dtype(df[c])]].min().index

In [None]:
def minmax_normalize(df, suffix=" (min-max normalized)"):
    """Returns a new DataFrame with numerical attributes of the input DataFrame
    min-max normalized.

    Parameters
    ----------
    df: DataFrame
        Input to be min-max normalized. May contain both numeric
        and categorical attributes.
    suffix: string
        Suffix to append to the names of normalized attributes.

    Returns
    -------
    DataFrame:
        A copy of df with its numeric attributes replaced by their min-max
        normalization. The normalized features are renamed with the suffix
        appended to the end of their original names.
    """
    df = df.copy()  # avoid overwriting the original dataframe
    min_values = df.select_dtypes(include="number").min()  # Skip categorical features
    max_values = df[min_values.index].max()

    # min-max normalize
    df[min_values.index] = (df[min_values.index] - min_values) / (
        max_values - min_values
    )

    # rename normalized features
    df.rename(columns={c: c + suffix for c in min_values.index}, inplace=True)

    return df

It is a good idea to rename the normalized features to differentiate them from the original features. The following plots the statistics of the normalized features.

In [None]:
df_minmax_normalized = minmax_normalize(df)
assert df_minmax_normalized.target.to_numpy().base is df.target.to_numpy().base

show_feature_statistics(df_minmax_normalized)
plt.show()

After normalization, we can see how instances of different classes differ in different input features other than `mean area` and `worst area`. In particular, both `mean-concavity` and `worst-concavity` are substantially higher for malignant examples than for benign examples. Such details are hard to see in the plots before normalization.

### Standard Normalization

Min-max normalization is not appropriate for features with unbounded support where $b-a=\infty$ in {eq}`min-max`. The normalization factor $\max_j z_j - \min_j z_j$ in {eq}`min-max-sample` for i.i.d. samples will approach $\infty$ as the number of samples goes to infinity.

Let us inspect the distribution of each feature using [`displot`](https://seaborn.pydata.org/generated/seaborn.displot.html) provided by the package [`seaborn`](https://seaborn.pydata.org), which was imported with

```python
import seaborn as sns
```

In [None]:
@interact(
    feature=dataset.feature_names, kernel_density_estimation=True, group_by_class=False
)
def plot_distribution(feature, kernel_density_estimation, group_by_class):
    sns.displot(
        data=df,
        x=feature,
        col="target" if group_by_class else None,
        kde=kernel_density_estimation,
    )

Play with the above widgets to check if the features appear to have unbounded support.

For a feature $Z$ with unbounded support, one may use the $z$-score/standard normalization instead:

$$
\begin{align}
Z' := \frac{Z - E[Z]}{\sqrt{\operatorname{Var}(Z)}}.
\end{align}
$$ (standard)

Since the distribution of $Z$ is unknown, we normalize the sequence of i.i.d. samples $z_i$ using its sample mean $\mu$ and standard deviation $\sigma$ to the sequence of

$$
\begin{align}
z'_i := \frac{z_i - \mu}{\sigma}. 
\end{align}
$$ (standard-sample)

**Exercise** Complete the function `standard_normalize` as follows:

- Return a new copy of the input `DataFrame` `df` but with all its numeric attributes standard normalized. 
- You may use the methods `mean` and `std`.
- Rename the normalized features by appending `suffix` to their names. 

In [None]:
def standard_normalize(df, suffix=" (standard normalized)"):
    """Returns a DataFrame with numerical attributes of the input DataFrame
    standard normalized.

    Parameters
    ----------
    df: DataFrame
        Input to be standard normalized. May contain both numeric
        and categorical attributes.
    suffix: string
        Suffix to append to the names of normalized attributes.

    Returns
    -------
    DataFrame:
        A new copy of df that retains the categorical attributes but with the
        numeric attributes replaced by their standard normalization.
        The normalized features are renamed with the suffix appended to the end
        of their original names.
    """
    # YOUR CODE HERE
    raise NotImplementedError()


df_standard_normalized = standard_normalize(df)
show_feature_statistics(df_standard_normalized)

In [None]:
# tests
assert np.isclose(
    df_standard_normalized.select_dtypes(include="number").mean(), 0
).all()
assert np.isclose(df_standard_normalized.select_dtypes(include="number").std(), 1).all()

In [None]:
# hidden tests

## Nearest Neighbor Classification

To create a $k$-nearest-neighbor ($k$-NN) classifier, we can use `sklearn.neighbors.KNeighborsClassifier`. The following fits a $1$-NN classifier to the entire dataset and returns its training accuracy.

In [None]:
X, Y = df[dataset.feature_names], df.target
kNN1 = neighbors.KNeighborsClassifier(n_neighbors=1)
kNN1.fit(X, Y)

print("Training accuracy: {:0.3g}".format(kNN1.score(X, Y)))

**Exercise** Why is the training accuracy for $1$-NN $100\%$? Explain according to how 1-NN works.

YOUR ANSWER HERE

To avoid overly-optimistic performance estimates, the following uses 10-fold cross validations to compute the accuracies of 1-NN trained on datasets with and without normalization.

In [None]:
cv = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)

dfs = {"None": df, "Min-max": df_minmax_normalized}

acc = pd.DataFrame(columns=dfs.keys())
for norm in dfs:
    acc[norm] = cross_val_score(
        kNN1,
        dfs[norm].loc[:, lambda df: ~df.columns.isin(["target"])],
        # not [dataset.feature_names] since normalized features are renamed
        dfs[norm]["target"],
        cv=cv,
    )

acc.agg(["mean", "std"]).round(3)

The accuracies appear to show that normalization improves the performance of 1-NN. More precisely, the improvement of $\sim 5\%$ in the accuracy appear statistically insignificance because it is at least twice the standard deviations of $\sim 2\%$.

---

**Important**


The proper way to compare performance should take statistical significance into account such as the [paired t-test](https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f). There is not much we can do to improve the statistical significance other than collecting more data. Repeating the cross-validation with different random seeds do not help as that only smooth out the randomness in splitting, not sampling. 

---

### Data Leak

The accuracies computed for the normalizations above suffer from a subtle issue that render them overly optimistic:

---

**Important**

Since the normalization factors for cross validation were calculated from the entire dataset, the test data for each cross-validation fold may not be independent of the remaining normalized data for training the classifier. This subtle data leak may cause the performance estimate to be overly-optimistic.

---

This issue can be resolved by computing the normalization factors from the training set instead of the entire dataset. To do so, we will create a pipeline using:

```python
from sklearn import preprocessing
from sklearn.pipeline import make_pipeline
```

- Like the filtered classifier in Weka, `sklearn.pipeline` provides the function `make_pipeline` to combine a filter with a classifier.
- `sklearn.preprocessing` provides different filters for preprocessing features, , e.g., `StandardScaler` and `MinMaxScaler` for 

Creating a pipeline is useful especially for cross validation, where the normalization factors need to be recomputed for each fold.

In [None]:
kNN1_standard_normalized = make_pipeline(preprocessing.StandardScaler(), kNN1)
acc["Standard"] = cross_val_score(kNN1_standard_normalized, X, Y, cv=cv)
acc["Standard"].agg(["mean", "std"]).round(3)

**Exercise** Similar to the above cell, correct the accuracies in `acc['Min-max']` to use `preprocessing.MinMaxScaler` as part of a pipeline for the 1-NN classifier.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
acc["Min-max"].agg(["mean", "std"]).round(5)

In [None]:
# hidden tests

### Decision Regions

Since `sklearn` does not provide any function to plot the decision regions of a classifier, we provide the function `plot_decision_regions` in a module `util` defined in `util.py` of the current directory:

```python
from util import plot_decision_regions
```

In [None]:
plot_decision_regions?

In [None]:
Code(filename="util.py")

The following plots the decision region for a selected pair of input features.

In [None]:
if input("Execute? [y/N]").lower() == "y":

    @interact(
        normalization=["None", "Min-max", "Standard"],
        feature1=dataset.feature_names,
        feature2=dataset.feature_names,
        k=widgets.IntSlider(1, 1, 5, continuous_update=False),
        resolution=widgets.IntSlider(1, 1, 4, continuous_update=False),
    )
    def decision_regions_kNN(
        normalization,
        feature1=dataset.feature_names[0],
        feature2=dataset.feature_names[1],
        k=1,
        resolution=1,
    ):
        scaler = {
            "Min-max": preprocessing.MinMaxScaler,
            "Standard": preprocessing.StandardScaler,
        }
        kNN = neighbors.KNeighborsClassifier(n_neighbors=k)
        if normalization != "None":
            kNN = make_pipeline(scaler[normalization](), kNN)
        kNN.fit(df[[feature1, feature2]].to_numpy(), df.target.to_numpy())
        plt.figure(figsize=(10, 10))
        ax = plot_decision_regions(
            df[[feature1, feature2]], df.target, kNN, N=resolution * 100
        )
        ax.set_title("Decision region for {}-NN".format(k))
        ax.set_xlabel(feature1)
        ax.set_ylabel(feature2)

Interact with the widgets to: 

- Learn the effect on the decision regions/boundaries with different normalizations and choices of $k$.
- Learn to draw the decision boundaries for $1$-NN with min-max normalization.

To avoid repeated computations, `plot_decision_regions` is a decorated function with its return values memorized/cached. (For details, see a [notebook](https://www.cs.cityu.edu.hk/~ccha23/cs1302book/Lecture6/Decorator.html) and a [tutorial](https://realpython.com/primer-on-python-decorators/) on decorator.) 

To clear the cached plots, run:

In [None]:
plot_decision_regions.clear()

**Exercise** Complete the following code to plot the decision regions for decision trees. Afterwards, explain whether the decision regions change for different normalizations.

In [None]:
if input("Execute? [y/N]").lower() == "y":

    @interact(
        normalization=["None", "Min-max", "Standard"],
        feature1=dataset.feature_names,
        feature2=dataset.feature_names,
        resolution=widgets.IntSlider(1, 1, 4, continuous_update=False),
    )
    def decision_regions_kNN(
        normalization,
        feature1=dataset.feature_names[0],
        feature2=dataset.feature_names[1],
        resolution=1,
    ):
        scaler = {
            "Min-max": preprocessing.MinMaxScaler,
            "Standard": preprocessing.StandardScaler,
        }
        # YOUR CODE HERE
        raise NotImplementedError()
        plt.figure(figsize=(10, 10))
        ax = plot_decision_regions(
            df[[feature1, feature2]], df.target, DT, N=resolution * 100
        )
        ax.set_title("Decision region for Decision Tree")
        ax.set_xlabel(feature1)
        ax.set_ylabel(feature2)

YOUR ANSWER HERE