# Different Classifiers with scikit-learn

**CS5483 Data Warehousing and Data Mining**
___

In [1]:
%reset -f
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, tree, preprocessing, neighbors
# produce vector inline graphics
from IPython.display import set_matplotlib_formats, display
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score, StratifiedKFold
from sklearn.pipeline import make_pipeline
set_matplotlib_formats('svg')
from wittgenstein import RIPPER
from ipywidgets import interact
import ipywidgets as widgets
import seaborn as sns

## Normalization of Attributes

For this notebook, we consider the binary classification problem on the [breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)):

In [None]:
# load the dataset from sklearn
dataset = datasets.load_breast_cancer()

# create a DataFrame to help further analysis
df = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
df['target'] = dataset.target
df.target = df.target.astype('category')
df.target.cat.categories = dataset.target_names
df  # display an overview of the data

The goal is to train a classifier to diagnose whether a breast mass is malignant or benign. The target class distribution is shown as follows:

In [None]:
display(df.target.value_counts())
df.target.value_counts().plot(kind='bar', title='counts of different classes')

The input features are characteristics of cell images obtained by [fine needle analysis (FNA)](https://en.wikipedia.org/wiki/Fine-needle_aspiration):

In [2]:
%%html
<a href="https://www.researchgate.net/figure/FNA-biopsy-samples-of-benign-left-and-malignant-center-and-right-breast-tumor-cells_fig1_261959799"><img src="https://www.researchgate.net/profile/Carsten_Eickhoff/publication/261959799/figure/fig1/AS:296624837414914@1447732280878/FNA-biopsy-samples-of-benign-left-and-malignant-center-and-right-breast-tumor-cells.png" alt="FNA biopsy samples of benign (left) and malignant (center and right) breast tumor cells."/></a>
<p>FNA biopsy samples of benign (left) and malignant (center and right) breast tumor cells.</p>
<p>Eickhoff, Carsten. (2014). Crowd-powered experts: helping surgeons interpret breast cancer images. ACM International Conference Proceeding Series. 53-56. 10.1145/2594776.2594788.</p>

The following function displays the statistics of the features grouped by the class values:

In [None]:
def show_feature_statistics(df):
    df.groupby('target').boxplot(rot=90,
                                 layout=(1, 3),
                                 figsize=(12, 5),
                                 fontsize=7)


show_feature_statistics(df)

From the above plots, it can be observed that the attributes `mean area` and `worst area` have much larger ranges than other features have.

**Exercise** Is it true that a feature with the larger range is a better feature? Why?

YOUR ANSWER HERE

### Min-max Normalization

We can normalize a numeric feature to the unit interval as follows:

\begin{align}
X_i\in [X_i^{\min}, X_i^{\max}] \mapsto \frac{X_i - X_i^{\min}}{X_i^{\max} - X_i^{\min}} \in [0,1].
\end{align}

To perform the normalization on i.i.d. samples $x_{ij}$ indexed by $j$, we use the mapping

\begin{align}
x_{ij} \mapsto \frac{x_{ij} - \min_j x_j}{\max_j x_j - \min_j x_j}.
\end{align}

An implementation is as follows:

In [None]:
def minmax_normalize(df, suffix=' (min-max normalized)'):
    ''' Returns a DataFrame with numerical attributes of the input DataFrame 
    min-max normalized.
    
    Parameters
    ----------
    df (DataFrame): Input to be min-max normalized. May contain both numeric 
        and categorical attributes. 
    suffix (string): Suffix to append to the names of normalized attributes.
    
    Returns 
    -------
    A new DataFrame which retains the categorical attributes but with the 
    numeric attributes replaced by their min-max normalization. 
    The normalized features are renamed with the suffix appended to the end of 
    their original names.
    '''
    df_minmax_normalized = df.copy()
    min_values = df.min()  # Categorical feature target skipped automatically

    # min-max normalize
    df_minmax_normalized[min_values.index] = ((df[min_values.index] - min_values) /
                            (df.max() - min_values)).copy()
    
    # rename normalized features
    df_minmax_normalized.rename(
        columns={c: c + suffix
                 for c in min_values.index},
        inplace=True)

    return df_minmax_normalized

It is a good idea to rename the normalized features to differentiate them from the original features. The following plots the statistics of the normalized features.

In [None]:
df_minmax_normalized = minmax_normalize(df)
assert df_minmax_normalized.target.to_numpy().base is df.target.to_numpy().base
show_feature_statistics(df_minmax_normalized)

After normalization, we can see how instances of different classes differ in different input features other than `mean area` and `worst area`. In particular, both `mean-concavity` and `worst-concavity` are substantially higher for malignant examples than for benign examples. Such details are hard to see in the plots before normalization.

### Standard Normalization

For features with unbounded support and a distribution close to the normal distribution, one may use the $z$-score/standard normalization instead of min-max normalization:

\begin{align}
X_i \mapsto \frac{X_i - \mu_{X_i}}{\sigma_{X_i}},
\end{align}

where $\mu_{X_i}$ and $\sigma_{X_i}$ are the expectation and standard deviation of the feature $X_i$ respectively. The i.i.d.\ samples can be transformed using sample mean and standard deviation instead. It is okay to use a biased estimate for standard deviation and variance.

To inspect the distribution of each feature, we can use the function [`displot`](https://seaborn.pydata.org/generated/seaborn.displot.html) provided by the package [`seaborn`](https://seaborn.pydata.org).

In [None]:
import seaborn as sns

@interact(feature=dataset.feature_names, kernel_density_estimation=True, group_by_class=False)
def plot_distribution(feature, kernel_density_estimation, group_by_class):
    sns.displot(data=df, x=feature, col='target' if group_by_class else None, kde=kernel_density_estimation, height=3.5)

Play with the above widgets to check if the distributions of a feature look like the bell-shaped normal distribution.

**Exercise** Complete the function `standard_normalize` as follows:

- Assign to `df_standard_normalized` a new data `DataFrame` same as `df` but with all its numeric attributes standard normalized. 
- You may use the methods `mean` and `std`.
- Rename the normalized features by appending `suffix` to their names. 

In [None]:
def standard_normalize(df, suffix=' (standard normalized)'):
    ''' Returns a DataFrame with numerical attributes of the input DataFrame 
    standard normalized.
    
    Parameters
    ----------
    df (DataFrame): Input to be standard normalized. May contain both numeric 
        and categorical attributes. 
    suffix (string): Suffix to append to the names of normalized attributes.
    
    Returns 
    -------
    A new DataFrame which retains the categorical attributes but with the 
    numeric attributes replaced by their standard normalization. 
    The normalized features are renamed with the suffix appended to the end of 
    their original names.
    '''
    # YOUR CODE HERE
    raise NotImplementedError()
    return df_standard_normalized

df_standard_normalized = standard_normalize(df)
show_feature_statistics(df_standard_normalized)

In [None]:
# tests
assert np.isclose(df_standard_normalized.mean(), 0).all()

## Nearest Neighbor Classification

To create a $k$-nearest-neighbor ($k$-NN) classifier, we can use `sklearn.neighbors.KNeighborsClassifier`. The following fits a $1$-NN classifier to the entire dataset and returns its training accuracy.

In [None]:
from sklearn import neighbors

X, Y = df[dataset.feature_names], df.target
kNN1 = neighbors.KNeighborsClassifier(n_neighbors=1)
kNN1.fit(X, Y)

print('Training accuracy: {:0.3g}'.format(kNN1.score(X, Y)))

**Exercise** Why is the training accuracy for $1$-NN $100\%$? Explain according to how 1-NN works.

YOUR ANSWER HERE

To avoid overly-optimistic performance estimates, the following uses 10-fold cross validations to compute the accuracies of 1-NN trained on datasets with and without normalization.

In [None]:
cv = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)

dfs = {
    'None': df,
    'Min-max': df_minmax_normalized,
    'Standard': df_standard_normalized
}

acc = pd.DataFrame(columns=dfs.keys())
for norm in dfs:
    acc[norm] = cross_val_score(
        kNN1,
        dfs[norm].loc[:, lambda df: ~df.columns.isin(['target'])],
        # not [dataset.feature_names] since normalized features are renamed
        dfs[norm]['target'],
        cv=cv)

acc.agg(['mean', 'std']).round(5)

### Data Leak

The accuracies computed previously appear to show that normalization improves the performance of 1-NN. However, there are two subtle issues:

1. The standard deviations are quite large compared to the difference in performance. The difference in accuracies may be due to the random sampling in obtaining the data and splitting the data.
2. The normalization factors are calculated from the entire datasets. Hence, in cross-validation, the normalized data for training a classifier indeed depends on the data for testing the classifier. This can lead to an overly-optimistic estimate.

There is not much we can do to resolve the first issue, other than collecting more data. Repeating the cross-validation with different random seeds do not help as that only smooth out the randomness in splitting, not sampling.

For the second issue can be easily resolved, however, by computing the normalization factors from training set instead of the entire dataset:
- Like the filtered classifier in Weka, `sklearn.pipeline` provides the function `make_pipeline` to combine a filter with a classifier.
- `sklearn.preprocessing` provides different filters for normalizing features.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing

kNN1_standard_normalized = make_pipeline(preprocessing.StandardScaler(), kNN1)
acc['Standard'] = cross_val_score(kNN1_standard_normalized, X, Y, cv=cv)
acc['Standard'].agg(['mean', 'std']).round(5)

The variance still appears to too large to conclude the improve in performance. Indeed, the accuracy for the pipelined classifier gets even larger than the earlier overly-optimistic accuracy. Like the experimenter interface in Weka, a proper way to compare the performance of different classifiers is to use the [paired t-test](https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f).

**Exercise** Similar to the above cell, correct the accuracies in `acc['Min-max']` to use `preprocessing.MinMaxScaler` as part of a pipeline for the 1-NN classifier.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
acc['Min-max'].agg(['mean', 'std']).round(5)

In [None]:
# tests

### Decision Regions

Since `sklearn` does not provide any function to plot the decision regions of a classifier, we provide the function `plot_decision_regions` in a new module `util` defined in `util.py` of the current directory.

In [None]:
from util import plot_decision_regions
?plot_decision_regions

The following plots the decision region for a selected pair of input features.

In [None]:
@interact(normalization=['None', 'Min-max', 'Standard'],
          feature1=dataset.feature_names,
          feature2=dataset.feature_names,
          k=widgets.IntSlider(1, 1, 5, continuous_update=False),
          resolution=widgets.IntSlider(1, 1, 4, continuous_update=False))
def decision_regions_kNN(normalization,
                         feature1=dataset.feature_names[0],
                         feature2=dataset.feature_names[1],
                         k=1,
                         resolution=1):
    scaler = {
        'Min-max': preprocessing.MinMaxScaler,
        'Standard': preprocessing.StandardScaler
    }
    kNN = neighbors.KNeighborsClassifier(n_neighbors=k)
    if normalization != 'None':
        kNN = make_pipeline(scaler[normalization](), kNN)
    kNN.fit(df[[feature1, feature2]], df.target)
    plt.figure(figsize=(10, 10))
    ax = plot_decision_regions(df[[feature1, feature2]],
                               df.target,
                               kNN,
                               N=resolution * 100)
    ax.set_title('Decision region for {}-NN'.format(k))
    ax.set_xlabel(feature1)
    ax.set_ylabel(feature2)

Interact with the widgets to: 

- Learn the effect on the decision regions/boundaries with different normalizations and choices of $k$.
- Learn to draw the decision boundaries for $1$-NN with min-max normalization.

To avoid repeated computations, `plot_decision_regions` is [a decorated function](https://realpython.com/primer-on-python-decorators/) with its return values memorized/cached. To clear the cached plots, run:

In [None]:
plot_decision_regions.clear()

**Exercise** Complete the following code to plot the decision regions for decision trees. Afterwards, explain whether the decision regions change for different normalizations.

In [None]:
@interact(normalization=['None', 'Min-max', 'Standard'],
          feature1=dataset.feature_names,
          feature2=dataset.feature_names,
          resolution=widgets.IntSlider(1, 1, 4, continuous_update=False))
def decision_regions_kNN(normalization,
                         feature1=dataset.feature_names[0],
                         feature2=dataset.feature_names[1],
                         resolution=1):
    scaler = {
        'Min-max': preprocessing.MinMaxScaler,
        'Standard': preprocessing.StandardScaler
    }
    # YOUR CODE HERE
    raise NotImplementedError()
    plt.figure(figsize=(10, 10))
    ax = plot_decision_regions(df[[feature1, feature2]],
                               df.target,
                               DT,
                               N=resolution * 100)
    ax.set_title('Decision region for Decision Tree')
    ax.set_xlabel(feature1)
    ax.set_ylabel(feature2) 

YOUR ANSWER HERE

In [None]:
plot_decision_regions.clear()

## Rule-based Classification

Although RIPPER is not available in `sklearn`, there is an implementation in [`wittgenstein`](https://pypi.org/project/wittgenstein/).

In [None]:
from wittgenstein import RIPPER

ripper = RIPPER(random_state=0)
pos_class=dataset.target_names[0]
ripper.fit(df, class_feat = 'target', pos_class=pos_class)
ripper.out_model()

The above is the rule for detecting the positive class. To evaluate the performance:

In [None]:
scores = cross_val_score(ripper, df[dataset.feature_names], df.target == pos_class)
print(f'Accuracy: {scores.mean():.3g}')

**Challenge** The `predict` method of `ripper` seems to fail, as the following code does not show a correct decision region. As a challenge, try to identify the issue from the [github repository](https://github.com/imoscovitz/wittgenstein).

In [None]:
@interact(normalization=['None', 'Min-max', 'Standard'],
          feature1=dataset.feature_names,
          feature2=dataset.feature_names,
          resolution=widgets.IntSlider(1, 1, 4, continuous_update=False))
def decision_regions_kNN(normalization,
                         feature1=dataset.feature_names[0],
                         feature2=dataset.feature_names[1],
                         resolution=1):
    scaler = {
        'Min-max': preprocessing.MinMaxScaler,
        'Standard': preprocessing.StandardScaler
    }
    ripper = RIPPER(random_state=0)
    if normalization != 'None':
        ripper = make_pipeline(scaler[normalization](), ripper)
    X = df[[feature1, feature2]].to_numpy()
    Y = df.target == pos_class
    ripper.fit(X, Y)
    ripper.out_model()
    plt.figure(figsize=(10, 10))
    ax = plot_decision_regions(X,
                               Y,
                               ripper,
                               N=resolution * 100)
    ax.set_title('Decision region for RIPPER')
    ax.set_xlabel(feature1)
    ax.set_ylabel(feature2) 
    
plot_decision_regions.clear()