# Multi-Class Classification

To this point, we have focused on two-class classification.  This is the most common style, but there are scenarios in which two-class classification is not sufficient.  In this library, we will try out a variety of classification algorithms and see how they perform against this multi-class dataset.  We'll also see what changes, if any, we need to make this work.

## Prep Work

Before importing any classifier libraries, let's load our helper libraries and import the dataset for review.

In [None]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
import matplotlib as mpl

mpl.rcParams['figure.dpi'] = 100

Now let's prep the data.  Because we'll do it the same way for each, we only need to do this once.

In [None]:
bean_data = "../data/Dry_Bean_Dataset.csv"
df = pd.read_csv(bean_data, header=0)
df

This is a clean dataset, as all of our input features are numeric.  The class itself is not numeric, but that's okay--we won't need to make any changes there.

## Pre-Processing

The only pre-processing we'll need to do for most algorithms is to create our `X` and `y` sets, split, and go.  Certain algorithms, such as XGBoost, require numeric labels, but we'll handle this as we go along.

In [None]:
y = df['Class']
X = df.drop(['Class'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1740)

X_train

## Classification Showdown

Now we're going to try out all of these algorithms we've used in a proper showdown.  Because this is a more complex problem than some of the ones we've looked at, we'll end up seeing a larger variance in scores between algorithms.  Note that any conclusions we can draw are around *this* dataset, not necessarily an indictment of any algorithm in particular.

### CART

The first algorithm we'll look at is CART.

In [None]:
from sklearn import tree
clf_cart = tree.DecisionTreeClassifier(random_state=106)
clf_cart = clf_cart.fit(X_train, y_train)
predicted_cart = clf_cart.predict(X_test)
accuracy_score(y_test, predicted_cart)

In [None]:
cm_cart = confusion_matrix(y_test, predicted_cart)
ConfusionMatrixDisplay(cm_cart).plot()

In [None]:
print(classification_report(y_test, predicted_cart))

CART does a great job with accuracy and is a solid answer across the board.

### Random Forest

Our next algorithm is random forest.

In [None]:
from sklearn import ensemble
clf_rf = ensemble.RandomForestClassifier(random_state=106)
clf_rf = clf_rf.fit(X_train, y_train)
predicted_rf = clf_rf.predict(X_test)
accuracy_score(y_test, predicted_rf)

In [None]:
cm_rf = confusion_matrix(y_test, predicted_rf)
ConfusionMatrixDisplay(cm_rf).plot()

In [None]:
print(classification_report(y_test, predicted_rf))

Random forest, as we'd expect, does even better than CART.

### XGBoost

Our next algorithm is XGBoost.  XGBoost **does** require numerical labels, so we'll need to use encoded labels.  Previously, we've used the `LabelEncoder` to build this, but here we'll rig something simple.

In [None]:
def rickety_encoder(grain):
    if grain == "BARBUNYA":
        return 0
    elif grain == "BOMBAY":
        return 1
    elif grain == "CALI":
        return 2
    elif grain == "DERMASON":
        return 3
    elif grain == "HOROZ":
        return 4
    elif grain == "SEKER":
        return 5
    elif grain == "SIRA":
        return 6

y_train_enc = list(map(rickety_encoder, y_train))
y_test_enc = list(map(rickety_encoder, y_test))

Now that we've encoded our training and test labels, let's perform our tests.

In [None]:
import xgboost as xgb
clf_xgb = xgb.XGBClassifier(max_depth=9, n_estimators=300, use_label_encoder=False, eval_metric='logloss', random_state=106)
clf_xgb = clf_xgb.fit(X_train, y_train_enc)
predicted_xgb = clf_xgb.predict(X_test)
accuracy_score(y_test_enc, predicted_xgb)

Accuracy is slightly better than random forest, but in order to see the same confusion matrix, we'll need to build another rickety conversion function to get back to where we were.

In [None]:
def rickety_decoder(grain):
    if grain == 0:
        return "BARBUNYA"
    elif grain == 1:
        return "BOMBAY"
    elif grain == 2:
        return "CALI"
    elif grain == 3:
        return "DERMASON"
    elif grain == 4:
        return "HOROZ"
    elif grain == 5:
        return "SEKER"
    elif grain == 6:
        return "SIRA"

predicted_xgb_dec = list(map(rickety_decoder, predicted_xgb))


Now let's see how it performs.

In [None]:
cm_xgb = confusion_matrix(y_test, predicted_xgb_dec)
ConfusionMatrixDisplay(cm_xgb).plot()

In [None]:
print(classification_report(y_test, predicted_xgb_dec))

The results are extremely close to our random forest, but tend to be slightly better.  It looks like SIRA is the toughest to predict.

### K-Nearest Neighbors

Our next algorithm is k-Nearest Neighbors.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
clf_knn = KNeighborsClassifier(n_neighbors=5)
clf_knn = clf_knn.fit(X_train, y_train)
predicted_knn = clf_knn.predict(X_test)
accuracy_score(y_test, predicted_knn)

This is a pretty big drop-off from the tree-based algorithms, and the results are fairly stable for several values of k.

In [None]:
cm_knn = confusion_matrix(y_test, predicted_knn)
ConfusionMatrixDisplay(cm_knn).plot()

In [None]:
print(classification_report(y_test, predicted_knn))

We can see that there's a big dropoff in almost everything except for Bombay.

#### PCA

My conjecture is that we're suffering from the curse of dimensonality here. What happens if we apply a Principal Component Analysis to this dataset?

The first step is to rescale our data so that all inputs are normalized to the same scale.

In [None]:
from sklearn.preprocessing import StandardScaler

X_pca = df.drop(['Class'], axis=1)
X_pca = StandardScaler().fit_transform(X_pca)

Then we perform the PCA operation.

In [None]:
from sklearn.decomposition import PCA
pca_beans = PCA(n_components=3)
pc_beans = pca_beans.fit_transform(X_pca)

The `pc_beans` input contains our three principal components.

In [None]:
pc_beans

Now we can use `pc_beans` as the X for training.

In [None]:
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(pc_beans, y, test_size=0.30, random_state=1740)

X_train_pca

Now let's run the analysis again. Note that number of neighbors has gone **way** up. I'd actually be concerned about this many neighbors, but our training dataset size is over 9000, so we're just drawing large neighborhoods. In practical terms, `n=23` is close enough to the max and a much more reasonable number, but then we wouldn't have 101 dalmations.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
clf_knn = KNeighborsClassifier(n_neighbors=101)
clf_knn = clf_knn.fit(X_train_pca, y_train_pca)
predicted_knn = clf_knn.predict(X_test_pca)
accuracy_score(y_test_pca, predicted_knn)

Still not great, but considerably better than before!

In [None]:
cm_knn = confusion_matrix(y_test, predicted_knn)
ConfusionMatrixDisplay(cm_knn).plot()

In [None]:
print(classification_report(y_test, predicted_knn))

### Logistic Regression

Our next algorithm is logistic regression.

In [None]:
from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression(max_iter=10000)
clf_lr = clf_lr.fit(X_train, y_train)
predicted_lr = clf_lr.predict(X_test)
accuracy_score(y_test, predicted_lr)

Accuracy is back on track.  Here, accuracy is worse than random forest or XGBoost but better than CART.

In [None]:
cm_lr = confusion_matrix(y_test, predicted_lr)
ConfusionMatrixDisplay(cm_lr).plot()

In [None]:
print(classification_report(y_test, predicted_lr))

Logistic regression does a rather good job of it.  It doesn't do terribly on any of the classes, though like everything else, it has difficulties with Sira.

### Naive Bayes

Our next algorithm is multinomial Naive Bayes.  `CategoricalNB` is for cases where we have categorical input data, but all of our features are numeric.  Because of that, we can use a different variant of Naive Bayes.  The best-fit classifier for Naive Bayes in scikit-learn, in our case, is the multinomial classifier.

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf_nb = MultinomialNB(alpha=1)
clf_nb = clf_nb.fit(X_train, y_train)
predicted_nb = clf_nb.predict(X_test)
accuracy_score(y_test, predicted_nb)

This is a little better than kNN, but way below the trees or logistic regression.

In [None]:
cm_nb = confusion_matrix(y_test, predicted_nb)
ConfusionMatrixDisplay(cm_nb).plot()

In [None]:
print(classification_report(y_test, predicted_nb))

Bombay still appears to be easily separable, but for the rest of them, values are worse across the board. And PCA isn't a good call here because multinomial Naive Bayes requires all values be positive.

### Passive-Aggressive

Our next algorithm is passive-aggressive classification.

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier

clf_pa = PassiveAggressiveClassifier(C=1.0, max_iter=3000, random_state=1740)
clf_pa = clf_pa.fit(X_train, y_train)
predicted_pa = clf_pa.predict(X_test)
accuracy_score(y_test, predicted_pa)


Wow, what a drop-off!  Passive-Aggressive classifiers often give you great outcomes, but in this case, it's a dog.

In [None]:
cm_pa = confusion_matrix(y_test, predicted_pa)
ConfusionMatrixDisplay(cm_pa).plot()

This visualization helps us see what's going wrong:  it looks like the passive-aggressive classifier has become so focused on Dermason that it completely overwhelms Horoz and Seker.

In [None]:
print(classification_report(y_test, predicted_pa))

We can see the same pattern here as we've seen all along, where Bombay is easy and Sira is tough.  Barbunya typically isn't that tough, but passive-aggressive is struggling in our scenario.

But what if we try against the PCA dataset?

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier

clf_pa = PassiveAggressiveClassifier(C=1.0, max_iter=3000, random_state=1740)
clf_pa = clf_pa.fit(X_train_pca, y_train_pca)
predicted_pa = clf_pa.predict(X_test_pca)
accuracy_score(y_test_pca, predicted_pa)

In [None]:
cm_pa = confusion_matrix(y_test, predicted_pa)
ConfusionMatrixDisplay(cm_pa).plot()

In [None]:
print(classification_report(y_test, predicted_pa))

Better, but still not good: we have zero predictions for Bombay and none of our classes meet CART's not-so-lofty F1 scores.

The moral of the story is, don't limit yourself to just one classifier!