# <span style="font-family: Arial, sans-serif; color:#97f788">Ordinal Classification</span>

<span style="font-family: Arial, sans-serif; color:navyblue">Author: <a href="https://github.com/deburky" title="GitHub link">https://github.com/deburky</a></span>

This notebook tracks some experiments with simple ordinal classification approach (SOCA) proposed by Eibe Frank and Mark Hall.

I owe the discovery of this to @mosh98. You can find their implementation [here](https://github.com/mosh98/Ordinal_Classifier).
Corresponding Medium post is provided [here](https://towardsdatascience.com/simple-trick-to-train-an-ordinal-regression-with-any-classifier-6911183d2a3c).

### Simple Ordinal Classification Approach

<table>
<tr>
<td><img src="https://media.springernature.com/w316/springer-static/cover-hires/book/978-3-540-44795-5?as=webp" width="700" height="400" alt="Book cover image"></td>
<td>

The target we model each class as 1 if target > k, 0 otherwise. This results in several classifiers that are used in an ensemble manner. The first and last classifier are used as direct probability, and classes in-between are modeled as the difference between the probabilities of the next class.

For example we model the probability of class 1 out of 5 classes as:
- $1 - P(target > 1) = P(target <= 1)$

For the last class:
- $P(target > 4) = P(target = 5)$

The classes in between can be modeled either as:
- $P(target > k) - P(target > k+1)$ (approach in the scikit-learn implementation by @mosh98)
- $P(target > k) × (1-P(target > k+1))$ (original approach, p. 148)

Both approaches work well, but the original approach showed better results on our dataset.

**Reference:**
> Eibe Frank and Mark Hall. A Simple Approach to Ordinal Classification. Machine Learning: ECML 2001. Lecture Notes in Computer Science, vol 2167. Springer, Berlin, Heidelberg. [Link](https://link.springer.com/chapter/10.1007/3-540-44795-4_13)

</td>
</tr>
</table>

### Practical Example

The dataset used is [Amazon Fine Food Reviews](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews) with OpenAI embeddings ([repo](https://github.com/openai/openai-cookbook)), which are used as features (1536 embeddings per review) and use the score (1-5) as our target variable.

To test the approach we use:

- Logistic Regression
- SVM
- One-layer Neural Network

To illustrate how models fit the embedding space, we visualize Fisher information on a reduced embedding space using t-SNE/PCA.

The idea of ordinal classification is similar to proportional odds logistic regression model as described in Applied Logistic Regression (3rd ed.) by David W. Hosmer, Stanley Lemeshow, and Rodney X. Sturdivant.

Ordinal classification is a more generic approach, since we can use probabilistic classifiers other than logistic regression.

### Implementation

There are several ways to perform ordinal classification, most based on logistic regression methodologies.

The scikit-learn implementation or ordinal classifier comes from this [repo](https://github.com/mosh98/Ordinal_Classifier/tree/master), which implements the simple ordinal classification approach.

Additionally, `spacecutter` library by Ethan Rosenthal is tested for PyTorch models.

You can read more about the PyTorch approach [here](https://www.ethanrosenthal.com/2018/12/06/spacecutter-ordinal-regression/) as well as find the [repo](https://github.com/EthanRosenthal/spacecutter).

In [None]:
import pandas as pd
import numpy as np
from ast import literal_eval

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from rich import print as rprint

datafile_path = (
    "data/fine_food_reviews_with_embeddings_1k.parquet"
)

df = pd.read_parquet(datafile_path)
df["embedding"] = df.embedding.apply(literal_eval).apply(
    np.array # type: ignore
)  # type: ignore # convert string to array

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    list(df.embedding.values),
    df.Score,
    test_size=0.3,
    random_state=42,
)

# train a logistic regression
clf_open = LogisticRegression(
    fit_intercept=True,
    solver="newton-cg",
    penalty=None,
    random_state=42,
    multi_class="multinomial",
)

clf_open.fit(X_train, y_train)
preds = clf_open.predict(X_test)
probas = clf_open.predict_proba(X_test)

report = classification_report(
    y_test, preds, output_dict=False
)
rprint(report)

In [None]:
import pandas as pd
from sklearn.metrics import roc_curve, auc
from matplotlib import pyplot as plt
from typing import List, Union

%config InlineBackend.figure_format = 'retina'


# Visualizer of model performance
def plot_multiclass_roc_auc(
    y_score: Union[pd.Series, pd.DataFrame],
    y_true_untransformed: Union[pd.Series, pd.DataFrame],
    class_list: List[str],
    classifier_name: str,
) -> None:
    """
    ROC AUC plotting for a multiclass problem.
    It plots the ROC curve and computes the AUC for each class,
    as well as the micro-average AUC.

    """
    n_classes = len(class_list)
    y_true = pd.concat(
        [
            (y_true_untransformed == class_list[i])
            for i in range(n_classes)
        ],
        axis=1,
    ).values

    # choose cmap tab10 from matplotlib
    colors = plt.get_cmap("Set2").colors

    fpr, tpr, roc_auc = dict(), dict(), dict()

    # Compute ROC curve and ROC AUC for each class
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(
            y_true[:, i],
            y_score[:, i],
            drop_intermediate=False,
        )
        roc_auc[i] = auc(fpr[i], tpr[i])

    # Compute micro-average ROC curve and ROC AUC
    fpr["micro"], tpr["micro"], _ = roc_curve(
        y_true.ravel(), y_score.ravel()
    )
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

    print(
        f"{classifier_name} - Micro-average Gini: {roc_auc['micro'] * 2 - 1:.2f}"
    )

    # Plot ROC curves
    plt.figure(figsize=(5, 5), dpi=100)
    plt.plot(
        fpr["micro"],
        tpr["micro"],
        label=f'Micro-Avg={roc_auc["micro"] * 2 - 1:.2f}',
        color="deeppink",
        linestyle=":",
        linewidth=4,
    )

    for i in range(n_classes):
        plt.plot(
            fpr[i],
            tpr[i],
            lw=2,
            label=f"Class {class_list[i]}={roc_auc[i] * 2 - 1:.2f}",
            color=colors[i],
        )

    plt.plot([0, 1], [0, 1], "k--", lw=2)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel("1 - TNR")
    plt.ylabel("TPR")
    plt.title("ROC curves")
    plt.legend(loc="lower right")
    plt.show()

In [None]:
plot_multiclass_roc_auc(
    probas, y_test, [1, 2, 3, 4, 5], clf_open
)

### Spacecutter

In [None]:
import numpy as np
from torch import nn
from spacecutter.models import OrdinalLogisticModel
from skorch import NeuralNet
from spacecutter.callbacks import AscensionCallback
from spacecutter.losses import CumulativeLinkLoss

df_ord = df.copy()

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    list(df_ord.embedding.values),
    df_ord.Score - 1,
    test_size=0.3,
    random_state=42,
)

X_train = np.array(X_train, dtype=np.float32).copy()
y_train = np.array(y_train).reshape(-1, 1).copy()

# Determine the number of features and classes
num_features = X_train.shape[1]
num_classes = len(np.unique(y_train))

print(f"Number of unique classes: {num_classes}")

# Feedforward Neural Network
predictor = nn.Sequential(
    nn.Linear(num_features, 256),
    nn.ReLU(),
    nn.BatchNorm1d(256),
    nn.Dropout(0.5),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.BatchNorm1d(128),
    nn.Dropout(0.5),
    nn.Linear(128, 1)
)

model = OrdinalLogisticModel(predictor, num_classes)

ordinal_regression = NeuralNet(
    module=model,
    module__predictor=predictor,
    module__num_classes=num_classes,
    criterion=CumulativeLinkLoss,
    lr=0.2,
    max_epochs=100,
    train_split=None,
    callbacks=[
        ('ascension', AscensionCallback()),
    ],
)

ordinal_regression.fit(X_train, y_train)

In [None]:
from rich import print as rprint

probas_ord = ordinal_regression.predict_proba(
    np.array(X_test, dtype=np.float32)
).flatten()

preds_ord = ordinal_regression.predict(
    np.array(X_test, dtype=np.float32)
).argmax(axis=-1)

probas_ord_reshaped = probas_ord.reshape(-1, num_classes)

probas = pd.DataFrame(
    probas_ord_reshaped, 
    columns=[f'prob_class_{i}' for i in range(num_classes)]
).to_numpy()

_ = plot_multiclass_roc_auc(
    probas, y_test, [0, 1, 2, 3, 4], ordinal_regression
)

report = classification_report(y_test.values, preds_ord)
rprint(report)

### Ordinal Classification

[Source](https://github.com/mosh98/Ordinal_Classifier/blob/master/Ordinal_Classifier.py)

In [None]:
import pandas as pd
import numpy as np
from ast import literal_eval

from sklearn import svm

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from ordinal_classifier import OrdinalClassifier

from rich.console import Console

console = Console()

datafile_path = (
    "data/fine_food_reviews_with_embeddings_1k.parquet"
)

df = pd.read_parquet(datafile_path)
df["embedding"] = df.embedding.apply(literal_eval).apply(
    np.array
)  # type: ignore

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    list(df.embedding.values),
    df.Score - 1,
    test_size=0.3,
    random_state=42,
)

n_classes = len(np.unique(y_train))
X_train_, X_test_ = np.array(X_train, dtype=np.float16), np.array(X_test, dtype=np.float16)
y_train_, y_test_ = (
    np.array(y_train).reshape(-1, 1).ravel(), 
    np.array(y_test).reshape(-1, 1).ravel()
)

clf_base = svm.SVC(
    probability=True,
    random_state=42,
)

clf = OrdinalClassifier(clf_base)

clf.fit(X_train_, y_train_)
preds_ord_clf = clf.predict(X_test_)
probas_ord_clf = clf.predict_proba(X_test_)

report = classification_report(
    y_test, preds_ord_clf, output_dict=False
)
console.print(report)

### Comparison (Softmax vs Ordinal)

In [None]:
import numpy as np

def ordinal_accuracy(y_true: np.ndarray, y_pred: np.ndarray) -> float:

    total_count = len(y_true)

    accurate_count = sum(
        1
        for true_label, pred_label in zip(y_true, y_pred)
        if pred_label in [true_label, true_label - 1, true_label + 1]
    )
    return accurate_count / total_count


# Assuming y_test and preds are already defined
preds_lr = clf_open.predict(X_test)
preds_svm = clf.predict(X_test)

# Logistic Regression Model
accuracy_log_reg = ordinal_accuracy(y_test, preds_lr)
print(f"Ordinal Accuracy for Logistic Regression: {accuracy_log_reg:.2%}")

# Ordinal SVM Model
accuracy_ordinal_svm = ordinal_accuracy(y_test, preds_svm)
print(f"Ordinal Accuracy for Ordinal SVM: {accuracy_ordinal_svm:.2%}")

### Fisher information

The calculation of Fisher Information as implemented in the code is based on the gradients of the log-likelihood with respect to the parameters. This method is typically considered an observed Fisher Information since it is computed from the actual data points.

In the code, the Fisher Information is computed as the sum of the squares of the gradients of the log-likelihood. This is a common approach to estimating the observed Fisher Information:

```python
for i in range(log_likelihoods.shape[1]):
    zz = log_likelihoods[:, i].reshape(xx.shape)
    gx, gy = np.gradient(zz)
    fim[:, :, i] = gx**2 + gy**2
```

Here, gx and gy are the gradients of the log-likelihoods with respect to the two dimensions of the t-SNE space. The Fisher Information is then estimated as gx**2 + gy**2.

Similar to the 2D case, the Fisher Information is calculated as the sum of the squares of the gradients in the 3D space.

In [None]:
from fisher_information_visualizer import FisherInformationVisualizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

clf_base = LogisticRegression(max_iter=1000, solver='newton-cg', C=10.0, random_state=42)
clf_base = SVC(probability=True, random_state=42)
# clf_base = MLPClassifier(hidden_layer_sizes=(100, 10, ), max_iter=1000, random_state=42, learning_rate_init=1e-2)

visualizer = FisherInformationVisualizer(clf_base, X_train, y_train.ravel(), X_test, y_test, method='t-SNE')
# visualizer = FisherInformationVisualizer(clf_base, X_train, y_train, X_test, y_test, method='PCA')
visualizer.plot_2d_fisher_information(dimensionality=100, contour_levels=100)
visualizer.plot_3d_fisher_information(dimensionality=100)