# Online Passive-Aggressive Algorithms

The Passive-Aggressive series of classifiers is our first example of an **online** algorithm, meaning that training is intended to happen one record at a time instead of in batches.  This works especially well in cases where you can get a quick resolution to your predictions and want to perform constant machine learning.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

clf = PassiveAggressiveClassifier(loss="squared_hinge", C=1.0, max_iter=1000, random_state=0, tol=1e-4)

This classifier takes a new parameter:  `tol` (tolerance).  This acts as the stopping criterion, meaning that training continues until `loss > (previous_loss - tol)`.

Now let's prep the data.  Because we'll do it the same way for each, we only need to do this once.  I'll also remove the bits where we analyze the data, as we've seen it enough times already.

In [None]:
campus_data = "../data/CampusRecruitment.csv"
df = pd.read_csv(campus_data, header=0)
y = df['status']
X = df.drop(['sl_no', 'status', 'salary'], axis=1)

## Pre-Processing

For this dataset, we want to use the columns leading up to `status` to determine if different college graduates were placed at a job.  Because the salary is determined by the placement status, we can't use it to predict if a new graduate will be placed, so we'll have to drop that column.  Note that if we were interested in doing a regression analysis, we could try to predict the salary given placement, but we're keeping it classy and sticking to classification algorithms only. Also, we'll drop the `sl_no` column because it's a unique student number.

Unlike the heart attack dataset, this dataset includes non-numeric features.

In [None]:
df

Before we can feed this data into the Passive-Aggressive Classifier algorithm (or pretty much any other classification algorithm), we need to convert any text data into numeric data.  There are a few common techniques for encoding.  The technique we will use for our dataset is called one-hot encoding.  What it does is "pivot" the categorical data, so that each distinct categorical value gets its own feature.  For example, `gender` has two values, M and F.  One-hot encoding will create new new features, one for `gender=M` and one for `gender=F`.  We need to do this for each of the non-numeric features.

In [None]:
enc = preprocessing.OneHotEncoder()

# Get only categorical features
X_object = X.select_dtypes('object')

# Fit the input features to our encoder
enc.fit(X_object)

# Perform the transformation on our dataset
codes = enc.transform(X_object).toarray()
feature_names = enc.get_feature_names_out(['gender', 'ssc_b', 'hsc_b', 'hsc_s', 'degree_t', 'workex', 'specialisation'])

# Combine the new one-hot encoded features with our existing floats and everything is now numeric.
X = pd.concat([X.select_dtypes(exclude='object'), pd.DataFrame(codes, columns=feature_names).astype(int)], axis=1)
X.shape

The `shape` here shows that we have the same number of rows as before (215), but the number of columns went from 12 to 21. We started with 5 floating point features. To it, we added 16 values for categorical variables: 5 of our categorical variables had 2 values, and two categorical variables had 3 values.

In [None]:
X

By contrast, I'm going to perform a simple label encoding of the `status`.

In [None]:
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1740)

In [None]:
X_train

Now let's train the passive-aggressive model.

In [None]:
clf = clf.fit(X_train, y_train)

## How'd We Do?

Let's first use the `accuracy_score` method in sklearn to see just how well we did.

In [None]:
predicted = clf.predict(X_test)
accuracy_score(y_test, predicted)

The high-line accuracy is right in line with some other algorithms, including logistic regression.  Let's compare how it does in the confusion matrix and classification report.

In [None]:
confusion_matrix(y_test, predicted)

In [None]:
print(classification_report(y_test, predicted))

Let's compare this to the rest of the group:

|Case|kNN|XGBoost|Logistic Regression|Naive Bayes|Online PA|
|----|---|-------|-------------------|-----------|---------|
|True Negative|13|18|17|15|14|
|False Positive|8|3|4|6|7|
|False Negative|4|5|6|5|3|
|True Positive|40|39|38|39|41|

Our passive-aggressive classifier turned out to be the best at predicting placements, beating even XGBoost and Naive Bayes at the task. It did suffer a little bit with respect to non-placements, barely beating kNN and doing worse than all other algorithms.

Note that, more than most other algorithms, online passive-aggressive classifiers are very dependent on the ordering of input data.  A small change could lead to a substantial accuracy difference, more so than most algorithms.  The benefits are that they tend to be very accurate (especially as information changes over time) and you do not need massive amounts of data for retraining.