## ***Intro***

When working with highly unbalanced datasets, it can be challenging to develop a classification model that effectively distinguishes between classes. Resampling techniques can help balance the dataset, but if the data is **intrinsically unbalanced**, performance on the test set or in real-world scenarios might not improve and could even decrease. Assigning a high weight to misclassifying minority class examples can lead to a model that predicts the minority class at the **slightest suspicion**, resulting in high recall but poor precision and accuracy.

In such cases, better results may be achieved by training an anomaly detection model instead of a traditional classification model. By treating the minority class as anomalies and allowing the model to focus on understanding what defines a "normal" data point (the majority class), we can improve its ability to distinguish between classes.

The idea of using anomaly detection to identify minority classes by treating them as anomalies is well-established in the machine learning and data mining community. This approach leverages the concept that in a highly unbalanced dataset, the minority class can be seen as rare or anomalous compared to the majority class. Anomaly detection algorithms like **One-Class Support Vector Machine** can be used to identify and separate these minority class instances.

In this notebook, we address a binary classification problem (**predicting loan defaults**) using this approach with One-Class SVM.

## ***Imports***

In [1]:
# For data manipulation and linear algebra
import pandas as pd
import numpy as np

# some utilities from Scikit learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.pipeline import make_pipeline

# The One-Class SVM model
from sklearn.svm import OneClassSVM

## ***Functions***

In [2]:
def convert_to_tensor(*args, target=False) -> tuple:
    """Converts one or more DataFrames to Tensors.

    This function takes one or more DataFrames as input and returns a tuple of Tensors.

    Args:
    *args: A variable number of DataFrames to be converted.
    target: (Optional) Boolean flag indicating the data type of the output Tensors.
    - If True (default: False), Tensors will have the original data types of the DataFrame columns.
    - If False, Tensors will be cast to float type.

    Returns:
    A tuple of Tensors, one for each input DataFrame.
    """
    # Create a tensor from dataframe (convert inputs to float)
    return (np.array(df)for df in args) if target else (np.array(df).astype(float) for df in args)

## ***Data preparation***

**We will use a base table of 100 features selected from a set of features generated with Deep Feature Synthesis on the Home Credit - Credit Risk Model Stability competition dataset totaling over 1,500,000 examples.**

In [3]:
dataset = pd.read_parquet("/kaggle/input/deep-feature-synthesis-home-credit-stability/base_100features.parquet")
dataset

Unnamed: 0_level_0,WEEK_NUM,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),...,MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision),target
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,4,259,2828,5,221,2093,6,2,2,...,2,2,0,0.0,0.0,0.0,1,3,3,0
1,0,5,259,2828,5,221,2093,6,2,0,...,2,2,0,0.0,0.0,0.0,1,3,3,0
2,0,5,259,2828,5,221,2093,8,2,0,...,2,2,0,0.0,0.0,1.0,1,3,4,0
3,0,3,259,2828,5,221,2093,8,2,0,...,2,2,0,0.0,0.0,2.0,1,3,3,0
4,0,4,259,2828,5,221,2093,8,2,2,...,2,2,0,0.0,0.0,1.0,1,3,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2703450,91,1,143,356,5,221,2093,0,1,0,...,4,3,0,0.0,0.0,2.0,10,0,0,0
2703451,91,2,403,2825,5,221,2093,0,1,0,...,4,3,0,0.0,0.0,3.0,10,0,0,0
2703452,91,1,273,3202,5,221,2093,1,1,0,...,4,3,0,2.0,0.0,1.0,10,0,0,0
2703453,91,2,148,1959,5,221,2093,0,1,0,...,4,0,0,2.0,1.0,3.0,10,0,0,0


In [4]:
# Isolate the positive examples
class_one = dataset[dataset["target"] == 1].drop(columns="WEEK_NUM")

In [5]:
# You can notice that the positive class is extremely under represented (totalling about 3% of the entire dataset)
class_one

Unnamed: 0_level_0,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),MODE(person_1.sex_738L),...,MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision),target
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,4,259,2828,5,221,2093,8,2,2,0,...,2,2,0,0.0,0.0,1.0,1,3,4,1
101,4,259,2828,5,221,2093,1,2,2,1,...,2,2,0,0.0,0.0,1.0,1,3,3,1
118,4,259,2828,5,221,2093,1,2,2,1,...,2,2,0,0.0,0.0,0.0,1,3,3,1
129,4,259,2828,5,221,2093,8,2,2,1,...,2,2,0,0.0,0.0,0.0,1,3,4,1
148,4,259,2828,5,221,2093,1,2,2,1,...,2,2,0,0.0,0.0,0.0,1,3,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2702884,1,515,2177,5,221,2093,6,0,0,0,...,4,4,0,0.0,0.0,1.0,10,0,4,1
2702904,1,297,1304,5,221,2093,8,1,0,1,...,4,0,0,1.0,1.0,2.0,10,0,4,1
2703005,1,232,233,5,221,2093,6,0,0,0,...,4,0,0,3.0,3.0,2.0,10,0,5,1
2703400,1,798,1536,5,221,2093,8,1,0,0,...,4,0,0,8.0,4.0,2.0,10,0,0,1


In [6]:
# Remove the positive examples from the training data
dataset = dataset.drop(index=class_one.index)
dataset

Unnamed: 0_level_0,WEEK_NUM,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),...,MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision),target
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,4,259,2828,5,221,2093,6,2,2,...,2,2,0,0.0,0.0,0.0,1,3,3,0
1,0,5,259,2828,5,221,2093,6,2,0,...,2,2,0,0.0,0.0,0.0,1,3,3,0
2,0,5,259,2828,5,221,2093,8,2,0,...,2,2,0,0.0,0.0,1.0,1,3,4,0
3,0,3,259,2828,5,221,2093,8,2,0,...,2,2,0,0.0,0.0,2.0,1,3,3,0
5,0,3,259,2828,5,221,2093,1,2,0,...,2,2,0,0.0,0.0,0.0,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2703450,91,1,143,356,5,221,2093,0,1,0,...,4,3,0,0.0,0.0,2.0,10,0,0,0
2703451,91,2,403,2825,5,221,2093,0,1,0,...,4,3,0,0.0,0.0,3.0,10,0,0,0
2703452,91,1,273,3202,5,221,2093,1,1,0,...,4,3,0,2.0,0.0,1.0,10,0,0,0
2703453,91,2,148,1959,5,221,2093,0,1,0,...,4,0,0,2.0,1.0,3.0,10,0,0,0


In [7]:
# Drop WEEK_NUM and target to form the training entries X
week_num = dataset["WEEK_NUM"]
target = dataset["target"]
dataset.drop(columns=["WEEK_NUM", "target"], inplace=True)
dataset

Unnamed: 0_level_0,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),MODE(person_1.sex_738L),...,MODE(static_cb_0.education_1103M),MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision)
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,4,259,2828,5,221,2093,6,2,2,0,...,3,2,2,0,0.0,0.0,0.0,1,3,3
1,5,259,2828,5,221,2093,6,2,0,1,...,3,2,2,0,0.0,0.0,0.0,1,3,3
2,5,259,2828,5,221,2093,8,2,0,0,...,3,2,2,0,0.0,0.0,1.0,1,3,4
3,3,259,2828,5,221,2093,8,2,0,0,...,3,2,2,0,0.0,0.0,2.0,1,3,3
5,3,259,2828,5,221,2093,1,2,0,0,...,3,2,2,0,0.0,0.0,0.0,1,3,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2703450,1,143,356,5,221,2093,0,1,0,0,...,4,4,3,0,0.0,0.0,2.0,10,0,0
2703451,2,403,2825,5,221,2093,0,1,0,0,...,4,4,3,0,0.0,0.0,3.0,10,0,0
2703452,1,273,3202,5,221,2093,1,1,0,1,...,4,4,3,0,2.0,0.0,1.0,10,0,0
2703453,2,148,1959,5,221,2093,0,1,0,0,...,1,4,0,0,2.0,1.0,3.0,10,0,0


In [8]:
# Split the dataset into training and test sets (randomly shuffling the data)
X_train, X_test, Y_train, Y_test = train_test_split(dataset, target, test_size=0.05, shuffle=True, random_state=123)

In [9]:
X_test["target"] = Y_test
X_test = pd.concat([class_one, X_test])
Y_test = X_test["target"]
X_test = X_test.drop(columns="target")
# This test set contains positive samples (anomalies) and class 0 samples (normal)
X_test

Unnamed: 0_level_0,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),MODE(person_1.sex_738L),...,MODE(static_cb_0.education_1103M),MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision)
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,4,259,2828,5,221,2093,8,2,2,0,...,3,2,2,0,0.0,0.0,1.0,1,3,4
101,4,259,2828,5,221,2093,1,2,2,1,...,3,2,2,0,0.0,0.0,1.0,1,3,3
118,4,259,2828,5,221,2093,1,2,2,1,...,3,2,2,0,0.0,0.0,0.0,1,3,3
129,4,259,2828,5,221,2093,8,2,2,1,...,3,2,2,0,0.0,0.0,0.0,1,3,4
148,4,259,2828,5,221,2093,1,2,2,1,...,3,2,2,0,0.0,0.0,0.0,1,3,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603324,3,259,2828,5,221,2093,1,2,0,1,...,3,2,2,0,0.0,0.0,0.0,1,3,6
2656157,1,74,1061,5,221,2093,8,1,0,0,...,2,4,0,1,0.0,0.0,2.0,1,3,0
736282,4,259,2828,5,221,2093,1,2,2,0,...,4,4,3,0,0.0,0.0,1.0,7,2,0
2582828,2,716,2638,5,221,2093,0,1,0,0,...,1,4,0,0,0.0,0.0,1.0,6,1,1


In [10]:
# keep the true test labels in a Dataframe before converting to tensor
true_label_test = pd.DataFrame(Y_test)
true_label_test

Unnamed: 0_level_0,target
case_id,Unnamed: 1_level_1
4,1
101,1
118,1
129,1
148,1
...,...
603324,0
2656157,0
736282,0
2582828,0


In [11]:
# The training targets are not need since One-class SVM is an unsupervised learning algorithm
# It learns a boundary that encompasses the majority of the normal data points in the feature space such as...
# it includes as many normal data points as possible while leaving margin
del Y_train

In [12]:
# Convert Xs and Y_test to tensor (np.ndarray)
X_train, X_test = convert_to_tensor(X_train, X_test)
Y_test, = convert_to_tensor(Y_test, target=True)

In [13]:
X_train.shape, X_test.shape

((1404731, 100), (121928, 100))

In [14]:
Y_test.dtype, X_train.dtype

(dtype('int64'), dtype('float64'))

## ***StandardScaler for normalization***

**This estimator will act as a preprocessing layer that shifts and scale inputs into a distribution centered around 0 with standard deviation 1. It computes mean and variance when .fit is called and transforms the input as follow: input = (input - mean) / sqrt(var) when .transform is called.**

In [15]:
normalizer = StandardScaler()
normalizer.fit(X_train)
normalizer.transform(X_test[:1])

array([[ 1.83919035, -0.45531369,  0.73120892,  0.35371528,  0.24161303,
         0.19736789,  1.41493822,  1.19684817,  2.67904387, -0.77066656,
        -1.37365293, -1.81153817, -2.19172863, -1.38418382, -1.02127155,
        -0.92244544, -1.00178133, -2.03920019, -1.18238739, -1.16884014,
        -1.07304114, -0.03762435, -0.68216029, -0.16105809, -1.00000952,
        -0.71128993, -0.35064861, -0.34644742, -0.1732166 , -2.43614515,
        -0.28790524, -1.04679099, -0.71241014, -0.11407644, -0.02355975,
        -0.88121803, -0.38685975,  0.50266479, -0.47442117, -1.32746473,
         0.09995368, -1.54086436, -0.87985997,  0.09864465, -0.86249792,
         0.88686887,  3.66426434, -1.05027747, -0.8185344 , -1.35444393,
        -0.79202635, -1.00735234, -0.14029436, -0.21648698, -0.88926868,
        -0.77907876, -0.57453974,  0.37827476,  0.42659078, -0.62267149,
        -1.0285634 , -0.90969173, -0.64772668, -0.52818976, -0.40037697,
        -0.6788392 , -0.66391732, -0.51645261, -0.3

## ***Make pipeline***

In [16]:
pipe = make_pipeline(StandardScaler(),
                     OneClassSVM(kernel='rbf', nu=0.575, gamma='scale', verbose=True, max_iter=-1))

In [17]:
# Given the latency involved in training such models, we will use a smaller subset of the training set for training
# Training with more samples may result in better performance
pipe.fit(X=X_train[:100000])

[LibSVM]........................
*
optimization finished, #iter = 24661
obj = 230768875.913652, rho = 10379.968367
nSV = 57513, nBSV = 57486


In [19]:
# Perform classification on test samples
y_pred = pipe.predict(X=X_test)
y_pred

array([-1, -1, -1, ..., -1,  1, -1])

In [20]:
# One-Class SVM returns -1 for outliers (anomalies) and 1 for inliers.
y_pred = np.where(y_pred == 1, 0, 1)
y_pred

array([1, 1, 1, ..., 1, 0, 1])

In [21]:
y_pred.sum()

77157

In [22]:
# This is the recall score of the model, calculated as tp/(tp+fn) where tp == True Positive (class 1) and fn == False Negative (class 0)
recall_score(Y_test, y_pred)

0.717402175271909

In [23]:
# The precision score calculated as tp/(tp+fp) where tp == True Positive and fp == False Positive
precision_score(Y_test, y_pred)

0.4462459660173413

In [24]:
# This is the accuracy score of the model
accuracy_score(Y_test, y_pred)

0.5383423003739912

In [25]:
# The area under the ROC curve 
roc_auc_score(Y_test, y_pred)

0.5697541890507298

In [26]:
true_label_test["predicted class"] = y_pred
true_label_test

Unnamed: 0_level_0,target,predicted class
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1
4,1,1
101,1,1
118,1,1
129,1,1
148,1,1
...,...,...
603324,0,1
2656157,0,1
736282,0,1
2582828,0,0


## ***Conclusion***

**Feel free to tune the parameters of the model (pay special attention to parameter "nu" for significant impact on precision and accuracy). You could also train with a different set of features. Thanks for reading**