Skip to content

ghostqriver/AROSS

Repository files navigation

AROSS: Area-based Representative points OverSampling with Shifting in Imbalance Learning

Area-based Representative Points Oversampling with Shifting (AROSS) is an algorithm targeting the class imbalance problem, it balances dataset by generating synthetic instances of minority class in safe and half-safe areas populated surrounding representative points, which is efficient with capturing the disjoints subsets of minority class and avoiding imposing class overlapping to the dataset.

Cite AROSS

If you wish to refer our work, please use the following BibTeX citation:

Soon to be replenished

Installation

The AROSS algorithm is created under python 3.9 with related dependencies:

  • scikit-learn (1.1.2)
  • pandas (1.4.2)
  • numpy (1.21.5)
  • pyclustering (0.10.1.2)
  • kneed (0.8.1)
  • scipy (1.8.1)

Basic usage

from AROSS import AROSS
from utils.utils import read_data
from utils.visualize import show_oversampled

X,y = read_data('Datasets/sampledata_new_3.csv')

ar = AROSS(n_cluster=5,linkage='ward')
X_oversampled,y_oversampled = ar.fit_sample(X,y)
show_oversampled(X,y,X_oversampled,y_oversampled)

Output figure:

About AROSS

AROSS can be parsed into four steps:

  1. Clustering the input features using agglomerative clustering [1]
  • When the n_cluster is not given, the algorithm will determine it by BIC automatically [2]

  • When the linkage is not givem, the algorithm will determine it by CPCC automatically [3]

  1. Extracting the representative points from clustering results [4]

  2. Populating and classifying areas surrounding representative points

  3. Generating synthetic instances using the Gaussian Generator

AROSS - shifting

Shifting refers to that, one more operation of shifting reps toward the centroid of the cluster will be conducted after extracting representative points (step 2) when the alpha given is not 0. The greater the alpha is, the more reps will be shifted toward the centroid.

Reference

[1] Fabian Pedregosa et al. “Scikit-learn: Machine learning in Python”. In: the Journal of machine Learning research 12 (2011), pp. 2825–2830.

[2] Gideon Schwarz. “Estimating the dimension of a model”. In: The annals of statistics (1978), pp. 461–464.

[3] James S Farris. “On the cophenetic correlation coefficient”. In: Systematic Zoology 18.3 (1969), pp. 279–285.

[4] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. “CURE: An efficient clustering algorithm for large databases”. In: ACM Sigmod record 27.2 (1998), pp. 73–84.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages