Area-based Representative Points Oversampling with Shifting (AROSS) is an algorithm targeting the class imbalance problem, it balances dataset by generating synthetic instances of minority class in safe and half-safe areas populated surrounding representative points, which is efficient with capturing the disjoints subsets of minority class and avoiding imposing class overlapping to the dataset.
If you wish to refer our work, please use the following BibTeX citation:
Soon to be replenishedThe AROSS algorithm is created under python 3.9 with related dependencies:
- scikit-learn (1.1.2)
- pandas (1.4.2)
- numpy (1.21.5)
- pyclustering (0.10.1.2)
- kneed (0.8.1)
- scipy (1.8.1)
from AROSS import AROSS
from utils.utils import read_data
from utils.visualize import show_oversampled
X,y = read_data('Datasets/sampledata_new_3.csv')
ar = AROSS(n_cluster=5,linkage='ward')
X_oversampled,y_oversampled = ar.fit_sample(X,y)
show_oversampled(X,y,X_oversampled,y_oversampled)Output figure:
AROSS can be parsed into four steps:
- Clustering the input features using agglomerative clustering [1]
-
When the
n_clusteris not given, the algorithm will determine it by BIC automatically [2] -
When the
linkageis not givem, the algorithm will determine it by CPCC automatically [3]
-
Extracting the representative points from clustering results [4]
-
Populating and classifying areas surrounding representative points
-
Generating synthetic instances using the Gaussian Generator
AROSS - shifting
Shifting refers to that, one more operation of shifting reps toward the centroid of the cluster will be conducted after extracting representative points (step 2) when the alpha given is not 0. The greater the alpha is, the more reps will be shifted toward the centroid.
[1] Fabian Pedregosa et al. “Scikit-learn: Machine learning in Python”. In: the Journal of machine Learning research 12 (2011), pp. 2825–2830.
[2] Gideon Schwarz. “Estimating the dimension of a model”. In: The annals of statistics (1978), pp. 461–464.
[3] James S Farris. “On the cophenetic correlation coefficient”. In: Systematic Zoology 18.3 (1969), pp. 279–285.
[4] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. “CURE: An efficient clustering algorithm for large databases”. In: ACM Sigmod record 27.2 (1998), pp. 73–84.



