# SAMPLING REPORTING

Data noise and class imbalance are two of the most common data quality issues that can affect the performance of most classification algorithms. A real-world dataset, however, contains a mixture of data noise and class imbalance.

Several techniques such as sampling methods and cost-sensitive learning have been developed over the years to handle class imbalance issues. While some sampling methods are capable of filtering the data noise, the impact of the noise varies depending on the learning algorithm.

Here, we offers a framework in the form of a python tool for selecting top performers from a wide range of techniques and learning algorithms for mining noisy and imbalance datasets for classification problems. The framework depends on imbalanced-learn, scikit-learn and numpy. The tool will be the go to tool for quick and easy experimentation and generation of reports on classification based problems.

In [1]:
from sampling_report.reporting.sampling_imblearn import sampling_report

In [2]:
help(sampling_report)

Help on function sampling_report in module sampling_report.reporting.sampling_imblearn:

sampling_report(data: pandas.core.frame.DataFrame, target: str, model, random_state=None, class_weight=None, exclude=None, include=None)
    :param data: Dataset including features and target.
                 Notes: Dataset in the form of Pandas DataFrame.
                 Dataset is expected to be preprocessed with no missing values.
                 sampling_report can automatically handle categorical encoding.
    :param target: str
                   The class variable must be one of the columns in the data
    :param model: Name of the model. Can be one or combination of:
                  ['rf', 'tree', 'lg', 'gb', 'knn', 'nn', 'svm', 'linear_svm', 'nu_svm', 'nb', 'gnb', 'bnb', 'cnb']
                  Notes: String or list is expected for single model. List is expected for multiple models.
    :param random_state: int or RandomState instance, default=None
                         Controls t

### Load Data

The breast cancer dataset available at the UC Irvine Machine Learning Repository [1] [2] is widely used by researchers to demonstrate methods of dealing with noisy and imbalanced data.

The following packages are required to load the data.

In [3]:
import os
import pandas as pd

In [4]:
def load_data(file_name, col_name):
    file_path = os.path.join(os.getcwd(), 'dataset', file_name)
    data = pd.read_csv(file_path, names=col_name, header=None)
    return data

In [5]:
col_name = ['Class', 'age', 'menopause', 'tumor-size', 'inv-nodes', 
            'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat']
file_name='breast-cancer.data'
data =load_data(file_name, col_name)

In [6]:
data.shape

(286, 10)

In [7]:
data.head()

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
0,no-recurrence-events,30-39,premeno,30-34,0-2,no,3,left,left_low,no
1,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,right,right_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
3,no-recurrence-events,60-69,ge40,15-19,0-2,no,2,right,left_up,no
4,no-recurrence-events,40-49,premeno,0-4,0-2,no,2,right,right_low,no


In [8]:
data.isnull().any()

Class          False
age            False
menopause      False
tumor-size     False
inv-nodes      False
node-caps      False
deg-malig      False
breast         False
breast-quad    False
irradiat       False
dtype: bool

In [9]:
data['Class'].value_counts()

no-recurrence-events    201
recurrence-events        85
Name: Class, dtype: int64

### Target variable is imbalance

In [10]:
round((data['Class'].value_counts() / data['Class'].value_counts().sum())*100, 2)

no-recurrence-events    70.28
recurrence-events       29.72
Name: Class, dtype: float64

### Investigate possible attribute noise

In [11]:
def get_attribute_noise(df, var, missing_val):
    df = df.copy()
    return df[df[var].str.strip().isin([missing_val])]

In [12]:
get_attribute_noise(df=data, var='node-caps', missing_val='?')

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
145,no-recurrence-events,40-49,premeno,25-29,0-2,?,2,left,right_low,yes
163,no-recurrence-events,60-69,ge40,25-29,3-5,?,1,right,left_up,yes
164,no-recurrence-events,60-69,ge40,25-29,3-5,?,1,right,left_low,yes
183,no-recurrence-events,50-59,ge40,30-34,9-11,?,3,left,left_up,yes
184,no-recurrence-events,50-59,ge40,30-34,9-11,?,3,left,left_low,yes
233,recurrence-events,70-79,ge40,15-19,9-11,?,1,left,left_low,yes
263,recurrence-events,50-59,lt40,20-24,0-2,?,1,left,left_up,no
264,recurrence-events,50-59,lt40,20-24,0-2,?,1,left,left_low,no


In [13]:
get_attribute_noise(df=data, var='breast-quad', missing_val='?')

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
206,recurrence-events,50-59,ge40,30-34,0-2,no,3,left,?,no


### Investigate possible class noise

In [14]:
def get_class_noise(df, class_name):
    df = df.copy()
    data_dedup_1 = df[~df.duplicated()]
    data_2 = data_dedup_1.drop([class_name], axis=1)
    data_dedup_2 = data_2[data_2.duplicated(keep=False)]
    df_final = df.iloc[data_dedup_2.index,]
    return df_final.sort_values(by=data_dedup_2.columns.tolist())

In [15]:
get_class_noise(df=data, class_name='Class')

Unnamed: 0,Class,age,menopause,tumor-size,inv-nodes,node-caps,deg-malig,breast,breast-quad,irradiat
90,no-recurrence-events,30-39,premeno,0-4,0-2,no,2,right,central,no
205,recurrence-events,30-39,premeno,0-4,0-2,no,2,right,central,no
35,no-recurrence-events,30-39,premeno,30-34,0-2,no,2,left,left_up,no
281,recurrence-events,30-39,premeno,30-34,0-2,no,2,left,left_up,no
2,no-recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
210,recurrence-events,40-49,premeno,20-24,0-2,no,2,left,left_low,no
46,no-recurrence-events,40-49,premeno,25-29,0-2,no,2,right,left_low,no
236,recurrence-events,40-49,premeno,25-29,0-2,no,2,right,left_low,no
108,no-recurrence-events,40-49,premeno,30-34,0-2,no,3,right,right_up,no
212,recurrence-events,40-49,premeno,30-34,0-2,no,3,right,right_up,no


In [16]:
def replace_cat_missing(df, var, val_to_replace=['?', ' ?'], suffix='_missing'):
    df = df.copy()
    df[var] = df[var].replace(val_to_replace, var+suffix)
    return df

In [17]:
VAR_WITH_MISSING_VALUES = ['node-caps', 'breast-quad']

In [18]:
for var in VAR_WITH_MISSING_VALUES:
    data = replace_cat_missing(df=data, var=var, val_to_replace='?', suffix='_missing')

### Apply sampling_report

Demonstrate sampling reporting for random forest classifier without using class weight.

In [19]:
sampling_report(data=data, target='Class', model='rf', random_state=0)


EditedNearestNeighbours
Class distribution after sampling: {'no-recurrence-events': 91, 'recurrence-events': 4}
mean_f1: 0.875 | mean_acc: 0.97  | model_name: RandomForestClassifier | sampling_strategy: all

Class distribution after sampling: {'no-recurrence-events': 91, 'recurrence-events': 85}
mean_f1: 0.826 | mean_acc: 0.829  | model_name: RandomForestClassifier | sampling_strategy: not minority

Class distribution after sampling: {'no-recurrence-events': 91, 'recurrence-events': 85}
mean_f1: 0.826 | mean_acc: 0.829  | model_name: RandomForestClassifier | sampling_strategy: majority

Class distribution after sampling: {'no-recurrence-events': 201, 'recurrence-events': 4}
mean_f1: 0.898 | mean_acc: 0.99  | model_name: RandomForestClassifier | sampling_strategy: not majority


RepeatedEditedNearestNeighbours
Class distribution after sampling: {'no-recurrence-events': 89}
mean_f1: 1.0 | mean_acc: 1.0  | model_name: RandomForestClassifier | sampling_strategy: all

Class distribution af

mean_f1: 0.598 | mean_acc: 0.6  | model_name: RandomForestClassifier | sampling_strategy: majority

Class distribution after sampling: {'no-recurrence-events': 201, 'recurrence-events': 85}
mean_f1: 0.647 | mean_acc: 0.738  | model_name: RandomForestClassifier | sampling_strategy: not majority


ADASYN
Class distribution after sampling: {'no-recurrence-events': 201, 'recurrence-events': 201}
mean_f1: 0.777 | mean_acc: 0.779  | model_name: RandomForestClassifier | sampling_strategy: all

Class distribution after sampling: {'no-recurrence-events': 201, 'recurrence-events': 201}
mean_f1: 0.777 | mean_acc: 0.779  | model_name: RandomForestClassifier | sampling_strategy: not majority

Class distribution after sampling: {'no-recurrence-events': 201, 'recurrence-events': 201}
mean_f1: 0.777 | mean_acc: 0.779  | model_name: RandomForestClassifier | sampling_strategy: minority

Class distribution after sampling: {'no-recurrence-events': 201, 'recurrence-events': 85}
mean_f1: 0.669 | mean_acc: 0

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

[2] M. Zwitter, M. Soklic, University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia