# Balanced Bagging
## In this file we will explain the usage of Balanced bagging to solve the problem of classification on a highly unbalanced data.
## The basic idea of the balanced bagging is to split the dataset to multiple balanced sub-dataset and create multiple classifier based on each one and the result is taken by doing majority vote.


In [8]:

from sklearn.utils import shuffle
from sklearn.metrics import precision_recall_fscore_support
import numpy as np

from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from imblearn.ensemble import BalancedBaggingClassifier 

bbc = BalancedBaggingClassifier(random_state=42)

X = np.genfromtxt("train.csv", delimiter=',')
Y = np.genfromtxt("train_labels.csv")

n = X.shape[0]
d = X.shape[1]
train_size = n * 70 // 100

X, Y = shuffle(X, Y, random_state=27)

# split train test
X_train = X[:train_size]
Y_train = Y[:train_size]
X_test = X[train_size:]
Y_test = Y[train_size:]

bbc.fit(X_train, Y_train) 

Y__ = bbc.predict(X_test)

res = precision_recall_fscore_support(Y_test,Y__)

X_final = np.genfromtxt("test.csv", delimiter=',')

Y_final = bbc.predict(X_final)

np.savetxt("test_pred_labels_2.csv", Y_final, delimiter=",")


# The onfusion matrix:

In [10]:
print(confusion_matrix(Y_test, Y__))

[[27591  2343]
 [   19    47]]


# The precision, recall, fscore, support of the two classes:

In [11]:
print(res)

(array([0.99931184, 0.01966527]), array([0.9217278 , 0.71212121]), array([0.95895315, 0.03827362]), array([29934,    66], dtype=int64))


# Precision:
### Class1: 0.99931184
### Class2: 0.01966527
# Recall:
### Class1: 0.9217278
### Class2: 0.71212121
# FScore:
### Class1: 0.95895315
### Class2: 0.03827362


### The precision for the minority class is low. However, the recall is high. In the second algorithm that we implemented, we got a higher precision.