<a href="https://colab.research.google.com/github/ajaysaikiran2208/datasharing/blob/master/Class_Balancing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Class** **Balancing in Machine Learning**

If you have spent some time in machine learning and data science, you would have definitely come across imbalanced class distribution. This is a scenario where the number of observations belonging to one class is significantly lower than those belonging to the other classes.

This problem is predominant in scenarios where anomaly detection is crucial like electricity pilferage, fraudulent transactions in banks, identification of rare diseases, etc. In this situation, the predictive model developed using conventional machine learning algorithms could be biased and inaccurate.

This happens because Machine Learning Algorithms are usually designed to improve accuracy by reducing the error. Thus, they do not take into account the class distribution / proportion or balance of classes.

In [None]:
from sklearn.datasets import make_classification
nb_samples = 1000
weights = (0.95, 0.05)
x, y = make_classification(n_samples=nb_samples, n_features=2, n_redundant=0, weights=weights, random_state=1000)

print(x[y==0].shape)
print(x[y==1].shape)

So as expected, the first class is dominant. To balance the classes of this kind of dataset we have two techniques for avoiding class imbalance in machine learning:



*Data Level approach: Resampling Techniques*

1.Resampling with replacement

2.SMOTE Resampling

Now let’s go through both these class balancing techniques to see how we can balance the classes before using any machine learning algorithm.

## Resampling with Replacement:


In the resampling with replacement method, we resample from the dataset limited to the minor class until we reach the desired number of samples in both classes. As we operate with replacing, it can be iterated by the n number of times. But the resulting dataset will contain data points sampled from 54 possible values (according to our example). Here is how we can use the resampling with replacement technique using Python:

In [2]:
# Resampling with Replacement
import numpy as np
from sklearn.utils import resample
x_resampled = resample(x[y==1], n_samples=x[y==0].shape[0], random_state=1000)

x_ = np.concatenate((x[y==0], x_resampled))
y_ = np.concatenate((y[y==0], np.ones(shape=(x[y==0].shape[0],), dtype=np.int32)))

print(x_[y_==0].shape)
print(x_[y_==1].shape)

(946, 2)
(946, 2)


## SMOTE Resampling:


SMOTE resampling is one of the most robust approaches for avoiding class imbalance. It stands for Synthetic Minority Over-sampling Technique. This technique was designed to generate new samples consistent with the minor classes. To implement the SMOTE resampling technique for class balancing, we can use the imbalanced-learn library which has many algorithms for this kind of problem. Here’s how to implement SMOTE resampling for class balancing using Python:

In [5]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=1000)
x_, y_ = smote.fit_sample(x, y)

print(x_[y_==0].shape)
print(x_[y_==1].shape)



(946, 2)
(946, 2)




## Summary


Both the Resampling with replacement and SMOTE resampling are very useful techniques for avoiding Class imbalance in machine learning. Resampling with replacement method is used to increase the number of samples but the resulting distribution will be the same as the values are taken from the existing set. Whereas, SMOTE resampling generates the same number of samples by considering the neighbours. 