#### Handling Imabalnced Dataset - Undersampling & Oversampling

###### Imabalanced dataset
In a classification(binary), if the number of samples from one class is very high than the number of samples from the other class, then the dataset is called imbalanced dataset. 

###### Undersampling
One technique can be used named Undersampling; it means to reduece the number of samples from the dominating class. But it is not useful when the total number of sample in the dataset is less. 

###### Oversampling
In that case we can use Oversampling. Oversampling is increasing the sample numbers from the minority class.

In [1]:
# First importing the necessary libraries

import pandas as pd
import numpy as np
from imblearn.under_sampling import NearMiss

In [11]:
# Loading the dataset

df = pd.read_csv('creditcard.csv')
df.head(2)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0


Here our target column is Class, here Class = 0 means normal transaction otherwise it is fraud

In [12]:
# Let's check if the dataset is unbalanced or not
df.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [23]:
# It is clear that the dataset is imbalanced
# Now we will use imblearn library to downsampling
# But first we need to separate the independant and dependant variables

X = df[df.columns[:-1]]
Y = df[df.columns[-1]]

print('shape of independant variables:{} '.format(X.shape))
print('shape of dependant variables:{} '.format(Y.shape))

shape of independant variables:(284807, 30) 
shape of dependant variables:(284807,) 


In [44]:
# Now implement the undersampling

near_miss = NearMiss()
x_under, y_under = near_miss.fit_sample(X, Y)


print('New shape of independant variables:{} '.format(x_under.shape))
print('New shape of dependant variables:{} '.format(y_under.shape))

New shape of independant variables:(984, 30) 
New shape of dependant variables:(984,) 


So undersampling reduces the number of samples from the biased class.

Now we will try with Oversampling.

In [42]:
from imblearn.combine import SMOTETomek
smkt = SMOTETomek(random_state =42)

In [43]:
x_upper, y_upper = smkt.fit_sample(X,Y)

In [45]:
print('New shape of oversampled independant variables:{} '.format(x_upper.shape))
print('New shape of oversampled dependant variables:{} '.format(y_upper.shape))

New shape of oversampled independant variables:(567562, 30) 
New shape of oversampled dependant variables:(567562,) 


We can also chose the ratio between two classes 

In [46]:
from imblearn.over_sampling import RandomOverSampler

In [67]:
randover = RandomOverSampler(sampling_strategy = .1)

In [68]:
x_rand, y_rand = randover.fit_sample(X, Y)
print('New shape of oversampled independant variables:{} '.format(x_rand.shape))
print('New shape of oversampled dependant variables:{} '.format(y_rand.shape))

New shape of oversampled independant variables:(312746, 30) 
New shape of oversampled dependant variables:(312746,) 
