<a href="https://colab.research.google.com/github/expeditive/machine-learning/blob/main/handling_imbalanced_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#What Is Imbalanced Data in ML?
In machine learning, imbalanced data refers to a dataset where one class (category) has many more samples than the other(s).

📌 Example:
Imagine you are building a model to detect fraud in transactions:

✅ 98,000 transactions are normal

❌ 2,000 transactions are fraudulent

This is a highly imbalanced dataset (98% vs 2%).

##Why Is This a Problem?
If a model always predicts "normal," it can be 98% accurate, but it completely misses all fraud! That’s bad in real-world use.

So, we need special techniques to make the model focus on the minority class (fraudulent transactions).

Handling It (In Short):
To solve this issue, we do things like:

Resampling the data (balance the number of samples in each class)

Choosing better metrics (like F1-score instead of just accuracy)

Using models that are aware of class imbalance

Adding cost for mistakes (penalize wrong predictions more for the minority class)

In [1]:
#importing dependencies
import numpy as np
import pandas as pd


In [2]:
#loading dataset to pandas dataframe
credit_card_data = pd.read_csv('/content/credit_card_fraud_data.csv')

In [3]:
#first 5 rows
credit_card_data.head()

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Feature_10,Fraud
0,0.524822,0.664592,-1.184648,-1.399142,-0.627364,1.256357,-0.979395,1.518224,-0.303218,-0.821357,0
1,2.961115,-0.354136,-1.58841,0.423144,-0.882289,4.062069,0.36339,2.888723,-3.024036,0.344617,0
2,3.695444,1.727707,-0.480724,-0.289114,-1.370615,1.155969,-0.192582,1.575073,-2.527986,0.996194,0
3,2.280354,0.749077,-2.748689,0.582322,-2.551469,1.377335,-0.861782,2.092249,-2.261708,0.264793,0
4,1.843133,-0.501772,-0.1159,0.620451,-0.728438,2.251169,-1.893136,4.843484,-1.851503,-3.294511,0


In [4]:
#last 5 rows
credit_card_data.tail()

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Feature_10,Fraud
9995,1.373621,2.58375,-4.630677,1.218793,-3.702057,0.846284,-0.783723,-1.633186,-2.133364,4.241298,0
9996,1.69348,2.623242,0.135164,-0.864211,-1.473444,0.497463,-1.006448,1.592835,-1.668315,-0.184937,0
9997,-0.172976,1.565003,1.494715,-2.220424,-0.236798,0.032328,-1.369348,3.473531,0.115208,-3.967552,0
9998,-0.799408,1.000463,-2.415457,0.322903,-2.216429,-0.854094,-0.457101,1.565183,0.442054,-1.879892,0
9999,3.327595,-0.844413,-0.442694,-2.59858,0.871482,2.793985,0.17118,2.57945,-1.058797,-0.671981,0


In [5]:
#distribution of the two classes
credit_card_data['Fraud'].value_counts()

Unnamed: 0_level_0,count
Fraud,Unnamed: 1_level_1
0,9451
1,549


this is HIGHLY imbalanced dataset

0 --> legit transaction

1 --> fraud transactons

In [7]:
#seperating the legit and fraud transactions

legit = credit_card_data[credit_card_data.Fraud == 0]
fraud = credit_card_data[credit_card_data.Fraud == 1] #Fraud is the column name here

In [8]:
print(legit.shape)
print(fraud.shape)

(9451, 11)
(549, 11)


#RESAMPLING

build a dataset containing similar distributon of legit and fraud transacs

so we have 549 fraud transacs in that dataset

now take random 549 legit transacs out of 9451 total transacs


In [9]:
legit_sample = legit.sample(n = 549)

In [10]:
print(legit_sample.shape)

(549, 11)


now concatenate the two dataframes one is the 549 legit we extracted from total and 549 fraud transacs

In [11]:
new_dataset = pd.concat([legit_sample, fraud], axis = 0)

In [12]:
new_dataset.head()

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Feature_10,Fraud
2173,0.56309,1.808848,-0.341135,-0.601031,-1.660959,1.392873,-0.860799,2.342512,-1.822271,-0.996072,0
9555,1.170113,1.383558,-1.33205,-0.50995,-1.251501,0.705432,0.805961,0.916001,-0.829965,0.196013,0
4627,0.017338,0.243643,-1.528091,2.055272,-1.079044,1.020551,0.436876,2.526198,-0.190486,-2.128208,0
2604,2.897468,-1.242508,-1.677704,0.148518,0.533624,4.873279,-0.413579,2.060035,-2.075332,0.912194,0
4683,-0.560156,1.511983,-0.661847,-1.074181,-1.11102,0.367613,-1.3836,2.010533,0.045676,-2.056454,0


In [13]:
new_dataset.tail()

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Feature_7,Feature_8,Feature_9,Feature_10,Fraud
9943,-0.540494,2.076818,0.212806,0.507272,-0.960502,-1.838487,0.327038,-1.705098,0.175635,1.263653,1
9955,-0.02135,1.737238,1.125728,2.221049,-0.061018,-0.723666,-0.894662,0.894187,0.410399,-1.40249,1
9966,0.580763,1.87236,2.047092,0.257603,-1.300864,-3.062886,-0.042721,0.01533,-0.871703,-0.359518,1
9974,0.059708,-1.347466,-0.474845,-0.449878,-0.565842,-0.531376,0.063353,1.09252,-0.258166,-1.169121,1
9980,1.158443,1.847824,-2.195389,0.090428,-0.99197,-1.132606,0.024826,1.101668,1.602771,-1.444249,1


In [15]:
new_dataset['Fraud'].value_counts()

Unnamed: 0_level_0,count
Fraud,Unnamed: 1_level_1
0,549
1,549


now the data is balanced and can be used to train the model