###  This notebook will show you how you can use SMOTE to balance an imbalanced dataset

SMOTE is a method used to balance imbalanced datasets by creating new synthetic samples for the minority class. It addresses the issue of overfitting that can arise from simple random oversampling. Instead of just duplicating existing instances, SMOTE works within the feature space to generate new data points by interpolating between neighboring positive instances of the minority class.

In [31]:
#First time installation
#!pip install imblearn

In [28]:
#Import all necessary librariesimport pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import os

In [4]:
"""
This code generates synthetic data for a binary classification problem with two classes: 
majority (labeled as 0) and minority (labeled as 1). 
It creates a dataset of random samples in a 2-dimensional space. 

"""
    
# Set the number of samples for each class
n_samples_majority = 1000
n_samples_minority = 100

# Generate the majority class samples
df_majority = pd.DataFrame(np.random.rand(n_samples_majority, 2))
df_majority["label"] = 0

# Generate the minority class samples
df_minority = pd.DataFrame(np.random.rand(n_samples_minority, 2) + 2)
df_minority["label"] = 1

# Combine the majority and minority class samples
df = pd.concat([df_majority, df_minority])

# Print the class distribution
print(df["label"].value_counts())

0    1000
1     100
Name: label, dtype: int64


In [6]:
df

Unnamed: 0,0,1,label
0,0.635612,0.042486,0
1,0.547591,0.550873,0
2,0.749612,0.940050,0
3,0.050989,0.753128,0
4,0.584046,0.526648,0
...,...,...,...
95,2.841486,2.880615,1
96,2.759495,2.056199,1
97,2.529830,2.737648,1
98,2.716458,2.049005,1


In [22]:
# Splitting the DataFrame into features (X) and target (y)
X = df.drop("label", axis=1)  # Drop the "label" column to get the features
y = df["label"]  # Select the "label" column as the target

# Print the first few rows of X and y for verification
print("Features (X):", X.sample(5))
print("\nTarget (y):", y.sample(5))


Features (X):             0         1
822  0.221628  0.922134
423  0.732389  0.921703
497  0.111181  0.365762
202  0.955953  0.521996
449  0.296470  0.622233

Target (y): 797    0
498    0
331    0
55     1
992    0
Name: label, dtype: int64


In [23]:
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a SMOTE object
smote = SMOTE(random_state=42)

# Resample the train set using SMOTE
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)


In [30]:
# Print the count for each class
class_counts = pd.Series(y_train_smote).value_counts()
print("Class Counts:")
print(class_counts)

# Create a DataFrame for the balanced dataset
balanced_df = pd.DataFrame(X_train_smote, columns=X_train.columns)
balanced_df["label"] = y_train_smote

# Print the first few rows of the balanced DataFrame
print("\nBalanced DataFrame:")
print(balanced_df.sample(5))

# Save the balanced DataFrame as a variable (optional)
balanced_df.to_csv("balanced_dataset.csv", index=False)

Class Counts:
0    796
1    796
Name: label, dtype: int64

Balanced DataFrame:
             0         1  label
572   0.054158  0.134275      0
1309  2.372451  2.418068      1
141   0.485411  0.211238      0
511   0.401186  0.969354      0
999   2.862182  2.944962      1
