
***Balancing data*** is a crucial step in machine learning, especially when dealing with imbalanced datasets where one class significantly outweighs the other(s). Balancing techniques help improve model performance by ensuring that the model doesn't become biased towards the majority class. Here, I'll provide code examples for some common data balancing techniques:



# **Random Oversampling:**
Randomly sample from the minority class to match the number of samples in the majority class.

In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import resample

In [None]:
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)


In [None]:
# Count the number of samples in each class
class_counts = np.bincount(y)
print("Class counts:", class_counts)

Class counts: [947  53]


In [None]:
# Perform random oversampling on the minority class
minority_class = np.where(class_counts == np.min(class_counts))[0][0]
majority_class = np.where(class_counts == np.max(class_counts))[0][0]


In [None]:
# Upsample the minority class to match the majority class
minority_samples = X[y == minority_class]
oversampled_minority = resample(minority_samples, replace=True, n_samples=class_counts[majority_class], random_state=42)

In [None]:
X_balanced = np.vstack([X[y == majority_class], oversampled_minority])
y_balanced = np.hstack([y[y == majority_class], np.full(class_counts[majority_class], minority_class)])

# Verify the class balance
print("Balanced class counts:", np.bincount(y_balanced))

Balanced class counts: [947 947]


# ***Random Undersampling:***
Randomly remove samples from the majority class to match the number of samples in the minority class.



In [None]:
import numpy as np
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)

# Perform random undersampling on the majority class
undersampler = RandomUnderSampler(sampling_strategy='majority', random_state=42)
X_balanced, y_balanced = undersampler.fit_resample(X, y)

# Verify the class balance
print("Balanced class counts:", np.bincount(y_balanced))

Balanced class counts: [53 53]


# ***Synthetic Minority Over-sampling Technique (SMOTE):***
Generate synthetic samples for the minority class.



In [None]:
import numpy as np
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)

# Perform SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X, y)

# Verify the class balance
print("Balanced class counts:", np.bincount(y_balanced))

Balanced class counts: [947 947]


# ***Class-weighted Loss:***
In some machine learning libraries (e.g., scikit-learn), you can set class weights in the model to penalize misclassifications of the minority class more heavily.


In [None]:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)

# Create a logistic regression model with class weights
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X, y)