# Class Imbalance

![1_zB0xRorLnqPzjMY_DQ3HAA.png](attachment:1_zB0xRorLnqPzjMY_DQ3HAA.png)

https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data

## Resampling

Resampling involves creating a new transformed version of the training dataset in which the selected examples have a different class distribution.

This is a simple and effective strategy for imbalanced classification problems.

Applying re-sampling strategies to obtain a more balanced data distribution is an effective solution to the imbalance problem

[Source: MachineLearningMastery](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/)

![1_H6XodlitlGDl9YdbwaZLMw.png](attachment:1_H6XodlitlGDl9YdbwaZLMw.png)

### Oversampling

**Random Oversampling:** Randomly duplicate examples in the minority class.

Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset.

### Undersampling

**Random Undersampling:** Randomly delete examples in the majority class.

Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset.

# SMOTE

The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important.

One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

[Source: MachineLearningMastery](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)

![1_CeOd_Wbn7O6kpjSTKTIUog.png](attachment:1_CeOd_Wbn7O6kpjSTKTIUog.png)

# Breakout: SMOTE... with ChatGPT!

In [4]:
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Generate a synthetic imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],
                           n_informative=3, n_redundant=1, flip_y=0,
                           n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Train a classifier on the resampled data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_resampled, y_resampled)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the classifier
print("Classification Report:\n", classification_report(y_test, y_pred))

ImportError: cannot import name '_OneToOneFeatureMixin' from 'sklearn.base' (/Users/emilykenney/opt/anaconda3/lib/python3.8/site-packages/sklearn/base.py)

Try and apply SMOTE to the Fraud detection dataset: https://www.kaggle.com/datasets/kartik2112/fraud-detection/data![image.png](attachment:image.png)