<a href="https://colab.research.google.com/github/cagBRT/Data/blob/main/Data_Sampling_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Data Sampling:**<br>
## **Oversampling and Undersampling**

Machine learning techniques can give misleadingly optimistic performance on classification datasets when there is an imbalanced class distribution. <br>

Many machine learning algorithms are designed to operate on classification data with an equal number of observations for each class. <br>

When the dataset is imbalanced, algorithms can learn that the minority class examples are not important. Which means they can be ignored for the sake of performance.<br>

Decision trees, k-nearest neighbors, and neural networks, learn that the minority class is not as important as the majority class and put more attention and perform better on the majority class.

Imbalanced datasets are those where there is a severe skew in the class distribution, such as 1:100 or 1:1000 examples in the minority class to the majority class

- Random sampling provides a naive technique for rebalancing the class distribution for an imbalanced dataset.<br>
- Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.<br>
- Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.<br>

The most popular solution to an imbalanced classification problem is to change the composition of the training dataset.
<br>

Techniques designed to change the class distribution in the training dataset are generally referred to as sampling methods as we are sampling an existing data sample.<br>

If the sampling is balanced, many ML algorithms can be used on the data.

**Sampling is only performed on the training dataset**.<br>

There are many data sampling methods to use for balancing class distribution in the training set.<br>

There is no best data sampling method, the best method is the one that works best on your data. <br>

Sampling methods behave differently depending on the ML algorithm and the training dataset.

Sampling is not performed on the holdout test or validation dataset

# **Use Random oversampling to balance the class distribution**

Oversampling and undersampling can be used for two-class (binary) classification problems and multiclass classification problems with one or more majority or minority classes

**Import libraries**

In [None]:
# example of random oversampling to balance the class distribution
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler

**Create a dataset**

In [None]:
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0) 

**Use oversampling on the minority class**<br>
This method randomly duplicating examples from the minority class in the training dataset

In [None]:

# summarize class distribution
print("Before resample:",Counter(y))
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_over, y_over = oversample.fit_resample(X, y)
# summarize class distribution
print("After resample:",Counter(y_over))

# **Using a pipeline to perform Random Oversampling**

**Import libraries**

In [None]:
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler

**Define the dataset**

In [None]:
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)

**Define the Pipeline**

In [None]:
# define pipeline
steps = [('over', RandomOverSampler()), ('model', DecisionTreeClassifier())] 
pipeline = Pipeline(steps=steps)

**Train the pipeline**

In [None]:
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1) 
score = mean(scores)
print('F-measure: %.3f' % score)



---



# **Undersampling**

With Random undersampling randomly selected examples from the majority class deleted from the training dataset.<br>

This technique reduces the number of examples in the majority class in the working version of the training dataset. 

Undersampling can be repeated multiple times to acheive the desired class distribution, for example: an equal number of examples for each class

**Import libraries**

In [None]:
# example of random undersampling to balance the class distribution
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from numpy import mean
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier

Create the dataset

In [None]:
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0) 

Use undersample to reduce the size of the majority class in the training dataset

In [None]:
# summarize class distribution
print("Before undersampling:",Counter(y))
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy='majority')
# fit and apply the transform
X_over, y_over = undersample.fit_resample(X, y)
# summarize class distribution
print("After undersampling:",Counter(y_over))

**Create a dataset**

In [None]:
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)

**Create the pipeline**

In [None]:
# define pipeline
steps = [('under', RandomUnderSampler()), ('model', DecisionTreeClassifier())] 
pipeline = Pipeline(steps=steps)

**Undersampling using a pipeline**

In [None]:
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1) 
score = mean(scores)
print('F-measure: %.3f' % score)