**Classifiation Dataset**
A Classification dataset is one wherein the dataset can be split into one or more categories.
Example #1: Survey Data(yes, no answers)

**Classifiation Problem**
A Classification Problem is one where we are trying to find some metric (mean, meadian, mode, standard deviation, etc) on a Classification Dataset

**Imbalanced Dataset**
An imbalanced dataset is one where data is leaning towards one classification of data.
e.g. Survey data(yes, no answers) has 900 yes and 100 no.

The problem with this Imbalance is that the model we will create will be biased towards this Imbalance(towards Yes based answers).

The solution is to fix this imbalance. There are two technique availaible
1) **Up Sampling**
    Up Sampling means increasing the number of data points in a dataset to make a more balanced dataset.
    Assume in a sample data set you have 900 'No Fraud' samples and 100 'Fraud' samples. This is an imbalanced dataset.
    After Up Sampling techniques applied, you would have 900 'No Fraud' and 900 'Fraud' in the dataset making it a balanced dataset.

    **Disadvantage of this technique**
    - This technique of UpSampling adds th newly created datapoints on top of the previous datapoints - this does NOT add VARIANCE between datapoints.
    - This problem is solved by using SMOTE oversampling.


2) **Down Sampling**
    Down Sampling means decreasing the number of data points in a dataset to make a more balanced dataset.
    Assume in a sample data set you have 900 'No Fraud' samples and 100 'Fraud' samples. This is an imbalanced dataset.
    After Down Sampling techniques applied, you would have 100 'No Fraud' and 100 'Fraud' in the dataset making it a balanced dataset.

    **Disadvantage of Down Sampling.**
    Down Sampling is bad because we are losing data points.


In [None]:
import numpy as np
import pandas as pd

# Set a random seed with any number(here 123). This will ensure reproducibility - will produce the same random numbers each time the code is run.
np.random.seed(123)

# This will generate the random number.
print(np.random.rand())

# ################### CREATE AN IMBALANCED DATASET ####################
# CREATING AN IMBALANCED DATASET of class_Zeroes(900 total) zeros and class_Ones(100 total) ones
# Create a DataFrame with two classes(class_Zeroes = 900 Zeroes and class_Ones = 100 Ones - Imbalanced Dataset)
total_samples = 1000
ratio = 0.9
class_Zeroes = int(total_samples * ratio) #900 samples
class_Ones = total_samples - class_Zeroes #100 samples

print("class Zeroes=",class_Zeroes, "class Ones=", class_Ones)

# Create the imbalanced dataset
# target': [0]* class_Zeroes means create '0' repeated class_Zeroes times (here 900 times)
# np.random.normal() : Means Normal Distribution / Gaussian Distribution
# loc = 1 menas mean = 1 for this distribution
# scale = 1.0 means standard deviation = 1.0 for this distribution
# This dataframe has 3 colums: feature1, feature2 and target
# feature1 = random numbers from normal distribution with mean=1, stddev=1.0, 
# feature2 = random numbers from normal distribution with mean=1, stddev=1.0
# target = 0 repeated 900 times
class_Zeroes = pd.DataFrame({'feature1': np.random.normal(loc=1, scale=1.0, size=class_Zeroes),
                             'feature2': np.random.normal(loc=1, scale=1.0, size=class_Zeroes),
                             'target': [0]* class_Zeroes}) 

class_Ones = pd.DataFrame({'feature1': np.random.normal(loc=2, scale=1.0, size=class_Ones),
                           'feature2': np.random.normal(loc=2, scale=1.0, size=class_Ones),
                           'target': [1]* class_Ones})



print("class_Zeroes shape=", class_Zeroes.shape)
print("class_Ones shape=", class_Ones.shape)

# Combine the two dataframes (class_Zeroes and class_Ones) to create the imbalanced dataset
# ignore_index=True is used to reset the index of the new dataframe, if you dont do that then the index values from the origional dataframes will be retained and you could possibly have duplicate index values.
imbalanced_data = pd.concat([class_Zeroes, class_Ones], ignore_index=True)

print("imbalanced_data shape=", imbalanced_data.shape) #1000 rows, 3 columns

# Print the counts of each class in the target column
print("count of 0's and 1's in the target column:")
print(imbalanced_data['target'].value_counts())


# ################### UP-SAMPLING ####################
print(" ******************* UP-SAMPLING ****************** ")
# Create two dataframes, one for each classification = df_majority (for class 0) and df_minority (for class 1)
# Put all values from imbalanced_data where target column = 0 into df_majority
# Put all values from imbalanced_data where target column = 1 into df_minority
df_majority = imbalanced_data[imbalanced_data.target == 0]
df_minority = imbalanced_data[imbalanced_data.target == 1]
df_minority.shape #100 rows, 3 columns
df_majority.shape #900 rows, 3 columns

# Resample the minority class with replacement to match the number of samples in the majority class
# df_majority has 900 rows, df_minority has 100 rows, we will UP-SAMPLE df_minority to have 900 rows
from sklearn.utils import resample
df_minority_up_sampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=len(df_majority),    # to match majority class, total length of majority class
                                 random_state=123) # reproducible results

print("df_minority_up_sampled shape=", df_minority_up_sampled.shape) #900 rows, 3 columns, previous size was 100 rows, 3 columns

# Combine the majority class with the upsampled minority class
# df_majority has 900 rows, df_minority_up_sampled has 900 rows, total 1800 rows
upsampled_data = pd.concat([df_majority, df_minority_up_sampled], ignore_index=True)
print("upsampled_data shape=", upsampled_data.shape) #1800 rows, 3 columns

# Print the counts of each class in the target column after upsampling
print("Final upsampled data")
print(upsampled_data['target'].value_counts()) #900 of each class (0's and 1's)

# ################### DOWN-SAMPLING ####################
print(" ******************* DOWN-SAMPLING ****************** ")
# Create two dataframes, one for each classification = df_majority (for class 0) and df_minority (for class 1)
# Put all values from imbalanced_data where target column = 0 into df_majority
# Put all values from imbalanced_data where target column = 1 into df_minority
df_majority = imbalanced_data[imbalanced_data.target == 0]
df_minority = imbalanced_data[imbalanced_data.target == 1]
df_minority.shape #100 rows, 3 columns
df_majority.shape #900 rows, 3 columns

# Resample the minority class with replacement to match the number of samples in the majority class
# df_majority has 900 rows, df_minority has 100 rows, we will UP-SAMPLE df_minority to have 900 rows
from sklearn.utils import resample
df_majority_down_sampled = resample(df_majority, 
                                 replace=False,     # Replace is false because we are down-sampling,we want to delete the rows from 900(majority class) to 100(minority class)
                                 n_samples=len(df_minority),    # to match minority class, total length of minority class
                                 random_state=123) # reproducible results

print("df_majority_down_sampled shape=", df_majority_down_sampled.shape) #100 rows, 3 columns, previous size was 900 rows, 3 columns

# Combine the majority class with the upsampled minority class
# df_majority has 900 rows, df_minority_up_sampled has 900 rows, total 1800 rows
downsampled_data = pd.concat([df_majority_down_sampled, df_minority], ignore_index=True)
print("downsampled_data shape=", downsampled_data.shape) #200 rows, 3 columns

# Print the counts of each class in the target column after downsampling
print("Final downsampled data")
print(downsampled_data['target'].value_counts()) #100 of each class (0's and 1's)



0.6964691855978616
class Zeroes= 900 class Ones= 100
class_Zeroes shape= (900, 3)
class_Ones shape= (100, 3)
imbalanced_data shape= (1000, 3)
count of 0's and 1's in the target column:
target
0    900
1    100
Name: count, dtype: int64
 ******************* UP-SAMPLING ****************** 
df_minority_up_sampled shape= (900, 3)
upsampled_data shape= (1800, 3)
Final upsampled data
target
0    900
1    900
Name: count, dtype: int64
 ******************* DOWN-SAMPLING ****************** 
df_majority_down_sampled shape= (100, 3)
downsampled_data shape= (200, 3)
Final downsampled data
target
0    100
1    100
Name: count, dtype: int64
