1. UpSampling
2. DownSampling

In [1]:
import pandas as pd
import numpy as np

In NumPy, np.random.seed(123) sets the random number generator's seed to a fixed value — in this case, 123.
What does that mean?

NumPy's random functions (like np.random.rand(), np.random.randint(), etc.) generate pseudo-random numbers. These numbers aren't truly random — they're determined by a mathematical formula — so if you start from the same seed, you'll get the same sequence of numbers every time

Why use a seed?

    ✅ Reproducibility: Ensures that your results can be exactly replicated — very important for debugging, testing, and sharing your work.

    🔄 Controlled randomness: You still get random-like results, but they're repeatable when needed.

In [2]:
np.random.seed(123)

n_samples =1000
class_0_ratio = 0.9
n_class_0=int(n_samples*class_0_ratio) # will generate 900 data points of class 0
n_class_1=n_samples-n_class_0 # 100 data points of class 1

In [3]:
n_class_0, n_class_1

(900, 100)

In [5]:
class_0 = pd.DataFrame({
    'feature_1':np.random.normal(loc=0,scale=1,size=n_class_0),
    'feature_2':np.random.normal(loc=0,scale=1,size=n_class_0),
    'target':[0]*n_class_0
})

class_1 = pd.DataFrame({
    'feature_1':np.random.normal(loc=2,scale=1,size=n_class_1),
    'feature_2':np.random.normal(loc=2,scale=1,size=n_class_1),
    'target':[1]*n_class_1
})

What is np.random.normal(loc=..., scale=..., size=...)?

This function generates random numbers from a normal (Gaussian) distribution — the bell curve.
Parameters:

    loc → The mean of the distribution (center of the bell curve).

    scale → The standard deviation (spread or width of the bell).

    size → How many numbers you want to generate.

Mean (loc) = 0 → Most values will be centered around 0.

Standard Deviation (scale) = 1 → Most values will lie within about ±1 of the mean.

🏷️ 2. What is 'target': [0]*n_class_0?

so [0] is the value which is multiped by a number say 10 to get a list with 10 values are 0

[0]*10 = [0,0,0,0,0,0,0,0,0,0]

[2]*4=[2,2,2,2]

In [23]:
# class_0.hist()

In [14]:
class_1

Unnamed: 0,feature_1,feature_2,target
0,-0.643425,2.571923,1
1,1.551009,1.782767,1
2,1.641093,2.054318,1
3,2.133194,2.155998,1
4,1.355758,2.467810,1
...,...,...,...
95,2.677156,1.092048,1
96,2.963404,0.181955,1
97,1.621476,1.877267,1
98,3.429559,3.794486,1


In [22]:
# class_1.hist()

In [16]:
df = pd.concat([class_0,class_1]).reset_index(drop=True)

In [17]:
df

Unnamed: 0,feature_1,feature_2,target
0,-1.774224,0.285744,0
1,-1.201377,0.333279,0
2,1.096257,0.531807,0
3,0.861037,-0.354766,0
4,-1.520367,-1.120815,0
...,...,...,...
995,2.677156,1.092048,1
996,2.963404,0.181955,1
997,1.621476,1.877267,1
998,3.429559,3.794486,1


In [18]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [21]:
# df.hist()

In [20]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

#### Up Sampling

Making the value of this 100 (1 row )increase

In [28]:
df_minority=df[df['target']==1]  # minority is where target =1 because these are 100
df_majority=df[df['target']==0] # majority is where target = 0 because these are 900

In [26]:
df_minority.head()

Unnamed: 0,feature_1,feature_2,target
900,-0.643425,2.571923,1
901,1.551009,1.782767,1
902,1.641093,2.054318,1
903,2.133194,2.155998,1
904,1.355758,2.46781,1


In [27]:
df_majority.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.774224,0.285744,0
1,-1.201377,0.333279,0
2,1.096257,0.531807,0
3,0.861037,-0.354766,0
4,-1.520367,-1.120815,0


In [30]:
## Upsampling perform
from sklearn.utils import resample

In [32]:
df_minority_upsample=resample(df_minority,replace=True,n_samples=len(df_majority),random_state=42 ) # this replace = True ( sample with replacement) will generate additional samples

### 🔁 What Does `replace=True` Mean in Resampling?

When using `resample(df_minority, replace=True)`, you're performing **sampling with replacement**.

---

#### ✅ What is Sampling With Replacement?

- **With replacement** means that **after a sample (row) is selected, it is put back into the pool**, so it **can be selected again**.
- This allows **duplicate rows** to appear in the resampled dataset.
- It's commonly used when **upsampling** a minority class — especially when you need more samples than you originally have.

---

#### ❌ Sampling Without Replacement (`replace=False`)

- Once a sample is selected, it is **not returned to the pool**.
- Each sampled row is **unique**, so the number of samples you can draw is limited to the number of original rows.

---

#### 🎯 Why Use `replace=True`?

When upsampling a minority class:

- You might only have, for example, 100 rows in `df_minority`.
- If you want to increase that to 500 rows, you **must sample with replacement**, otherwise you can’t get more than 100 unique samples.
- Sampling with replacement allows you to **repeat some rows**, effectively boosting the minority class size.

---

#### 🧠 Example:

Original `df_minority`:

| index | feature_1 | feature_2 | target |
|-------|-----------|-----------|--------|
| 0     | 0.5       | 1.2       | 1      |
| 1     | 0.7       | 0.9       | 1      |
| 2     | 0.3       | 1.0       | 1      |

Resampled with replacement (`n_samples=5`):

| index | feature_1 | feature_2 | target |
|-------|-----------|-----------|--------|
| 1     | 0.7       | 0.9       | 1      |
| 0     | 0.5       | 1.2       | 1      |
| 1     | 0.7       | 0.9       | 1      |  ← repeated
| 2     | 0.3       | 1.0       | 1      |
| 0     | 0.5       | 1.2       | 1      |  ← repeated

---

#### ✅ Summary:

- `replace=True` → allows **duplicates** → used for **upsampling**.
- Useful when you want to **balance class distributions** in a dataset.

You're allowing the function to duplicate rows from df_minority in the resampled output.

 What is random_state?

In the context of functions like resample(), random_state is a seed for the random number generator. It controls the randomness of the operation, allowing for reproducibility.
🧠 Key Points About random_state

    Deterministic Randomness:

        When you use a random operation (like sampling, splitting, or shuffling), it is generally random and will produce different results every time you run it.

        However, by setting random_state to a fixed value, you can make sure that the random process always produces the same result.

    Why is it useful?:

        Reproducibility: By fixing the seed (random_state), you ensure that your results can be replicated by anyone running the same code.

        In machine learning, this is useful when you want to compare models or results consistently across different runs.

        It ensures that even with random processes, the outputs are predictable for testing, debugging, or reporting.

    How to Use It:

        random_state can be set to any integer value (e.g., 42, 0, 100, etc.). This will "lock" the randomness.

        If you set random_state=None (the default), it will generate a different random sequence each time you run the code.

        Using the same number for random_state ensures you get the same result every time.

In [33]:
df_minority_upsample.shape

(900, 3)

In [34]:
df_minority_upsample.value_counts()

feature_1  feature_2  target
2.672100   4.062063   1         19
2.931332   0.238687   1         18
2.615975   2.827540   1         18
1.297848   1.971292   1         16
3.429559   3.794486   1         16
                                ..
1.185665   1.072206   1          4
2.200704   2.874327   1          4
2.353186   3.057770   1          4
2.029244   1.073458   1          4
3.308688   1.309217   1          3
Name: count, Length: 100, dtype: int64

In [35]:
df_minority_upsample['target'].value_counts()


target
1    900
Name: count, dtype: int64

In [36]:
df_upsampled= pd.concat([df_majority,df_minority_upsample])

In [37]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

In [38]:
df_upsampled.shape

(1800, 3)

#### Down Sampling

In [39]:
class_0 = pd.DataFrame({
    'feature_1':np.random.normal(loc=0,scale=1,size=n_class_0),
    'feature_2':np.random.normal(loc=0,scale=1,size=n_class_0),
    'target':[0]*n_class_0
})

class_1 = pd.DataFrame({
    'feature_1':np.random.normal(loc=2,scale=1,size=n_class_1),
    'feature_2':np.random.normal(loc=2,scale=1,size=n_class_1),
    'target':[1]*n_class_1
})

In [40]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [41]:
df

Unnamed: 0,feature_1,feature_2,target
0,1.517860,-0.941745,0
1,0.824163,0.691638,0
2,-1.797046,-0.234782,0
3,0.098292,0.135919,0
4,0.388795,2.684533,0
...,...,...,...
995,1.385447,2.249912,1
996,3.568762,2.027591,1
997,2.540002,3.365407,1
998,2.662896,1.580869,1


In [43]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [44]:
df_minority.shape, df_majority.shape 

((100, 3), (900, 3))

In [49]:
## DownScaling perform
df_majority_downsample = resample(df_majority,
                                  replace=False,
                                  n_samples=len(df_minority),
                                  random_state=42) # this is sample without repalcement

In [50]:
df_majority_downsample

Unnamed: 0,feature_1,feature_2,target
70,-1.644016,-1.319344,0
827,0.300620,0.195662,0
231,-0.479301,1.028600,0
588,1.124394,0.360059,0
39,-0.792136,-0.742869,0
...,...,...,...
398,0.029044,1.393804,0
76,0.222636,0.259494,0
196,0.049524,-0.470017,0
631,-1.135788,-0.472097,0


In [51]:
df_downsampled= pd.concat([df_minority,df_majority_downsample])

In [52]:
df_downsampled

Unnamed: 0,feature_1,feature_2,target
900,1.216660,1.386137,1
901,3.688449,2.842091,1
902,2.573096,1.602177,1
903,2.366302,0.286096,1
904,-0.316440,1.636890,1
...,...,...,...
398,0.029044,1.393804,0
76,0.222636,0.259494,0
196,0.049524,-0.470017,0
631,-1.135788,-0.472097,0


In [55]:
df_downsampled['target'].value_counts()

target
1    100
0    100
Name: count, dtype: int64