## Creating an Imbalanced Dataset

In [4]:
import numpy as np
import pandas as pd

In [2]:
np.random.seed(123)

np.random.seed(0) forces NumPy‚Äôs random number generator to behave predictably.

Think of the random generator like a shuffle machine.

With a seed, you are giving it a fixed starting position.

Without a seed, it starts in a random position every time.

So np.random.seed(0) ensures:

‚û° Every time you run the code
‚û° You get THE EXACT SAME random numbers
‚û° In the exact same order

WITH seed:
np.random.seed(0)
print(np.random.rand(3))


Output (always):

[0.5488135  0.71518937 0.60276338]


Run it 1000 times ‚Üí same result.

WITHOUT seed:
print(np.random.rand(3))


Output changes EVERY run:

[0.12 0.44 0.88]
next run ‚Üí [0.77 0.21 0.95]
next ‚Üí [0.34 0.56 0.10]


Every execution = different numbers.

üß† WHY do we even need a seed?

Because ML and data science require reproducibility.
If you generate random data or split a dataset and then:

Someone else runs your code

Or you run it after 1 month

Or you debug your model

‚Ä¶you want the EXACT same behavior.

Otherwise:

Your training accuracy changes

Your dataset split changes

Your results fluctuate

You can't debug properly

No one else can reproduce your project

It becomes chaos.

In [1]:
n_samples = 1000
ratio = 0.9
n_class0 = int(n_samples * ratio)
n_class1 = n_samples - n_class0


n_class0, n_class1

(900, 100)

In [7]:
zeroDf = pd.DataFrame({
    'feature1' : np.random.normal(loc=1, scale=1, size=n_class0),
    'feature2' : np.random.normal(loc=1, scale=1, size=n_class0),
    'target' : [0]*n_class0 
})

oneDf = pd.DataFrame({
    'feature1': np.random.normal(loc=1, scale=1, size=n_class1),
    'feature2' : np.random.normal(loc=1, scale=1, size=n_class1),
    'target' : [1]*n_class1
})


df = pd.concat([zeroDf, oneDf], ignore_index=True)

df

Unnamed: 0,feature1,feature2,target
0,0.186139,1.789128,0
1,2.079940,1.987302,0
2,2.128683,0.537318,0
3,0.316058,1.612378,0
4,0.864886,0.812350,0
...,...,...,...
995,1.513051,0.930543,1
996,2.664769,0.968859,1
997,2.440069,-0.816291,1
998,2.485371,0.353901,1



### **What `ignore_index=True` actually does**

When you concatenate two DataFrames, Pandas normally **keeps their original row indexes**.

Example:

* `zeroDf` might have indexes `0,1,2,...,899`
* `oneDf` might also have indexes `0,1,2,...,99`

If you do:

```python
df = pd.concat([zeroDf, oneDf])
```

Your final DataFrame will have **duplicate indexes**, because both original DataFrames start at 0.

Looks like this:

```
index | feature1 | feature2 | target
0     | ...      | ...      | 0
1     | ...      | ...      | 0
...
899   | ...      | ...      | 0
0     | ...      | ...      | 1   ‚Üê duplicate index!
1     | ...      | ...      | 1   ‚Üê duplicate index!
...
```

This is messy and stupid for most use-cases.

### **So what does `ignore_index=True` do?**

It **throws away the original indexes**
and creates a **fresh, clean, continuous index**:

```
index | feature1 | feature2 | target
0     | ...      | ...      | 0
1     | ...      | ...      | 0
...
999   | ...      | ...      | 1
```

That‚Äôs it. Nothing complicated.

### **If you don‚Äôt use it?**

You end up with **duplicate index values**, which can break:

* row selection
* merging
* training ML models
* debugging

So it's safer to always use `ignore_index=True` when stacking rows.


In [8]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

# Upsampling

In [9]:
df_minority = df[df['target']==1]
df_majority= df[df['target']==0]

In [12]:
from sklearn.utils import resample

In [13]:
df_minority_upsampled = resample(
    df_minority,
    replace=True,
    n_samples=len(df_majority),
    random_state=42
)

df_minority_upsampled['target'].value_counts()

target
1    900
Name: count, dtype: int64

In [20]:
upsampled_df = pd.concat([df_majority, df_minority_upsampled], ignore_index = True)

In [21]:
upsampled_df['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

## DownSampling

In [17]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [18]:
df_minority = df[df['target']==1]
df_majority= df[df['target']==0]

In [23]:
from sklearn.utils import resample

df_majority_downsampled = resample(
    df_majority,
    replace=False,
    n_samples=len(df_minority),
    random_state=42
)

df_majority_downsampled['target'].value_counts()

target
0    100
Name: count, dtype: int64

In [24]:
downsampled_df = pd.concat([df_minority, df_majority_downsampled], ignore_index=True)
downsampled_df['target'].value_counts()

target
1    100
0    100
Name: count, dtype: int64