## Handling Imbalanced Dataset

1. Up Sampling
2. Down Sampling

**Handling Imbalanced Dataset**  
Imbalanced datasets occur when one class (category) has much fewer samples than another, which can bias machine learning models.

**Up Sampling:**  
Up sampling increases the number of samples in the minority class by duplicating or generating new samples until both classes are balanced.  
- **Use:** Prevents models from ignoring the minority class, improving detection and prediction for rare events.

**Down Sampling:**  
Down sampling reduces the number of samples in the majority class by randomly removing samples until both classes are balanced.  
- **Use:** Prevents models from being biased toward the majority class, but may lose information.

**In Data Analysis and Evaluation:**  
- Both techniques help create fair, reliable models that learn from all classes equally.
- They improve metrics like recall and F1-score for minority classes.
- Essential for big data and real-world problems where class imbalance is common (e.g., fraud detection, medical diagnosis).

In [1]:
import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

**What is it doing?**
- Imports necessary libraries (`numpy` and `pandas`) for data manipulation.
- Sets a random seed so results are reproducible.
- Defines the total number of samples (`n_samples = 1000`).
- Sets the ratio for the majority class (`class_0_ratio = 0.9`), meaning 90% of samples will be in class 0.
- Calculates the number of samples for each class:  
  - `n_class_0` is 900 (majority class).
  - `n_class_1` is 100 (minority class).

**Why is it doing this on upsampled data?**
- This step creates an imbalanced dataset, which is common in real-world problems.
- The imbalance is necessary to demonstrate why upsampling is needed.

**Why do we need to choose upsampled data?**
- In imbalanced datasets, models tend to ignore the minority class, leading to poor predictions for rare events.
- Upsampling the minority class later will help balance the dataset, so models learn from both classes equally and make fair, accurate predictions.

**Summary:**  
This code sets up an imbalanced dataset to show the need for upsampling, which is essential for building unbiased machine learning models.

Here’s a code-by-code explanation of what’s happening in your notebook:

---

### 1. `import numpy as np`  
Imports the NumPy library, which is used for numerical operations and creating random numbers.

### 2. `import pandas as pd`  
Imports the pandas library, which is used for data manipulation and creating DataFrames.

---

### 3. `np.random.seed(123)`  
Sets the random seed to 123.  
**Why:** Ensures that random numbers generated are the same every time you run the code (reproducibility).

---

### 4.  


In [None]:
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

- **n_samples = 1000:** Total number of samples in your dataset.
- **class_0_ratio = 0.9:** 90% of samples will be in class 0 (majority class).
- **n_class_0 = int(n_samples * class_0_ratio):** Calculates the number of samples in class 0 (900).
- **n_class_1 = n_samples - n_class_0:** Calculates the number of samples in class 1 (100).

**Why:**  
This sets up an imbalanced dataset, where class 0 is much larger than class 1. This is common in real-world problems (e.g., fraud detection).

---

### 5. `n_class_0, n_class_1`  
Displays the number of samples in each class:  
- Output: `(900, 100)`

**Why:**  
Confirms the split and shows the imbalance.

---

**Summary:**  
You are preparing to create a dataset with two classes, one much larger than the other, to demonstrate how to handle imbalanced data in analysis and machine learning.

In [2]:
n_class_0,n_class_1

(900, 100)

<!-- What it does: Sets up the environment and calculates how many samples will be in each class (90% class 0, 10% class 1).
Why useful: Simulates a real-world imbalanced dataset for demonstration. -->

In [3]:
## CREATE MY DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

**What is it doing?**
- Creates two separate DataFrames:
  - `class_0` for the majority class (target = 0) with 900 samples, features drawn from a normal distribution centered at 0.
  - `class_1` for the minority class (target = 1) with 100 samples, features drawn from a normal distribution centered at 2.
- Each DataFrame includes two features and a target label.

**Why is it doing this on upsampled data?**
- This step sets up an imbalanced dataset, which is common in real-world scenarios.
- The imbalance is necessary to demonstrate why upsampling is needed to balance the classes.

**Why do we need to choose upsampled data?**
- Imbalanced datasets cause models to ignore the minority class, resulting in poor predictions for rare events.
- Upsampling the minority class later will help balance the dataset, so models learn from both classes equally and make fair, accurate predictions.

**Summary:**  
This code creates an imbalanced dataset to show the need for upsampling, which is essential for building unbiased and effective machine learning models.

Similar code found with 1 license type

In the code above, **features** are the input variables used for prediction, specifically `'feature_1'` and `'feature_2'`. These are generated from normal distributions with different means for each class. The **target** is the output variable that the model tries to predict, represented by the `'target'` column. It indicates the class label: `0` for the majority class and `1` for the minority class. Features help the model learn patterns, while the target guides classification.

What it does: Creates two groups with different feature distributions and target labels.
Why useful: Mimics how features may differ between classes in real data.

In [None]:
##MAIN CODE

df=pd.concat([class_0,class_1]).reset_index(drop=True)

Here’s a code-by-code explanation for the steps shown in your screenshot:

---

### 1. Create the majority class DataFrame (`class_0`)


In [None]:
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})



- **What it does:**  
  Generates `n_class_0` samples for the majority class (`target=0`).  
  - `feature_1` and `feature_2` are drawn from a normal distribution centered at 0.
  - All target labels are 0.

- **Why useful:**  
  Simulates the majority class in an imbalanced dataset.

---

### 2. Create the minority class DataFrame (`class_1`)


In [None]:
class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})



- **What it does:**  
  Generates `n_class_1` samples for the minority class (`target=1`).  
  - `feature_1` and `feature_2` are drawn from a normal distribution centered at 2.
  - All target labels are 1.

- **Why useful:**  
  Simulates the minority class, with features different from the majority class.

---

### 3. Combine both classes into one DataFrame


**What is it doing?**
- Creates two separate DataFrames:
  - `class_0` for the majority class (target = 0) with 900 samples, features drawn from a normal distribution centered at 0.
  - `class_1` for the minority class (target = 1) with 100 samples, features drawn from a normal distribution centered at 2.
- Each DataFrame includes two features and a target label.

**Why is it doing this on upsampled data?**
- This step sets up an imbalanced dataset, which is common in real-world scenarios.
- The imbalance is necessary to demonstrate why upsampling is needed to balance the classes.

**Why do we need to choose upsampled data?**
- Imbalanced datasets cause models to ignore the minority class, resulting in poor predictions for rare events.
- Upsampling the minority class later will help balance the dataset, so models learn from both classes equally and make fair, accurate predictions.

**Summary:**  
This code creates an imbalanced dataset to show the need for upsampling, which is essential for building unbiased and effective machine learning models.

Similar code found with 1 license type

In [None]:
df = pd.concat([class_0, class_1]).reset_index(drop=True)



- **What it does:**  
  Merges the two DataFrames (`class_0` and `class_1`) into a single DataFrame `df`.
  - `reset_index(drop=True)` ensures the index is continuous and unique.

- **Why useful:**  
  Prepares the full dataset for analysis, showing the class imbalance and feature differences.

---

**Summary:**  
You have created a synthetic dataset with two classes, each with different feature distributions and sizes, mimicking real-world class imbalance for further analysis or machine learning.

Similar code found with 1 license type

**What is it doing?**
- Combines the two DataFrames `class_0` (majority class) and `class_1` (minority class) into a single DataFrame `df`.
- `reset_index(drop=True)` resets the row index so it is continuous and unique.

**Why is it doing this on upsampled data?**
- This step merges both classes to create the full dataset, showing the class imbalance before any upsampling.
- It prepares the data for further processing, such as upsampling the minority class.

**Why do we need to choose upsampled data?**
- After upsampling, you will combine the upsampled minority class with the majority class to create a balanced dataset.
- This ensures both classes have equal representation, which is important for fair and accurate machine learning.

**Summary:**  
This step creates the initial imbalanced dataset, which is then used for upsampling to balance the classes and improve model performance.

What it does: Merges both classes into one DataFrame.
Why useful: Prepares the dataset for analysis and modeling.

In [5]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,1.376371,2.845701,1
996,2.23981,0.880077,1
997,1.13176,1.640703,1
998,2.902006,0.390305,1
999,2.69749,2.01357,1


In [6]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [7]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

What it does: Counts how many samples are in each class.
Why useful: Confirms the dataset is imbalanced.

In [8]:
## upsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

**What is it doing?**
- Splits the main DataFrame `df` into two separate DataFrames:
  - `df_minority` contains all rows where the target is 1 (minority class).
  - `df_majority` contains all rows where the target is 0 (majority class).

**Why is it doing this on upsampled data?**
- To identify and separate the minority and majority classes before upsampling.
- This separation is necessary so you can specifically upsample the minority class to balance the dataset.

**Why do we need to choose upsampled data?**
- In imbalanced datasets, the minority class has fewer samples, causing models to ignore it.
- By upsampling the minority class, you ensure both classes have equal representation, which helps the model learn patterns from both classes and improves fairness and accuracy.

**Summary:**  
This step prepares the data for upsampling by separating the classes, which is essential for balancing the dataset and building unbiased machine learning models.

What it does:
Splits the dataset into minority (target=1) and majority (target=0) classes.
Why valuable:
Allows you to separately analyze and process each class, which is essential for balancing imbalanced datasets.

In [9]:
from sklearn.utils import resample
df_minority_upsampled=resample(df_minority,replace=True, #Sample With replacement
         n_samples=len(df_majority),
         random_state=42
        )



**What is it doing?**
- Uses scikit-learn’s `resample` function to randomly duplicate samples from the minority class (`df_minority`) until it has the same number of samples as the majority class (`df_majority`).
- `replace=True` allows the same sample to be picked multiple times (sampling with replacement).
- `n_samples=len(df_majority)` sets the number of upsampled minority samples to match the majority class.
- `random_state=42` ensures reproducibility.

**Why is it doing this on upsampled data?**
- To balance the dataset so both classes have equal representation.
- Prevents machine learning models from ignoring the minority class due to imbalance.

**Why do we need to choose upsampled data?**
- In imbalanced datasets, models often perform poorly on the minority class.
- Upsampling helps the model learn patterns from both classes equally, improving fairness and accuracy, especially for rare events.

**Summary:**  
This step creates a balanced dataset by increasing the number of minority samples, which is essential for building reliable and unbiased machine learning models.

What it does:
Uses scikit-learn’s resample to randomly duplicate samples from the minority class until it matches the majority class size.
Why valuable:
Balances the dataset, so machine learning models can learn equally from both classes, improving detection of rare events and reducing bias

In [10]:
df_minority_upsampled.shape

(900, 3)

`df_minority_upsampled.shape` returns the dimensions (number of rows and columns) of the upsampled minority class DataFrame.

**What is it doing?**  
- It shows how many samples (rows) and features (columns) are present in `df_minority_upsampled` after upsampling.

**Why is it important?**  
- After upsampling, the minority class should have the same number of samples as the majority class.  
- Checking the shape confirms that upsampling worked and the class sizes are now balanced.

**Why do we use upsampled data?**  
- In imbalanced datasets, models can ignore the minority class and perform poorly on it.
- Upsampling increases the number of minority samples, helping models learn patterns from both classes equally.
- This leads to fairer, more accurate predictions, especially for rare events.

**Summary:**  
You check the shape to verify balance, and use upsampled data to improve model performance on minority classes.

What it does:
Shows the shape (rows, columns) of the upsampled minority class.
Why valuable:
Confirms that the minority class now has the same number of samples as the majority class, ensuring balance for further analysis.

In [11]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


`df_minority_upsampled.head()` displays the first five rows of the upsampled minority class DataFrame.

**What is it doing?**
- Shows a quick preview of the upsampled data, including feature values and target labels.
- Helps you visually check that upsampling worked and the data looks correct.

**Why on upsampled data?**
- After upsampling, you want to confirm that the minority class has enough samples and the data structure is as expected.
- It ensures that the new samples (created by duplicating existing ones) are present.

**Why choose upsampled data?**
- In imbalanced datasets, models often ignore the minority class.
- Upsampling increases the number of minority samples, so models learn from both classes equally.
- This improves prediction accuracy for rare events and prevents bias toward the majority class.

**How does the output occur?**
- The output shows five rows from `df_minority_upsampled`, each with feature values and a target label of 1 (minority class).
- These rows may include duplicates, since upsampling uses sampling with replacement.

**Summary:**  
You use `.head()` to quickly inspect the upsampled data, ensuring balance and readiness for fair model training.

In [12]:
df_upsampled=pd.concat([df_majority,df_minority_upsampled])

`df_upsampled = pd.concat([df_majority, df_minority_upsampled])`

**What is it doing?**
- Combines the majority class (`df_majority`) and the upsampled minority class (`df_minority_upsampled`) into a single DataFrame called `df_upsampled`.
- This creates a new, balanced dataset where both classes have an equal number of samples.

**Why is it doing this on upsampled data?**
- After upsampling, the minority class has the same number of samples as the majority class.
- Merging them ensures the final dataset is balanced and ready for analysis or machine learning.

**Why do we need to choose upsampled data?**
- In imbalanced datasets, models tend to ignore the minority class, leading to poor predictions for rare events.
- Using upsampled data helps the model learn patterns from both classes equally, improving fairness and accuracy.

**Summary:**  
This step prepares a balanced dataset by combining the majority class with the upsampled minority class, which is essential for building reliable and unbiased machine learning models.

In [13]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

`df_upsampled['target'].value_counts()`

**What is it doing?**
- Counts how many samples belong to each class (0 and 1) in the upsampled, balanced dataset.
- Displays the number of rows for each target value after combining the majority and upsampled minority classes.

**Why is it doing this on upsampled data?**
- After upsampling, you want to confirm that both classes have equal representation.
- This check ensures the dataset is now balanced and ready for fair analysis or machine learning.

**Why do we need to choose upsampled data?**
- In imbalanced datasets, models often ignore the minority class, leading to poor predictions for rare events.
- Upsampling increases the number of minority samples, so models learn from both classes equally.
- This improves fairness, accuracy, and reliability of predictions.

**Summary:**  
This step verifies that your final dataset is balanced, which is essential for building unbiased and effective machine learning models.

## Down Sampling

In [14]:
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(123)

# Create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# Check the class distribution
print(df['target'].value_counts())

target
0    900
1    100
Name: count, dtype: int64


Certainly! Here’s a line-by-line explanation of what this code does:



In [None]:
class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})



### Explanation:

- **`pd.DataFrame({...})`**  
  Creates a new pandas DataFrame, which is a table-like data structure for storing and analyzing data.

- **`'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1)`**  
  Generates `n_class_1` random numbers for the column `'feature_1'` using a normal (Gaussian) distribution:
  - `loc=2`: The mean (center) of the distribution is 2.
  - `scale=1`: The standard deviation (spread) is 1.
  - `size=n_class_1`: Number of samples generated equals the number of minority class samples.

- **`'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1)`**  
  Same as above, but for the column `'feature_2'`.

- **`'target': [1] * n_class_1`**  
  Creates a list of `n_class_1` elements, all set to 1. This marks all rows as belonging to the minority class (class 1).

### What does it mean and why is it useful?

- **Purpose:**  
  This code creates a synthetic dataset for the minority class, with two features (`feature_1` and `feature_2`) and a target label (`target`).
- **How:**  
  It uses random numbers to simulate real-world data, ensuring the minority class has distinct feature values (centered at 2).
- **Value for analysis:**  
  - Helps you test and demonstrate techniques for handling imbalanced datasets.
  - Allows you to compare how models perform on different classes.
  - Useful for building fair and robust machine learning models.

**Summary:**  
This code builds a DataFrame for the minority class with random feature values and a target label, helping you simulate and analyze imbalanced data scenarios.

Similar code found with 1 license type

In the code:





- **`loc`** is the mean (center) of the normal distribution.
- **`scale`** is the standard deviation (spread) of the distribution.

**How do you decide `loc` and `scale`?**

- **`loc` (mean):**  
  Choose a value that represents the average or typical value for your feature in the class.  
  Example: If you want the minority class to have higher feature values than the majority, set `loc=2` for minority and `loc=0` for majority.

- **`scale` (standard deviation):**  
  Choose a value that represents how much variation you expect in your data.  
  Example: `scale=1` means most values will be close to the mean, but some will be further away.

**Summary:**  
Set `loc` and `scale` based on the characteristics you want your synthetic data to have. For real data, use the actual mean and standard deviation from your dataset. For simulations, pick values that help you demonstrate differences between classes.

What it does:
Splits the data into majority and minority classes.
Uses resample to duplicate minority samples until both classes are equal.
Combines them into a new balanced DataFrame.
Why useful:
Balances the dataset, which helps machine learning models learn both classes equally.
Prevents bias toward the majority class.


Why is this useful for data analysts?
Improves model accuracy: Balanced data helps models learn both classes, reducing bias.
Essential for big data: Many real-world datasets are imbalanced; these techniques automate balancing.
Prevents misleading results: Imbalanced data can make models look good on accuracy but fail on minority class detection.
Easy to implement: These steps can be applied to any dataset with class imbalance.
Summary:
Upsampling and downsampling are key preprocessing steps for classification tasks with imbalanced data. They help ensure fair, reliable, and interpretable machine learning results.

In [20]:
## downsampling
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [None]:
from sklearn.utils import resample
df_majority_downsampled=resample(df_majority,replace=False, #Sample With replacement
         n_samples=len(df_minority),
         random_state=42
        )

Of course. That line of code is used to fix a common problem in machine learning called **class imbalance**.

Here’s a breakdown of what the code is doing and why.

***

### What This Code Does ⚖️

In short, this code is performing **downsampling**.

It takes the dataset containing the **majority class** (the one with far more samples, `df_majority`) and randomly selects a smaller number of samples from it. The goal is to create a new, smaller majority DataFrame (`df_majority_downsampled`) that has the exact same number of samples as the **minority class** (`df_minority`).

Imagine you have:
* `df_majority`: 1000 examples of "Class A".
* `df_minority`: 100 examples of "Class B".

This code will randomly pick **100** examples from `df_majority` and discard the other 900. You can then combine this new, smaller majority set with your original minority set to create a perfectly balanced dataset of 100 "Class A" and 100 "Class B" samples.

***

### Why We Need to Downsample

Machine learning models can become **biased** when trained on an imbalanced dataset. If a model sees that 90% of the data is "Class A," it can achieve 90% accuracy by simply guessing "Class A" every single time. It fails to learn the actual patterns that distinguish Class A from Class B, making it useless for predicting the rare, minority class.

**Downsampling is a strategy to fix this bias.** By creating a balanced dataset, you force the model to learn the underlying features of both classes equally, leading to a more intelligent and useful model.

***

### Breaking Down the Parameters

Let's look at each part of the `resample` function call:

* `df_majority`: This is the **input DataFrame**. It's the large dataset containing only the samples from the majority class that you want to shrink.

* `replace=False`: This means **"sample without replacement."** When a data point is selected from `df_majority`, it is removed from the pool and cannot be chosen again. This ensures all the samples in your new `df_majority_downsampled` are unique.

* `n_samples=len(df_minority)`: This is the most important part for balancing. It tells the function **how many samples to select**. By setting it to the length (number of rows) of the minority class DataFrame, you guarantee that the output DataFrame will be the same size as the minority class, creating a perfect balance.

* `random_state=42`: This ensures **reproducibility**. The number `42` is just a seed for the random number generator. By setting this, you ensure that every time you run this code, the exact same "random" sample from `df_majority` will be chosen. This is crucial for getting consistent results when you re-run your experiments or share your code with others.

In [27]:
from sklearn.utils import resample
df_majority_downsampled = resample(df_majority, replace=False, 
		n_samples=len(df_minority), 
		random_state=42)

df_majority_downsampled.shape

(100, 3)

This code first performs **downsampling** to balance a dataset and then uses the `.shape` attribute to **verify** the result.

-----

### 1\. Downsampling the Majority Class

The first line of code tackles the problem of an **imbalanced dataset**, where one class (the majority class) has far more samples than another (the minority class).

```python
df_majority_downsampled = resample(df_majority, 
                                   replace=False, 
                                   n_samples=len(df_minority), 
                                   random_state=42)
```

  * **Why it's done**: Machine learning models trained on imbalanced data become biased. They can achieve high accuracy by simply always predicting the majority class, without learning the actual patterns to identify the rare minority class. Downsampling creates a balanced dataset to train a more effective model.

  * **How it's done**:

      * `df_majority`: The input DataFrame containing only the samples from the larger class.
      * `replace=False`: This means "sample without replacement." It ensures that each data point from the majority class is selected at most once.
      * `n_samples=len(df_minority)`: This sets the size of the new, downsampled DataFrame to be equal to the number of samples in the minority class. This is the key step to creating a 1:1 balance.
      * `random_state=42`: This makes the random sampling reproducible. Anyone running the code will get the exact same "random" sample.

-----

### 2\. Verifying the Shape ✅

The second line is a simple but important verification step.

```python
df_majority_downsampled.shape
```

  * **What it does**: The `.shape` attribute in pandas returns a tuple representing the dimensions of the DataFrame in the format `(number_of_rows, number_of_columns)`.

  * **Why it's done here**: This is a sanity check to confirm that the downsampling worked as expected. The number of rows in `df_majority_downsampled` should now be equal to the number of rows in `df_minority`. For example, if `df_minority` had 200 samples, the output of `.shape` should be `(200, number_of_columns)`.

In [30]:
df_downsampled = pd.concat([df_majority_downsampled, df_minority]).reset_index(drop=True)

This code combines your two separate DataFrames—the downsampled majority class and the original minority class—into a single, balanced DataFrame with a clean index.

It works in two main steps.

-----

### 1\. Combining the DataFrames (`pd.concat`)

The `pd.concat()` function takes a list of DataFrames and stacks them together.

```python
# Before this step, you have two separate DataFrames:
# df_majority_downsampled: Contains, for example, 200 samples of the majority class.
# df_minority:             Contains 200 samples of the minority class.

pd.concat([df_majority_downsampled, df_minority])
```

This first part stacks `df_minority` directly below `df_majority_downsampled`, creating a new, single DataFrame with 400 rows. This new DataFrame is now perfectly **balanced**, with an equal number of samples from each class.

### 2\. Cleaning the Index (`.reset_index`)

When you concatenate DataFrames, the new DataFrame keeps the original index labels from the pieces it was built from. This results in a messy index with duplicate numbers.

  * **Before**: The index might look like `[10, 52, 120, ..., 3, 47, 99]` where the numbers are out of order and repeated.

The `.reset_index(drop=True)` method fixes this.

  * It **resets** the index to the default clean sequence (0, 1, 2, 3, ...).
  * The `drop=True` part **discards** the old, messy index instead of adding it as a new column to your DataFrame.

The final result, `df_downsampled`, is a single, perfectly balanced DataFrame ready for training a machine learning model.

In [32]:
df_downsampled.target.value_counts()

target
0    100
1    100
Name: count, dtype: int64

The error solved by Copilot was balancing the dataset using **downsampling**. Originally, the majority class had many more samples than the minority class, causing class imbalance. Copilot used `resample` to randomly select a subset of the majority class (`df_majority_downsampled`) so it matches the size of the minority class. Then, both classes were combined into `df_downsampled`, resulting in a balanced dataset. This prevents model bias toward the majority class and improves fairness and accuracy.