
# STEP 3: Handle Missing Data – REMOVE (Complete Pandas Guide)

This notebook covers **ALL practical and interview-relevant ways** to REMOVE missing (null) data
from a Pandas DataFrame.

Focus: **when to remove rows, when to remove columns, and how to do it safely**.


In [None]:

import pandas as pd
import numpy as np


## 1. Sample Dataset with Missing Values

In [None]:

df = pd.DataFrame({
    "order_id": [1001, 1002, 1003, 1004, 1005, 1006],
    "customer": ["Alice", "Bob", None, "David", "Eve", None],
    "amount": [2500, None, 1800, 2200, None, 3000],
    "city": ["Mumbai", "Delhi", "Mumbai", None, "Pune", None],
    "discount": [None, 10, None, 5, None, None]
})
df


## 2. Remove Rows with ANY Missing Value

In [None]:

df_drop_any = df.dropna()
df_drop_any


## 3. Remove Rows with ALL Missing Values

In [None]:

df_drop_all = df.dropna(how='all')
df_drop_all


## 4. Remove Rows Based on Critical Columns

In [None]:

df_drop_subset = df.dropna(subset=['order_id', 'amount'])
df_drop_subset


## 5. Remove Columns with ANY Missing Value

In [None]:

df_drop_cols_any = df.dropna(axis=1)
df_drop_cols_any


## 6. Remove Columns with ALL Missing Values

In [None]:

df_drop_cols_all = df.dropna(axis=1, how='all')
df_drop_cols_all


## 7. Remove Columns Based on Threshold

In [None]:

# Keep columns having at least 60% non-null values
threshold = 0.6 * len(df)
df_drop_thresh = df.dropna(axis=1, thresh=threshold)
df_drop_thresh


## 8. Remove Rows Based on Threshold

In [None]:

# Keep rows having at least 3 non-null values
df_drop_row_thresh = df.dropna(thresh=3)
df_drop_row_thresh


## 9. Remove Rows with Missing Values in Specific Column

In [None]:

df_amount_missing = df[df['amount'].notna()]
df_amount_missing


## 10. Remove Rows Using Boolean Indexing

In [None]:

df_boolean = df[~df['city'].isnull()]
df_boolean


## 11. Inplace Removal (Memory Efficient)

In [None]:

df_copy = df.copy()
df_copy.dropna(subset=['amount'], inplace=True)
df_copy



## ✅ Best Practices & Interview Notes
- Remove rows only when data is **unusable**
- Remove columns when **information value is low**
- Always analyze missing percentage before removing
- Never blindly drop time-series rows
- Prefer threshold-based removal for large datasets



## ✔ Summary
- `dropna()` is the core removal method
- `axis`, `how`, `subset`, and `thresh` control behavior
- Removal decisions should follow business logic
