
# STEP 6: Handle Duplicate Data (Complete Pandas Guide)

This notebook covers **ALL practical and real-world ways** to DETECT, ANALYZE,
and REMOVE duplicate data in a Pandas DataFrame.

Focus: **exact duplicates, key-based duplicates, latest-record logic, and best practices**.


In [None]:

import pandas as pd
import numpy as np


## 1. Sample Dataset with Duplicates

In [None]:

df = pd.DataFrame({
    "order_id": [1001, 1002, 1002, 1003, 1004, 1004, 1004],
    "customer": ["Alice", "Bob", "Bob", "Charlie", "David", "David", "David"],
    "amount": [2500, 1800, 1800, 2200, 3000, 3000, 3000],
    "order_date": ["2024-01-01", "2024-01-02", "2024-01-02",
                   "2024-01-03", "2024-01-04", "2024-01-04", "2024-01-04"]
})
df


## 2. Detect Duplicate Rows

In [None]:

df.duplicated()


## 3. Count Duplicate Rows

In [None]:

df.duplicated().sum()


## 4. View Duplicate Rows

In [None]:

df[df.duplicated()]


## 5. Detect Duplicates Based on Specific Columns

In [None]:

df.duplicated(subset=['order_id'])


## 6. View Duplicates Based on Business Key

In [None]:

df[df.duplicated(subset=['order_id'], keep=False)]


## 7. Remove Exact Duplicate Rows

In [None]:

df_no_exact_dup = df.drop_duplicates()
df_no_exact_dup


## 8. Remove Duplicates Based on Key (Keep First)

In [None]:

df_keep_first = df.drop_duplicates(subset=['order_id'], keep='first')
df_keep_first


## 9. Remove Duplicates Based on Key (Keep Last)

In [None]:

df_keep_last = df.drop_duplicates(subset=['order_id'], keep='last')
df_keep_last


## 10. Latest Record Wins (Sort + Drop)

In [None]:

df_sorted = df.copy()
df_sorted['order_date'] = pd.to_datetime(df_sorted['order_date'])
df_sorted = df_sorted.sort_values('order_date')
df_latest = df_sorted.drop_duplicates(subset=['order_id'], keep='last')
df_latest


## 11. Remove Duplicates Using Boolean Indexing

In [None]:

df_boolean = df[~df.duplicated(subset=['order_id'])]
df_boolean


## 12. Inplace Duplicate Removal

In [None]:

df_inplace = df.copy()
df_inplace.drop_duplicates(subset=['order_id'], inplace=True)
df_inplace



## ✅ Best Practices & Interview Notes
- Always detect before removing duplicates
- Prefer business keys (`order_id`, `user_id`) over full-row match
- Use `keep='last'` for latest-record logic
- Sorting before removal is critical
- Never remove duplicates blindly in audit data



## ✔ Summary
- `duplicated()` detects duplicates
- `drop_duplicates()` removes them
- `subset` and `keep` control logic
- Business rules define what a duplicate means
