
# Removing Duplicates in Pandas  
In data analysis, duplicate records can distort insights and lead to inaccurate or misleading conclusions.  
This notebook explains **how duplicates arise**, **how to detect them**, and **how to remove them** using Pandas.

---

## Why Removing Duplicates Matters

Imagine you are analyzing customer data for a retail store:  
A customer, *Shopaholic Sally*, visits three times and uses different credit cards but gives the **same zip code** each time.

If duplicates are not removed:  
- Sally appears as **three different customers**.  
- Demographic analysis for zip code `32803` becomes inaccurate.  
- Marketing and reporting decisions can be skewed.

Duplicate rows can slip in due to:  
- Data entry errors  
- Multiple system integrations  
- Customer using different identifiers  
- Tracking transactions instead of individuals  

Therefore, **detecting and removing duplicates is critical** to maintaining dataset accuracy.

---

## Step 1: Import required libraries
We load NumPy and Pandas, along with `Series` and `DataFrame` constructors.


In [None]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame


## Step 2: Create a DataFrame containing duplicates

Below, we construct a DataFrame with intentional duplicate rows.  
Each row has three columns, and rows appear in repeated pairs.


In [None]:
DF_obj = DataFrame({'column 1': [1,1,2,2,3,3,3],
                    'column 2':['a', 'a', 'b', 'b', 'c', 'c', 'c'],
                    'column 3': ['A', 'A', 'B', 'B', 'C', 'C', 'C']})
DF_obj


## Step 3: Detecting Duplicate Rows Using `.duplicated()`

`DataFrame.duplicated()` checks each row and returns **True** if the row is a duplicate of an earlier row.

How it works:
- The *first occurrence* of a row is marked `False`
- Any *subsequent repeated row* is marked `True`

This helps identify which rows should be removed.


In [None]:
DF_obj.duplicated()


## Step 4: Removing All Duplicate Rows with `.drop_duplicates()`

If you want to remove every repeated row and keep only the first occurrence, use:

```python
DF_obj.drop_duplicates()
```

This returns a DataFrame where only unique rows remain.


In [None]:
DF_obj.drop_duplicates()


## Step 5: Modify DataFrame to Demonstrate Column-Based Deduplication

To illustrate how to remove duplicates **based on a specific column**,  
we slightly modify the last value in `column 3` from `C` to `D`.

This ensures that the row at index 6 is **not** a duplicate when deduplicating based on that column.


In [None]:
DF_obj = DataFrame({'column 1': [1,1,2,2,3,3,3],
                    'column 2':['a', 'a', 'b', 'b', 'c', 'c', 'c'],
                    'column 3': ['A', 'A', 'B', 'B', 'C', 'D', 'C']})
DF_obj


## Step 6: Removing Duplicates Based on a Specific Column

Sometimes you only want to remove rows where a **specific column value** is duplicated.

Use:

```python
DF_obj.drop_duplicates(['column 3'])
```

This:
- Keeps the first occurrence of each value in `column 3`
- Drops all later rows with the same value  


In [None]:
DF_obj.drop_duplicates(['column 3'])