ðŸŸ¦ 1. Import Libraries

In [1]:
import pandas as pd

ðŸŸ¦ 2. Create Sample Dataset

In [2]:
data = {
    "ID": [1, 2, 3, 2, 4, 3, 5, 5],
    "Name": ["Alice", "Bob", "Charlie", "Bob", "David", "Charlie", "Eva", "Eva"],
    "City": ["Toronto", "NY", "LA", "NY", "Chicago", "LA", "Toronto", "Toronto"],
    "Salary": [70000, 80000, 90000, 80000, 85000, 90000, 60000, 60000]
}

df = pd.DataFrame(data)
df

Unnamed: 0,ID,Name,City,Salary
0,1,Alice,Toronto,70000
1,2,Bob,NY,80000
2,3,Charlie,LA,90000
3,2,Bob,NY,80000
4,4,David,Chicago,85000
5,3,Charlie,LA,90000
6,5,Eva,Toronto,60000
7,5,Eva,Toronto,60000


ðŸŸ¦ 3. Detect Duplicate Rows

3.1 Find duplicate rows

In [5]:
df.duplicated()

0    False
1    False
2    False
3     True
4    False
5     True
6    False
7     True
dtype: bool

3.2 Show only duplicate rows

In [6]:
df[df.duplicated()]

Unnamed: 0,ID,Name,City,Salary
3,2,Bob,NY,80000
5,3,Charlie,LA,90000
7,5,Eva,Toronto,60000


3.3 Count number of duplicates

In [7]:
df.duplicated().sum()

3

ðŸŸ¦ 4. Keeping First, Last or Marking All Duplicates

4.1 Keep first occurrence (default)

In [8]:
df[df.duplicated(keep="first")]

Unnamed: 0,ID,Name,City,Salary
3,2,Bob,NY,80000
5,3,Charlie,LA,90000
7,5,Eva,Toronto,60000


4.2 Keep last occurrence

In [9]:
df[df.duplicated(keep="last")]

Unnamed: 0,ID,Name,City,Salary
1,2,Bob,NY,80000
2,3,Charlie,LA,90000
6,5,Eva,Toronto,60000


4.3 Mark all duplicates

In [11]:
df[df.duplicated(keep=False)]

Unnamed: 0,ID,Name,City,Salary
1,2,Bob,NY,80000
2,3,Charlie,LA,90000
3,2,Bob,NY,80000
5,3,Charlie,LA,90000
6,5,Eva,Toronto,60000
7,5,Eva,Toronto,60000


ðŸŸ¦ 5. Remove Duplicates

5.1 Remove duplicate rows (keep first)

In [12]:
df_no_duplicates = df.drop_duplicates()
df_no_duplicates

Unnamed: 0,ID,Name,City,Salary
0,1,Alice,Toronto,70000
1,2,Bob,NY,80000
2,3,Charlie,LA,90000
4,4,David,Chicago,85000
6,5,Eva,Toronto,60000


5.2 Keep last occurrence

In [13]:
df_last = df.drop_duplicates(keep="last")
df_last

Unnamed: 0,ID,Name,City,Salary
0,1,Alice,Toronto,70000
3,2,Bob,NY,80000
4,4,David,Chicago,85000
5,3,Charlie,LA,90000
7,5,Eva,Toronto,60000


5.3 Remove all duplicates

In [14]:
df_all_removed = df.drop_duplicates(keep=False)
df_all_removed

Unnamed: 0,ID,Name,City,Salary
0,1,Alice,Toronto,70000
4,4,David,Chicago,85000


ðŸŸ¦ 6. Duplicates Based on Subset of Columns

Sometimes rows arenâ€™t fully identical but duplicates based on certain columns.

In [15]:
df_subset = df.drop_duplicates(subset=["Name", "City"])
df_subset

Unnamed: 0,ID,Name,City,Salary
0,1,Alice,Toronto,70000
1,2,Bob,NY,80000
2,3,Charlie,LA,90000
4,4,David,Chicago,85000
6,5,Eva,Toronto,60000


ðŸŸ¦ 7. Detect Duplicates by Subset

In [16]:
df[df.duplicated(subset=["Name", "City"], keep=False)]

Unnamed: 0,ID,Name,City,Salary
1,2,Bob,NY,80000
2,3,Charlie,LA,90000
3,2,Bob,NY,80000
5,3,Charlie,LA,90000
6,5,Eva,Toronto,60000
7,5,Eva,Toronto,60000


ðŸŸ¦ 8. Real-World Logic Example

Keep the row with highest salary for each Name.

In [18]:
data = {
    "ID": [1, 2, 3, 2, 4, 3, 5, 5],
    "Name": ["Alice", "Bob", "Charlie", "Bob", "David", "Charlie", "Eva", "Eva"],
    "City": ["Toronto", "NY", "LA", "NY", "Chicago", "LA", "Toronto", "Toronto"],
    "Salary": [70000, 80000, 90000, 85000, 85000, 100000, 62000, 60000]
}

df = pd.DataFrame(data)

df_sorted = df.sort_values(by="Salary", ascending=False)
df_unique_salary = df_sorted.drop_duplicates(subset=["Name"], keep="first")
df_unique_salary


Unnamed: 0,ID,Name,City,Salary
5,3,Charlie,LA,100000
3,2,Bob,NY,85000
4,4,David,Chicago,85000
0,1,Alice,Toronto,70000
6,5,Eva,Toronto,62000


ðŸŸ¦ 9. Reset Index after Cleaning

In [19]:
df_unique_salary.reset_index(drop=True, inplace=True)
df_unique_salary

Unnamed: 0,ID,Name,City,Salary
0,3,Charlie,LA,100000
1,2,Bob,NY,85000
2,4,David,Chicago,85000
3,1,Alice,Toronto,70000
4,5,Eva,Toronto,62000


# ðŸŸ¦ Summary â€” Removing Duplicates

In this subsection, you learned how to:

- Detect duplicate rows using `duplicated()`
- View and count duplicates
- Remove duplicates using `drop_duplicates()`
- Keep first, last, or remove all duplicates
- Detect and remove duplicates based on specific columns
- Use sorting before duplicate removal to keep best records
- Reset index after cleaning data
