## Example: Deduplicating a Customer List

In this example, we'll demonstrate how to identify and remove duplicate entries from a dataset using pandas. We'll use a simple dataset representing a list of customers who signed up for a service.

In [2]:
import pandas as pd

# Sample data: Customer signups
data = {
    'SignupID': [101, 102, 103, 104, 105],
    'Name': ['Ahmed Ali', 'Fatima Hassan', 'Omar Ibrahim', 'Ahmed Ali', 'Fatima Hassan'],
    'Email': ['ahmed@example.com', 'fatima@example.com', 'omar@example.com', 'ahmed@example.com', 'fatima@different-email.com'],
    'Plan': ['Basic', 'Premium', 'Basic', 'Basic', 'Standard']
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original Customer List:")
display(df)

Original Customer List:


Unnamed: 0,SignupID,Name,Email,Plan
0,101,Ahmed Ali,ahmed@example.com,Basic
1,102,Fatima Hassan,fatima@example.com,Premium
2,103,Omar Ibrahim,omar@example.com,Basic
3,104,Ahmed Ali,ahmed@example.com,Basic
4,105,Fatima Hassan,fatima@different-email.com,Standard


### Identifying Duplicates

Notice that **Ahmed Ali** appears twice with the same email (`ahmed@example.com`), while **Fatima Hassan** appears twice but with different emails (`fatima@example.com` vs `fatima@different-email.com`).

Let's identify duplicates based on the **Email** column.

In [4]:
# Find duplicates in the 'Email' column, keep=False to mark all duplicates
duplicates = df.duplicated(subset=['Email'], keep=False)
print("Duplicate Entries (by Email):")
display(df[duplicates])

# Find duplicates in the 'Name' column
duplicates = df.duplicated(subset=['Name'], keep=False)
print("Duplicate Entries (by Name):")
display(df[duplicates])

Duplicate Entries (by Email):


Unnamed: 0,SignupID,Name,Email,Plan
0,101,Ahmed Ali,ahmed@example.com,Basic
3,104,Ahmed Ali,ahmed@example.com,Basic


Duplicate Entries (by Name):


Unnamed: 0,SignupID,Name,Email,Plan
0,101,Ahmed Ali,ahmed@example.com,Basic
1,102,Fatima Hassan,fatima@example.com,Premium
3,104,Ahmed Ali,ahmed@example.com,Basic
4,105,Fatima Hassan,fatima@different-email.com,Standard


### Removing Duplicates

We can remove these duplicates using `drop_duplicates()`. By default, it keeps the **first** occurrence and drops the rest. We can also specify `subset=['Email']` to ensure uniqueness by email address.

In [5]:
# Remove duplicates, keeping the first occurrence
df_cleaned = df.drop_duplicates(subset=['Email'])

print("Cleaned Customer List:")
display(df_cleaned)

Cleaned Customer List:


Unnamed: 0,SignupID,Name,Email,Plan
0,101,Ahmed Ali,ahmed@example.com,Basic
1,102,Fatima Hassan,fatima@example.com,Premium
2,103,Omar Ibrahim,omar@example.com,Basic
4,105,Fatima Hassan,fatima@different-email.com,Standard
