Let's imagine you have a dataset that contains information about customers of an e-commerce store. The dataset includes the following columns:

customer_id: Unique identifier for each customer
age: Age of the customer
gender: Gender of the customer
email: Email address of the customer
purchase_amount: The amount spent by the customer on purchases

In [1]:
import pandas as pd
import numpy as np

In [10]:
# Create the DataFrame with duplicate records
data = {
    'customer_id': [1, 2, 3, 4, 5, 2, 3],
    'age': [25, 20, 30, 35,18, 30, 30],
    'gender': ['M', 'F', 'M', 'M', 'F', 'F', 'M'],
    'email': ['example1@example.com', 'example2@example.com', 'example3@example.com', 'example4@example.com', 'example6@example.com','example2@example.com', 'example3@example.com'],
    'purchase_amount': [100.0, 50.0, 250.0, 200.0, 100.0, 50.0, 150.0]
}

df = pd.DataFrame(data)
df

Unnamed: 0,customer_id,age,gender,email,purchase_amount
0,1,25,M,example1@example.com,100.0
1,2,20,F,example2@example.com,50.0
2,3,30,M,example3@example.com,250.0
3,4,35,M,example4@example.com,200.0
4,5,18,F,example6@example.com,100.0
5,2,30,F,example2@example.com,50.0
6,3,30,M,example3@example.com,150.0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   customer_id      7 non-null      int64  
 1   age              7 non-null      int64  
 2   gender           7 non-null      object 
 3   email            7 non-null      object 
 4   purchase_amount  7 non-null      float64
dtypes: float64(1), int64(2), object(2)
memory usage: 412.0+ bytes


## Drop exact duplicate rows


In [15]:
df_deduped = df.drop_duplicates()

### Drop duplicates based on specific columns

The duplicate records are:

Customer ID 2: This record is duplicated twice.
Customer ID 3: This record is duplicated twice.

In [12]:
df_subset_deduped = df.drop_duplicates(subset=['customer_id', 'email'])
df_subset_deduped

Unnamed: 0,customer_id,age,gender,email,purchase_amount
0,1,25,M,example1@example.com,100.0
1,2,20,F,example2@example.com,50.0
2,3,30,M,example3@example.com,250.0
3,4,35,M,example4@example.com,200.0
4,5,18,F,example6@example.com,100.0


### Partial Duplication - Keep First/Last Occurrence

In [30]:
# Keep the first occurrence of each duplicated row
df_keep_first = df.copy()
df_keep_first.drop_duplicates(subset='email', keep='first', inplace=True)
df_keep_first

Unnamed: 0,customer_id,age,gender,email,purchase_amount
0,1,25,M,example1@example.com,100.0
1,2,20,F,example2@example.com,50.0
2,3,30,M,example3@example.com,250.0
3,4,35,M,example4@example.com,200.0
4,5,18,F,example6@example.com,100.0


In [31]:
# Keep the last occurrence of each duplicated row
df_keep_last = df.copy()
df_keep_last.drop_duplicates(subset='email', keep='last', inplace=True)
df_keep_last

Unnamed: 0,customer_id,age,gender,email,purchase_amount
0,1,25,M,example1@example.com,100.0
1,2,20,F,example2@example.com,50.0
2,3,30,M,example3@example.com,250.0
3,4,35,M,example4@example.com,200.0
4,5,18,F,example6@example.com,100.0
