# [Handling Duplicate Data in Pandas](#)

Duplicate data is a common issue in data analysis and processing. Understanding what duplicates are, why they occur, and how to handle them is crucial for maintaining data integrity and ensuring accurate analyses.


<img src="../images/duplicates.png" width="800">

Duplicates in a dataset are records or rows that are identical or nearly identical to other records in the same dataset. They can be:

- **Exact duplicates**: Rows where all values across all columns are identical.
- **Partial duplicates**: Rows where some, but not all, columns have identical values.


For example, consider this DataFrame:


In [1]:
import pandas as pd

df = pd.DataFrame({
    'Name': ['John', 'Jane', 'John', 'Mike', 'Jane'],
    'Age': [28, 32, 28, 45, 32],
    'City': ['New York', 'Boston', 'New York', 'Chicago', 'Boston']
})

df

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Jane,32,Boston
2,John,28,New York
3,Mike,45,Chicago
4,Jane,32,Boston


In this dataset, rows 0 and 2 are exact duplicates, while rows 1 and 4 are partial duplicates (same Name and Age, but different index).


Duplicates can occur for various reasons:

1. **Data entry errors**: Manual data entry can lead to accidental duplication.
2. **System glitches**: Automated systems might sometimes record the same information multiple times.
3. **Data merging**: When combining data from multiple sources, duplicates can be introduced.
4. **Repeated measurements**: In scientific or experimental data, the same measurement might be taken multiple times.
5. **Intentional redundancy**: Some systems deliberately create duplicates for backup or verification purposes.
6. **Data processing errors**: Mistakes in ETL (Extract, Transform, Load) processes can create duplicates.


Properly handling duplicates is crucial for several reasons:

1. **Data integrity**: Duplicates can skew your analysis, leading to incorrect conclusions or inflated statistics.

2. **Storage efficiency**: Removing unnecessary duplicates can reduce data storage requirements.

3. **Processing speed**: Fewer duplicates often mean faster data processing and analysis.

4. **Accurate reporting**: Duplicates can lead to overestimation in reports and visualizations.

5. **Machine learning model performance**: Duplicates can bias machine learning models and affect their performance.

6. **Decision making**: In business contexts, duplicates can lead to poor decision-making based on inaccurate data.

7. **Data quality**: Handling duplicates is a key aspect of ensuring overall data quality.


Let's look at how duplicates can affect a simple analysis:

In [3]:
# Mean age with duplicates
df['Age'].mean()

33.0

In [4]:
# Mean age after removing duplicates
df.drop_duplicates()['Age'].mean()

35.0

As we can see, the presence of duplicates can significantly impact even simple statistical measures.


In the following sections, we'll explore various techniques for detecting, counting, and handling duplicates in Pandas, ensuring that your data analysis is based on clean, accurate data.

## <a id='toc1_'></a>[Detecting Duplicate Data](#toc0_)

Detecting duplicates is the first step in handling duplicate data. Pandas provides several methods to identify duplicates in your DataFrame.


### <a id='toc1_1_'></a>[ Using `.duplicated()` method](#toc0_)


The `.duplicated()` method is the primary tool for detecting duplicates in Pandas. It returns a boolean Series where `True` indicates that the row is a duplicate.


In [5]:
# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 5, 5],
    'B': ['a', 'b', 'b', 'c', 'd', 'e', 'e']
})
df

Unnamed: 0,A,B
0,1,a
1,2,b
2,2,b
3,3,c
4,4,d
5,5,e
6,5,e


In [6]:
# Detect duplicates
df.duplicated()

0    False
1    False
2     True
3    False
4    False
5    False
6     True
dtype: bool

By default, `.duplicated()` considers a row as a duplicate if it's identical to a previous row. The first occurrence is not marked as a duplicate.


### <a id='toc1_2_'></a>[Identifying duplicate rows](#toc0_)


To identify which rows are duplicates, we can combine `.duplicated()` with boolean indexing:


In [7]:
# Show duplicate rows
df[df.duplicated()]

Unnamed: 0,A,B
2,2,b
6,5,e


In [8]:
# Show duplicate rows, including first occurrences
df[df.duplicated(keep=False)]

Unnamed: 0,A,B
1,2,b
2,2,b
5,5,e
6,5,e


The `keep` parameter in `.duplicated()` has three options:
- `'first'` (default): Mark duplicates as `True` except for the first occurrence.
- `'last'`: Mark duplicates as `True` except for the last occurrence.
- `False`: Mark all duplicates as `True`.


### <a id='toc1_3_'></a>[Identifying duplicate values in specific columns](#toc0_)


You can also check for duplicates based on specific columns:


In [9]:
# Create a DataFrame with partial duplicates
df2 = pd.DataFrame({
    'Name': ['John', 'Jane', 'John', 'Mike', 'Jane'],
    'Age': [28, 32, 28, 45, 33],
    'City': ['New York', 'Boston', 'Chicago', 'Chicago', 'Boston']
})

In [10]:
# Check duplicates based on 'Name' column
df2.duplicated(subset=['Name'])

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [11]:
# Check duplicates based on 'Name' and 'City' columns
df2.duplicated(subset=['Name', 'City'])

0    False
1    False
2    False
3    False
4     True
dtype: bool

You can combine this with boolean indexing to view the duplicate rows:


In [12]:
# Show rows with duplicate names
df2[df2.duplicated(subset=['Name'], keep=False)]

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Jane,32,Boston
2,John,28,Chicago
4,Jane,33,Boston


To get a summary of duplicate counts for a specific column:


In [13]:
# Count occurrences of each name
df2['Name'].value_counts()

Name
John    2
Jane    2
Mike    1
Name: count, dtype: int64

In [14]:
# Show names that appear more than once
df2['Name'].value_counts()[df2['Name'].value_counts() > 1]

Name
John    2
Jane    2
Name: count, dtype: int64

These methods provide flexible ways to detect and identify duplicates in your DataFrame, whether you're looking at entire rows or specific columns. By understanding the nature and extent of duplicates in your data, you can make informed decisions about how to handle them in subsequent data cleaning and analysis steps.

## <a id='toc2_'></a>[Removing Duplicate Data](#toc0_)

Once you've identified duplicates in your dataset, the next step is often to remove them. Pandas provides efficient methods for removing duplicates, with options to customize the process based on your specific needs.


### <a id='toc2_1_'></a>[Using .drop_duplicates() method](#toc0_)


The primary method for removing duplicates in Pandas is `.drop_duplicates()`. This method returns a new DataFrame with duplicates removed.


By default, `.drop_duplicates()` considers all columns when identifying duplicates. This means that rows are considered duplicates only if all their values are identical.


In [18]:
# DataFrame with duplicates across all columns
df_all = pd.DataFrame({
    'A': [1, 2, 2, 3],
    'B': ['a', 'b', 'b', 'c'],
    'C': [10, 20, 20, 30]
})
df_all

Unnamed: 0,A,B,C
0,1,a,10
1,2,b,20
2,2,b,20
3,3,c,30


In [19]:
df_all.drop_duplicates()

Unnamed: 0,A,B,C
0,1,a,10
1,2,b,20
3,3,c,30


### <a id='toc2_2_'></a>[Removing duplicates based on specific columns](#toc0_)


You can specify which columns to consider when identifying duplicates using the `subset` parameter:


In [32]:
# DataFrame with partial duplicates
df = pd.DataFrame({
    'Name': ['John', 'Jane', 'John', 'Mike', 'Jane'],
    'Age': [28, 32, 28, 45, 33],
    'City': ['New York', 'Boston', 'Chicago', 'Chicago', 'Boston']
})
df

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Jane,32,Boston
2,John,28,Chicago
3,Mike,45,Chicago
4,Jane,33,Boston


In [33]:
# Remove duplicates based on 'Name' column
df.drop_duplicates(subset=['Name'])

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Jane,32,Boston
3,Mike,45,Chicago


In [34]:
# Remove duplicates based on 'Name' and 'City' columns
df.drop_duplicates(subset=['Name', 'City'])

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Jane,32,Boston
2,John,28,Chicago
3,Mike,45,Chicago


### <a id='toc2_3_'></a>[Keeping first vs. last occurrence](#toc0_)


The `keep` parameter in `.drop_duplicates()` allows you to specify which occurrence of a duplicate to keep:

- `'first'` (default): Keep the first occurrence of a duplicate.
- `'last'`: Keep the last occurrence of a duplicate.
- `False`: Drop all duplicates, including the first occurrence.


In [36]:
# Keep first occurrence (default behavior)
df.drop_duplicates(keep='first')

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Jane,32,Boston
2,John,28,Chicago
3,Mike,45,Chicago
4,Jane,33,Boston


In [37]:
# Keep last occurrence
df.drop_duplicates(keep='last')


Unnamed: 0,Name,Age,City
0,John,28,New York
1,Jane,32,Boston
2,John,28,Chicago
3,Mike,45,Chicago
4,Jane,33,Boston


In [38]:
# todo: fixme
# Keep all duplicates
df.drop_duplicates(keep=False)

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Jane,32,Boston
2,John,28,Chicago
3,Mike,45,Chicago
4,Jane,33,Boston


You can combine this with the `subset` parameter:


In [39]:
# Keep last occurrence of duplicates based on 'Name'
df.drop_duplicates(subset=['Name'], keep='last')

Unnamed: 0,Name,Age,City
2,John,28,Chicago
3,Mike,45,Chicago
4,Jane,33,Boston


Remember that `.drop_duplicates()` returns a new DataFrame. If you want to modify the original DataFrame, use the `inplace=True` parameter:


In [40]:
# Modify the original DataFrame
df.drop_duplicates(inplace=True)

In [41]:
df

Unnamed: 0,Name,Age,City
0,John,28,New York
1,Jane,32,Boston
2,John,28,Chicago
3,Mike,45,Chicago
4,Jane,33,Boston


When removing duplicates, it's important to consider:

1. Which columns are relevant for identifying duplicates in your specific use case.
2. Whether you need to keep the first, last, or no occurrences of duplicates.
3. The potential impact on your analysis of removing certain duplicates.


By using these methods, you can effectively clean your data of unwanted duplicates, ensuring that your subsequent analyses are based on unique, relevant data points.

## <a id='toc3_'></a>[Advanced Duplicate Handling](#toc0_)

As data complexity increases, you may encounter situations where simple duplicate removal isn't sufficient. This section covers more advanced techniques for handling duplicates in various scenarios.


### <a id='toc3_1_'></a>[Partial duplicates](#toc0_)


Partial duplicates occur when some, but not all, columns match between rows. Handling these often requires a more nuanced approach.


In [42]:
# Create a DataFrame with partial duplicates
df = pd.DataFrame({
    'Name': ['John Smith', 'Jane Doe', 'John Smith', 'Mike Johnson', 'Jane Doe'],
    'Age': [28, 32, 28, 45, 33],
    'City': ['New York', 'Boston', 'Chicago', 'Chicago', 'Boston']
})
df

Unnamed: 0,Name,Age,City
0,John Smith,28,New York
1,Jane Doe,32,Boston
2,John Smith,28,Chicago
3,Mike Johnson,45,Chicago
4,Jane Doe,33,Boston


In [44]:
# Identify partial duplicates based on 'Name'
partial_dupes = df[df.duplicated(subset=['Name'], keep=False)]

print("Partial duplicates:")
partial_dupes

Partial duplicates:


Unnamed: 0,Name,Age,City
0,John Smith,28,New York
1,Jane Doe,32,Boston
2,John Smith,28,Chicago
4,Jane Doe,33,Boston


In [45]:
# Aggregate partial duplicates
aggregated = df.groupby('Name').agg({
    'Age': 'first',
    'City': lambda x: ', '.join(set(x))
}).reset_index()

print("\nAggregated data:")
aggregated


Aggregated data:


Unnamed: 0,Name,Age,City
0,Jane Doe,32,Boston
1,John Smith,28,"Chicago, New York"
2,Mike Johnson,45,Chicago


In this example, we identify partial duplicates based on the 'Name' column and then aggregate the data to combine information from duplicate entries.


### <a id='toc3_2_'></a>[Fuzzy matching for near-duplicates](#toc0_)


Near-duplicates are entries that are very similar but not exactly identical, often due to typos or slight variations. Fuzzy matching can help identify these.


In [46]:
%pip install thefuzz

Note: you may need to restart the kernel to use updated packages.


In [47]:
from thefuzz import fuzz, process



In [48]:
# Create a DataFrame with near-duplicates
df = pd.DataFrame({
    'Name': ['John Smith', 'Jon Smith', 'Jane Doe', 'Jane Do', 'Mike Johnson']
})
df

Unnamed: 0,Name
0,John Smith
1,Jon Smith
2,Jane Doe
3,Jane Do
4,Mike Johnson


In [49]:
# Function to find fuzzy duplicates
def find_fuzzy_duplicates(name, names_list, cutoff=80):
    matches = process.extract(name, names_list, limit=2, scorer=fuzz.token_sort_ratio)
    return [match for match in matches if match[1] >= cutoff and match[0] != name]

# Apply fuzzy matching
df['Fuzzy_Matches'] = df['Name'].apply(lambda x: find_fuzzy_duplicates(x, df['Name']))
df

Unnamed: 0,Name,Fuzzy_Matches
0,John Smith,"[(Jon Smith, 95, 1)]"
1,Jon Smith,"[(John Smith, 95, 0)]"
2,Jane Doe,"[(Jane Do, 93, 3)]"
3,Jane Do,"[(Jane Doe, 93, 2)]"
4,Mike Johnson,[]


In [51]:
# Identify groups of fuzzy duplicates
fuzzy_groups = {}
for idx, row in df.iterrows():
    if row['Fuzzy_Matches']:
        name = row['Name']
        match = row['Fuzzy_Matches'][0][0]
        if name not in fuzzy_groups and match not in fuzzy_groups:
            fuzzy_groups[name] = [name, match]
        elif name in fuzzy_groups:
            fuzzy_groups[name].append(match)
        elif match in fuzzy_groups:
            fuzzy_groups[match].append(name)


In [52]:
print("\nFuzzy duplicate groups:")
for group in fuzzy_groups.values():
    print(group)


Fuzzy duplicate groups:
['John Smith', 'Jon Smith', 'Jon Smith']
['Jane Doe', 'Jane Do', 'Jane Do']


This example uses the `fuzzywuzzy` library to identify near-duplicates based on name similarity.


### <a id='toc3_3_'></a>[Handling duplicates in time series data](#toc0_)


Time series data often requires special consideration when handling duplicates, especially when timestamps are involved.


In [53]:
# Create a time series DataFrame with duplicates
df = pd.DataFrame({
    'Timestamp': pd.date_range(start='2023-01-01', periods=5, freq='D').tolist() + 
                 [pd.Timestamp('2023-01-02'), pd.Timestamp('2023-01-04')],
    'Value': [1, 2, 3, 4, 5, 2.5, 4.5]
})
df

Unnamed: 0,Timestamp,Value
0,2023-01-01,1.0
1,2023-01-02,2.0
2,2023-01-03,3.0
3,2023-01-04,4.0
4,2023-01-05,5.0
5,2023-01-02,2.5
6,2023-01-04,4.5


In [54]:
# Sort by timestamp and remove duplicates, keeping the last occurrence
df_cleaned = df.sort_values('Timestamp').drop_duplicates('Timestamp', keep='last')
df_cleaned

Unnamed: 0,Timestamp,Value
0,2023-01-01,1.0
5,2023-01-02,2.5
2,2023-01-03,3.0
6,2023-01-04,4.5
4,2023-01-05,5.0


In [55]:
# Alternatively, aggregate duplicates
df_aggregated = df.groupby('Timestamp').agg({
    'Value': ['mean', 'min', 'max']
}).reset_index()

df_aggregated

Unnamed: 0_level_0,Timestamp,Value,Value,Value
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max
0,2023-01-01,1.0,1.0,1.0
1,2023-01-02,2.25,2.0,2.5
2,2023-01-03,3.0,3.0,3.0
3,2023-01-04,4.25,4.0,4.5
4,2023-01-05,5.0,5.0,5.0


In this example, we first remove duplicates by keeping the last occurrence for each timestamp. Then, we demonstrate how to aggregate duplicate timestamps by calculating statistics like mean, min, and max.


These advanced techniques for handling duplicates allow you to deal with more complex scenarios in data cleaning and preparation. When working with partial duplicates, near-duplicates, or time series data, it's important to consider the specific requirements of your analysis and choose the appropriate method for handling duplicates.