Link to Medium blog post: https://towardsdatascience.com/finding-and-removing-duplicate-rows-in-pandas-dataframe-c6117668631f

# Finding and removing duplicate rows in Pandas DataFrame

In [3]:
# We will use a subset from the Titanic dataset available on Kaggle.

import pandas as pd
def load_data(): 
    df_all = pd.read_csv('train.csv')
    # Take a subset
    return df_all.loc[:300, ['Survived', 'Pclass', 'Sex', 'Cabin', 'Embarked']].dropna()
df = load_data()

In [4]:
# Print out the first 5 rows of the dataframe
print(df.head())

    Survived  Pclass     Sex Cabin Embarked
1          1       1  female   C85        C
3          1       1  female  C123        S
6          0       1    male   E46        S
10         1       3  female    G6        S
11         1       1  female  C103        S


## 1. Finding duplicate rows

In [5]:
# To find duplicates on a specific column, we can simply call duplicated() method on the column.

df.Cabin.duplicated()

1      False
3      False
6      False
10     False
11     False
       ...  
291    False
292    False
297    False
298    False
299     True
Name: Cabin, Length: 62, dtype: bool

To take a look at the duplication in the DataFrame as a whole, just call the duplicated() method on the DataFrame. It outputs True if an entire row is identical to a previous row.

In [6]:
df.duplicated()

1      False
3      False
6      False
10     False
11     False
       ...  
291    False
292    False
297    False
298    False
299    False
Length: 62, dtype: bool

To consider certain columns for identifying duplicates, we can pass a list of columns to the argument subset:

In [7]:
df.duplicated(subset=['Survived', 'Pclass', 'Sex'])

1      False
3       True
6      False
10     False
11      True
       ...  
291     True
292     True
297     True
298     True
299     True
Length: 62, dtype: bool

## 2. Counting duplicates and non-duplicates

The result of the duplicated() is a boolean Series, and we can add them up to count the number of duplicates. Behind the theme, True get converted to 1 and False get converted to 0, then it adds them up.

In [8]:
# Count duplicate on a column
df.Cabin.duplicated().sum()

8

Just like before, we can count the duplicate in a DataFrame and on certain columns.

In [10]:
# Count duplicate in a DataFrame
df.duplicated().sum()

2

In [11]:
# Count duplicate on certain columns
df.duplicated(subset=['Survived', 'Pclass', 'Sex']).sum()

52

If you want to count the number of non-duplicates (The number of False), you can invert it with negation (~)and then call sum():

In [12]:
# Count the number of non-duplicates
(~df.duplicated()).sum()

60

## 3. Extracting duplicate rows with loc

Pandas duplicated() returns a boolean Series. However, it is not practical to see a list of True and False when we need to perform some data analysis.

We can Pandas loc data selector to extract those duplicate rows:

In [13]:
# Extract duplicate rows
df.loc[df.duplicated(), :]

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
124,0,1,male,D26,S
251,0,3,female,G6,S


## 4. Determining which duplicates to mark with keep

There is an argument keep in Pandas duplicated() to determine which duplicates to mark. keep defaults to 'first', which means the first occurrence gets kept, and all others get identified as duplicates.

We can change it to 'last' keep the last occurrence and mark all others as duplicates

In [14]:
df.loc[df.duplicated(keep='first'), :]

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
124,0,1,male,D26,S
251,0,3,female,G6,S


In [15]:
df.loc[df.duplicated(keep='last'), :]

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
102,0,1,male,D26,S
205,0,3,female,G6,S


There is a third option we can use keep=False. It marks all duplicates as True and allows us to see all duplicate rows.

In [16]:
df.loc[df.duplicated(keep=False), :]

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
102,0,1,male,D26,S
124,0,1,male,D26,S
205,0,3,female,G6,S
251,0,3,female,G6,S


## 5. Dropping duplicate rows

In [17]:
# We can use Pandas built-in method drop_duplicates() to drop duplicate rows.

df.drop_duplicates()

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
1,1,1,female,C85,C
3,1,1,female,C123,S
6,0,1,male,E46,S
10,1,3,female,G6,S
11,1,1,female,C103,S
21,1,2,male,D56,S
23,1,1,male,A6,S
27,0,1,male,C23 C25 C27,S
31,1,1,female,B78,C
52,1,1,female,D33,C


In [18]:
# We can set the argumentinplace=True to remove duplicates from the original DataFrame.

df.drop_duplicates(inplace=True)

The argument keep can be set for drop_duplicates() as well to determine which duplicates to keep. It defaults to 'first' to keep the first occurrence and drop all other duplicates.

Similarly, we can set keep to 'last' to keep the last occurrence and drop other duplicates.

In [19]:
# Use keep='last' to keep the last occurrence 
df.drop_duplicates(keep='last')

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
1,1,1,female,C85,C
3,1,1,female,C123,S
6,0,1,male,E46,S
10,1,3,female,G6,S
11,1,1,female,C103,S
21,1,2,male,D56,S
23,1,1,male,A6,S
27,0,1,male,C23 C25 C27,S
31,1,1,female,B78,C
52,1,1,female,D33,C


In [20]:
# To drop all duplicates
df.drop_duplicates(keep=False)

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
1,1,1,female,C85,C
3,1,1,female,C123,S
6,0,1,male,E46,S
10,1,3,female,G6,S
11,1,1,female,C103,S
21,1,2,male,D56,S
23,1,1,male,A6,S
27,0,1,male,C23 C25 C27,S
31,1,1,female,B78,C
52,1,1,female,D33,C


In [21]:
# Considering certain columns for dropping duplicates
df.drop_duplicates(subset=['Survived', 'Pclass', 'Sex'])

Unnamed: 0,Survived,Pclass,Sex,Cabin,Embarked
1,1,1,female,C85,C
6,0,1,male,E46,S
10,1,3,female,G6,S
21,1,2,male,D56,S
23,1,1,male,A6,S
66,1,2,female,F33,S
75,0,3,male,F G73,S
148,0,2,male,F2,S
177,0,1,female,C49,C
205,0,3,female,G6,S
