# Identify, calculate and remove duplicates using Python

## Identify duplicates
Using data.info() we can extract information about all columns in our data frame, including the number of rows.

data.uniques() returns only unique values in an array (not complete dataframes). 

data.duplicated() returns a boolean series, marking duplicated rows as True.

We can also detect any duplicated indexes using data.index.duplicated()

In [19]:
import pandas as pd
temperature_data = {"station_id": [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
                   "date": ['2022-07-13', '2022-07-13', '2022-07-13', '2022-07-13', \
                   '2022-07-13', '2022-07-14', '2022-07-14', '2022-07-14', \
                    '2022-07-13', '2022-07-15', '2022-07-15', '2022-07-15'],
                   "temperature": [36.5, 37.9, 34.3, 40.2, 36.5, 39.8, 35.2, 41.7, 36.5, 37.2, 30.9, 38.6]}
data = pd.DataFrame(temperature_data)

In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   station_id   12 non-null     int64  
 1   date         12 non-null     object 
 2   temperature  12 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 416.0+ bytes


In [21]:
# We can check unique values per column using pd.unique()
pd.unique(data['temperature'])

array([36.5, 37.9, 34.3, 40.2, 39.8, 35.2, 41.7, 37.2, 30.9, 38.6])

In [22]:
data.duplicated()

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8      True
9     False
10    False
11    False
dtype: bool

In [23]:
data.index.duplicated()

array([False, False, False, False, False, False, False, False, False,
       False, False, False])

## Calculate duplicates
Using data.duplicated().sum() we can calculate the total amount of duplicated rows per column.

In [24]:
data.duplicated().sum()

2

## Remove duplicates
We can remove duplicates using data.drop_duplicates(). This will also work for indexes.

data.index.drop_duplicates()

By default, it will keep the first observation. If we want to keep the last one, we should pass the parameter keep='last'.

In [25]:
data = data.drop_duplicates(keep='last')
data.head()

Unnamed: 0,station_id,date,temperature
1,2,2022-07-13,37.9
2,3,2022-07-13,34.3
3,4,2022-07-13,40.2
5,2,2022-07-14,39.8
6,3,2022-07-14,35.2


Note that rows 0 and 4 are now missing.