**Duplicate Values**
> Removing duplicates is crucial because repeated entries in a dataset can lead to biased or misleading outcomes. When duplicate rows are present, they can skew the analysis by inflating metrics like averages, frequencies, or totals, giving a false sense of patterns or trends. This is especially problematic in machine learning, where models might overfit, thinking that duplicate rows are independent observations, which can lead to poorer generalization on new data. Removing these duplicates ensures that the dataset accurately reflects the unique instances, improving the reliability and validity of the results.
> 

**Detecting Duplicate Observations**
> `df.info()` &emsp; `df.duplicated()` &emsp; `df.duplicated().sum()` &emsp; `df[df.duplicated()]`


<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

In [1]:
import pandas as pd

<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; Detect Duplicate Values </p>

In [2]:
data = pd.read_excel('data/duplicated_values.xlsx')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   D1      21 non-null     object
 1   D2      21 non-null     int64 
 2   D3      21 non-null     int64 
 3   D4      21 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 804.0+ bytes


In [3]:
# data.duplicated()
data.duplicated().sum()

4

In [4]:
data[data.duplicated()]

# data[data.duplicated(keep='last')]

# data[data.duplicated(keep=False)]

# data[data.duplicated(subset='D2')]
# data[data.duplicated(subset=['D2', 'D3'])]
# data[data.duplicated(subset=['D2', 'D4'])]

Unnamed: 0,D1,D2,D3,D4
3,Blue,50,69,37
6,Purple,56,21,66
14,Blue,50,69,37
15,Blue,50,69,37


In [5]:
indexes = data[data.duplicated(keep='last')].index
print(indexes)

Index([0, 1, 3, 14], dtype='int64')


In [6]:
display(data[data.duplicated(keep=False)].value_counts().to_frame())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
D1,D2,D3,D4,Unnamed: 4_level_1
Blue,50,69,37,4
Purple,56,21,66,2


<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; Remove Duplicate Values  </p>

In [7]:
df = data.copy()

print(data.duplicated().sum())

# df.drop_duplicates(inplace=True)

# df.drop_duplicates(keep='last', inplace=True)

# df.drop_duplicates(keep=False, inplace=True)

# df.drop_duplicates(subset='D2', inplace=True)
# df.drop_duplicates(subset=['D2', 'D3'], inplace=True)
# df.drop_duplicates(subset=['D2', 'D4'], inplace=True)

# df.drop(index=indexes, inplace=True)

print(df.shape)
display(df)

4
(21, 4)


Unnamed: 0,D1,D2,D3,D4
0,Purple,56,21,66
1,Blue,50,69,37
2,Green,40,23,40
3,Blue,50,69,37
4,Red,17,62,43
5,Yellow,44,59,73
6,Purple,56,21,66
7,Purple,53,44,34
8,Blue,27,22,79
9,Blue,63,10,16


<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>