<a href="https://colab.research.google.com/github/axel-sirota/manage-data-pandas/blob/main/module4/ManageDataPandas_Mod4Demo1_FindDuplicates.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finding duplicates



## Prep

In the last series of demos we worked with `Nans`, now we are going to work with another troublesome issue of datasets: duplicated data. For this we will use a slightly midified version of the drinks dataset that randomly duplicated rows.

In [None]:
%%writefile get_data.sh
if [ ! -f drinks_duplicated.csv ]; then
  wget -O drinks_duplicated.csv https://raw.githubusercontent.com/axel-sirota/manage-data-pandas/main/data/drinks_duplicated.csv
fi

Writing get_data.sh


In [None]:
!bash get_data.sh

--2023-04-24 20:01:24--  https://raw.githubusercontent.com/axel-sirota/normalise-data-pandas/main/data/drinks_duplicated.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8292 (8.1K) [text/plain]
Saving to: ‘drinks_duplicated.csv’


2023-04-24 20:01:25 (64.8 MB/s) - ‘drinks_duplicated.csv’ saved [8292/8292]



In [None]:
import numpy as np
import pandas as pd

drinks_duplicated =  pd.read_csv('drinks_duplicated.csv')
drinks_duplicated

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Malta,149,100,120,6.6,EU
1,Slovakia,196,293,116,11.4,EU
2,Brunei,31,2,1,0.6,AS
3,Cameroon,147,1,4,5.8,AF
4,Bahamas,122,176,51,6.3,
...,...,...,...,...,...,...
324,Denmark,224,81,278,10.4,EU
325,Mexico,238,68,5,5.5,
326,Bolivia,167,41,8,3.8,SA
327,Brazil,245,145,16,7.2,SA


## Finding duplicates

We can see by the number of rows something is odd and there must be duplicated data, however a good idea to find out is to index by some column! Let's index by country, which should be unique in this dataset

In [None]:
drinks_duplicated.set_index('country', inplace=True)
drinks_duplicated

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Malta,149,100,120,6.6,EU
Slovakia,196,293,116,11.4,EU
Brunei,31,2,1,0.6,AS
Cameroon,147,1,4,5.8,AF
Bahamas,122,176,51,6.3,
...,...,...,...,...,...
Denmark,224,81,278,10.4,EU
Mexico,238,68,5,5.5,
Bolivia,167,41,8,3.8,SA
Brazil,245,145,16,7.2,SA


In [None]:
drinks_duplicated.index.value_counts()

Liberia        5
El Salvador    5
Indonesia      5
Vietnam        5
Vanuatu        5
              ..
Kenya          1
San Marino     1
Greece         1
Barbados       1
Malawi         1
Name: country, Length: 193, dtype: int64

We can indeed see some countries have multiple rows. What can we do? Let's reset the index to have a unique index and see the `duplicated` method

In [None]:
drinks_duplicated.reset_index(inplace=True)

In [None]:
drinks_duplicated

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Malta,149,100,120,6.6,EU
1,Slovakia,196,293,116,11.4,EU
2,Brunei,31,2,1,0.6,AS
3,Cameroon,147,1,4,5.8,AF
4,Bahamas,122,176,51,6.3,
...,...,...,...,...,...,...
324,Denmark,224,81,278,10.4,EU
325,Mexico,238,68,5,5.5,
326,Bolivia,167,41,8,3.8,SA
327,Brazil,245,145,16,7.2,SA


In [None]:
drinks_duplicated.duplicated()   # The first occurence is set to normal, the rest as duplicated

0      False
1      False
2      False
3      False
4      False
       ...  
324    False
325    False
326    False
327     True
328    False
Length: 329, dtype: bool

Now we can index by this!

In [None]:
drinks_duplicated[~drinks_duplicated.duplicated()]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Malta,149,100,120,6.6,EU
1,Slovakia,196,293,116,11.4,EU
2,Brunei,31,2,1,0.6,AS
3,Cameroon,147,1,4,5.8,AF
4,Bahamas,122,176,51,6.3,
...,...,...,...,...,...,...
319,Morocco,12,6,10,0.5,AF
324,Denmark,224,81,278,10.4,EU
325,Mexico,238,68,5,5.5,
326,Bolivia,167,41,8,3.8,SA


Why does this make sense? Because duplicated will set True (ie: duplicated) if it already found one record, therefore we want to filter out the ones with False, so we negate and that's it!

Notice something very important, if we set keep set to False then basically we are keeping with lines that had NO duplicates, which sometimes is useful in lab work.

In [None]:
drinks_duplicated[~drinks_duplicated.duplicated(keep=False)]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
1,Slovakia,196,293,116,11.4,EU
3,Cameroon,147,1,4,5.8,AF
4,Bahamas,122,176,51,6.3,
10,Seychelles,157,25,51,4.1,AF
11,Russian Federation,247,326,73,11.5,AS
...,...,...,...,...,...,...
319,Morocco,12,6,10,0.5,AF
324,Denmark,224,81,278,10.4,EU
325,Mexico,238,68,5,5.5,
326,Bolivia,167,41,8,3.8,SA
