# Handling duplicates


In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://github.com/alx2202/DataAnalysis/raw/main/Day13/emps.csv'
emps = pd.read_csv(url, delimiter=';', encoding='utf-8', index_col='employee_id', parse_dates=['hire_date'])
emps

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
102,Lex,De Haan,Administration Vice President,17000,2003-01-13,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
104,Bruce,Ernst,Programmer,6000,2001-05-21,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
...,...,...,...,...,...,...,...,...,...,...
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany
205,Shelley,Higgins,Accounting Manager,12000,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America


In [36]:
emps.dtypes

first_name                 object
last_name                  object
job_title                  object
salary                      int64
hire_date          datetime64[ns]
department_name            object
address                    object
postal_code                object
city                       object
country                    object
dtype: object

[`.count()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html?highlight=count#pandas.DataFrame.count) that counts how many non-empty values we have. We can use it on a DataFrame or a Series.

In [3]:
emps.count()  # usage on a DataFrame

first_name         107
last_name          107
job_title          107
salary             107
hire_date          107
department_name    106
address            106
postal_code        105
city               106
country            106
dtype: int64

In [5]:
emps.city.count()  # usage on a Series

106

[`.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html?highlight=value_counts#pandas.DataFrame.value_counts) - that counts values we have in a Series or a DataFrame. We can use it on a DataFrame or a Series. When applied on a DataFrame each row will be treated as one value, so we will see how many duplicates we have - for such the number will be greater than one. 

In [7]:
emps.city.value_counts()

South San Francisco    45
Oxford                 34
Seattle                18
Southlake               5
Toronto                 2
London                  1
Munich                  1
Name: city, dtype: int64

In [9]:
emps.value_counts()  # in this particular case this operation is not very informative.

first_name  last_name   job_title                 salary  hire_date   department_name  address                                   postal_code  city                 country                 
Adam        Fripp       Stock Manager             8200    2007-04-10  Shipping         2011 Interiors Blvd                       99236        South San Francisco  United States of America    1
Karen       Partners    Sales Manager             13500   2007-01-05  Sales            Magdalen Centre, The Oxford Science Park  OX9 9ZB      Oxford               United Kingdom              1
Payam       Kaufling    Stock Manager             7900    2005-05-01  Shipping         2011 Interiors Blvd                       99236        South San Francisco  United States of America    1
Patrick     Sully       Sales Representative      9500    2006-03-04  Sales            Magdalen Centre, The Oxford Science Park  OX9 9ZB      Oxford               United Kingdom              1
Pat         Fay         Marketing Repres

In a DataFrame, so far we've been using one-level index, like to employees the index was `employee_id`. In this case, when using `.value_counts()` on a slice of the original DataFrame we have multi-level index.

In [11]:
emps[['city', 'department_name', 'job_title']].value_counts()

city                 department_name   job_title                      
Oxford               Sales             Sales Representative               29
South San Francisco  Shipping          Stock Clerk                        20
                                       Shipping Clerk                     20
Seattle              Finance           Accountant                          5
                     Purchasing        Purchasing Clerk                    5
Oxford               Sales             Sales Manager                       5
Southlake            IT                Programmer                          5
South San Francisco  Shipping          Stock Manager                       5
Seattle              Executive         Administration Vice President       2
Toronto              Marketing         Marketing Manager                   1
Seattle              Purchasing        Purchasing Manager                  1
London               Human Resources   Human Resources Representative      1
Seatt

We can perform sort operation on both index and values we have in a DataFrame or Series. For that we can use two methods:
- [`.sort_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html?highlight=sort_index) - sort an index
- [`.sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values) - sorts values

We can pick which sorting algorithm we want to use ('quicksort', 'mergesort', 'heapsort', 'stable'). You can check [wikipedia](https://en.wikipedia.org/wiki/Sorting_algorithm) for more information on sorting algorithms.

In [13]:
emps_slice = emps[['city', 'department_name', 'job_title']].value_counts()
emps_slice.sort_values()

city                 department_name   job_title                      
Toronto              Marketing         Marketing Manager                   1
Seattle              Accounting        Public Accountant                   1
                     Administration    Administration Assistant            1
                     Executive         President                           1
Munich               Public Relations  Public Relations Representative     1
Seattle              Finance           Finance Manager                     1
London               Human Resources   Human Resources Representative      1
Seattle              Purchasing        Purchasing Manager                  1
                     Accounting        Accounting Manager                  1
Toronto              Marketing         Marketing Representative            1
Seattle              Executive         Administration Vice President       2
South San Francisco  Shipping          Stock Manager                       5
South

In [16]:
emps_slice.sort_values(ascending=False)

city                 department_name   job_title                      
Oxford               Sales             Sales Representative               29
South San Francisco  Shipping          Shipping Clerk                     20
                                       Stock Clerk                        20
Seattle              Finance           Accountant                          5
                     Purchasing        Purchasing Clerk                    5
Oxford               Sales             Sales Manager                       5
Southlake            IT                Programmer                          5
South San Francisco  Shipping          Stock Manager                       5
Seattle              Executive         Administration Vice President       2
                                       President                           1
                     Accounting        Accounting Manager                  1
                                       Public Accountant                   1
     

In [18]:
emps_slice

city                 department_name   job_title                      
Oxford               Sales             Sales Representative               29
South San Francisco  Shipping          Stock Clerk                        20
                                       Shipping Clerk                     20
Seattle              Finance           Accountant                          5
                     Purchasing        Purchasing Clerk                    5
Oxford               Sales             Sales Manager                       5
Southlake            IT                Programmer                          5
South San Francisco  Shipping          Stock Manager                       5
Seattle              Executive         Administration Vice President       2
Toronto              Marketing         Marketing Manager                   1
Seattle              Purchasing        Purchasing Manager                  1
London               Human Resources   Human Resources Representative      1
Seatt

In [19]:
emps_slice.sort_index()

city                 department_name   job_title                      
London               Human Resources   Human Resources Representative      1
Munich               Public Relations  Public Relations Representative     1
Oxford               Sales             Sales Manager                       5
                                       Sales Representative               29
Seattle              Accounting        Accounting Manager                  1
                                       Public Accountant                   1
                     Administration    Administration Assistant            1
                     Executive         Administration Vice President       2
                                       President                           1
                     Finance           Accountant                          5
                                       Finance Manager                     1
                     Purchasing        Purchasing Clerk                    5
     

*Duplicates* are rows that have the same value on corresponding columns. 

Most often we don't want to have duplicates and we want to get rid of them. For that we can use [`.drop_duplicates()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html?highlight=drop_duplicates#pandas.DataFrame.drop_duplicates).

In [21]:
emps_selection = emps[['city', 'department_name', 'job_title']]
emps_selection

Unnamed: 0_level_0,city,department_name,job_title
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100,Seattle,Executive,President
101,Seattle,Executive,Administration Vice President
102,Seattle,Executive,Administration Vice President
103,Southlake,IT,Programmer
104,Southlake,IT,Programmer
...,...,...,...
202,Toronto,Marketing,Marketing Representative
203,London,Human Resources,Human Resources Representative
204,Munich,Public Relations,Public Relations Representative
205,Seattle,Accounting,Accounting Manager


In [23]:
emps_selection.drop_duplicates()  # now I have unique rows only!

Unnamed: 0_level_0,city,department_name,job_title
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100,Seattle,Executive,President
101,Seattle,Executive,Administration Vice President
103,Southlake,IT,Programmer
108,Seattle,Finance,Finance Manager
109,Seattle,Finance,Accountant
114,Seattle,Purchasing,Purchasing Manager
115,Seattle,Purchasing,Purchasing Clerk
120,South San Francisco,Shipping,Stock Manager
125,South San Francisco,Shipping,Stock Clerk
145,Oxford,Sales,Sales Manager


By default `.drop_duplicates()` treats whole row as one value and removes repeating values/rows. If we want to change the behaviour and narrow down the columns that are used for determining if we have duplicates or not we can use `subset` argument. 

In [25]:
emps.drop_duplicates(subset='city')

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
120,Matthew,Weiss,Stock Manager,8000,2006-07-18,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America
145,John,Russell,Sales Manager,14000,2006-10-01,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
178,Kimberely,Grant,Sales Representative,7000,2009-05-24,,,,,
201,Michael,Hartstein,Marketing Manager,13000,2006-02-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany


In [28]:
emps.drop_duplicates(subset=['department_name', 'job_title'])

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
101,Neena,Kochhar,Administration Vice President,17000,1999-09-21,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
108,Nancy,Greenberg,Finance Manager,12000,2004-08-17,Finance,2004 Charade Rd,98199,Seattle,United States of America
109,Daniel,Faviet,Accountant,9000,2004-08-16,Finance,2004 Charade Rd,98199,Seattle,United States of America
114,Den,Raphaely,Purchasing Manager,11000,2004-12-07,Purchasing,2004 Charade Rd,98199,Seattle,United States of America
115,Alexander,Khoo,Purchasing Clerk,3100,2005-05-18,Purchasing,2004 Charade Rd,98199,Seattle,United States of America
120,Matthew,Weiss,Stock Manager,8000,2006-07-18,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America
125,Julia,Nayer,Stock Clerk,3200,2007-07-16,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America
145,John,Russell,Sales Manager,14000,2006-10-01,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom


In [30]:
# Employees from each city with the biggest salary.
emps.sort_values('salary', ascending=False).drop_duplicates(subset='city')

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
145,John,Russell,Sales Manager,14000,2006-10-01,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
201,Michael,Hartstein,Marketing Manager,13000,2006-02-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
121,Adam,Fripp,Stock Manager,8200,2007-04-10,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America
178,Kimberely,Grant,Sales Representative,7000,2009-05-24,,,,,
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom


When dropping duplicates we can decide which values should be kept by using `keep` named argument. This argument can take several values:
- `first` - first value from a duplicated ones will be kept
- `last` - last value from a duplicated ones will be kept
- `False` -> all duplicates will removed

In [31]:
emps.drop_duplicates(subset='city')

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
120,Matthew,Weiss,Stock Manager,8000,2006-07-18,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America
145,John,Russell,Sales Manager,14000,2006-10-01,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
178,Kimberely,Grant,Sales Representative,7000,2009-05-24,,,,,
201,Michael,Hartstein,Marketing Manager,13000,2006-02-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany


In [32]:
emps.drop_duplicates(subset='city', keep='first')  # no changes, because `first` is a default value for keep argument

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
100,Steven,King,President,24000,1997-06-17,Executive,2004 Charade Rd,98199,Seattle,United States of America
103,Alexander,Hunold,Programmer,9000,2000-01-03,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
120,Matthew,Weiss,Stock Manager,8000,2006-07-18,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America
145,John,Russell,Sales Manager,14000,2006-10-01,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
178,Kimberely,Grant,Sales Representative,7000,2009-05-24,,,,,
201,Michael,Hartstein,Marketing Manager,13000,2006-02-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany


In [33]:
emps.drop_duplicates(subset='city', keep='last')

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
107,Diana,Lorentz,Programmer,4200,2009-02-07,IT,2014 Jabberwocky Rd,26192,Southlake,United States of America
178,Kimberely,Grant,Sales Representative,7000,2009-05-24,,,,,
179,Charles,Johnson,Sales Representative,6200,2011-01-04,Sales,"Magdalen Centre, The Oxford Science Park",OX9 9ZB,Oxford,United Kingdom
199,Douglas,Grant,Shipping Clerk,2600,2010-01-13,Shipping,2011 Interiors Blvd,99236,South San Francisco,United States of America
202,Pat,Fay,Marketing Representative,6000,2007-08-17,Marketing,147 Spadina Ave,M5V 2L7,Toronto,Canada
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925,Munich,Germany
206,William,Gietz,Public Accountant,8300,2004-06-07,Accounting,2004 Charade Rd,98199,Seattle,United States of America


In [35]:
# In this particular case we can see employees which are the only one in the particular city.
emps.drop_duplicates(subset='city', keep=False)

Unnamed: 0_level_0,first_name,last_name,job_title,salary,hire_date,department_name,address,postal_code,city,country
employee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
178,Kimberely,Grant,Sales Representative,7000,2009-05-24,,,,,
203,Susan,Mavris,Human Resources Representative,6500,2004-06-07,Human Resources,8204 Arthur St,,London,United Kingdom
204,Hermann,Baer,Public Relations Representative,10000,2004-06-07,Public Relations,Schwanthalerstr. 7031,80925.0,Munich,Germany
