In [1]:
import pandas as pd

In [2]:
# read a dataset of movie reviewers (modifying the default parameter values for read_table)
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=user_cols, index_col='user_id')

In [3]:
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [4]:
users.shape

(943, 4)

###### Scenario - We want to identify duplicate zip/post codes

For more info on .duplicated visit http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html

In [5]:
users.zip_code.duplicated().head()

user_id
1    False
2    False
3    False
4    False
5    False
Name: zip_code, dtype: bool

When we run this we get back a series of Trues & Falses. It returns a True if there was an entry previous to it ie above it, that was identical. In our dataset, user 29 comes back as True which means that they have the same zipcode as someone earlier in the dataset. 

We cannot see which one, it doesn't tell us that but we do know that it is a duplicate to something that was previous to it 

In [6]:
users.zip_code.duplicated().sum()

148

...because these are booleans, we can add them up to get the number of duplicates. 

The Trues get converted to ones and the Flases get converted to zeros. Pandas adds it all up and tells us 
that there are 148 duplicate zipcodes 

###### Duplication in the dataframe as a whole rather than just a series

In [7]:
users.duplicated()

user_id
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
30     False
       ...  
914    False
915    False
916    False
917    False
918    False
919    False
920    False
921    False
922    False
923    False
924    False
925    False
926    False
927    False
928    False
929    False
930    False
931    False
932    False
933    False
934    False
935    False
936    False
937    False
938    False
939    False
940    False
941    False
942    False
943    False
Length: 943, dtype: bool

Outputs a True if an entire row is identical to a previous, ie one above it, row

In [8]:
users.duplicated().sum()

7

Shows us that there are 7 rows in the dataframe which are duplicates ie have the same data as a previous row in the dataframe 

In [9]:
users.loc[users.duplicated(), :]

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
496,21,F,student,55414
572,51,M,educator,20003
621,17,M,student,60402
684,28,M,student,55414
733,44,F,other,60630
805,27,F,other,20009
890,32,M,student,97301


We can pass, to loc, a series of booleans and it will show any row in which True is present in that series. This allows us to see the seven rows that were identified by duplicated()

keep parameter to duplictaed()

keep = first - This is a parameter of duplicated and first is the default setting. If we add this to our code nothing will change as this is what was run in the first place.

The logic for first - Mark duplicates as True except for the first occurrence ie the first occurence is kept and then all others are identified as duplicates

In [10]:
users.loc[users.duplicated(keep='last'), :]

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
67,17,M,student,60402
85,51,M,educator,20003
198,21,F,student,55414
350,32,M,student,97301
428,28,M,student,55414
437,27,F,other,20009
460,44,F,other,60630


When we change the keep option from first to last, we see the seven rows that are counted as duplicates but instead of keeping the first ones we are now keeping the later ones 

In [11]:
users.loc[users.duplicated(keep=False), :]

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
67,17,M,student,60402
85,51,M,educator,20003
198,21,F,student,55414
350,32,M,student,97301
428,28,M,student,55414
437,27,F,other,20009
460,44,F,other,60630
496,21,F,student,55414
572,51,M,educator,20003
621,17,M,student,60402


This has the affect of marking all duplicates as True and so they ALL show up when we use the loc method.

We can actuall now see which rows are duplicates of each other eg rows 67 & 621 look identical to me. The caveat here is that they do have unique user_id's so they may not actually be the same user

###### Scenario - We want to drop the duplicates from the dataframe

In [12]:
users.drop_duplicates().shape

(936, 4)

This will drop the duplicates and the shape, which we can see above, shows that we have lost those seven rows

N.B. This will take the inplace=True parameter if you like but, by default, that doesn't happen

If we changed the keep behaviour here from first (the default) to last, then it would still drop the seven rows it is just that it would drop the other seven rows

If we changed it to keep=False, then it would drop all fourteen duplicates and we would lose 14 rows 

###### Bonus Tip - What if you only wanted to consider certain columns when identifying duplicates

In [13]:
users.duplicated(subset=['age', 'zip_code']).sum()

16

If we tell duplicated to only consider age and zip_code as the relevant columns, rather than the whole dataframe, there are 16 duplicates in the dataframe

In [14]:
users.drop_duplicates(subset=['age', 'zip_code']) # Can also be used with this parameter

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,05201
9,29,M,student,01002
10,53,M,lawyer,90703
