# Remove Duplicates
In this short tutorial I show how to remove duplicates from a dataframe, using the `drop_duplicates()` function provided by the `pandas` library.
Duplicates removal is a technique used to preprocess data. Data preprocessing also includes:
* missing values
* standardization
* normalization
* formatting
* binning.

## Data import
Firstly, I import the Python `pandas` library and then I read the CSV file through the `read_csv()` function. 

In [1]:
import pandas as pd

df = pd.read_csv('cupcake_duplicates.csv')
df.head()

Unnamed: 0,Mese,Cupcake
0,2004-01,5
1,2004-01,5
2,2004-01,5
3,2004-02,5
4,2004-03,4


Now I list the number of records contained in the dataframe. I exploit the `shape` attributes, which shows the number of rows and the number of columns of the dataframe.

In [3]:
df.shape

(210, 2)

## Check for the presence of duplicates
In order to check whether a record is duplicated or not, I can exploit the `duplicated()` function, which returns `True` if a record has other duplicates, `False` otherwise.

In [5]:
df.duplicated()

0      False
1       True
2       True
3      False
4      False
       ...  
205    False
206    False
207    False
208    False
209    False
Length: 210, dtype: bool

I can use the `duplicated()` function also on a subset of columns of the dataframe. In this case, I must use the `subset` parameter, which contains the list of columns to be checked.

In [19]:
df.duplicated(subset=['Mese'])

0      False
1       True
2       True
3      False
4      False
       ...  
205    False
206    False
207    False
208    False
209    False
Length: 210, dtype: bool

Now I can calculate the number duplicates through the sum of `True` records.

In [6]:
df.duplicated().sum()

6

## Drop duplicates
Now I can drop duplicates through the `drop_duplicates()` function. I can use different strategies:
* drop all duplicates, on the basis of all the columns
* drop all duplicates, on the basis of some columns

In both the strategies, I can decide whether to maintain a copy of the duplicated values or not. This can be done through the `keep` parameter, passed as input to the `drop_duplicates()` function.

In [9]:
df1 = df.drop_duplicates()

In [10]:
df1.shape

(204, 2)

In [11]:
df1.head()

Unnamed: 0,Mese,Cupcake
0,2004-01,5
3,2004-02,5
4,2004-03,4
5,2004-04,6
6,2004-05,5


Drop also the first duplicate

In [14]:
df2 = df.drop_duplicates(keep=False)
df2.head()

Unnamed: 0,Mese,Cupcake
3,2004-02,5
4,2004-03,4
5,2004-04,6
6,2004-05,5
7,2004-06,6


In [15]:
df2.shape

(201, 2)

Drop duplicates on the basis of a subset of columns

In [17]:
df3 = df.drop_duplicates(subset=["Cupcake"])
df3.shape

(78, 2)

In [18]:
df3.head()

Unnamed: 0,Mese,Cupcake
0,2004-01,5
4,2004-03,4
5,2004-04,6
11,2004-10,10
12,2004-11,7
