# Duplicate elements in DataFrames

## Setup

In [1]:
import pandas as pd

## Creation

Creation of an example DataFrame (starting from a dictionary of dictionaries):

In [2]:
data = {
    "Monarch": {
        "Spain": "Felipe VI",
        "Colombia": None,
        "France": None,
        "Canada": "Elizabeth II",
        "Italy": None,
        "Germany": None,
        "Austria": None,
        "Norway": "Harald V",
    },
    "Language": {
        "Spain": "Spanish",
        "Colombia": "Spanish",
        "France": "French",
        "Canada": "French",
        "Italy": "Italian",
        "Germany": "German",
        "Austria": "German",
        "Norway": "Norwegian",
    },
    "Currency": {
        "Spain": "EUR",
        "Colombia": "COP",
        "France": "EUR",
        "Canada": "CAD",
        "Italy": "EUR",
        "Germany": "EUR",
        "Austria": "EUR",
        "Norway": "NOK",
    },
}

In [3]:
# For now, let's forget about these steps:
df = pd.DataFrame(data)
df["Monarch"] = df["Monarch"].astype("string")
df["Language"] = df["Language"].astype("string")
df["Currency"] = df["Currency"].astype("string")

## Demo 1: Droping duplicates

In [4]:
df

Unnamed: 0,Monarch,Language,Currency
Spain,Felipe VI,Spanish,EUR
Colombia,,Spanish,COP
France,,French,EUR
Canada,Elizabeth II,French,CAD
Italy,,Italian,EUR
Germany,,German,EUR
Austria,,German,EUR
Norway,Harald V,Norwegian,NOK


Drop duplicate rows:

In [5]:
df.drop_duplicates()

Unnamed: 0,Monarch,Language,Currency
Spain,Felipe VI,Spanish,EUR
Colombia,,Spanish,COP
France,,French,EUR
Canada,Elizabeth II,French,CAD
Italy,,Italian,EUR
Germany,,German,EUR
Norway,Harald V,Norwegian,NOK


Drop duplicate rows, based on a specific column:

In [6]:
df.drop_duplicates("Currency")

Unnamed: 0,Monarch,Language,Currency
Spain,Felipe VI,Spanish,EUR
Colombia,,Spanish,COP
Canada,Elizabeth II,French,CAD
Norway,Harald V,Norwegian,NOK


In [7]:
df.drop_duplicates("Language")

Unnamed: 0,Monarch,Language,Currency
Spain,Felipe VI,Spanish,EUR
France,,French,EUR
Italy,,Italian,EUR
Germany,,German,EUR
Norway,Harald V,Norwegian,NOK


Drop duplicate rows, based on specific columns:

In [8]:
df.drop_duplicates(["Language", "Currency"])

Unnamed: 0,Monarch,Language,Currency
Spain,Felipe VI,Spanish,EUR
Colombia,,Spanish,COP
France,,French,EUR
Canada,Elizabeth II,French,CAD
Italy,,Italian,EUR
Germany,,German,EUR
Norway,Harald V,Norwegian,NOK


Drop duplicate rows, based on specific columns, and keep the last occurence instead of the first one:

In [9]:
df.drop_duplicates(["Language", "Currency"], keep="last")

Unnamed: 0,Monarch,Language,Currency
Spain,Felipe VI,Spanish,EUR
Colombia,,Spanish,COP
France,,French,EUR
Canada,Elizabeth II,French,CAD
Italy,,Italian,EUR
Austria,,German,EUR
Norway,Harald V,Norwegian,NOK


## Exercise 1

In [10]:
students = pd.DataFrame(
    {
        "Name": ["Alice", "Alice", "Bob", "Bob", "Clara", "Clara", "David", "David"],
        "Test": [1, 2, 1, 2, 1, 2, 1, 2],
        "Score": [10, 10, 9, 8, 10, 2, None, 7],
        "Result": ["Pass", "Pass", "Pass", "Pass", "Pass", "Fail", None, "Pass"],
    }
)

In [11]:
students

Unnamed: 0,Name,Test,Score,Result
0,Alice,1,10.0,Pass
1,Alice,2,10.0,Pass
2,Bob,1,9.0,Pass
3,Bob,2,8.0,Pass
4,Clara,1,10.0,Pass
5,Clara,2,2.0,Fail
6,David,1,,
7,David,2,7.0,Pass


Drop duplicate rows in the `students` DataFrame:

In [12]:
students.drop_duplicates()

Unnamed: 0,Name,Test,Score,Result
0,Alice,1,10.0,Pass
1,Alice,2,10.0,Pass
2,Bob,1,9.0,Pass
3,Bob,2,8.0,Pass
4,Clara,1,10.0,Pass
5,Clara,2,2.0,Fail
6,David,1,,
7,David,2,7.0,Pass


Drop duplicate rows in the `students` DataFrame, based on the "Name" column:

In [13]:
students.drop_duplicates("Name")

Unnamed: 0,Name,Test,Score,Result
0,Alice,1,10.0,Pass
2,Bob,1,9.0,Pass
4,Clara,1,10.0,Pass
6,David,1,,


Drop duplicate rows in the `students` DataFrame, based on the "Name" column, keeping the last occurence:

In [14]:
students.drop_duplicates("Name", keep="last")

Unnamed: 0,Name,Test,Score,Result
1,Alice,2,10.0,Pass
3,Bob,2,8.0,Pass
5,Clara,2,2.0,Fail
7,David,2,7.0,Pass


Drop duplicate rows in the `students` DataFrame, based on the "Name" and "Result" columns:

In [15]:
students.drop_duplicates(["Name","Result"])

Unnamed: 0,Name,Test,Score,Result
0,Alice,1,10.0,Pass
2,Bob,1,9.0,Pass
4,Clara,1,10.0,Pass
5,Clara,2,2.0,Fail
6,David,1,,
7,David,2,7.0,Pass


## Demo 2: Getting unique values

In [16]:
df

Unnamed: 0,Monarch,Language,Currency
Spain,Felipe VI,Spanish,EUR
Colombia,,Spanish,COP
France,,French,EUR
Canada,Elizabeth II,French,CAD
Italy,,Italian,EUR
Germany,,German,EUR
Austria,,German,EUR
Norway,Harald V,Norwegian,NOK


Get unique values in a column:

In [17]:
df["Currency"].unique()

<StringArray>
['EUR', 'COP', 'CAD', 'NOK']
Length: 4, dtype: string

In [18]:
df["Language"].unique()

<StringArray>
['Spanish', 'French', 'Italian', 'German', 'Norwegian']
Length: 5, dtype: string

In [19]:
df["Monarch"].unique()

<StringArray>
['Felipe VI', <NA>, 'Elizabeth II', 'Harald V']
Length: 4, dtype: string

<div class="alert alert-info">

<b>Note:</b> Unique values are <b>not sorted</b>.

</div>

Count the number of unique values in a column:

In [20]:
len(df["Currency"].unique())

4

In [21]:
len(df["Language"].unique())

5

In [22]:
len(df["Monarch"].unique())

4

Get the number of unique values in a column:

In [23]:
df["Currency"].nunique()

4

In [24]:
df["Language"].nunique()

5

In [25]:
df["Monarch"].nunique()

3

In [26]:
df["Monarch"].nunique(dropna=False)

4

<div class="alert alert-warning">

<b>Beware:</b> By default, the <code>.nunique()</code> method does not count missing values, whereas the <code>.unique()</code> does!

</div>

## Exercise 2

In [27]:
students

Unnamed: 0,Name,Test,Score,Result
0,Alice,1,10.0,Pass
1,Alice,2,10.0,Pass
2,Bob,1,9.0,Pass
3,Bob,2,8.0,Pass
4,Clara,1,10.0,Pass
5,Clara,2,2.0,Fail
6,David,1,,
7,David,2,7.0,Pass


Get unique values in the "Name" column of the `students` DataFrame:

In [30]:
students.Name.unique()

array(['Alice', 'Bob', 'Clara', 'David'], dtype=object)

4

Get the number of unique values in the "Name" column of the `students` DataFrame:

In [32]:
len(students.Name.unique())

4

Get unique values in the "Score" column of the `students` DataFrame:

In [33]:
students.Score.unique()

array([10.,  9.,  8.,  2., nan,  7.])

Count the number of unique values in the "Score" column of the `students` DataFrame

In [36]:
students.Score.nunique()

5

Get the number of unique values in the "Name" column of the `students` DataFrame:

In [37]:
students.Name.nunique()

4