# Duplicate elements in DataFrames

## Setup

In [None]:
import pandas as pd

## Creation

Creation of an example DataFrame (starting from a dictionary of dictionaries):

In [None]:
data = {
    "Monarch": {
        "Spain": "Felipe VI",
        "Colombia": None,
        "France": None,
        "Canada": "Elizabeth II",
        "Italy": None,
        "Germany": None,
        "Austria": None,
        "Norway": "Harald V",
    },
    "Language": {
        "Spain": "Spanish",
        "Colombia": "Spanish",
        "France": "French",
        "Canada": "French",
        "Italy": "Italian",
        "Germany": "German",
        "Austria": "German",
        "Norway": "Norwegian",
    },
    "Currency": {
        "Spain": "EUR",
        "Colombia": "COP",
        "France": "EUR",
        "Canada": "CAD",
        "Italy": "EUR",
        "Germany": "EUR",
        "Austria": "EUR",
        "Norway": "NOK",
    },
}

In [None]:
# For now, let's forget about these steps:
df = pd.DataFrame(data)
df["Monarch"] = df["Monarch"].astype("string")
df["Language"] = df["Language"].astype("string")
df["Currency"] = df["Currency"].astype("string")

## Demo 1: Droping duplicates

In [None]:
df

Drop duplicate rows:

In [None]:
df.drop_duplicates()

Drop duplicate rows, based on a specific column:

In [None]:
df.drop_duplicates("Currency")

In [None]:
df.drop_duplicates("Language")

Drop duplicate rows, based on specific columns:

In [None]:
df.drop_duplicates(["Language", "Currency"])

Drop duplicate rows, based on specific columns, and keep the last occurence instead of the first one:

In [None]:
df.drop_duplicates(["Language", "Currency"], keep="last")

## Exercise 1

In [None]:
students = pd.DataFrame(
    {
        "Name": ["Alice", "Alice", "Bob", "Bob", "Clara", "Clara", "David", "David"],
        "Test": [1, 2, 1, 2, 1, 2, 1, 2],
        "Score": [10, 10, 9, 8, 10, 2, None, 7],
        "Result": ["Pass", "Pass", "Pass", "Pass", "Pass", "Fail", None, "Pass"],
    }
)

In [None]:
students

Drop duplicate rows in the `students` DataFrame:

Drop duplicate rows in the `students` DataFrame, based on the "Name" column:

Drop duplicate rows in the `students` DataFrame, based on the "Name" column, keeping the last occurence:

Drop duplicate rows in the `students` DataFrame, based on the "Name" and "Result" columns:

## Demo 2: Getting unique values

In [None]:
df

Get unique values in a column:

In [None]:
df["Currency"].unique()

In [None]:
df["Language"].unique()

In [None]:
df["Monarch"].unique()

<div class="alert alert-info">

<b>Note:</b> Unique values are <b>not sorted</b>.

</div>

Count the number of unique values in a column:

In [None]:
len(df["Currency"].unique())

In [None]:
len(df["Language"].unique())

In [None]:
len(df["Monarch"].unique())

Get the number of unique values in a column:

In [None]:
df["Currency"].nunique()

In [None]:
df["Language"].nunique()

In [None]:
df["Monarch"].nunique()

In [None]:
df["Monarch"].nunique(dropna=False)

<div class="alert alert-warning">

<b>Beware:</b> By default, the <code>.nunique()</code> method does not count missing values, whereas the <code>.unique()</code> does!

</div>

## Exercise 2

In [None]:
students

Get unique values in the "Name" column of the `students` DataFrame:

Get the number of unique values in the "Name" column of the `students` DataFrame:

Get unique values in the "Score" column of the `students` DataFrame:

Count the number of unique values in the "Score" column of the `students` DataFrame

Get the number of unique values in the "Name" column of the `students` DataFrame: