# Missing values

## Setup

In [1]:
import pandas as pd

## Creation

Creation of an example DataFrame (starting from a dictionary of dictionaries):

In [2]:
data = {
    "Capital": {
        "Spain": "Madrid",
        "Belgium": "Brussels",
        "France": "Paris",
        "Italy": "Roma",
        "Germany": "Berlin",
        "Portugal": "Lisbon",
        "Norway": "Oslo",
        "Greece": "Athens",
    },
    "Population": {
        "Spain": 46733038,
        "Belgium": 11449656,
        "France": 67076000,
        "Italy": 60390560,
        "Germany": 83122889,
        "Portugal": 10295909,
        "Norway": 5391369,
        "Greece": 10718565,
    },
    "Monarch": {
        "Spain": "Felipe VI",
        "Belgium": "Philippe",
        "Norway": "Harald V",
    },
    "Area": {
        "Spain": 505990,
        "Belgium": 30688,
        "France": 640679,
        "Italy": 301340,
        "Germany": None,
        "Portugal": 92212,
        "Norway": 385207,
        "Greece": 131957,
    },
    "Currency": {
        "Spain": "EUR",
        "Belgium": "EUR",
        "France": "EUR",
        "Italy": "EUR",
        "Germany": "EUR",
        "Portugal": None,
        "Norway": "NOK",
        "Greece": "EUR",
    },
    "Formation": {
        "Spain": "1715-06-09",
        "Belgium": "1830-10-04",
        "France": "1792-09-22",
        "Italy": None,
        "Germany": None,
        "Portugal": None,
        "Norway": None,
        "Greece": None,
    },
}

In [3]:
# For now, let's forget about these steps:
df = pd.DataFrame(data)
df["Density"] = df["Population"] / df["Area"]
df["Capital"] = df["Capital"].astype("string")
df["Monarch"] = df["Monarch"].astype("string")
df["Area"] = df["Area"].astype("Int64")
df["Currency"] = df["Currency"].astype("category")
df["Formation"] = df["Formation"].astype("datetime64[ns]")

Apple stock data, taken from the [`matplotlib` sample datasets](https://github.com/matplotlib/sample_data/blob/master/aapl.csv)

In [4]:
# For now, let's forget about these steps:
apple = pd.read_csv("AAPL.csv")
apple["Date"] = apple["Date"].astype("datetime64[ns]")
apple = apple.set_index("Date")
apple = apple.sort_index()
apple.at["1984-09-07", "Open"] = None
apple.loc["1984-09-10":"1984-09-11", "High"] = None
apple.loc["1984-09-10":"1984-09-12", "Low"] = None

## DataFrames

In [5]:
df

Unnamed: 0,Capital,Population,Monarch,Area,Currency,Formation,Density
Spain,Madrid,46733038,Felipe VI,505990.0,EUR,1715-06-09,92.359608
Belgium,Brussels,11449656,Philippe,30688.0,EUR,1830-10-04,373.098801
France,Paris,67076000,,640679.0,EUR,1792-09-22,104.695175
Italy,Roma,60390560,,301340.0,EUR,NaT,200.406717
Germany,Berlin,83122889,,,EUR,NaT,
Portugal,Lisbon,10295909,,92212.0,,NaT,111.654763
Norway,Oslo,5391369,Harald V,385207.0,NOK,NaT,13.996031
Greece,Athens,10718565,,131957.0,EUR,NaT,81.227711


In [6]:
df.dtypes

Capital               string
Population             int64
Monarch               string
Area                   Int64
Currency            category
Formation     datetime64[ns]
Density              float64
dtype: object

In [7]:
apple.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1984-09-07,,26.87,26.25,26.5,2981600,3.02
1984-09-10,26.5,,,26.37,2346400,3.01
1984-09-11,26.62,,,26.87,5444000,3.07
1984-09-12,26.87,27.0,,26.12,4773600,2.98
1984-09-13,27.5,27.62,27.5,27.5,7429600,3.14


In [8]:
apple.count()

Open         6080
High         6079
Low          6078
Close        6081
Volume       6081
Adj Close    6081
dtype: int64

## Demo 1: Drop rows with missing values

In [9]:
df

Unnamed: 0,Capital,Population,Monarch,Area,Currency,Formation,Density
Spain,Madrid,46733038,Felipe VI,505990.0,EUR,1715-06-09,92.359608
Belgium,Brussels,11449656,Philippe,30688.0,EUR,1830-10-04,373.098801
France,Paris,67076000,,640679.0,EUR,1792-09-22,104.695175
Italy,Roma,60390560,,301340.0,EUR,NaT,200.406717
Germany,Berlin,83122889,,,EUR,NaT,
Portugal,Lisbon,10295909,,92212.0,,NaT,111.654763
Norway,Oslo,5391369,Harald V,385207.0,NOK,NaT,13.996031
Greece,Athens,10718565,,131957.0,EUR,NaT,81.227711


Count non-missing values:

In [10]:
df.count()

Capital       8
Population    8
Monarch       3
Area          7
Currency      7
Formation     3
Density       7
dtype: int64

Drop rows with missing values:

In [11]:
df.dropna()

Unnamed: 0,Capital,Population,Monarch,Area,Currency,Formation,Density
Spain,Madrid,46733038,Felipe VI,505990,EUR,1715-06-09,92.359608
Belgium,Brussels,11449656,Philippe,30688,EUR,1830-10-04,373.098801


Drop rows with missing values in specific columns:

In [12]:
df.dropna(subset=["Area", "Population"])

Unnamed: 0,Capital,Population,Monarch,Area,Currency,Formation,Density
Spain,Madrid,46733038,Felipe VI,505990,EUR,1715-06-09,92.359608
Belgium,Brussels,11449656,Philippe,30688,EUR,1830-10-04,373.098801
France,Paris,67076000,,640679,EUR,1792-09-22,104.695175
Italy,Roma,60390560,,301340,EUR,NaT,200.406717
Portugal,Lisbon,10295909,,92212,,NaT,111.654763
Norway,Oslo,5391369,Harald V,385207,NOK,NaT,13.996031
Greece,Athens,10718565,,131957,EUR,NaT,81.227711


## Exercise 1

In [13]:
apple.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1984-09-07,,26.87,26.25,26.5,2981600,3.02
1984-09-10,26.5,,,26.37,2346400,3.01
1984-09-11,26.62,,,26.87,5444000,3.07
1984-09-12,26.87,27.0,,26.12,4773600,2.98
1984-09-13,27.5,27.62,27.5,27.5,7429600,3.14


Count non-missing values:

In [14]:
apple.count()

Open         6080
High         6079
Low          6078
Close        6081
Volume       6081
Adj Close    6081
dtype: int64

Drop rows with missing values:

In [15]:
apple.dropna()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1984-09-13,27.50,27.62,27.50,27.50,7429600,3.14
1984-09-14,27.62,28.50,27.62,27.87,8826400,3.18
1984-09-17,28.62,29.00,28.62,28.62,6886400,3.27
1984-09-18,28.62,28.87,27.62,27.62,3495200,3.15
1984-09-19,27.62,27.87,27.00,27.00,3816000,3.08
...,...,...,...,...,...,...
2008-10-08,85.91,96.33,85.68,89.79,78847900,89.79
2008-10-09,93.35,95.80,86.60,88.74,57763700,88.74
2008-10-10,85.70,100.00,85.00,96.80,79260700,96.80
2008-10-13,104.55,110.53,101.02,110.26,54967000,110.26


Drop rows with missing values in the "Open" and "High" columns:

In [16]:
apple.dropna(subset=["Open","High"])

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1984-09-12,26.87,27.00,,26.12,4773600,2.98
1984-09-13,27.50,27.62,27.50,27.50,7429600,3.14
1984-09-14,27.62,28.50,27.62,27.87,8826400,3.18
1984-09-17,28.62,29.00,28.62,28.62,6886400,3.27
1984-09-18,28.62,28.87,27.62,27.62,3495200,3.15
...,...,...,...,...,...,...
2008-10-08,85.91,96.33,85.68,89.79,78847900,89.79
2008-10-09,93.35,95.80,86.60,88.74,57763700,88.74
2008-10-10,85.70,100.00,85.00,96.80,79260700,96.80
2008-10-13,104.55,110.53,101.02,110.26,54967000,110.26


Drop rows with missing values in the "Open" column:

In [19]:
apple.dropna(subset=["Open"])

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1984-09-10,26.50,,,26.37,2346400,3.01
1984-09-11,26.62,,,26.87,5444000,3.07
1984-09-12,26.87,27.00,,26.12,4773600,2.98
1984-09-13,27.50,27.62,27.50,27.50,7429600,3.14
1984-09-14,27.62,28.50,27.62,27.87,8826400,3.18
...,...,...,...,...,...,...
2008-10-08,85.91,96.33,85.68,89.79,78847900,89.79
2008-10-09,93.35,95.80,86.60,88.74,57763700,88.74
2008-10-10,85.70,100.00,85.00,96.80,79260700,96.80
2008-10-13,104.55,110.53,101.02,110.26,54967000,110.26


## Demo 2: Replace missing values by a constant

In [20]:
df

Unnamed: 0,Capital,Population,Monarch,Area,Currency,Formation,Density
Spain,Madrid,46733038,Felipe VI,505990.0,EUR,1715-06-09,92.359608
Belgium,Brussels,11449656,Philippe,30688.0,EUR,1830-10-04,373.098801
France,Paris,67076000,,640679.0,EUR,1792-09-22,104.695175
Italy,Roma,60390560,,301340.0,EUR,NaT,200.406717
Germany,Berlin,83122889,,,EUR,NaT,
Portugal,Lisbon,10295909,,92212.0,,NaT,111.654763
Norway,Oslo,5391369,Harald V,385207.0,NOK,NaT,13.996031
Greece,Athens,10718565,,131957.0,EUR,NaT,81.227711


Replace missing values in a column by a constant:

In [21]:
df["Monarch"].fillna("")

Spain       Felipe VI
Belgium      Philippe
France               
Italy                
Germany              
Portugal             
Norway       Harald V
Greece               
Name: Monarch, dtype: string

In [22]:
df["Monarch"] = df["Monarch"].fillna("")

In [23]:
df

Unnamed: 0,Capital,Population,Monarch,Area,Currency,Formation,Density
Spain,Madrid,46733038,Felipe VI,505990.0,EUR,1715-06-09,92.359608
Belgium,Brussels,11449656,Philippe,30688.0,EUR,1830-10-04,373.098801
France,Paris,67076000,,640679.0,EUR,1792-09-22,104.695175
Italy,Roma,60390560,,301340.0,EUR,NaT,200.406717
Germany,Berlin,83122889,,,EUR,NaT,
Portugal,Lisbon,10295909,,92212.0,,NaT,111.654763
Norway,Oslo,5391369,Harald V,385207.0,NOK,NaT,13.996031
Greece,Athens,10718565,,131957.0,EUR,NaT,81.227711


Replace missing values in several columns by constants:

In [24]:
value = {
    "Area": 0,
    "Formation": "---",
}

In [25]:
df.fillna(value=value)

Unnamed: 0,Capital,Population,Monarch,Area,Currency,Formation,Density
Spain,Madrid,46733038,Felipe VI,505990,EUR,1715-06-09 00:00:00,92.359608
Belgium,Brussels,11449656,Philippe,30688,EUR,1830-10-04 00:00:00,373.098801
France,Paris,67076000,,640679,EUR,1792-09-22 00:00:00,104.695175
Italy,Roma,60390560,,301340,EUR,---,200.406717
Germany,Berlin,83122889,,0,EUR,---,
Portugal,Lisbon,10295909,,92212,,---,111.654763
Norway,Oslo,5391369,Harald V,385207,NOK,---,13.996031
Greece,Athens,10718565,,131957,EUR,---,81.227711


<div class="alert alert-info">

<b>Note:</b> The fill value must be consistent with the data type of the column.

</div>

## Exercise 2

In [26]:
apple.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1984-09-07,,26.87,26.25,26.5,2981600,3.02
1984-09-10,26.5,,,26.37,2346400,3.01
1984-09-11,26.62,,,26.87,5444000,3.07
1984-09-12,26.87,27.0,,26.12,4773600,2.98
1984-09-13,27.5,27.62,27.5,27.5,7429600,3.14


Replace missing values in the "Low" column by zero:

In [29]:
apple['Low'] = apple['Low'].fillna(0)

In [30]:
apple

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1984-09-07,,26.87,26.25,26.50,2981600,3.02
1984-09-10,26.50,,0.00,26.37,2346400,3.01
1984-09-11,26.62,,0.00,26.87,5444000,3.07
1984-09-12,26.87,27.00,0.00,26.12,4773600,2.98
1984-09-13,27.50,27.62,27.50,27.50,7429600,3.14
...,...,...,...,...,...,...
2008-10-08,85.91,96.33,85.68,89.79,78847900,89.79
2008-10-09,93.35,95.80,86.60,88.74,57763700,88.74
2008-10-10,85.70,100.00,85.00,96.80,79260700,96.80
2008-10-13,104.55,110.53,101.02,110.26,54967000,110.26
