# Chapter 05 Missing Data
Pandas for Everyone. See the author's [github page](https://github.com/chendaniely/pandas_for_everyone)

In [2]:
import pandas as pd
import numpy as np
from numpy import NaN

## What is Missing Value?
numpy has a special symbol NaN, representing a missing value. It has special properties, e.g., it is not equal to anything, not True, not False, not even to itself.

NaN is not None.

In [3]:
NaN == True

False

In [4]:
NaN == False

False

In [5]:
NaN == NaN

False

In [6]:
NaN == None

False

In [7]:
NaN is NaN

True

In [8]:
NaN is float

False

### Test for Missing Value
Use the isnull() function

In [9]:
pd.isnull(NaN)

True

In [10]:
s = pd.Series([1, 2, NaN])
s

0    1.0
1    2.0
2    NaN
dtype: float64

In [11]:
s == NaN # The wrong way, because NaN != NaN

0    False
1    False
2    False
dtype: bool

In [12]:
s.isnull() # The correct way

0    False
1    False
2     True
dtype: bool

## Working With Missing Data
Let's find and count missing data first.

In [13]:
ebola = pd.read_csv('data/country_timeseries.csv')
ebola.head()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
1,1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2,1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
3,1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,
4,12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,


### Count Values
We use the count() method to count the number of non-NaN values per column

In [14]:
ebola.count() # count how many non-NaN values per column

Date                   122
Day                    122
Cases_Guinea            93
Cases_Liberia           83
Cases_SierraLeone       87
Cases_Nigeria           38
Cases_Senegal           25
Cases_UnitedStates      18
Cases_Spain             16
Cases_Mali              12
Deaths_Guinea           92
Deaths_Liberia          81
Deaths_SierraLeone      87
Deaths_Nigeria          38
Deaths_Senegal          22
Deaths_UnitedStates     18
Deaths_Spain            16
Deaths_Mali             12
dtype: int64

### Count Missing Values
Count number of missing values (NaN) per column

In [17]:
count_nan_in_series = lambda s: s.isnull().apply(lambda x: 1 if x else 0).sum()
ebola.apply(count_nan_in_series) # count each column's number of NaN

Date                     0
Day                      0
Cases_Guinea            29
Cases_Liberia           39
Cases_SierraLeone       35
Cases_Nigeria           84
Cases_Senegal           97
Cases_UnitedStates     104
Cases_Spain            106
Cases_Mali             110
Deaths_Guinea           30
Deaths_Liberia          41
Deaths_SierraLeone      35
Deaths_Nigeria          84
Deaths_Senegal         100
Deaths_UnitedStates    104
Deaths_Spain           106
Deaths_Mali            110
dtype: int64

#### Here is another approach using .shape

In [18]:
ebola.shape # the first element is the total number of rows, or the total number of elements per column

(122, 18)

In [20]:
ebola.shape[0] - ebola.count() # use the DataFrame's element-by-element calculation

Date                     0
Day                      0
Cases_Guinea            29
Cases_Liberia           39
Cases_SierraLeone       35
Cases_Nigeria           84
Cases_Senegal           97
Cases_UnitedStates     104
Cases_Spain            106
Cases_Mali             110
Deaths_Guinea           30
Deaths_Liberia          41
Deaths_SierraLeone      35
Deaths_Nigeria          84
Deaths_Senegal         100
Deaths_UnitedStates    104
Deaths_Spain           106
Deaths_Mali            110
dtype: int64

## Cleaning Missing Data

To deal with missing data, we can:

1. Replace missing data;
2. Ignore missing data.

### Replace

    fillna() method (Series -> Series, DataFrame -> DataFrame)

In [22]:
s

0    1.0
1    2.0
2    NaN
dtype: float64

In [23]:
s.fillna(0)

0    1.0
1    2.0
2    0.0
dtype: float64

In [24]:
ebola.fillna(0).head()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,0.0,10030.0,0.0,0.0,0.0,0.0,0.0,1786.0,0.0,2977.0,0.0,0.0,0.0,0.0,0.0
1,1/4/2015,288,2775.0,0.0,9780.0,0.0,0.0,0.0,0.0,0.0,1781.0,0.0,2943.0,0.0,0.0,0.0,0.0,0.0
2,1/3/2015,287,2769.0,8166.0,9722.0,0.0,0.0,0.0,0.0,0.0,1767.0,3496.0,2915.0,0.0,0.0,0.0,0.0,0.0
3,1/2/2015,286,0.0,8157.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3496.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12/31/2014,284,2730.0,8115.0,9633.0,0.0,0.0,0.0,0.0,0.0,1739.0,3471.0,2827.0,0.0,0.0,0.0,0.0,0.0


### Fill Forward

We can use built-in methods to fill forward or backward. When we fill data forward, the last known value is used for the next missing value. If a series begins with a missing value, then that data will remain missing.

In [31]:
s = pd.Series([NaN, NaN, 5, 6, NaN])
s

0    NaN
1    NaN
2    5.0
3    6.0
4    NaN
dtype: float64

In [32]:
s.fillna(method='ffill')

0    NaN
1    NaN
2    5.0
3    6.0
4    6.0
dtype: float64

### Fill Backward

When we fill data backward, the newest value is used to replace the missing value. It is like the reverse of the forward fill. If a series's last value is NaN, then it won't get filled.

In [33]:
s.fillna(method='bfill')

0    5.0
1    5.0
2    5.0
3    6.0
4    NaN
dtype: float64

### Interpolate

Use interpolation to calculate the missing values.

In [38]:
s = pd.Series([NaN, 1, 2, 3, NaN, 4, 5, NaN, NaN, 6, NaN, 8, NaN])
s

0     NaN
1     1.0
2     2.0
3     3.0
4     NaN
5     4.0
6     5.0
7     NaN
8     NaN
9     6.0
10    NaN
11    8.0
12    NaN
dtype: float64

In [39]:
s.interpolate()

0          NaN
1     1.000000
2     2.000000
3     3.000000
4     3.500000
5     4.000000
6     5.000000
7     5.333333
8     5.666667
9     6.000000
10    7.000000
11    8.000000
12    8.000000
dtype: float64

### Drop Missing Values

In [40]:
s.dropna()

1     1.0
2     2.0
3     3.0
5     4.0
6     5.0
9     6.0
11    8.0
dtype: float64

### Calculations With Missing Data

In [41]:
s + s.dropna()

0      NaN
1      2.0
2      4.0
3      6.0
4      NaN
5      8.0
6     10.0
7      NaN
8      NaN
9     12.0
10     NaN
11    16.0
12     NaN
dtype: float64

Note that the two series are matched by index