In [1]:
import pandas as pd
import numpy as np

# **Handling Missing Data**
Pandas has convenient methods to check, remove, fill missing data.
It also ignore missing data when performing main operation such as mean, sum etc.

To signal missing data, pandas uses the NaN object of numpy, the numerical library it depends on. So in order to set a value as NaN we need to import it, conventionally as "np".

In [2]:
df = pd.DataFrame({
    'a': [1,2,np.nan,np.nan], 
    'b':[4, np.nan, 6, np.nan],
    'c':[8, 9, 10, 11],
    'd':[12, np.nan, np.nan, 15],
})
df

Unnamed: 0,a,b,c,d
0,1.0,4.0,8,12.0
1,2.0,,9,
2,,6.0,10,
3,,,11,15.0


### Detecting missing data
The method `.isna` (or `.isnull` in the older versions of pandas) can be used to return a mask that is True where the data is missing. To get the opposit mask, use `.notna` (or `.notnull`)

In [3]:
df.isna()

Unnamed: 0,a,b,c,d
0,False,False,False,False
1,False,True,False,True
2,True,False,False,True
3,True,True,False,False


In [4]:
df.notna()

Unnamed: 0,a,b,c,d
0,True,True,True,True
1,True,False,True,False
2,False,True,True,False
3,False,False,True,True


### Removing missing data
The method `dropna` handles removal of missing data in one of the following way:
 - how='any': (default) removes all rows that have even just one missing data in any column
 - how='all': removes the row only if all the columns contain missing data

The method returns an edited copy of the data.

In [5]:
df.dropna()
# same as:  df.dropna(how='any')

Unnamed: 0,a,b,c,d
0,1.0,4.0,8,12.0


In [6]:
df.dropna(how='all')

Unnamed: 0,a,b,c,d
0,1.0,4.0,8,12.0
1,2.0,,9,
2,,6.0,10,
3,,,11,15.0


If you want to remove columns instead of rows the axis argument has to be set accordingly: `axis='columns'`

In [7]:
df.dropna(axis='columns')

Unnamed: 0,c
0,8
1,9
2,10
3,11


The optional `subset` argument can be used to specify a subset of columns to focus the search of missing data on.

In [8]:
df.dropna(how='any', subset=['c', 'd'])

Unnamed: 0,a,b,c,d
0,1.0,4.0,8,12.0
3,,,11,15.0


### Replacing missing data
`fillna` can be used as a convenient way to replace missing data in a DataFrame.

The method returns an edited copy of the data.

In [9]:
df.fillna(-999)

Unnamed: 0,a,b,c,d
0,1.0,4.0,8,12.0
1,2.0,-999.0,9,-999.0
2,-999.0,6.0,10,-999.0
3,-999.0,-999.0,11,15.0


Instead of a value, a method can be used to fill the missing data. The available methods are:
- `pad` or `ffill`: propagate last valid observation forward to next valid
- `backfill` or `bfill`: use next valid observation to fill gap.

In [10]:
df.fillna(method='ffill')

Unnamed: 0,a,b,c,d
0,1.0,4.0,8,12.0
1,2.0,4.0,9,12.0
2,2.0,6.0,10,12.0
3,2.0,6.0,11,15.0


### ***EXERCISE 8.1***
Prove that main pandas operation ignore missing data by creating a copy of the following `s1` Series provided, where all NaN have been removed. Name the amended copy `s2`.
Compare the `.mean()` results of the two Series.

***HINT***: the same methods used for the dataframes above work for Series

In [11]:
s1 = pd.Series([1,2,np.nan,4])
# insert solution here

### ***EXERCISE 8.2***
Get the sum total of all the values in the `df` provided, only including rows where either 'quality1' or quality2' is not missing.

In [12]:
df = pd.DataFrame({
    'quality1': [100,92,30,np.nan,np.nan,15], 
    'value':[7,4,8,1,9,2],
    'quality2': [89,88,np.nan,np.nan,1,100], 
})
# insert solution here