# Handling Missing Data

## Missing data conventions

There are genererally two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a missing entry.

In the masking approach, the mask might be a separate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value. In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be more global, e.g. using NaN.

Both have tradeoff: seperate mask requires allocation of extra Boolean array which adds overhead; a sentinel value reduces range of valid values and may require extra logic in CPU and GPU arithmetic. Pandas chose to use sentinels for missing data, deriving from existing Python null values: the special floating-point NaN value, and the Python None object.

## None: Pythonic missing data

Since None is a Python object, it can't be used in arbitrary NumPy/Pandas arrays, but only in arrays with dtype 'object':

In [1]:
import numpy as np
import pandas as pd

In [2]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

This means that any operations on this array will be done at the Python level, with much more overhead:

In [5]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1e6, dtype=dtype).sum()
    print()

dtype = object
53.1 ms ± 58.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
1.96 ms ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)



Also any aggregations on an array with a None value will generally give an error:

In [6]:
vals1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

## NaN: Missing numerical data

NaN is different: it is a special floating-point value recognised by all systems that use the standard IEEE floating-point representation:

In [7]:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype

dtype('float64')

Thus this supports fast operations, but they are not always useful - can think of NaN like a virus that infects everything it touches:

In [8]:
1 + np.nan

nan

In [9]:
0 + np.nan

nan

In [10]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

Numpy does provide special versions that can fix this:

In [11]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

(8.0, 1.0, 4.0)

## NaN and None in Pandas

Pandas is built to handle both almost interchangeably:

In [12]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

For types without a sentinel value, Pandas typecasts when NA values are present:

In [14]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int32

In [15]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

## Operating on Null Values

isnull(): Generate boolean mask indicating missing values <br>
notnull(): Opposite of isnull() <br>
dropna(): Return filtered version of data <br>
fillna(): Return a copy of the data with missing values filled or imputed <br>

### Detecting null values

In [16]:
data = pd.Series([1, np.nan, 'hello', None])

In [17]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [18]:
data[data.notnull()]

0        1
2    hello
dtype: object

Results are similar for DataFrames

### Dropping null values

In [21]:
data.dropna()

0        1
2    hello
dtype: object

In [23]:
data.fillna('filled null value')

0                    1
1    filled null value
2                hello
3    filled null value
dtype: object

In [24]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


We can't drop single values from a DF, so dropna() gives a number of options. By default, it drops any row that contains a null value:

In [25]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [26]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


This drops good data, so can drop rows with all NA values, or a majority. We can specify this using the **how** or **thresh** parameters. The default is **how='any'**. We can specify **how='all'** which only drops those will all null values:

In [27]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [28]:
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


**thresh** is the minimum number of non-null values for the row/col to be kept:

In [30]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [29]:
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


### Filling null values

In [31]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [32]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [33]:
# forward-fill
data.fillna(method='ffill') # == data.ffill()

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [36]:
# backward-fill 
data.fillna(method='bfill') # == data.bfill()

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [37]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [38]:
df.fillna(method='ffill', axis=1) # == df.ffill(axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0
