In [62]:
import numpy as np
import pandas as pd


What does "missing data" mean? What is a missing value? It depends on the origin of the data and the context it was generated. For example, for a survey, a Salary field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy":

In [63]:
falsy_values = (0, False, None, '', [], {})

For Python, all the values above are considered **falsy**:

In [64]:
any(falsy_values)

False

Numpy has a special **nullable** value for number which is ```np.nan```. It's ***Nan***: **Not a Number**

In [65]:
np.nan

nan

The ```np.nan``` value is kind of a virus. Everything that it touches becomes ```np.nan``` :

In [66]:
3 + np.nan

nan

In [67]:
a = np.array([1,2,3, np.nan, np.nan, 4])

In [68]:
a.sum()

nan

In [69]:
a.mean()

nan

This is better than regular ```None``` value, which in the previous examples would have raised an exception:

In [70]:
3 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

For a numeric array, the ```None``` value is replaced by ```np.nan``` :

In [71]:
a = np.array([1,2,3, np.nan, None, 4], dtype='float')

In [72]:
a

array([ 1.,  2.,  3., nan, nan,  4.])

Numpy also supports an **infinite** type:

In [73]:
np.inf

inf

Which also behaves as a virus:

In [74]:
3 + np.inf

inf

In [75]:
np.inf / 3

inf

In [76]:
np.inf / np.inf

nan

In [77]:
b = np.array([1,2,3,np.inf, np.nan, 4], dtype=np.float)

In [78]:
b.sum()

nan

### Checking for nan or inf
There are two functions: ```np.isnan``` and ```np.isinf``` that will perform the desired checks:

In [79]:
np.isnan(np.nan)

True

In [80]:
np.isinf(np.inf)

True

And the joint operation can be performed with ```np.isfinite```

In [81]:
np.isfinite(np.nan), np.isfinite(np.inf)

(False, False)

```np.isnan``` and ```np.isinf``` also take arrays as inputs, and return boolean arrays as results:

In [82]:
np.isnan(np.array([1,2,3,np.nan,np.inf,4]))

array([False, False, False,  True, False, False])

In [83]:
np.isinf(np.array([1,2,3,np.nan,np.inf,4]))

array([False, False, False, False,  True, False])

In [84]:
np.isfinite(np.array([1,2,3,np.nan,np.inf,4]))

array([ True,  True,  True, False, False,  True])

***Note:*** *It's not so common to find infinite values. From now on, we'll keep working with only ```np.nan```*

### Filtering them out
Whenever you're trying to perform an operation with a **numpy array** and you know there might be missing values, you'll need to filter them out before proceeding, to avoid ```nan``` propagation. We'll use a combination of the previous ```np.isnan``` + **boolean arrays** for this purpose:

In [85]:
a = np.array([1,2,3,np.nan,np.nan,4])

In [86]:
a[~np.isnan(a)]

array([1., 2., 3., 4.])

Which is equivalent to:

In [87]:
a[np.isfinite(a)]

array([1., 2., 3., 4.])

And with that result, all the operation can be now performed:

In [88]:
a[np.isfinite(a)].sum()

10.0

In [89]:
a[~np.isnan(a)].mean()

2.5