In [26]:
import numpy as np
import pandas as pd

What does "missing data" mean? What is a missing value? It depends on the origin of the data and the context it was generated. For example, for a survey, a Salary field with an empty value, or a number 0, or an invalid value (a string for example) can be considered "missing data". These concepts are related to the values that Python will consider "Falsy":

In [3]:
falsy_values = (0, False , None, '', [], {})

For Python, all the values above are considered "falsy":

In [4]:
any(falsy_values)

False

Numpy has a special "nullable" value for numbers which is np.nan. It's NaN: "Not a number"

In [5]:
np.nan

nan

The np.nan value is kind of a virus. Everything that it touches becomes np.nan:

In [6]:
3 + np.nan

nan

In [7]:
a = np.array([1,5,6,7,5])
a+ np.nan

array([nan, nan, nan, nan, nan])

In [8]:
a = np.array([1,2,3, np.nan, np.nan, 4])
a

array([ 1.,  2.,  3., nan, nan,  4.])

In [9]:
a.sum()

nan

In [10]:
a.mean()

nan

This is better than regular None values, which in the previous examples would have raised an exception:

In [11]:
3 + None

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

For a numeric array, the None value is replaced by np.nan:

In [12]:
a = np.array([1,2,3, np.nan, None, 4], dtype= 'float')
a

array([ 1.,  2.,  3., nan, nan,  4.])

As we said, np.nan is like a virus. If you have any nan value in an array and you try to perform an operation on it, you'll get unexpected results:

In [13]:
a = np.array([1,2,3, np.nan, None, 4], dtype= 'float')
a.mean()

nan

In [14]:
a.sum()

nan

Numpy also supports an "Infinite" type:

In [15]:
np.inf

inf

Which also behaves as a virus:

In [16]:
3+np.inf

inf

In [17]:
np.inf/3

inf

In [18]:
np.inf / np.inf

nan

In [20]:
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float32)
b.sum()

nan

<div style="border: 2px solid green; background-color: blue; padding: 10px;">
    <strong></strong>
</div>

# Checking for nan or inf
### There are two functions: np.isnan and np.isinf that will perform the desired checks:

In [21]:
np.isnan(np.nan)

True

In [22]:
b = np.array([1, 2, 3, np.inf, np.nan, 4], dtype=np.float32)
np.isnan(b)

array([False, False, False, False,  True, False])

In [23]:
np.isinf(np.inf)

True

And the joint operation can be performed with **np.isfinite**.

In [24]:
np.isfinite(np.nan), np.isfinite(np.inf)

(False, False)

**np.isnan** and **np.isinf** also take arrays as inputs, and return boolean arrays as results:

In [27]:
np.isnan(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False,  True, False, False])

In [28]:
np.isinf(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([False, False, False, False,  True, False])

In [29]:
np.isfinite(np.array([1, 2, 3, np.nan, np.inf, 4]))

array([ True,  True,  True, False, False,  True])

Note: It's not so common to find infinite values. From now on, we'll keep working with only np.nan

<div style="border: 2px solid green; background-color: blue; padding: 10px;">
    <strong></strong>
</div>

# Filtering them out




Whenever you're trying to perform an operation with a Numpy array and you know there might be missing values, you'll need to filter them out before proceeding, to avoid nan propagation. We'll use a combination of the previous np.isnan + boolean arrays for this purpose:

In [35]:
a = np.array([1, 2, 3, np.nan, np.nan, 4])
a[~np.isnan(a)] #first we get boolean of finite numbers as false and using not operation we make it true and the get those number by using its index

array([1., 2., 3., 4.])

Which is equivalent to:

In [36]:
a[np.isfinite(a)]

array([1., 2., 3., 4.])

And with that result, all the operation can be now performed:

In [37]:
a[np.isfinite(a)].sum()

10.0

In [38]:
a[np.isfinite(a)].mean()

2.5