### Missing Data

In [1]:
import pandas as pd
import numpy as np

Python's implementation of `float` objects includes some special "numbers" to represent infinity and "not a number" (these are actually part of the IEEE specification for floats, not just Python specific things)

In [2]:
float('inf')

inf

This is actually a float typed object:

In [3]:
type(float('inf'))

float

We can also get that `float` object from the `math` module:

In [4]:
import math
math.inf

inf

Or even from the NumPy library:

In [5]:
np.inf

inf

Having this "infinity" float can be useful when you need to define upper and lower bounds.

The other special float, and the one we're going to use in this video, is the `NaN` (not a number) value - it is basically used to indicate a float number that is undefined or not representable.

It is also often used to indicate missing data in an array.

In [6]:
float('NaN'), float('nan')

(nan, nan)

It is also available from `math` and `numpy`:

In [7]:
math.nan, np.nan

(nan, nan)

**CAUTION**: Do not try to compare a `NaN` number to another one - they will never compare equal!

In [8]:
float('nan') == float('nan')

False

In [9]:
float('nan') is math.nan

False

And this kind of makes sense, if two numbers are undefined, who's to say if they are equal?

So how do we check if a number is `NaN` if we cannot use equality tests?

The `math` module has the function `isnan()` which we can use:

In [10]:
math.isnan(float('NAN'))

True

In [11]:
math.isnan(np.nan)

True

The NumPy module has that function as well:

In [12]:
np.isnan(math.nan)

True

And if you're wondering why NumPy has that function defined given it is already in the `math` module, keep in mind that NumPy functions are universal functions.

In [13]:
a = np.array([1, 2, np.nan, 3, np.nan])
a

array([ 1.,  2., nan,  3., nan])

In [14]:
np.isnan(a)

array([False, False,  True, False,  True])

#### Working with Missing Values in Series

Let's look at `NaN` in the context of `Series` objects:

In [15]:
s = pd.Series([3.14, 2.5, None, 5])
s

0    3.14
1    2.50
2     NaN
3    5.00
dtype: float64

As you can see, the array was interpreted as a float64 array, and the `None` value was converted to the `NaN` float:

In [16]:
type(s[2])

numpy.float64

So we can have `NaN` values for floats, but what about integers?

Pandas is built on top of NumPy, and NumPy arrays do not have the concept of `NaN` for integers, so neither does Pandas.

What ends up happening is that Pandas will cast the series to a float64 (or object) when it encounters a `None` or `NaN`:

In [17]:
pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

In [18]:
pd.Series([1, 2, 3, None])

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

In [19]:
pd.Series([1, 2, 3, np.nan])

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

So for numeric types, we can either use `NaN` or `None` and Pandas will convert it to the float `NaN` if needed.

But we have to be more careful with series that are of `object` type, such as strings:

In [20]:
s = pd.Series(['a', 'b', None, np.nan])
s

0       a
1       b
2    None
3     NaN
dtype: object

You'll notice here that `None` was **not** converted to `NaN`.

So how do we test if a value in this series is "missing"? 

In [21]:
s[2] is None

True

In [22]:
s[3] is None

False

So we can't test using `is None` for both cases, and neither can we use `isnan()`:

In [23]:
try:
    math.isnan(s[2])
except TypeError as ex:
    print('TypeError:', ex)

TypeError: must be real number, not NoneType


Fortunately, Pandas offers a few functions we can use to deal with missing values, represented either by `None` or by `NaN`:

- isnull()
- notnull()
- dropna()
- fillna()

Let's take a look at each one and see how they work, in the context of `Series` objects first.

##### isnull()

This is a universal function that is applied to every element of the series:

In [24]:
s = pd.Series(['aaa', 'bbb', None, 'ddd', np.nan], index=list('abcde'))
s

a     aaa
b     bbb
c    None
d     ddd
e     NaN
dtype: object

In [25]:
pd.isnull(s)

a    False
b    False
c     True
d    False
e     True
dtype: bool

As you can see this returns a series of boolean type (and we could then use it for boolean masking):

In [26]:
s[pd.isnull(s)]

c    None
e     NaN
dtype: object

Here we were able to essentially extract all the `NaN` and `None` values.

##### notnull()

The `isnull()` function basically created a mask where the missing values returned `True` - we could reverse this mask by using not (`~`):

In [27]:
s[~pd.isnull(s)]

a    aaa
b    bbb
d    ddd
dtype: object

Or, we can just use the `notnull()` function:

In [28]:
s[pd.notnull(s)]

a    aaa
b    bbb
d    ddd
dtype: object

##### dropna()

This function will drop any missing values from the series:

In [29]:
s.dropna()

a    aaa
b    bbb
d    ddd
dtype: object

And again, this does not affect the original series, but rather returns a new series, and as usual, the index labels are maintained for the new series.

##### fillna()

This function can be used to replace missing values in a series with some other value:

In [30]:
s

a     aaa
b     bbb
c    None
d     ddd
e     NaN
dtype: object

In [31]:
s.fillna('missing')

a        aaa
b        bbb
c    missing
d        ddd
e    missing
dtype: object

But where the `fillna()` becomes interesting is when we use other values in the same series to impute the missing value. Pandas offers two common methods.

The first one is to use the preceding (non-missing) value. This is called a **forward fill**:

In [32]:
s.fillna(method='ffill')

a    aaa
b    bbb
c    bbb
d    ddd
e    ddd
dtype: object

We can also use back fill, that looks at the next non-missing value:

In [33]:
s.fillna(method='bfill')

a    aaa
b    bbb
c    ddd
d    ddd
e    NaN
dtype: object

Notice how the last value was not replaced - that's because there was no value after the `e` row. And a similar thing happens with forward filling if the first value is missing.

You could use both a back fill and a forward fill to handle the edges:

In [34]:
s.fillna(method='ffill').fillna(method='bfill')

a    aaa
b    bbb
c    bbb
d    ddd
e    ddd
dtype: object

Related to this `fillna()` function is a more advanced function called `interpolate()` that can be used for more advanced techniques for imputing missing values. We won't study this in detail, but we'll look at it briefly.

For more information, you should refer to:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html

In [35]:
s = pd.Series([1, 2, None, 4, None, 7])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
5    7.0
dtype: float64

In [36]:
s.interpolate(method='linear')

0    1.0
1    2.0
2    3.0
3    4.0
4    5.5
5    7.0
dtype: float64

The linear method basically assumes the numbers are equally spaced and fills in missing values that way.

#### Working with Missing Values in DataFrames

Now let's look at how all this works in the context of data frames.

In [37]:
d = {
    'col1': {'row1': 1, 'row2': 10, 'row3': 100, 'row4': 1000, 'row5': 10000},
    'col2': {'row1': 2, 'row2': None, 'row3': None, 'row4': 2000, 'row5': 20000},
    'col3': {'row1': 3, 'row2': 30, 'row3': 300, 'row4': None, 'row5': 40000},
    'col4': {'row1': 4, 'row2': 40, 'row3': 400, 'row4': 4000, 'row5': 40000}
}

df = pd.DataFrame(d)
df

Unnamed: 0,col1,col2,col3,col4
row1,1,2.0,3.0,4
row2,10,,30.0,40
row3,100,,300.0,400
row4,1000,2000.0,,4000
row5,10000,20000.0,40000.0,40000


We can use the `isnull()` and `notnull()` functions on the entire dataframe:

In [38]:
df.isnull()

Unnamed: 0,col1,col2,col3,col4
row1,False,False,False,False
row2,False,True,False,False
row3,False,True,False,False
row4,False,False,True,False
row5,False,False,False,False


As you can see we get a mask for every element of the matrix.

The `fillna()` can be applied to the entire data frame as well:

In [39]:
df.fillna(0)

Unnamed: 0,col1,col2,col3,col4
row1,1,2.0,3.0,4
row2,10,0.0,30.0,40
row3,100,0.0,300.0,400
row4,1000,2000.0,0.0,4000
row5,10000,20000.0,40000.0,40000


We can also use a back/forward fill method - but the question then becomes do you fill based on the column values or the row values?

Let's try it:

In [40]:
print(df)
df.fillna(method='ffill')

       col1     col2     col3   col4
row1      1      2.0      3.0      4
row2     10      NaN     30.0     40
row3    100      NaN    300.0    400
row4   1000   2000.0      NaN   4000
row5  10000  20000.0  40000.0  40000


Unnamed: 0,col1,col2,col3,col4
row1,1,2.0,3.0,4
row2,10,2.0,30.0,40
row3,100,2.0,300.0,400
row4,1000,2000.0,300.0,4000
row5,10000,20000.0,40000.0,40000


As you can see, this used a forward fill based on the column values.

But we might want to fill values based on the rows, not the columns.

The `axis` argument allows us to specify this:

In [41]:
print(df)
df.fillna(method='ffill', axis=1)

       col1     col2     col3   col4
row1      1      2.0      3.0      4
row2     10      NaN     30.0     40
row3    100      NaN    300.0    400
row4   1000   2000.0      NaN   4000
row5  10000  20000.0  40000.0  40000


Unnamed: 0,col1,col2,col3,col4
row1,1.0,2.0,3.0,4.0
row2,10.0,10.0,30.0,40.0
row3,100.0,100.0,300.0,400.0
row4,1000.0,2000.0,2000.0,4000.0
row5,10000.0,20000.0,40000.0,40000.0


The backfill method works the same way.

The `interpolate` function also works on data frames, and just like the `fillna` method, we can also specify the axis we want to fill along.

In [42]:
print(df)
df.interpolate(method='linear')

       col1     col2     col3   col4
row1      1      2.0      3.0      4
row2     10      NaN     30.0     40
row3    100      NaN    300.0    400
row4   1000   2000.0      NaN   4000
row5  10000  20000.0  40000.0  40000


Unnamed: 0,col1,col2,col3,col4
row1,1,2.0,3.0,4
row2,10,668.0,30.0,40
row3,100,1334.0,300.0,400
row4,1000,2000.0,20150.0,4000
row5,10000,20000.0,40000.0,40000


Or we could interpolate along the columns axis:

In [43]:
print(df)
df.interpolate(method='linear', axis=1)

       col1     col2     col3   col4
row1      1      2.0      3.0      4
row2     10      NaN     30.0     40
row3    100      NaN    300.0    400
row4   1000   2000.0      NaN   4000
row5  10000  20000.0  40000.0  40000


Unnamed: 0,col1,col2,col3,col4
row1,1.0,2.0,3.0,4.0
row2,10.0,20.0,30.0,40.0
row3,100.0,200.0,300.0,400.0
row4,1000.0,2000.0,3000.0,4000.0
row5,10000.0,20000.0,40000.0,40000.0


Lastly we have the `dropna()` method.

This can be used to drop rows or columns that contain null values, we just have to specify which axis we want to operate on (`0` to drop rows, `1` to drop columns, with the default being `0`):

In [44]:
print(df)
df.dropna()

       col1     col2     col3   col4
row1      1      2.0      3.0      4
row2     10      NaN     30.0     40
row3    100      NaN    300.0    400
row4   1000   2000.0      NaN   4000
row5  10000  20000.0  40000.0  40000


Unnamed: 0,col1,col2,col3,col4
row1,1,2.0,3.0,4
row5,10000,20000.0,40000.0,40000


As you can see, this dropped any **row** that contained a null value.

On the other hand, we may want to drop **columns** that contain null values:

In [45]:
print(df)
df.dropna(axis=1)

       col1     col2     col3   col4
row1      1      2.0      3.0      4
row2     10      NaN     30.0     40
row3    100      NaN    300.0    400
row4   1000   2000.0      NaN   4000
row5  10000  20000.0  40000.0  40000


Unnamed: 0,col1,col4
row1,1,4
row2,10,40
row3,100,400
row4,1000,4000
row5,10000,40000
