# Session 18: Missing data, interpolation and filling strategies

## Missing data

When dealing with `pandas` object's methods, we found `df.isna()` and `df.isnull()`. These two methods perform the same task, even in the documentation it's described as such. 

`pandas` converts both `None` and `np.isnan` into the same thing: `NaN`. It's this object that `isna` and `isnull` detect.

Let's create a DF to compare the methods.

In [2]:
import numpy as np
import pandas as pd

df = pd.DataFrame({
    "A": [1, 2, 3, None],
    "B": ["x", "y", np.nan, "z"],
    "C": [True, True, False, None]
})

df

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,,False
3,,z,


In [3]:
# the behavior is the same, because of the under-the-hood conversion from pandas
df.isna() == df.isnull()

Unnamed: 0,A,B,C
0,True,True,True
1,True,True,True
2,True,True,True
3,True,True,True


## Handling `NaN`

Let's think of a `pd.DataFrame` with _n_ rows and _m_ columns

`df.isna()` returns a _mask_ consisting of a matrix of _m_ rows and _n_ columns filled with `True` and `False`:
* True represents a position in which there's a `None` or `NaN`
* False represents a position without `None` or `NaN`


In [4]:
# checking NaN per column: 

df.isna().mean()

A    0.25
B    0.25
C    0.25
dtype: float64

In [5]:
# NaN in the whole dataset 
df.isna().mean().mean()

0.25

In [6]:
df

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,,False
3,,z,


In [7]:
df["A"].isna()

0    False
1    False
2    False
3     True
Name: A, dtype: bool

### In pandas, the logical operators are different than Python

* `not` is `~`
* `and` is `&`
* `or` is `|`

In [9]:
# filtering a series with Nan:
# only NaN values

df[
    (df["A"].isna())&
    (~df["B"].isna())
]

Unnamed: 0,A,B,C
3,,z,


In [10]:
# filtering a series with Nan:
# excluding NaN values

df[~df["A"].isna()] # with not

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,,False


In [11]:
# the opposite of `pd.isna()` is `pd.notna()`
df.notna()

Unnamed: 0,A,B,C
0,True,True,True
1,True,True,True
2,True,False,True
3,False,True,False


### Removing `NaN` values

In `pandas` we have the `df.dropna()` method to remove all null values.

By default, `pd.dropna()` will remove **all** rows with an NaN.
* if we pass the `axis=1` argument, we change the behavior so that it removes columns with NaN

In [12]:
df.dropna(axis=0)

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True


In [13]:
df.dropna(axis=1)

0
1
2
3


In [14]:
df["A"].dropna()

0    1.0
1    2.0
2    3.0
Name: A, dtype: float64

With the argument `how` we can control a bit more this removal of NaN:
* `how=any` if any NaN value is found, that row/column is removed
* `how=all` only if **all** values are NaN, that row/column is removed

In [15]:
df.dropna(how="any")

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True


In [16]:
df.dropna(how="all")

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,,False
3,,z,


## Filling `NaN`

We have several approaches for filling missing values:
* Fill NaN with a constant value
* Forward Fill or Backward Fill NaN
* Fill NaN with Mean, Median or Mode of the data
* Interpolate Data and Fill NaN

In `pandas` we can fill the missing values in our DFs with `df.fillna()`. This method has several arguments we can tweak in our operation depending on the approach:
* `value`: with what we want the `NaN` to be replaced with
* `method`: 
    * `bfill`: use next valid observation to fill gap 
    * `ffill`: propagate last valid observation forward to next valid
* `axis`: 0 (default, rows) or 1 (columns)

### Fill NaN with a constant value

In [17]:
df

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,,False
3,,z,


In [18]:
value_to_fill = 3
df.fillna(value_to_fill)

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,3,False
3,3.0,z,3


### Fill NaN with `bfill` or `ffill`

In [19]:
df

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,,False
3,,z,


In [20]:
# using `ffill`
df.fillna(method="ffill")

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,y,False
3,3.0,z,False


In [21]:
# using `bfill`
df.fillna(method="bfill")

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,z,False
3,,z,


### Fill NaN with Mean, Median or Mode of the data

In [22]:
df

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,,False
3,,z,


In [23]:
df.mean()

A    2.000000
C    0.666667
dtype: float64

In [24]:
df.fillna(df.mean())

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,,False
3,2.0,z,0.666667


In [25]:
df.median()

A    2.0
C    1.0
dtype: float64

In [26]:
df.fillna(df.median())

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,,False
3,2.0,z,1


In [27]:
df.dtypes

A    float64
B     object
C     object
dtype: object

In [28]:
df.mode(numeric_only=True)

Unnamed: 0,A
0,1.0
1,2.0
2,3.0


In [29]:
df.fillna(df.mode(numeric_only=True))

Unnamed: 0,A,B,C
0,1.0,x,True
1,2.0,y,True
2,3.0,,False
3,,z,


### Interpolate Data

In pandas we can fill the NaN with interpolated values according to multiple methods:
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate

The main argument in `df.interpolate()` is `method`:
* By default, the method is `linear` -- linear interpolation
* Other methods, from docs:
    * ‘time’: Works on daily and higher resolution data to interpolate given length of interval.
    * ‘index’, ‘values’: use the actual numerical values of the index.
    * ‘pad’: Fill in NaNs using existing values.
    * ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5).
    * ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes.
    * ‘from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.

In [30]:
import pandas as pd
import numpy as np

In [31]:
df = pd.DataFrame({
    "col_a": [np.log(i) for i in range(1, 26)],
    "col_b": [i**2 for i in range(26, 51)],
    "col_c": [i*2 for i in range(51, 76)]
})

In [32]:
df.head()

Unnamed: 0,col_a,col_b,col_c
0,0.0,676,102
1,0.693147,729,104
2,1.098612,784,106
3,1.386294,841,108
4,1.609438,900,110


In [33]:
df_nan = pd.DataFrame({
    "col_a": [None if i%2==0 else np.log(i) for i in range(1, 26)],
    "col_b": [None if i%10==0 else i**2 for i in range(26, 51)],
    "col_c": [None if i%3==0 else i*2 for i in range(51, 76)]
})

df_nan.head()

Unnamed: 0,col_a,col_b,col_c
0,0.0,676.0,
1,,729.0,104.0
2,1.098612,784.0,106.0
3,,841.0,
4,1.609438,,110.0


In [34]:
df_nan.isna().mean()

col_a    0.48
col_b    0.12
col_c    0.36
dtype: float64

In [35]:
df_interpolate = df_nan.interpolate()

df_interpolate

Unnamed: 0,col_a,col_b,col_c
0,0.0,676.0,
1,0.549306,729.0,104.0
2,1.098612,784.0,106.0
3,1.354025,841.0,108.0
4,1.609438,901.0,110.0
5,1.777674,961.0,112.0
6,1.94591,1024.0,114.0
7,2.071567,1089.0,116.0
8,2.197225,1156.0,118.0
9,2.29756,1225.0,120.0


In [36]:
df - df_interpolate

Unnamed: 0,col_a,col_b,col_c
0,0.0,0.0,
1,0.143841,0.0,0.0
2,0.0,0.0,0.0
3,0.032269,0.0,0.0
4,0.0,-1.0,0.0
5,0.014085,0.0,0.0
6,0.0,0.0,0.0
7,0.007874,0.0,0.0
8,0.0,0.0,0.0
9,0.005025,0.0,0.0


As we can see, the linear interpolation on simple distributions works quite well.

In [37]:
df.interpolate()

Unnamed: 0,col_a,col_b,col_c
0,0.0,676,102
1,0.693147,729,104
2,1.098612,784,106
3,1.386294,841,108
4,1.609438,900,110
5,1.791759,961,112
6,1.94591,1024,114
7,2.079442,1089,116
8,2.197225,1156,118
9,2.302585,1225,120
