# Handling Missing Data

In [2]:
import pandas as pd
import numpy as np

What happens is we need a way to specify or tell pandas that there’s missing data. So a couple of ways to do that is to use the nan object in NumPy. We get printed to the screen this NaN value for Not a Number.

In [3]:
df_ = pd.DataFrame({'x': [1, 2, np.nan, 4 ]})
df_

Unnamed: 0,x
0,1.0
1,2.0
2,
3,4.0


Now, by default, if you start performing some statistical computations with this DataFrame—so for example, similar to what we did before in the previous lesson, say we want to compute the mean of the values in this DataFrame—then what pandas is going to do is ignore the NaN value.

In [4]:
df_.mean()

x    2.333333
dtype: float64

So in this case, pandas ignored the NaN value. You can pass in a keyword argument to tell pandas to not ignore any NaN values. And this keyword argument is called skipna. The default value is True, so if you wanted to include in the computation that there was a NaN value, then you would pass in a value of False.

In [5]:
df_.mean(skipna=False)

x   NaN
dtype: float64

Whenever pandas or many other modules and general just computations reach a NaN value, the entire computation comes out as NaN as well. In this case, that’s what’s happening. For example, if you just want to see what’s 1 plus 2 plus, say, np.nan, then you’re going to get a nan value.



In [6]:
1 + 2 + np.nan

nan

In certain situations, you may want to replace or fill in for a NaN value with some default value. And in pandas, the method that does this is called .fillna(). Now, .fillna() has several options.

The default way to do this is to pass into the .fillna() method a value for all of the NaN values that you want to set. So for example, if you wanted to set all of the NaN values to 0, we would pass in to the keyword argument value the value of 0.



In [7]:
df_.fillna(value=0)

Unnamed: 0,x
0,1.0
1,2.0
2,0.0
3,4.0


By default, this returns a new DataFrame where all of the NaN values are replaced by the value that you’re passing in to the value keyword.

Or you can also put the keyword argument inplace=True, and then that would return a value of None and replace in the df_ DataFrame all of the NaN values with 0, and so it would modify the DataFrame.

Sometimes instead of just setting a value of 0, maybe what you want is to sort of continue with the previous non-NaN value—in other words, the previous actual numerical value—to sort of continue as the value that would get copied over onto any NaN values.

We can do this instead using the method keyword argument and then passing in a 'ffill' (forward fill) to that value.

So this is a string, it’s a method that would take the previous non-NaN value—or in other words, the numeric value—and to replace any NaN values with that one.

In [8]:
df_.fillna(method='ffill')

Unnamed: 0,x
0,1.0
1,2.0
2,2.0
3,4.0


Instead, we may want to use forward values to replace NaN values. In other words, we would like to do a backwards fill. So in this case, the method would be 'bfill, and so the NaN value would be replaced by 4.0.

In [9]:
df_.fillna(method='bfill')

Unnamed: 0,x
0,1.0
1,2.0
2,4.0
3,4.0


In [10]:
df_

Unnamed: 0,x
0,1.0
1,2.0
2,
3,4.0


Another common way to fill in missing values is to use mathematical interpolation. And this is essentially what interpolation is about, is when you only have, say, a sample of what you’re measuring and you’re also interested in values for some variable where you weren’t able to measure. For example, if we wanted to fill in the missing value—let’s see that DataFrame again—and sort of to continue this pattern would be fill in by the value in between 2.0 and 4.0.

This would amount to what would be called linear interpolation. So if we called the .interpolate() function, this would return a new DataFrame where the missing values are obtained by interpolating the previous and the value that follows the NaN value—in this case, 2.0 and 4.0.

In [12]:
df_.interpolate()

Unnamed: 0,x
0,1.0
1,2.0
2,3.0
3,4.0


In certain situations, you may want to simply remove any row or any column that contains a NaN value. pandas provides the .dropna() method.

The default behavior of .dropna() is to remove any row that contains a NaN value. This is accomplished as well by passing in a value 0 to the axis keyword. So whether you pass in 0 or not, if we run this,


In [13]:
df_.dropna(axis=0)

Unnamed: 0,x
0,1.0
1,2.0
3,4.0


we’re going to get a new DataFrame where that row—row label number 2—that had a NaN value is removed. Now, again, the axis keyword controls whether it’s rows or columns that are deleted, the ones that have a NaN value.

The default for axis is 0. So again, we get that DataFrame where row number two is removed. Now, if we pass in a value of 1 to the axis keyword argument,



In [14]:
df_.dropna(axis=1)

0
1
2
3


in this case, we’re going to get a DataFrame that has no data. In this case, it was because our DataFrame had only one column, and that column had a NaN value. Now, regardless of whether you’re passing in a value of 1 or 0 to remove a row or column, you can pass in a value of True for the inplace keyword.



In [15]:
df_.dropna(axis=0, inplace=True)


If we go ahead and, say, remove the row that has a NaN value, this would modify the df_ DataFrame inplace, and so now it’s a DataFrame that has no rows containing a NaN value, and so we’ve got that row label number 2 gone.


