<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#See-Table-7-1-for-a-list-of-some-functions-related-to-missing-data-handling." data-toc-modified-id="See-Table-7-1-for-a-list-of-some-functions-related-to-missing-data-handling.-0.1">See Table 7-1 for a list of some functions related to missing data handling.</a></span></li></ul></li><li><span><a href="#1.-Filtering-Out-Missing-Data" data-toc-modified-id="1.-Filtering-Out-Missing-Data-1">1. Filtering Out Missing Data</a></span><ul class="toc-item"><li><span><a href="#This-is-equivalent-to:" data-toc-modified-id="This-is-equivalent-to:-1.1">This is equivalent to:</a></span></li><li><span><a href="#With-DataFrame-objects,-things-are-a-bit-more-complex." data-toc-modified-id="With-DataFrame-objects,-things-are-a-bit-more-complex.-1.2">With DataFrame objects, things are a bit more complex.</a></span></li><li><span><a href="#Passing-how='all'-will-only-drop-rows-that-are-all-NA:" data-toc-modified-id="Passing-how='all'-will-only-drop-rows-that-are-all-NA:-1.3">Passing <code>how='all'</code> will only drop rows that are all NA:</a></span></li><li><span><a href="#To-drop-columns-in-the-same-way,-pass-axis=1:" data-toc-modified-id="To-drop-columns-in-the-same-way,-pass-axis=1:-1.4">To drop columns in the same way, pass <code>axis=1:</code></a></span></li></ul></li><li><span><a href="#2.-Filling-In-Missing-Data" data-toc-modified-id="2.-Filling-In-Missing-Data-2">2. Filling In Missing Data</a></span><ul class="toc-item"><li><span><a href="#fillna-returns-a-new-object,-but-you-can-modify-the-existing-object-in-place:" data-toc-modified-id="fillna-returns-a-new-object,-but-you-can-modify-the-existing-object-in-place:-2.1"><code>fillna</code> returns a new object, but you can modify the existing object in-place:</a></span></li><li><span><a href="#With-fillna-you-can-do-lots-of-other-things-with-a-little-creativity." data-toc-modified-id="With-fillna-you-can-do-lots-of-other-things-with-a-little-creativity.-2.2">With <code>fillna</code> you can do lots of other things with a little creativity.</a></span></li><li><span><a href="#See-Table-7-2-for-a-reference-on-fillna." data-toc-modified-id="See-Table-7-2-for-a-reference-on-fillna.-2.3">See Table 7-2 for a reference on <code>fillna</code>.</a></span></li></ul></li></ul></div>

# 7.1 Handling Missing Data
Missing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible. For example, **all of the descriptive statistics on pandas objects exclude missing data by default**
* The way that missing data is represented in pandas objects is somewhat imperfect, but it is functional for a lot of users.
* For numeric data, pandas uses the floating-point value `NaN` (Not a Number) to represent missing data. We call this a **sentinel value** that can be easily detected:

In [1]:
import pandas as pd
import numpy as np

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

* In pandas, we’ve adopted a convention used in the R programming language by referring to missing data as `NA`, **which stands for not available.**
* In statistics applications, NA data may either be data that does not exist or that exists but was not observed
* The built-in Python `None` value is also treated as `NA` in object arrays:

In [4]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

####  See Table 7-1 for a list of some functions related to missing data handling.
![image.png](attachment:image.png)

### 1. Filtering Out Missing Data
There are a few ways to filter out missing data.
* While you always have the option to do it by hand using `pandas.isnull` and boolean indexing, the `dropna` can be helpful

In [5]:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

#### This is equivalent to:

In [6]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

#### With DataFrame objects, things are a bit more complex. 
* `dropna` by default drops any row containing a missing value:

In [7]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [8]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


#### Passing `how='all'` will only drop rows that are all NA:

In [9]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


#### To drop columns in the same way, pass `axis=1:`

In [10]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [11]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


A related way to filter out DataFrame rows tends to concern time series data. 
* Suppose you want to **keep only rows containing a certain number of observations.** You can indicate this with the `thresh` argument:

In [12]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.952921,,
1,0.276267,,
2,0.760786,,-0.808315
3,-0.713527,,0.416105
4,-0.32421,1.736784,1.096385
5,-0.256604,0.308013,1.808138
6,-0.199655,0.575004,-0.10767


In [13]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.32421,1.736784,1.096385
5,-0.256604,0.308013,1.808138
6,-0.199655,0.575004,-0.10767


In [14]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
2,0.760786,,-0.808315
3,-0.713527,,0.416105
4,-0.32421,1.736784,1.096385
5,-0.256604,0.308013,1.808138
6,-0.199655,0.575004,-0.10767


### 2. Filling In Missing Data
For most purposes, the `fillna` method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value:

In [15]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.952921,0.0,0.0
1,0.276267,0.0,0.0
2,0.760786,0.0,-0.808315
3,-0.713527,0.0,0.416105
4,-0.32421,1.736784,1.096385
5,-0.256604,0.308013,1.808138
6,-0.199655,0.575004,-0.10767


Calling `fillna` with a dict, you can **use a different fill value for each column:**

In [17]:
df.fillna({1: 0.5, 2: 0})

Unnamed: 0,0,1,2
0,-0.952921,0.5,0.0
1,0.276267,0.5,0.0
2,0.760786,0.5,-0.808315
3,-0.713527,0.5,0.416105
4,-0.32421,1.736784,1.096385
5,-0.256604,0.308013,1.808138
6,-0.199655,0.575004,-0.10767


#### `fillna` returns a new object, but you can modify the existing object in-place:

The same interpolation methods available for reindexing can be used with `fillna`:

In [18]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,0.260642,0.01775,1.036211
1,-0.717592,-0.472621,0.602104
2,0.074856,,-0.381256
3,1.286412,,-1.15735
4,-0.686835,,
5,-0.213994,,


In [19]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.260642,0.01775,1.036211
1,-0.717592,-0.472621,0.602104
2,0.074856,-0.472621,-0.381256
3,1.286412,-0.472621,-1.15735
4,-0.686835,-0.472621,-1.15735
5,-0.213994,-0.472621,-1.15735


In [20]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.260642,0.01775,1.036211
1,-0.717592,-0.472621,0.602104
2,0.074856,-0.472621,-0.381256
3,1.286412,-0.472621,-1.15735
4,-0.686835,,-1.15735
5,-0.213994,,-1.15735


#### With `fillna` you can do lots of other things with a little creativity. 
* For example, you might pass the mean or median value of a Series:

In [21]:
data = pd.Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

#### See Table 7-2 for a reference on `fillna`.
![image.png](attachment:image.png)