<h1>Handling Missing Data with Pandas</h1>


In [3]:
import numpy as np
import pandas as pd

<h3>Pandas Utility Functions</h3>
<p>Functions that allow pandas to identify and detect null values. This section is designed to help you identify missing values in imported data.</p>

In [4]:
pd.isnull(np.nan) #isnull and nan are synonomous

True

In [5]:
pd.isnull(None) #Since 'None' is also null, it returns false.

True

In [6]:
pd.isna(np.nan) #isna is synonomous with nan

True

In [7]:
pd.isna(None)

True

In [8]:
pd.notnull(None)

False

In [9]:
pd.notnull(np.nan)

False

In [10]:
pd.notnull(3)

True

These functions also work with SERIES and DataFrames:

In [11]:
# Check to see if there are null values in pd.Series([1,np.nan,7])
## Should return [0: False, 1:True, 2: False]

pd.isnull(pd.Series([1,np.nan, 7]))

0    False
1     True
2    False
dtype: bool

In [12]:
# Check to see if there are values in pd.Series([1, np.nan, 7])
## Should return [0: True, 1: False, 2: True]

pd.notnull(pd.Series([1, np.nan,7]))

0     True
1    False
2     True
dtype: bool

In [13]:
pd.isnull(pd.DataFrame({
    'Column A': [1,np.nan,2],
    'Column B': [np.nan, 2,3,],
    'Column C': [np.nan,2,np.nan]
}))

Unnamed: 0,Column A,Column B,Column C
0,False,True,True
1,True,False,False
2,False,False,True


<h4>Pandas Operations with Missing Values</h4>
Pandas manages missing values more gracefully than numpy. <b>nan</b>s will no longer behave as 'viruses', and operations will just ignore them completely.

In [14]:
pd.Series([1,2,np.nan]).count() #Ignores the .nan

2

In [15]:
pd.Series([1,2,np.nan]).sum() #Ignores the .nan when summing elements

3.0

In [16]:
pd.Series([1,2,np.nan]).mean() #Ignores the .nan when finding mean

1.5

<b>Filtering Missing Data</b><br>
<p>We could combine boolean selection + <b>pd.isnull</b> to filter out the <b>.nan</b> and null values.</p>

In [17]:
s = pd.Series([1,2,3,np.nan,np.nan,4])

In [18]:
pd.notnull(s)

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [19]:
pd.notnull(s).count()

6

In [20]:
s[pd.notnull(s)]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

But both <b>notnull</b> and <b>isnull</b> are also methods of SERIES and DataFrames so we could use it that way:

In [21]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

In [22]:
s.notnull()

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

In [23]:
s[s.notnull()]

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

<b>Dropping null values</b>

In [24]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

In [25]:
s.dropna()

0    1.0
1    2.0
2    3.0
5    4.0
dtype: float64

<b>Dropping null values on DataFrames</b><br>
You can't drop single values - only entire rows or columns. 

In [26]:
df = pd.DataFrame({
    'Column A': [1,np.nan,30, np.nan], 
    'Column B': [2,8,33,np.nan], 
    'Column C': [np.nan, 9, 32, 100], 
    'Column D': [5,8,34,110]
})

In [27]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,33.0,32.0,34
3,,,100.0,110


In [28]:
df.shape

(4, 4)

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Column A  2 non-null      float64
 1   Column B  3 non-null      float64
 2   Column C  3 non-null      float64
 3   Column D  4 non-null      int64  
dtypes: float64(3), int64(1)
memory usage: 256.0 bytes


In [30]:
df.isnull()

Unnamed: 0,Column A,Column B,Column C,Column D
0,False,False,True,False
1,True,False,False,False
2,False,False,False,False
3,True,True,False,False


In [31]:
df.isnull().sum()

Column A    2
Column B    1
Column C    1
Column D    0
dtype: int64

<h2>Filling Null Values</h2>
<p>This section shows you everything you can do to fill in null values once they've been identified.<br>
    Sometimes, instead of dropping null values, we might need to replace them. These are the different methods and mechanisms.</p>

In [32]:
s

0    1.0
1    2.0
2    3.0
3    NaN
4    NaN
5    4.0
dtype: float64

<b>Filling nulls with an arbitrary value</b>

In [33]:
s.fillna(0) #fills the 'NaN' spots with '0'

0    1.0
1    2.0
2    3.0
3    0.0
4    0.0
5    4.0
dtype: float64

In [34]:
s.fillna(s.mean()) #fills missing value with mean

0    1.0
1    2.0
2    3.0
3    2.5
4    2.5
5    4.0
dtype: float64

<b>Filling nulls with contiguous (close) values</b><br>
The <u>method</u> argument is used to fill null values with other values close to the null one.

In [35]:
s.fillna(method='ffill') #Forward fill

0    1.0
1    2.0
2    3.0
3    3.0
4    3.0
5    4.0
dtype: float64

In [36]:
s.fillna(method='bfill') #Backward fill

0    1.0
1    2.0
2    3.0
3    4.0
4    4.0
5    4.0
dtype: float64

This can still leave null values at the extremes of the SERIES/DataFrame:

In [37]:
pd.Series([np.nan, 3,np.nan,9]).fillna(method='ffill')

0    NaN
1    3.0
2    3.0
3    9.0
dtype: float64

In [38]:
pd.Series([1,np.nan,3,np.nan,np.nan]).fillna(method='bfill')

0    1.0
1    3.0
2    3.0
3    NaN
4    NaN
dtype: float64

<b>Filling null values on DataFrames</b><br>
The <u>fillna</u> method works similarly of DataFrames. You can specifiy the 'axis' to use to fill the values and you have more control on the values passed:

In [39]:
df

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,,8.0,9.0,8
2,30.0,33.0,32.0,34
3,,,100.0,110


In [40]:
#Filling null values: COL A w/ 0's, COL B w/ 99, and COL C w/ mean
df.fillna({'Column A': 0, 'Column B':99, 'Column C':df['Column C'].mean()})

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,47.0,5
1,0.0,8.0,9.0,8
2,30.0,33.0,32.0,34
3,0.0,99.0,100.0,110


In [41]:
#Forward Fill Method on the ROW Axis
df.fillna(method='ffill', axis=1)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,2.0,5.0
1,,8.0,9.0,8.0
2,30.0,33.0,32.0,34.0
3,,,100.0,110.0


In [42]:
#Forward Fill method on the COL axis
df.fillna(method='ffill', axis=0)

Unnamed: 0,Column A,Column B,Column C,Column D
0,1.0,2.0,,5
1,1.0,8.0,9.0,8
2,30.0,33.0,32.0,34
3,30.0,33.0,100.0,110


<h3>Checking for NA Values</h3>
The answer to 'does the SERIES/DataFrame have missing values?' is either TRUE/FALSE

<b>Example 1: Checking the lenght</b><br>
If there are missing values, <code>s.dropna()</code> will have less elements than <code>s</code>.

In [43]:
s.dropna().count()

4

In [44]:
missing_values = len(s.dropna()) != len(s)
missing_values

True

There's also a <code>count</code> method that excludes <code>nan</code> from its results:

In [45]:
len(s)

6

In [46]:
s.count()

4

So we could just do:

In [50]:
missing_values = s.count() != len(s)
missing_values

True

<b>More Pythonic solution:</b> <code>any</code><br>
    The methods <code>any</code> and <code>all</code> check if any 'TRUE' value in a SERIES or all values or TRUE. 

In [52]:
pd.Series([True, False, False]).any()

True

In [49]:
pd.Series([True, False, False]).all()

False

The <code>isnull()</code> method returned a Boolean SERIES with 'TRUE' values wherever there was a <code>nan</code>:

In [53]:
s.isnull()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

So we can just use the <code>any</code> method with the boolean array returned:

In [54]:
pd.Series([1,np.nan]).isnull().any()

True

In [55]:
pd.Series([1,2,]).isnull().any()

False

In [56]:
s.isnull().any()

True

In [57]:
s.isnull().values.any()

True