<a href='http://www.scienceacademy.ca'> <img style="float: left;height:70px" src="Logo_SA.png"></a>

Hi Guys,<br>
Welcome back to the pandas essentials, now we are going to talk about the missing data!<br>
## Handling Missing Data
Missing data is very common in many data analysis applications. pandas has a great ability to deal with the missing data. <br>Let's learn some convenient methods to deal with **missing data in pandas**:<br>

In [1]:
import numpy as np
import pandas as pd

Creating ad dataframe with missing data

In [2]:
data_dic = {'A':[1,2,np.nan,4,np.nan],
            'B':[np.nan,np.nan,np.nan,np.nan,np.nan],
            'C':[11,12,13,14,15],
            'D':[16,np.nan,18,19,20]}
df = pd.DataFrame(data_dic) # dataframe from a dic.

In [3]:
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0



**isnull(), notnull() -- Check for missing data in the dataset!**

In [4]:
# isnull() returns True if the data is missing
df.isnull()

Unnamed: 0,A,B,C,D
0,False,True,False,False
1,False,True,False,True
2,True,True,False,False
3,False,True,False,False
4,True,True,False,False


In [5]:
# notnull() returns True for non-NaN values
df.notnull()

Unnamed: 0,A,B,C,D
0,True,False,True,True
1,True,False,True,False
2,False,False,True,True
3,True,False,True,True
4,False,False,True,True


&#9758; NaN as "0" for sum()

In [6]:
# Sum on Column "A", (NaN as 0)
df['A'].sum()

7.0

&#9758; NaN ignored for mean().

In [7]:
df['A'].mean()

2.3333333333333335

**dropna(), fillna() -- Cleaning / filling the missing data**

In [8]:
# drop any row (dafault value) with any NaN value
df.dropna()

Unnamed: 0,A,B,C,D


In [9]:
# for column, need to tell axis = 1
df.dropna(axis=1)

Unnamed: 0,C
0,11
1,12
2,13
3,14
4,15


thresh : int, default None
thresh = 3 means, it will drop any column that have less than 3 non-NaN values.

In [10]:
df.dropna(thresh=3, axis=1)

Unnamed: 0,A,C,D
0,1.0,11,16.0
1,2.0,12,
2,,13,18.0
3,4.0,14,19.0
4,,15,20.0


We can use fillna() to fill in the values.<br>
inplaced = True for permanent change.

In [11]:
df.fillna(value='Filled')

Unnamed: 0,A,B,C,D
0,1,Filled,11,16
1,2,Filled,12,Filled
2,Filled,Filled,13,18
3,4,Filled,14,19
4,Filled,Filled,15,20


Let's fill in the values using mean of the column. 

In [12]:
df['A'].fillna(value = df['A'].mean())

0    1.000000
1    2.000000
2    2.333333
3    4.000000
4    2.333333
Name: A, dtype: float64

In [13]:
# pad / ffill: Forward fill, last valid observation forward to next NaN
df.fillna(method='ffill') # df.fillna(method='pad')

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,16.0
2,2.0,,13,18.0
3,4.0,,14,19.0
4,4.0,,15,20.0


In [14]:
#bfill/backfill -- use NEXT valid observation to fill gap
df.fillna(method='bfill')

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,18.0
2,4.0,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [15]:
# fill with you own given value
df.fillna(0)

Unnamed: 0,A,B,C,D
0,1.0,0.0,11,16.0
1,2.0,0.0,12,0.0
2,0.0,0.0,13,18.0
3,4.0,0.0,14,19.0
4,0.0,0.0,15,20.0


# Good Job!