### C. Dealing with Missing data

Real world datasets are messy, often with missing values. Pandas replace NaN with missing values by default. NaN stands for not a number. 

Missing values can either be ignored, droped or filled. 

In [2]:
import pandas as pd
import numpy as np

In [3]:
# Creating a dataframe

df3 = pd.DataFrame(np.array ([[1,2,3], [4,np.nan,6], [7,np.nan,np.nan]]), 
                   columns = ['column 1', 'column 2', 'column 3'])

In [4]:
df3

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,,6.0
2,7.0,,


#### Checking Missing values

In [5]:
# Recognizing the missing values

df3.isnull()

Unnamed: 0,column 1,column 2,column 3
0,False,False,False
1,False,True,False
2,False,True,True


In [6]:
df3

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,,6.0
2,7.0,,


In [7]:
# Calculating number of the missing values in each feature

df3.isnull().sum()

column 1    0
column 2    2
column 3    1
dtype: int64

In [8]:
# Recognizng non missig values

df3.notna()

Unnamed: 0,column 1,column 2,column 3
0,True,True,True
1,True,False,True
2,True,False,False


In [9]:
df3.notna().sum()

column 1    3
column 2    1
column 3    2
dtype: int64

#### Removing the missing values

In [11]:
df3

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,,6.0
2,7.0,,


In [10]:
## Dropping missing values 

df3.dropna()

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0


All rows are deleted because dropna() will remove each row which have missing value. 

In [12]:
# you can drop NaNs in specific column(s)

df3['column 3'].dropna()

0    3.0
1    6.0
Name: column 3, dtype: float64

In [13]:
# You can drop data by axis 
# Axis = 1...drop all columns with Nans
# df3.dropna(axis='columns')

df3.dropna(axis=1)

Unnamed: 0,column 1
0,1.0
1,4.0
2,7.0


In [14]:
# axis = 0...drop all rows with Nans
# df3.dropna(axis='rows') is same 

df3.dropna(axis=0)

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0


#### Filling the missing values

In [16]:
df3

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,,6.0
2,7.0,,


In [15]:
# Filling Missing values

df3.fillna(10)

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,10.0,6.0
2,7.0,10.0,10.0


In [17]:
df3.fillna('fillme')

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,fillme,6.0
2,7.0,fillme,fillme


In [19]:
df3

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,,6.0
2,7.0,,


In [18]:
# You can forward fill (ffill) or backward fill(bfill)
# Or fill a current value with previous or next value

df3.fillna(method='ffill')

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,2.0,6.0
2,7.0,2.0,6.0


In [20]:
# Won't change it because the last values are NaNs, so it backward it

df3.fillna(method='bfill')

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,,6.0
2,7.0,,


In [21]:
df3

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,,6.0
2,7.0,,


In [22]:
# If we change the axis to columns, you can see that Nans at row 2 and col 2 is backfilled with 6

df3.fillna(method='bfill', axis='columns')

Unnamed: 0,column 1,column 2,column 3
0,1.0,2.0,3.0
1,4.0,6.0,6.0
2,7.0,,
