Hi Guys,<br>
Welcome back to the pandas essentials, now we are going to talk about the missing data!<br>
## Handling Missing Data
Missing data is very common in many data analysis applications. pandas has a great ability to deal with the missing data. <br>Let's learn some convenient methods to deal with **missing data in pandas**:<br>

* isnull(), isna(), notnull(), dropna(), fillna(), 

In [1]:
import numpy as np
import pandas as pd

In [2]:
data_dic = {'A':[1,2,np.nan,4,np.nan],
            'B':[np.nan,np.nan,np.nan,np.nan,np.nan],
            'C':[11,12,13,14,15],
            'D':[16,np.nan,18,19,20]}
df = pd.DataFrame(data_dic) 
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0



**isnull(), isna(), notnull() -- Check for missing data in the dataset!**

In [None]:
df.isnull()

In [None]:
df.isnull().sum().sum()

In [None]:
df['A'].isnull()

In [None]:
df['A'].isnull().sum()

In [None]:
df.isna()

In [None]:
df.isna().sum().sum()

In [None]:
df.loc[1].isnull().sum()

In [None]:
df.notnull()

In [None]:
df.shape

In [None]:
df.notnull().sum()

In [None]:
df.notnull().sum().sum()

In [None]:
df

In [None]:
# Sum on Column "A", (NaN as 0)
df['A'].sum()

&#9758; NaN ignored for mean().

In [None]:
df['A'].mean()

In [None]:
df.loc[3].sum()

**dropna(), fillna() -- Cleaning / filling the missing data**

In [None]:
df.dropna(axis=0)

In [None]:
df

In [None]:
# for column, need to tell axis = 1
df.dropna(axis=1)

In [None]:
df

thresh : int, default None
thresh = 3 means, it will drop any column that have less than 3 non-NaN values.

In [None]:
df.dropna(thresh=3, axis=1)

We can use fillna() to fill in the values.<br>
inplaced = True for permanent change.

In [None]:
df.fillna(value=2)

In [None]:
df

Let's fill in the values using mean of the column. 

In [None]:
df['A'].fillna(value = df['A'].mean())

In [3]:
df

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [None]:
df.fillna(method='ffill') 

In [None]:
df.fillna(method='bfill')

In [None]:
df.fillna(0)

In [6]:
from sklearn.impute import SimpleImputer

In [9]:
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [10]:
imputer = SimpleImputer(strategy='constant', fill_value= -1) 
df2['A'] = imputer.fit_transform(df2[['A']])
df2

array([[ 1., -1., 11., 16.],
       [ 2., -1., 12., -1.],
       [-1., -1., 13., 18.],
       [ 4., -1., 14., 19.],
       [-1., -1., 15., 20.]])

In [11]:
df2 = df.copy()
imputer = SimpleImputer(strategy='mean') 
df2['A'] = imputer.fit_transform(df2[['A']])
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,2.333333,,13,18.0
3,4.0,,14,19.0
4,2.333333,,15,20.0


In [12]:
df2 = df.copy()
imputer = SimpleImputer(strategy='median') 
df2['A'] = imputer.fit_transform(df2[['A']])
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,2.0,,13,18.0
3,4.0,,14,19.0
4,2.0,,15,20.0


In [13]:
data_dic = {'A':[1,2,np.nan,4,np.nan],
            'B':[np.nan,np.nan,np.nan,np.nan,np.nan],
            'C':[11,12,13,14,15],
            'D':[16,np.nan,18,19,18]}
df2 = pd.DataFrame(data_dic) 
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,18.0


In [14]:
imputer = SimpleImputer(strategy='most_frequent') 
df2['D'] = imputer.fit_transform(df2[['D']])
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,18.0
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,18.0


In [15]:
from sklearn.impute import KNNImputer

In [16]:
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D
0,1.0,,11,16.0
1,2.0,,12,
2,,,13,18.0
3,4.0,,14,19.0
4,,,15,20.0


In [17]:
knn_imputer = KNNImputer(n_neighbors=2, weights="uniform")
df2 = knn_imputer.fit_transform(df2)
df2

array([[ 1., 11., 16.],
       [ 2., 12., 17.],
       [ 3., 13., 18.],
       [ 4., 14., 19.],
       [ 3., 15., 20.]])

In [None]:
df

In [None]:
# fill with you own given value
df.fillna(0, inplace=True)

In [None]:
df

# Good Job!