# Dealing with Missing Data
### We often find the missing data in dataset provided to us or we download. And Machine learning algorithms do not give optimal output when trained on such dataset with missing rows or columns. So, in most of the cases, we have to handle this kind of situation.
<br>There are two appraches to deal with missing data
<li> Eliminate the training examples or features with missing data
<li> Imputation
<br><br> Before we implement techniques, lets look how to find training examples with missing values

In [5]:
#import libraries
import pandas as pd
import numpy as np
from io import StringIO

In [6]:
csv_data=\
'''
P,Q,R,S
11,12,15,13
5.9,9.8,,8.8
12,34,32,
'''
df=pd.read_csv(StringIO(csv_data))
print(df)

      P     Q     R     S
0  11.0  12.0  15.0  13.0
1   5.9   9.8   NaN   8.8
2  12.0  34.0  32.0   NaN


In [7]:
# isnull() method is our tool to find missing data in dataframe .
df.isnull()

Unnamed: 0,P,Q,R,S
0,False,False,False,False
1,False,False,True,False
2,False,False,False,True


In [9]:
df.isnull().sum()

P    0
Q    0
R    1
S    1
dtype: int64

We find that there are 1 Null or NaN values in column R and S

## Technique 1 - Eliminating training examples or features with missing values

In [10]:
# dropna() is our tool to eliminate missing data
df.dropna(axis=0)

Unnamed: 0,P,Q,R,S
0,11.0,12.0,15.0,13.0


Here axis=0 means that where there is a NaN values in row, just drop them 

In [11]:
df.dropna(axis=1)

Unnamed: 0,P,Q
0,11.0,12.0
1,5.9,9.8
2,12.0,34.0


Here axis=1 means that , drop the columns where there are NaN values

In [12]:
df.dropna(thresh=2)

Unnamed: 0,P,Q,R,S
0,11.0,12.0,15.0,13.0
1,5.9,9.8,,8.8
2,12.0,34.0,32.0,


Here thresh=2 means that , drop those rows where there are fewer than 2 real values

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html &nbsp; is link for more use of parameters of dropna

## Technique 2 - Imputation

Removing training examples or features with few missing values , we can lose the meaningful information or insights. So,it is not feasible method to implement. <br> In practical appraches, we do imputation , we just fill the missing data with other values.

In [13]:
df.fillna(df.mean())

Unnamed: 0,P,Q,R,S
0,11.0,12.0,15.0,13.0
1,5.9,9.8,23.5,8.8
2,12.0,34.0,32.0,10.9


In [16]:
from sklearn.impute import SimpleImputer
simpleImputer = SimpleImputer( strategy='median' , missing_values=np.nan)
simpleImputer = simpleImputer.fit(df)
imputed_data = simpleImputer.transform(df)
print(imputed_data)


[[11.  12.  15.  13. ]
 [ 5.9  9.8 23.5  8.8]
 [12.  34.  32.  10.9]]


strategy='mean' means that we fill missing data with median values 
<br>There are 2 more values for strategy i.e 
<li> mean <li>most_frequent

In [19]:
simpleImputer_mean = SimpleImputer(strategy='mean')
simpleImputer_mean = simpleImputer_mean.fit(df)
imputedMean_data = simpleImputer_mean.transform(df)
print(imputedMean_data)

[[11.  12.  15.  13. ]
 [ 5.9  9.8 23.5  8.8]
 [12.  34.  32.  10.9]]


# Check Out more tutorials on building good training data

In [24]:
 %load_ext watermark
 %watermark -a 'Ankur Wasnik' -u -d

Author: Ankur Wasnik

Last updated: 2021-01-14

