# Handle Missing values 

In [21]:
# importing libraries
import pandas as pd
import numpy as np
df= pd.read_csv("missing_img/gender_submission.csv")
df.head()

Unnamed: 0,PassengerId,Survived
0,892.0,0.0
1,893.0,1.0
2,894.0,0.0
3,895.0,0.0
4,896.0,1.0


In [22]:
df.isnull().sum()

PassengerId    25
Survived        3
dtype: int64

# Deleting Rows

This method commonly used to handle the null values. Here, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 70-75% of missing values. This method is advised only when there are enough samples in the data set. One has to make sure that after we have deleted the data, there is no addition of bias. Removing the data will lead to loss of information which will not give the expected results while predicting the output.

Pros:

Complete removal of data with missing values results in robust and highly accurate model
Deleting a particular row or a column with no specific information is better, since it does not have a high weightage


Cons:

Loss of information and data
Works poorly if the percentage of missing values is high (say 30%), compared to the whole dataset


two types: 

1- List Wise Deletion :

In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one of the major advantage of this method, but this method reduces the power of model because it reduces the sample size

2- Pair Wise Deletion.

In pair wise deletion, we perform analysis with all cases in which the variables of interest are present. Advantage of this method is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses different sample size for different variables

only cases relating to each pair of variables with missing data ...

![title](missing_img/mis1.png)


In [23]:
df.dropna(inplace=True)
df.isnull().sum()

PassengerId    0
Survived       0
dtype: int64

# Replacing With Mean/ Mode/ Median Imputation

Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given 


attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. It can be of two types:-

Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower”  (28.33) and then replace missing value with it.

Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.


Pros:
This is a better approach when the data size is small
It can prevent data loss which results in removal of the rows and columns

Cons:
Imputing the approximations add variance and bias
Works poorly compared to other multiple-imputations method


	KNN Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages.
    
Advantages:

	k-nearest neighbour can predict both qualitative & quantitative attributes
	Creation of predictive model for each attribute with missing data is not required
	Attributes with multiple missing values can be easily treated
	Correlation structure of the data is taken into consideration

Disadvantage:

	KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset looking for the most similar instances.
	Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes


In [24]:
df['PassengerId'].isnull().sum()

0

In [25]:
df['PassengerId'].mean()

1105.5127551020407

In [26]:
df['PassengerId'].replace(np.NaN,df['PassengerId'].mean()).head(5)

0    892.0
1    893.0
2    894.0
3    895.0
4    896.0
Name: PassengerId, dtype: float64

In [27]:
# Same will do with meadian and mean values

df['PassengerId'].median()

1106.5

In [29]:
df['PassengerId'].mode().head()

0    892.0
1    893.0
2    894.0
3    895.0
4    896.0
dtype: float64

# Assigning An Unique Category

A categorical feature will have a definite number of possibilities, such as gender, for example. Since they have a definite number of classes, we can assign another class for the missing values. Here, the features Cabin and Embarked have missing values which can be replaced with a new category, say, U for ‘unknown’. This strategy will add more information into the dataset which will result in the change of variance. Since they are categorical, we need to find one hot encoding to convert it to a numeric form for the algorithm to understand it. Let us look at how it can be done in Python:

Pros:

Less possibilities with one extra category, resulting in low variance after one hot encoding — since it is categorical
Negates the loss of data by adding an unique category

Cons:

Adds less variance
Adds another feature to the model while encoding, which may result in poor performance

In [31]:
df.head()

Unnamed: 0,PassengerId,Survived
0,892.0,0.0
1,893.0,1.0
2,894.0,0.0
3,895.0,0.0
4,896.0,1.0


In [30]:
df['PassengerId'].fillna('U').head(5)

0    892.0
1    893.0
2    894.0
3    895.0
4    896.0
Name: PassengerId, dtype: float64

# Predicting The Missing Values
Using the features which do not have missing values, we can predict the nulls with the help of a machine learning algorithm. This method may result in better accuracy, unless a missing value is expected to have a very high variance. We will be using linear regression to replace the nulls in the feature ‘PassengerId’, using other available feature.

Pros:

Imputing the missing variable is an improvement as long as the bias from the same is smaller than the omitted variable bias
Yields unbiased estimates of the model parameters

Cons:

Bias also arises when an incomplete conditioning set is used for a categorical variable
Considered only as a proxy for the true values

# Use some algorithms Which Support Missing Values

KNN is a machine learning algorithm which works on the principle of distance measure. This algorithm can be used when there are nulls present in the dataset.

Another algorithm which can be used here is RandomForest. This model produces a robust result because it works well on non-linear and the categorical data. It adapts to the data structure taking into consideration of the high variance or the bias, producing better results on large datasets.

Pros:

Does not require creation of a predictive model for each attribute with missing data in the dataset
Correlation of the data is neglected

Cons:
Is a very time consuming process and it can be critical in data mining where large databases are being extracted
Choice of distance functions can be Euclidean, Manhattan etc. which is do not yield a robust result