# **Methods for Dealing with Missing Values.**

> [**Missing Data - Wikipedia**](https://en.wikipedia.org/wiki/Missing_data)

In statistics, missing data, or missing values, occur when no data value gets stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. Missing data can occur because of nonresponse, i.e., no information is provided for one or more items or a whole unit ("subject"). These forms of missingness take different types, with different impacts on the validity of conclusions from the research.

*   **Missing Completely at Random**
*   **Missing at Random**
*   **Missing Not at Random**

>  **Missing Completely at Random:** Values in a dataset are **missing completely at random (MCAR)** if the events that lead to any particular data item being missing are independent both of observable variables and unobservable parameters of interest and occur entirely at random. When data are MCAR, the analysis performed on the data is unbiased; however, data are rarely MCAR. In the case of MCAR, the missingness of data is unrelated to any study variable. Thus, the participants with completely observed data are in effect a random sample of all the participants assigned a particular intervention. With MCAR, the random assignment of treatments is assumed to be preserved, but that is usually an unrealistically strong assumption in practice.

>  **Missing at Random:** Missing at random (MAR) occurs when the missingness is not random, but where missingness can be fully accounted for by variables where there is complete information. Since MAR is an assumption that is impossible to verify statistically, we must rely on its substantive reasonableness. An example is that males are less likely to fill in a depression survey, but this has nothing to do with their level of depression after accounting for maleness. Depending on the analysis method, these data can still induce parameter bias in analyses due to the contingent emptiness of cells (male, very high depression may have zero entries). However, if the parameter gets estimated with Full Information Maximum Likelihood, MAR will provide asymptotically unbiased estimates.

>  **Missing Not at Random:** Missing not at random (MNAR) is data that is neither MAR nor MCAR (i.e., the value of the variable that's missing is related to the reason it's missing). To extend the previous example, this would occur if men failed to fill in a depression survey because of their level of depression.

## **Techniques of Dealing with Missing Data.**

Missing data reduces the representativeness of the sample and can therefore distort inferences about the population. Generally speaking, there are three main approaches to handle missing data:

1.   **Imputation:** where values are filled in the place of missing data.

2.   **Omission:** where samples with invalid data are discarded from further analysis.

3.   **Analysis:** by directly applying methods unaffected by the missing values.

In [1]:
# Import Library.
import pandas as pd
import numpy as np

# Load Dataset.
data = pd.read_csv(
    "http://www.creditriskanalytics.net/uploads/1/9/5/1/19511601/hmeq.csv"
)
data.head()

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,
1,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,
2,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,
3,1,1500,,,,,,,,,,,
4,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,


In [2]:
# Copy Dataframe.
df_copy = data.copy()

In [3]:
# Check for Missing Values.
print(data.isnull().sum())

BAD           0
LOAN          0
MORTDUE     518
VALUE       112
REASON      252
JOB         279
YOJ         515
DEROG       708
DELINQ      580
CLAGE       308
NINQ        510
CLNO        222
DEBTINC    1267
dtype: int64


## **Delete Rows (or Columns) with Missing Values.**

This method is commonly used to handle the null values. Here, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than $70-75\%$ of missing values. This method is advised only when there are enough samples in the dataset. One has to make sure that after we have deleted the data, there is no addition of bias. Removing the data will lead to loss of information which will not give the expected results while predicting the output.

**Pros:**

*   Complete removal of data with missing values results in a robust and highly accurate model.
*   Deleting a particular row or a column with no specific information is better since it does not have a high weightage.

**Cons:**
*   Loss of information and data.
*   Works poorly if the percentage of missing values is high (say $30\%$), compared to the whole dataset.

In [4]:
# Delete Entire Column (Feature) with Missing Values.
del df_copy["DEBTINC"]
print(df_copy.isnull().sum())
print(df_copy.shape)

BAD          0
LOAN         0
MORTDUE    518
VALUE      112
REASON     252
JOB        279
YOJ        515
DEROG      708
DELINQ     580
CLAGE      308
NINQ       510
CLNO       222
dtype: int64
(5960, 12)


In [5]:
# Delete Rows with Missing Values.
df_copy.dropna(inplace=True)
print(df_copy.isnull().sum())
print(df_copy.shape)

BAD        0
LOAN       0
MORTDUE    0
VALUE      0
REASON     0
JOB        0
YOJ        0
DEROG      0
DELINQ     0
CLAGE      0
NINQ       0
CLNO       0
dtype: int64
(4247, 12)


## **Impute Missing Values with Mean, Median, and Mode.**

This strategy can be applied to a feature that has numeric data like the age of a person or the ticket fare. We can calculate the mean, median, or mode of the feature and replace it with the missing values. This is an approximation that can add variance to the dataset. But the loss of the data can be negated by this method which yields better results compared to the removal of rows and columns. Replacing the above three approximations is a statistical approach for handling the missing values. This method is also called leaking the data while training. Another way is to approximate it with the deviation of neighboring values. This works better if the dataset is linear.

**Pros:**

*   It is a better approach when the data size is small.
*   It can prevent data loss, which results in the removal of the rows and columns.

**Cons:**

*   Imputing the approximations add variance and bias.
*   Works poorly compared to other multiple-imputations methods.
*   Works only with numerical continuous variables.
*   Can cause data leakage.
*   Do not factor the covariance between features.

In [6]:
""" Imputation Using the Mean Values. """

# Copy Dataframe.
df_mean_impute = data.copy()

""" Replace missing values using Mean Imputation. """
# df_mean_impute["CLNO"].fillna(df_mean_impute["CLNO"].mean())

df_mean_impute = df_mean_impute.fillna(df_mean_impute.mean())
print(df_mean_impute.isnull().sum())
print(df_mean_impute.shape)

BAD          0
LOAN         0
MORTDUE      0
VALUE        0
REASON     252
JOB        279
YOJ          0
DEROG        0
DELINQ       0
CLAGE        0
NINQ         0
CLNO         0
DEBTINC      0
dtype: int64
(5960, 13)


  if __name__ == '__main__':


In [7]:
""" Imputation Using the Median Values. """

# Copy Dataframe.
df_median_impute = data.copy()

""" Replace missing values using Median Imputation. """
# df_median_impute["CLNO"].fillna(df_median_impute["CLNO"].mean())

df_median_impute = df_median_impute.fillna(df_median_impute.median())
print(df_median_impute.isnull().sum())
print(df_median_impute.shape)

BAD          0
LOAN         0
MORTDUE      0
VALUE        0
REASON     252
JOB        279
YOJ          0
DEROG        0
DELINQ       0
CLAGE        0
NINQ         0
CLNO         0
DEBTINC      0
dtype: int64
(5960, 13)


  if __name__ == '__main__':


In [8]:
""" Imputation Using the Mode Values. """

# Copy Dataframe.
df_mode_impute = data.copy()

""" Replace missing values using Mode Imputation. """
# df_mode_impute["CLNO"].fillna(df_mode_impute["CLNO"].mean())

df_mode_impute = df_mode_impute.fillna(df_mode_impute.mode())
print(df_mode_impute.isnull().sum())
print(df_mode_impute.shape)

BAD          0
LOAN         0
MORTDUE    518
VALUE      112
REASON     252
JOB        279
YOJ        515
DEROG      708
DELINQ     580
CLAGE      308
NINQ       510
CLNO       222
DEBTINC    226
dtype: int64
(5960, 13)


## **Imputation Method for Categorical Columns (Assigning An Unique Category):**

When missing values is from categorical columns (string or numerical), then the missing values can be replaced with the most frequent category. If the number of missing values is very large, then it can be replaced with a new category.

**Pros:**

*   Prevent data loss which results in deletion of rows or columns.
*   Works well with a small dataset and is easy to implement.
*   Negates the loss of data by adding a unique category.
*   Fewer possibilities with one extra category, resulting in low variance after one hot encoding — since it is categorical.

**Cons:**

*   Addition of new features to the model while encoding, which may result in poor performance.

In [9]:
""" Imputation Using (Zero/Constant) Values. """

# Copy Dataframe.
df_constant_impute = data.copy()

""" Replace missing values using Constant Imputation. """
# df_constant_impute["REASON"].fillna("NA")

""" Replace missing values with a number. """
df_constant_impute = df_constant_impute.fillna(0)
print(df_constant_impute.isnull().sum())
print(df_constant_impute.shape)

BAD        0
LOAN       0
MORTDUE    0
VALUE      0
REASON     0
JOB        0
YOJ        0
DEROG      0
DELINQ     0
CLAGE      0
NINQ       0
CLNO       0
DEBTINC    0
dtype: int64
(5960, 13)


# **Using Algorithms that Support Missing Values.**

KNN is a machine learning algorithm that works on the principle of distance measure. This algorithm can be used when there are null values present in the dataset. When the algorithm is applied, KNN considers the missing values by taking the majority of the $K$-nearest values.

> [**Imputation of Missing Values**](https://scikit-learn.org/stable/modules/impute.html#nearest-neighbors-imputation)

Another algorithm that can be used here is RandomForest. This model produces a robust result because it works well on non-linear and categorical data. It adapts to the data structure taking into consideration of the high variance or the bias, producing better results on large datasets.

**Pros:**

*  Does not require the creation of a predictive model for each attribute with missing data in the dataset.

*  The correlation of the data is neglected.

**Cons:**

*  It is a very time-consuming process, and it can be critical in data mining where large databases are being extracted.

*  Choice of distance functions can be Euclidean, Manhattan, etc. which is do not yield a robust result.

In [10]:
from sklearn.impute import SimpleImputer, KNNImputer


def imputeF20(X, method="none"):
    if method == "none":
        return pd.DataFrame(X)
    if method == "drop":
        X = X.drop("DEBTINC", axis=1).values
        return pd.DataFrame(X)
    if method == "constant":
        imp = SimpleImputer(strategy="constant")
    if method == "mean":
        imp = SimpleImputer(strategy="mean")
    if method == "median":
        imp = SimpleImputer(strategy="median")
    if method == "most_frequent":
        imp = SimpleImputer(strategy="most_frequent")
    if method == "knn":
        imp = KNNImputer(n_neighbors=5)

    imp.fit(X)
    return pd.DataFrame(imp.transform(X))

In [11]:
slc = [1, 2, 3, 6, 7, 8, 9, 10, 11]
features = pd.DataFrame(data.values[:, slc], data.index, data.columns[slc]).values
features_impute = imputeF20(features, "knn")

print(features_impute.isnull().sum())
print(features_impute.shape)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
dtype: int64
(5960, 9)


# **DataWig: A framework for learning models to impute missing values in tables.**

*  [**Welcome to DataWig’s documentation!**](https://datawig.readthedocs.io/en/latest/index.html)

*  [**DataWig Github**](https://github.com/awslabs/datawig)

# **Imputation by Multivariate Imputation by Chained Equation (MICE)**

MICE is a method for replacing missing data values in data collection via multiple imputations.

> [**MICE**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/)

> [**The MICE Algorithm**](https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html)

In [12]:
# Import Library.
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Load Dataset.
data = sns.load_dataset("titanic")

# Feature Engineering.
data = data[["survived", "pclass", "sex", "age", "sibsp", "parch", "fare"]]
data["sex"] = [1 if x == "male" else 0 for x in data["sex"]]
data.head()

# Handling Missing Values.
imputer = IterativeImputer(
    imputation_order="ascending", max_iter=10, random_state=42, n_nearest_features=5
)
imputed_dataset = imputer.fit_transform(data)