# Handling Missing Values

Missing data is a common problem in real-world data analysis. It can arise due to various reasons such as data collection errors, data corruption, data entry errors, or data loss. Incomplete data can have a significant impact on the accuracy and reliability of the analysis results. Therefore, handling missing values is an important aspect of data preprocessing in data science.

There are several ways to handle missing data, and the most commonly used methods are as follows:

## Deletion

Deletion refers to removing the rows or columns that contain missing values from the dataset. This is the simplest approach and is often used when the amount of missing data is small. There are three types of deletion:

- **Listwise deletion (or complete case analysis):** This involves removing any row that contains missing values. This can lead to a significant loss of data, especially if the percentage of missing values is high.
- **Pairwise deletion:** This involves removing only the missing values in a particular column and keeping the rest of the data. This approach retains more data than listwise deletion but may lead to biased estimates if the missing values are not missing completely at random (MCAR) or missing at random (MAR).
- **Column-wise deletion:** It involves removing all the variables that have at least one missing value.

In [1]:
import pandas as pd
import numpy as np

# create a dataframe with missing values
df = pd.DataFrame({
    'col1': [1, 2, np.nan, 4, 5],
    'col2': [6, np.nan, 8, 9, 10],
    'col3': [11, 12, 13, np.nan, 15]
})

# display the dataframe
print("Original data:\n", df)

# List-wise deletion
new_df = df.dropna()
print("\nDataframe after list-wise deletion:\n", new_df)

# Pair-wise deletion
new_df = df.dropna(subset=['col1', 'col2'])
print("\nDataframe after pair-wise deletion:\n", new_df)

# Column-wise deletion
new_df = df.dropna(axis=1)
print("\nDataframe after column-wise deletion:\n", new_df)

Original data:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   NaN  12.0
2   NaN   8.0  13.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0

Dataframe after list-wise deletion:
    col1  col2  col3
0   1.0   6.0  11.0
4   5.0  10.0  15.0

Dataframe after pair-wise deletion:
    col1  col2  col3
0   1.0   6.0  11.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0

Dataframe after column-wise deletion:
 Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]


## Imputation

Imputation involves filling in the missing values with a substitute value. The substitute value can be a fixed value such as the mean or median of the column or a predicted value based on the values of other variables in the dataset. There are several types of imputation methods:

- **Mean Imputation:** Mean imputation is a simple technique that involves replacing missing values with the mean value of the non-missing values in the same column. This method is commonly used for continuous variables with a normal distribution. Mean imputation is a quick and easy method, but it can lead to biased estimates and increased variance if the data has outliers or is skewed.
- **Median Imputation:** Median imputation is similar to mean imputation, but instead of using the mean, it uses the median value of the non-missing values in the same column to replace missing values. Median imputation is a better option for data with outliers or a skewed distribution. However, like mean imputation, it can also lead to biased estimates and increased variance.
- **Mode Imputation:** Mode imputation is used for categorical variables and involves replacing missing values with the mode (most common value) of the non-missing values in the same column. This method is quick and easy, but it can lead to biased estimates if the mode is not representative of the population or if there are multiple modes.
- **Regression Imputation:** Regression imputation is a more sophisticated technique that involves predicting the missing values based on the relationship between the missing variable and other variables in the dataset. This method requires the use of a regression model to predict the missing values based on other variables in the dataset. The advantage of this method is that it can lead to more accurate estimates of the missing values, but it requires a strong relationship between the missing variable and the other variables.
- **Hot-Deck Imputation:** Hot-deck imputation is a method that involves replacing missing values with values from similar cases in the dataset. This method is similar to regression imputation, but instead of using a regression model, it uses the values from similar cases in the dataset. Hot-deck imputation can be useful when there is a high correlation between the missing variable and other variables in the dataset. However, it can also lead to biased estimates if the selected cases are not representative of the population.
- **K-Nearest Neighbors (KNN) imputation:** This involves predicting the missing values using the values of the k-nearest neighbors in the dataset. This can be a more flexible approach than regression imputation as it does not require a linear relationship between the variables.

In [2]:
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression

# create a dataframe with missing values
df = pd.DataFrame({
    'col1': [1, 2, np.nan, 4, 5],
    'col2': [6, np.nan, 8, 9, 10],
    'col3': [11, 12, 13, np.nan, 15]
})

# display the dataframe
print("Original data:\n", df)

# Mean Imputation
new_df = df.copy()
mean_value = new_df['col1'].mean()
new_df['col1'].fillna(value=mean_value, inplace=True)
print("\nMean Imputation:\n", new_df)

# Median Imputation
new_df = df.copy()
median_value = new_df['col1'].median()
new_df['col1'].fillna(value=median_value, inplace=True)
print("\nMedian Imputation:\n", new_df)

# Mode Imputation
new_df = df.copy()
mode_value = new_df['col1'].mode()[0]
new_df['col1'].fillna(value=mode_value, inplace=True)
print("\nMode Imputation:\n", new_df)

# Regression Imputation
new_df = df.copy()
model = LinearRegression()
x_train = new_df.dropna()[['col2', 'col3']]  # data points without missing values
y_train = new_df.dropna()['col1']  # target variable without missing values
model.fit(x_train, y_train)
x_test = new_df[new_df['col1'].isna()][['col2', 'col3']]  # data points with missing values
new_df.loc[df['col1'].isnull(), 'col1'] = model.predict(x_test)
print("\nRegression Imputation:\n", new_df)

# Hot-Deck Imputation
new_df = df.copy()
missing_index = np.where(new_df['col1'].isnull())[0]
for i in missing_index:
    new_df.iloc[i, 0] = new_df.iloc[i-1, 0]  # fill missing values with value of the previous observation
print("\nHot-Deck Imputation:\n", new_df)

# K-Nearest Neighbors (KNN) imputation
imputer = KNNImputer(n_neighbors=2)
df_impute_knn = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print("\nK-Nearest Neighbors (KNN) imputation:\n", df_impute_knn)

Original data:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   NaN  12.0
2   NaN   8.0  13.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0

Mean Imputation:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   NaN  12.0
2   3.0   8.0  13.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0

Median Imputation:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   NaN  12.0
2   3.0   8.0  13.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0

Mode Imputation:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   NaN  12.0
2   1.0   8.0  13.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0

Regression Imputation:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   NaN  12.0
2   3.0   8.0  13.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0

Hot-Deck Imputation:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   NaN  12.0
2   2.0   8.0  13.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0

K-Nearest Neighbors (KNN) imputation:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   7.0  12.0
2   3.0   8.0  13.0
3   4.0   9.0  14.0
4   5.0  10.0  15

## Prediction

Prediction involves using a statistical or machine learning model to predict the missing values based on other variables in the dataset. This is similar to regression imputation but can be more powerful as it allows for more complex relationships between the variables. However, it requires a larger amount of data and can be computationally expensive.

In [3]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# create a dataframe with missing values
df = pd.DataFrame({
    'col1': [1, 2, np.nan, 4, 5],
    'col2': [6, np.nan, 8, 9, 10],
    'col3': [11, 12, 13, np.nan, 15]
})
print("Original data:\n", df)

model = LinearRegression()
x_train = df.dropna()[['col2', 'col3']]  # data points without missing values
y_train = df.dropna()['col1']  # target variable without missing values
model.fit(x_train, y_train)
x_test = df[df['col1'].isnull()][['col2', 'col3']]  # data points with missing values
df.loc[df['col1'].isnull(), 'col1'] = model.predict(x_test)  # fill missing values with predicted values
print("\nData after handling missing values:\n", df)

Original data:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   NaN  12.0
2   NaN   8.0  13.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0

Data after handling missing values:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   NaN  12.0
2   3.0   8.0  13.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0


## Interpolation

Interpolation involves estimating the missing values based on the values of the neighboring points in the dataset. This is often used for time-series data where the missing values occur in a sequential order. There are several types of interpolation methods such as linear, cubic, and spline interpolation.

In [4]:
import pandas as pd
from scipy import interpolate

# create a dataframe with missing values
df = pd.DataFrame({
    'col1': [1, 2, np.nan, 4, 5],
    'col2': [6, np.nan, 8, 9, 10],
    'col3': [11, 12, 13, np.nan, 15]
})
print("Original data:\n", df)

df.interpolate(inplace=True)  # linear interpolation
print("\nData after Interpolation:\n", df)

Original data:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   NaN  12.0
2   NaN   8.0  13.0
3   4.0   9.0   NaN
4   5.0  10.0  15.0

Data after Interpolation:
    col1  col2  col3
0   1.0   6.0  11.0
1   2.0   7.0  12.0
2   3.0   8.0  13.0
3   4.0   9.0  14.0
4   5.0  10.0  15.0
