# **Handling Missing Values**

There's some technique that used to handle missing values such as : 

| **Technique**                 | **Description**                                        | **When to Use**                                       | **Advantages**                                                              | **Disadvantages**                                                             | **Python Implementation**                                        | **Package**                    |
|-------------------------------|-------------------------------------------------------|-------------------------------------------------------|----------------------------------------------------------------------------|----------------------------------------------------------------------------|------------------------------------------------------------------|--------------------------------|
| **Row Deletion**               | Deletes rows with missing values                      | When there are few missing values                     | Simple and easy to apply                                                   | Loss of important data/information if there are many missing values           | `data.dropna()`                                                  | **pandas**                     |
| **Column Deletion**            | Deletes columns with many missing values              | When the column is irrelevant or has more than 50% missing | Simple, can improve efficiency                                              | Loss of important variables if applied carelessly                             | `data.drop(columns=['column'])`                                  | **pandas**                     |
| **Mean/Median/Mode Imputation**| Replaces missing values with the mean, median, or mode | When data is numerical (mean/median) or categorical (mode) | Quick and simple                                                           | Ignores data variability, can introduce bias                                 | `SimpleImputer(strategy='mean')`                                 | **sklearn.impute**              |
| **K-Nearest Neighbors (KNN)**  | Replaces missing values based on nearest neighbors    | When there is a complex relationship between variables | Considers relationships between variables                                  | Time-consuming and resource-heavy for large datasets                         | `KNNImputer(n_neighbors=5)`                                      | **sklearn.impute**              |
| **Regression Imputation**      | Replaces missing values based on regression prediction| When variables are related                            | Considers deeper relationships between variables                            | More complex and requires assumptions about variable relationships            | Create your own regression model using `LinearRegression()`      | **sklearn.linear_model**        |
| **Indicator for Missingness**  | Marks whether values are missing with a new variable  | When missing values themselves can be important       | Retains information from missing values                                     | Expands the dataset by adding extra columns                                 | `data['missing_indicator'] = data['column'].isnull().astype(int)`| **pandas**                     |
| **Algorithm Ignore Missing**   | Algorithms that support missing values                | When using algorithms like Random Forest              | No need to perform imputation                                               | Not all algorithms support missing values                                    | RandomForest and XGBoost support missing values natively         | **sklearn.ensemble**, **xgboost**|
| **Interpolation**              | Uses data patterns or trends to estimate missing values| For time series or sequential data                    | Uses existing data trends                                                   | Works only with sequential or time series data                               | `data['column'].interpolate(method='linear')`                   | **pandas**                     |
| **Hot Deck Imputation**        | Replaces missing values with values from similar observations| In census or survey data                             | Considers similarity between observations                                  | Requires in-depth understanding of the data to identify similarities          | Not available in standard libraries, must be implemented manually | Manual implementation          |
| **Multiple Imputation**        | Generates several imputation sets and combines them   | When there are many missing values and complex analysis | Considers uncertainty in the imputation                                     | Complex and requires more time and resources                                 | `IterativeImputer()`                                              | **sklearn.impute**, **statsmodels** |


## **Packages and Dataset**

In [10]:
# Packages
import pandas as pd
import numpy as np
import seaborn as sns

dataset = pd.read_csv('/Users/Shared/Cloud Drive/repo_adi/dataset/Students_Performance_knn.csv')
dataset.isna().sum()

# accesing the NaN data on lunch Col
dataset[dataset['lunch'].isnull()].head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
6,female,group B,some college,,completed,88,95,92
41,female,group C,associate's degree,,none,58,73,68
53,male,group D,high school,,none,88,78,75
70,female,group D,some college,,completed,58,63,73
95,male,group C,associate's degree,,completed,78,81,82


## **A. Row & Col Deletion**

In [11]:
# implement Row Deletion
row_del_dataset = dataset.dropna(axis=0) # with this function, we will drop the entire rows that null values exist in any columns
print(row_del_dataset.isna().sum())

# we can specify the targeted cols in order to evaluate the null values
row_del_dataset_with_targeted_col = dataset.dropna(subset=['lunch'], axis=0)
row_del_dataset_with_targeted_col.isna().sum()



gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64


gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

In [12]:
# implement Col Deletion

col_del_dataset = dataset.dropna(axis=1) # with this function, we will drop the entire cols that null values exist in any rows 
print(col_del_dataset.isna().sum())


gender                         0
race/ethnicity                 0
parental level of education    0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64


## **B. Mean/Median/Mode Imputation**

When using imputation techniques with the scikit learn package, there are several strategies that can be used:

1. mean,median, mode : used for numerical variable

2. most_frequent : used for string variable

3. constant : can be used for both numerical and string

4. callable : custom imputation and also can be used for both numerical and string

In [13]:
# to use the central tendency imputation technique, we can use imputation model from scikit learn packages -> Simple

# import packages
from sklearn.impute import SimpleImputer

# implement 
imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
imputed_dataset = imputer.fit_transform(dataset)
imputed_dataset = pd.DataFrame(data=imputed_dataset, columns=dataset.columns)

imputed_dataset.isna().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

## **C. K Nearest Neighbors Imputation**

* The KNNImputer in scikit-learn is generally used for numerical data, as it calculates missing values based on the distance between data points

* If you're trying to use KNNImputer for categorical data, there are some steps that need to be taken because KNN works best with numerical features.

In [14]:
# we use knn imputer on numerical dataset
# import packages
from sklearn.impute import KNNImputer

# insntantiate dataset
new_dataset = dataset.select_dtypes('number')
new_dataset['gender'] = dataset['gender']
new_dataset.loc[new_dataset['math score']<= 50, 'math score'] = np.nan

# create the model
knn_imputer = KNNImputer(n_neighbors=5, missing_values=np.nan)
knn_imputed_dataset = knn_imputer.fit_transform(new_dataset.iloc[: ,:-1])

knn_imputed_dataset = pd.DataFrame(data=knn_imputed_dataset, columns=new_dataset.columns[:-1])
knn_imputed_dataset['gender'] = new_dataset['gender']
knn_imputed_dataset.isna().sum()

math score       0
reading score    0
writing score    0
gender           0
dtype: int64

## **D Regression Imputation**

* Regression Imputation is a method used to handle missing data by predicting missing values based on other observed variables in the dataset. In this technique, a regression model is built where the feature (or column) with missing data is the dependent variable, and the other features are independent variables (predictors). The regression model is then used to estimate and impute the missing values.

* If a dataset has missing values in the column income, a regression model can be built using other features like age, education level, etc., to predict the missing values in income

* If you're trying to use regression imputer for categorical data, there are some steps that need to be taken because regression works best with numerical features.

* **Steps to Use Regression Imputation:**

1. Identify missing values in your dataset.

2. Preprocess the data: Encode categorical features and split the dataset into observed (non-missing) and missing parts.

3. Fit a regression model using the rows without missing values.

4. Predict and impute missing values using the trained model.

In [15]:
# import relevant packages
from sklearn.linear_model import LinearRegression
import pandas as pd

# 1. identify missing values
new_dataset[new_dataset.isna().any(axis=1)].index# axis = 1 in order to evaluate the columns

# 2 encode the dataset
encoded_dataset = pd.get_dummies(data=new_dataset, columns=['gender'], drop_first=True, dtype=int) # 1 = male, 0 = female

# 3. isolate the dataset as missing value dataset and non missing values dataset
missing_values_dataset = encoded_dataset[encoded_dataset.isna().any(axis=1)]
non_missing_values_dataset = encoded_dataset[~encoded_dataset.isna().any(axis=1)]

# 4. prepare feature + response then construct regression model and train the model

# 4.1 Prepare data for training (predictors and target)
X = non_missing_values_dataset.drop(columns=['math score']) # features var
y = non_missing_values_dataset['math score'] # response var

# construct regression model and train the model
regression_imputer = LinearRegression() 
regression_imputer.fit(X, y)

# 5. Predict and Impute the Missing Values
X_pred = missing_values_dataset.drop(columns=['math score'])
y_pred = regression_imputer.predict(X_pred)

# 5.1 Fill in the missing values in the original dataset
encoded_dataset.loc[encoded_dataset['math score'].isna(),['math score']] = y_pred

print (encoded_dataset.isna().sum())
encoded_dataset['gender'] = np.where(encoded_dataset['math score'] == 1, 'male','female')# show the result
encoded_dataset.head()

math score       0
reading score    0
writing score    0
gender_male      0
dtype: int64


Unnamed: 0,math score,reading score,writing score,gender_male,gender
0,72.0,72,74,0,female
1,69.0,90,88,0,female
2,90.0,95,93,0,female
3,56.602528,57,44,1,female
4,76.0,78,75,1,female


this project doesn't cover some handling missing values technique such as : Hot Deck Imputation and Multiple Imputation.

* we will cover those technique in the future :)