# Missing Value Handling

Missing value handling is the most important step in a model building process and also for any data analysis that is used for making decisions.if not properly interpreted can lead to poor decisions which can lead to sever loss of business.



Missing values are usually represented in the form of `Nan` or `null` or `None` in the dataset.



![image.png](attachment:image.png)



`dataframe.info()` the function can be used to give information about the dataset. This will provide you with the column names along with the number of non – null values in each column.

![image-3.png](attachment:image-3.png)

The second way of finding whether we have null values in the data is by using the isnull() function.

![image-3.png](attachment:image-3.png)

## Importance of handling missing data

- Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive Bayes support data with missing values.
- You may end up building a biased machine learning model which will lead to incorrect results if the missing values are not handled properly.

## Types Of Missing Value

![image-4.png](attachment:image-4.png)

### 1. Missing Completely at Random (MCAR)

Missing values are completely independent of other data. There is no pattern.

In the case of MCAR, the data could be missing due to human error, some system/equipment failure, loss of sample, or some unsatisfactory technicalities while recording the values.

### 2. Missing at Random (MAR)

Missing at random (MAR) means that the reason for missing values can be explained by variables on which you have complete information as there is some relationship between the missing data and other values/data.

Example : Suppose the variables ‘Gender’ and ‘Age’ can be related and the reason for missing values of the ‘Age’ variable can be explained by the ‘Gender’ variable.You may find that all the people have answered their ‘Gender’ but ‘Age’ values are mostly missing for people who have answered their ‘Gender’ as ‘female’. (The reason being most of the females don’t want to reveal their age.

### 3. Missing not at Random (MNAR)

Missing values depend on the unobserved data.

If there is some structure/pattern in missing data and other observed data can not explain it, then it is Missing Not At Random (MNAR).


For example, suppose the name and the number of overdue books are asked in the poll for a library. So most of the people having no overdue books are likely to answer the poll. People having more overdue books are less likely to answer the poll.

So in this case, the missing value of the number of overdue books depends on the people who have more books overdue.

![image.png](attachment:image.png)

## Handling Missing Data

There are 2 primary ways of handling missing values:

- `Deleting the Missing values`

- `Imputing the Missing Values`

### 1. Deleting/dropping the Missing value

This approach is not recommended. It is one of the quick and dirty techniques one can use to deal with missing values

#### Deleting the rows with missing data
  - If a row has many missing values then we can drop the entire row.


![image.png](attachment:image.png)
    
#### Deleting the columns with missing data
  - If a certain column has many missing values then you can choose to drop the entire column.

![image-2.png](attachment:image-2.png)

`axis=1` is used to drop the column with `NaN` values.

`axis=0` is used to drop the row with `NaN` values.

### Imputing the Missing Value

1. Replacing With `Mean`
2. Replacing With `Mode`
3. Replacing With `Median`
4. Replacing With `Arbitrary Value`
5. Replacing with `previous value – Forward fill`
6. Replacing with `next value – Backward fill`
7. Imputation of Missing Value Using sci-kit learn
   - Univariate Approach : `SimpleImputer`
   - Multivariate Approach : `IterativeImputer`(regression model) , `KNNImputer`(Nearest Neighbors Imputations)

### Replacing With Mean / Median / Mode

- Mean is most commonly used for imputing missing values for numeric columns. If there are outliers then this technique will not be appropriate. In such cases, outliers need to be treated first.

- Median is the middlemost value. It’s better to use the median value for imputation in the case of outliers.

- Mode is the most frequently occurring value. It is used in the case of categorical features.

![image.png](attachment:image.png)

### Replacing With `Arbitrary Value`

- If you can make an educated guess about the missing value then you can replace it with some arbitrary value.
- Impute such value which treats it as a Separate Category

### Replacing with `previous value – Forward fill`

In some cases, we do impute the values with the previous value.This is called forward fill. It is mostly used in **time series data**.

![image-2.png](attachment:image-2.png)

### Replacing with `next value – Backward fill`

The missing value is imputed using the next value.

![image.png](attachment:image.png)

### Univariate Approach : SimpleImputer

For this , only a single feature is taken into consideration. 

We use SimpleImputer and replace the missing values with mean, mode, median or some constant value.

![image.png](attachment:image.png)

### Multivariate Approach : IterativeImputer(regression model)

In this case, the null values in one column are filled by fitting a regression model using other columns in the dataset.

![image.png](attachment:image.png)

### Multivariate Approach : KNNImputer(Nearest Neighbors Imputations)

Missing values are imputed using the k-Nearest Neighbors approach where a Euclidean distance is used to find the nearest neighbors

![image.png](attachment:image.png)

### Conclusion: 

We can say that there is no perfect way for filling the missing values in a dataset.We have to experiment through different methods and understand the type of missing data based on business scenario to check which method works the best for given dataset.

#  Thank You :)