## Handling Missing values

Why do you need to fill in the missing data? 

Because most of the machine learning models that you want to use will provide an error if you pass NaN values into it. The easiest way is to just fill them up with 0, but this can reduce your model accuracy significantly.


The absence of values is a cause of concern for real-life datasets. When collecting observations about a variable, missing values can occur due to reasons as diverse as –

* an error in machinery/equipment
* error on part of the researcher
* unavailable respondents
* accidental deletion of observations
* forgetfulness on part of the respondents
* error in accounting, etc.

In [None]:
# Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from datetime import date
%matplotlib inline

In [None]:
# Data 
df = pd.read_csv('data/titanic.csv')

In [None]:
df.head()

In [None]:
df.drop(columns=["Name", "Ticket", "PassengerId", "Cabin", "Embarked"], inplace=True)

In [None]:
df.head()

In [None]:
print(df.isna().sum())

In [None]:
# from sklearn.preprocessing import LabelEncoder
# le = LabelEncoder()
# df['Sex'] = le.fit_transform(df['Sex'])
# newdf = df
# Pandas .getdunnies as a main method of on hod encoding transformation
df['Sex'] = pd.get_dummies(df["Sex"], drop_first=True)

## The methods we will be discussing are:

### 1. Deleting the column with missing data

In [None]:
updated_df = df.dropna(axis=1)

In [None]:
updated_df.info()

### 2. Deleting the row with missing data

In [None]:
updated_df_1 = df.dropna(axis=0)

In [None]:
updated_df_1.info()

### 3. Filling the Missing Values – Imputation

In this case, we will be filling the missing values with a certain number.

The possible ways to do this are:

* Filling the missing data with the mean or median value if it’s a numerical variable.
* Filling the missing data with mode if it’s a categorical value.
* Filling the numerical value with 0 or -999, or some other number that will not occur in the data. This can be done so that the machine can recognize that the data is not real or is different.
* Filling the categorical value with a new type for the missing values.

You can use the fillna() function to fill the null values in the dataset.

In [None]:
updated_df_3 = df.copy()
updated_df_3['Age'] = updated_df_3['Age'].fillna(updated_df_3['Age'].mean())
updated_df_3.info()

### 4. Imputation with an additional column

In [None]:
updated_df_4 = df.copy()
updated_df_4['Ageismissing'] = updated_df_4['Age'].isnull()
updated_df_4.head()

In [None]:
from sklearn.impute import SimpleImputer, KNNImputer, MissingIndicator

my_imputer = SimpleImputer(strategy = 'mean')
data_new = my_imputer.fit_transform(updated_df_4)
pd.DataFrame(data_new, columns= updated_df_4.columns)

Essence of kNN algorithm
Univariate methods used for missing value imputation are simplistic ways of estimating the value and may not provide an accurate picture always. For example, let us say we have variables related to the density of cars on road and levels of pollutants in the air and there are few observations that are missing for the level of pollutants, imputing the level of pollutants by mean/median level of pollutants may not necessarily be an appropriate strategy.

In such scenarios, algorithms like k-Nearest Neighbors (kNN) can help to impute the values of missing data. Sociologists and community researchers suggest that human beings live in a community because neighbors generate a feeling of security and safety, attachment to community, and relationships that bring out a community identity through participation in various activities.

A similar imputation methodology that works on data is k-Nearest Neighbours (kNN) that identifies the neighboring points through a measure of distance and the missing values can be estimated using completed values of neighboring observations.

In [None]:
my_imputer = KNNImputer(n_neighbors=2)
data_new = my_imputer.fit_transform(updated_df_4)
pd.DataFrame(data_new, columns= updated_df_4.columns)

### 5. Filling with a Regression Model
In this case, the null values in one column are filled by fitting a regression model using other columns in the dataset.

I.E in this case the regression model will contain all the columns except Age in X and Age in Y.

Then after filling the values in the Age column, then we will use logistic regression to calculate accuracy.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()
df.info()

In [None]:
reg_df = df.copy()

In [None]:
testdf = reg_df[reg_df['Age'].isna()==True]
traindf = reg_df[reg_df['Age'].isna()==False]

In [None]:
y = traindf['Age']
traindf.drop("Age",axis=1,inplace=True)

In [None]:
lr.fit(traindf, y)

In [None]:
testdf.drop("Age", axis=1, inplace=True)

In [None]:
pred = lr.predict(testdf)
testdf['Age'] = pred

In [None]:
testdf