# **<ins style="color:aqua">Feature Engineering</ins>**
## **<ins style="color:green">Handling Missing Values</ins>**
1. ### **<ins style="color:red">(CCA : Complete Case Analysis)</ins>**
   - Remove Hole Row in which NaN value present.
   - Data missing must be random.
   - Complete Case Analysis(CCA), also called "List-Wise Deletion" of cases, consists in discarding observations(Row) where values in any of the variables(Column) are missing.
   - Complete Case Analysis means literally analyzing only those observations for which there is information in all of the variables in the dataset.
   - __Assumption For CCA:__ MCAR : Missing Completely at Random
   - __Advantage__ :
     - Easy to implement as no data manipulation required.
     - Preserves variable distribution (if data is MCAR), then the distribution of the variables of the reduced dataset should match the distribution in the original dataset.
   - __Disadvantage__:
     - It can exclude a large fraction of the original dataset (If mising data is abundant).
     - Excluded observations could be informative for the analysis (if data is not missing at random).
     - When using our models in production, the model will not know how to handle missing data.
   - __When to use CCA.__
     - MCAR : Missing Completely At Random
     - Percentage of the Missing data in column should be high. If percentage of missing data in a column is low then do not apply CCA.

3. ### **<ins style="color:red">Impute (Fill NaN Value)</ins>**
   - #### **Univariate** : _SimpleImputer_ Class Present in Scikit Learn for the _Univariate_.
   - If in a column any missing value is present then fill it using the remain data present in that column.
     - <ins style="color:blue"> __Numerical Type Column__ </ins>
       - Method to fill the Numerical Columns Missing Values.
         - Mean
         - Median
         - Random Value
         - End of Distribution Value
     - <ins style="color:blue"> __Categorical Type Column__ </ins>
       - Method to fill the Categorical Columns Missing Values.
         - Mode
         - Missing Word
   - #### **Multivariate**
   - If in a column any missing value is present then fill it using the data of all other columns.
     - __KNN Imputer__ Method
     - __Iterative Imputer__ Method
- __Missing Indicator__

# <b style="color:aqua">Multivariate Imputation Handling Missing Data</b>
## <b style="color:green">KNN Imputer</b>
- `class sklearn.impute.KNNImputer(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False, keep_empty_features=False)`
- Fill the missing value using the data present in other columns.
- Fill NaN value with that row's value whose other column's value are matches to it.
- It can be find using __Euclidean Distance__ formula.
- If value of __K__ in __KNN__ is 2 then fill NaN value with two neighbours.  \
  If value of __K__ in __KNN__ is __n__ then fill NaN value with __n__ neighbours.
- First find the K nearest neighbour using calculate _Euclidean Distance_ .
- Find the value and get mean.
- **Euclidean_Distance Formula :**  \
  `dist(x, y) = sqrt((x1-x2)**2 + (y1-y2)**2)`
- We use **Nan_Euclidean_Distance:** \
  `dist(x,y) = sqrt(weight * sq. distance from present coordinates)` where, _weight = Total No. of coordinates / No. of present coordinates_
- `sklearn.metrics.pairwise.nan_euclidean_distances(X, Y=None, *, squared=False, missing_values=nan, copy=True)`
- __Pors__:
  - It gives accurate result.
- __Cons__:
  - More number of calculation.
  - It is time taken problem.
  - Deploy hole training set on server.
  - Consume more memory and decrease speed.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv("../data/csvData/train.csv")[['Age', 'Pclass', 'Fare', 'Survived']]
df.head(7)

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0
5,,3,8.4583,0
6,54.0,1,51.8625,0


In [3]:
df.isnull().mean()*100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [4]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
X_train.head()

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7
873,47.0,3,9.0
182,9.0,3,31.3875
876,20.0,3,9.8458


`class sklearn.impute.KNNImputer(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False, keep_empty_features=False)`
- __missing_values__ : _int, float, str, np.nan or None_, default=np.nan
- __n_neighbors__ : _int_, default=5
- __weights__ : _{‘uniform’, ‘distance’} or callable_, default=’uniform’
- __metric__ : _{‘nan_euclidean’} or callable_, default=’nan_euclidean’
- __copy__ : _bool_, default=True
- __add_indicator__ : _bool_, default=False
- __keep_empty_features__ : _bool_, default=False

In [6]:
knn = KNNImputer(n_neighbors=3, weights='distance')
X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)
X_train_trf.shape, X_test_trf.shape

((712, 3), (179, 3))

In [7]:
X_train_trf

array([[ 40.        ,   1.        ,  27.7208    ],
       [  4.        ,   3.        ,  16.7       ],
       [ 47.        ,   3.        ,   9.        ],
       ...,
       [ 71.        ,   1.        ,  49.5042    ],
       [ 32.66666667,   1.        , 221.7792    ],
       [ 49.76289518,   1.        ,  25.925     ]])

In [8]:
pd.DataFrame(X_train_trf, columns=X_train.columns)

Unnamed: 0,Age,Pclass,Fare
0,40.000000,1.0,27.7208
1,4.000000,3.0,16.7000
2,47.000000,3.0,9.0000
3,9.000000,3.0,31.3875
4,20.000000,3.0,9.8458
...,...,...,...
707,30.000000,3.0,8.6625
708,26.151292,3.0,8.7125
709,71.000000,1.0,49.5042
710,32.666667,1.0,221.7792


In [9]:
lr = LogisticRegression()
lr.fit(X_train_trf, y_train)
y_pred = lr.predict(X_test_trf)

accuracy_score(y_test, y_pred)*100

71.50837988826815

In [10]:
# Comparision with SimpleImputer --> mean
si = SimpleImputer()
X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [11]:
lr = LogisticRegression()
lr.fit(X_train_trf2, y_train)
y_pred2 = lr.predict(X_test_trf2)
accuracy_score(y_test, y_pred2)*100

69.27374301675978